You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
When hashing many independent 2 KiB messages concurrently, my implementation does 25 Gbps, while the C implementation does ~7 Gbps.
I would like to contribute back my changes to this official library.
My code is Apache-2.0 licensed, so feel free to copy from it.
Suggested Changes
There are three major pieces required:
Adapt the SIMD backends to do work off each lane independently, namely:
Support lane masking (if one AVX lane finishes hashing before another does) -- Currently, all lanes are always active
Support independent chunk counters and flags -- the current backend assumes a contiguous range of chunks or parents
Adapt the scheduler to dispatch operations from in-flight hash calcuations to the SIMD backend concurrently
I'm not sure if the hash tree scheduling algorithm proposed in the BLAKE3 paper is capable of doing so.
When queueing operations for an in-flight hash state, and we're unable to have enough chunks to hash in parallel to meet the SIMD degree, the algorithm should yield to the next in-flight hash state, before actually starting to hash.
I've rewritten the scheduler from scratch, but it requires log2(chunk_cnt) * simd_degree * 32 working space per hash state. The algorithm I came up with is unfortunately much more complex than the elegant stack-based one in the paper.
Adapt the high-level API to tell the scheduler if there are multiple in-flight hash states
The simplest way is a new function call: fn blake3_multi(messages: &[&[u8]]) -> Vec<[u8; 32]>
Another way is to use thread-local storage to keep track of streaming operations on the current thread
s1 := Blake3::new()
s2 := Blake3::new()
s1.append("abcd"); // registers this append operation as a thread-local
s2.append("1234"):
hash2 := s2.fini(); // finds that s1 is also queued via thread-locals, so hashes both s1 and s2
hash1 := s1.fini(); // no-op! the result is already available
The text was updated successfully, but these errors were encountered:
riptl
changed the title
Consider multi-block batching
Consider multi-message batching
Feb 23, 2024
Problem
The current blake3 crate leaves a lot of single core performance on the table for message sizes below 8 KiB.
Namely, it doesn't SIMD parallelize hashing for small messages.
As a PoC, I've rewritten a BLAKE3 scheduler from scratch with a modified AVX2 backend:
https://github.com/firedancer-io/firedancer/tree/ripatel/fd_blake3/src/ballet/blake3
When hashing many independent 2 KiB messages concurrently, my implementation does 25 Gbps, while the C implementation does ~7 Gbps.
I would like to contribute back my changes to this official library.
My code is Apache-2.0 licensed, so feel free to copy from it.
Suggested Changes
There are three major pieces required:
log2(chunk_cnt) * simd_degree * 32
working space per hash state. The algorithm I came up with is unfortunately much more complex than the elegant stack-based one in the paper.fn blake3_multi(messages: &[&[u8]]) -> Vec<[u8; 32]>
The text was updated successfully, but these errors were encountered: