- Most efficient prefetching distance - 2 Updates
- How to get mantissa of long double? - 18 Updates
- Tricky ... - 5 Updates
| Bonita Montero <Bonita.Montero@gmail.com>: Oct 04 04:30PM +0200 There's the Unix-command wc which counts words and lines. And the wc-implementation from the current GNU core utilities contain an optional very tricky AVX-implementation. This improves the speed of wc on my Linux-computer by factor 29. I improved this algorithm further to partition the data in three parts which I handle interleaved, i.e. 32-byte-chunks synchronously from each part and then I increment the common offset by 32. This is while the original-algorithm has a depencency-chain which limits out of order execution. I partitioned the data in three but not in four parts because there wouldn't be enough integer-registers - I need 14 in my interleaving-loop. With four parts I'd have much regiseter spilling and reloading which gains a performance similar to the original-algorithm. The speedup of the interleaved code over the original wc-algorithm is about 60%. The reason why I tell this is that I benchmarked the code under different conditions. I also improved the trivial algorithm with prefetching and partitioning and it could be run with either switched on like the AVX-code. This are the results on my Linux Ryzen 7 1800X: trivial / non-interleaved / non-prefetched 1 thread: 468MB/s trivial / non-interleaved / prefetched 1 thread: 492MB/s trivial / interleaved / non-prefetched 1 thread: 778MB/s trivial / interleaved / prefetched 1 thread: 694MB/s AVX / non-interleaved / non-prefetched 1 thread: 13731MB/s AVX / non-interleaved / prefetched 1 thread: 13757MB/s AVX / interleaved / non-prefetched 1 thread: 19722MB/s AVX / interleaved / prefetched 1 thread: 23558MB/s As you can see manual prefetching gives only a little gain for the trivial non-interleaved code, and it even drops with the trivial interleaved / prefetched code over the trivial interleaved / non -prefetched code. But for the AVX-code there's a significant speedup of the interleaved / prefetched code over the interleaved / non -prefetched code. So there are cases where prefetching gives a significant speedup. With interleaving there are more complex memory access patterns and I suspect that the prefetcher doesn't work that good under such conditions. If you're interested in the code. The relevant functions are the lambdas trivialSpaceCount and avxSpaceCount. The code is compilable only with C++20. #if defined(_MSC_VER) #include <Windows.h> #elif defined(__unix__) #include <pthread.h>
Subscribe to:
Post Comments (Atom)
|
No comments:
Post a Comment