Monday, October 4, 2021

Digest for comp.lang.c++@googlegroups.com - 25 updates in 3 topics

Bonita Montero <Bonita.Montero@gmail.com>: Oct 04 04:30PM +0200

There's the Unix-command wc which counts words and lines. And the
wc-implementation from the current GNU core utilities contain an
optional very tricky AVX-implementation. This improves the speed
of wc on my Linux-computer by factor 29.
I improved this algorithm further to partition the data in three
parts which I handle interleaved, i.e. 32-byte-chunks synchronously
from each part and then I increment the common offset by 32. This
is while the original-algorithm has a depencency-chain which limits
out of order execution. I partitioned the data in three but not in
four parts because there wouldn't be enough integer-registers - I
need 14 in my interleaving-loop. With four parts I'd have much
regiseter spilling and reloading which gains a performance similar
to the original-algorithm.
The speedup of the interleaved code over the original wc-algorithm
is about 60%.
The reason why I tell this is that I benchmarked the code under
different conditions. I also improved the trivial algorithm with
prefetching and partitioning and it could be run with either
switched on like the AVX-code. This are the results on my Linux
Ryzen 7 1800X:
 
trivial / non-interleaved / non-prefetched
1 thread: 468MB/s
trivial / non-interleaved / prefetched
1 thread: 492MB/s
trivial / interleaved / non-prefetched
1 thread: 778MB/s
trivial / interleaved / prefetched
1 thread: 694MB/s
AVX / non-interleaved / non-prefetched
1 thread: 13731MB/s
AVX / non-interleaved / prefetched
1 thread: 13757MB/s
AVX / interleaved / non-prefetched
1 thread: 19722MB/s
AVX / interleaved / prefetched
1 thread: 23558MB/s
 
As you can see manual prefetching gives only a little gain for the
trivial non-interleaved code, and it even drops with the trivial
interleaved / prefetched code over the trivial interleaved / non
-prefetched code. But for the AVX-code there's a significant speedup
of the interleaved / prefetched code over the interleaved / non
-prefetched code. So there are cases where prefetching gives a
significant speedup.
With interleaving there are more complex memory access patterns
and I suspect that the prefetcher doesn't work that good under
such conditions.
 
If you're interested in the code. The relevant functions are the
lambdas trivialSpaceCount and avxSpaceCount. The code is compilable
only with C++20.
 
#if defined(_MSC_VER)
#include <Windows.h>
#elif defined(__unix__)
#include <pthread.h>

No comments: