Bonita Montero <Bonita.Montero@gmail.com>: Mar 25 03:14PM +0100 I had some performance issues because of a lot of modulo-operations with self-built hashing where the number of buckets coudln't be a power of two. So I asked myself what's the throughput of a modulo operation where the hash occupies the whole bit-span, i.e. the most siginificant bit is always set and the modoulo-operand is very small so that the subtract-and-shift engine keeps running for a lot of cycles. I'm XOR-ing the result of the modulo-3-operation with the next value to make the next calculation dependent of the one before; most CPU time should be spent by the division itself so the extra XOR-overhead shouldn't be significant. All other operations will be executed OoO -parallel on your CPU so that the division and the XOR-operation should run back-to-back. With the below code I do some interleaving with multiple chains of the described XOR-DIVs through an unrolling helper lambda. This allows the CPU to execute multiple DIVs at once. Surprisingly my small HP EliteDesk 800 G2 Linux PC with a Skylake CPU gets a higher throughput than my Zen2 AMD CPU when two or more DIV-chains are executed OoO-parallel. The Skylake-CPU can handle two DIVs at once whereas the Zen2-CPU can handle only one. #include <iostream> #include <random> #include <chrono> #include <vector> #include <atomic> #include <utility> #include <type_traits> #include <limits> using namespace std; using namespace chrono; int main() { uint64_t sum, mod = 3; constexpr size_t N = 0x400 / sizeof(uint64_t), ROUNDS = 1'000'000; mt19937_64 mt; uniform_int_distribution<uint64_t> uid( numeric_limits<int64_t>::min(), -1 ); vector<uint64_t> counters( N ); for( uint64_t &c : counters ) c = uid( mt ); auto testPar = [&]<size_t Par>() mutable -> double { uint64_t prevs[Par] = { 0 }; auto start = high_resolution_clock::now(); auto unroll = []<size_t ... Indices, typename Fn>( index_sequence<Indices ...>, Fn fn ) { (fn.template operator ()<Indices>(), ...); }; for( size_t r = ROUNDS; r--; ) for( size_t i = 0; i + Par < N; i += Par ) unroll( make_index_sequence<Par>(), [&]<size_t I>() { prevs[I] = (counters[i + I] ^ prevs[I]) % mod; } ); double ns = duration_cast<nanoseconds>( high_resolution_clock::now() - start ).count() / ((double)ROUNDS * N); uint64_t sumPrevs = 0; unroll( make_index_sequence<Par>(), [&]<size_t I>() { sumPrevs += prevs[I]; } ); sum = sumPrevs; return ns; }; using test_lambda_t = remove_reference_t<decltype(testPar)>; static struct { size_t par; double (test_lambda_t::*fn)(); } const tests[] = { { 1, &test_lambda_t::template operator ()<1> }, { 2, &test_lambda_t::template operator ()<2> }, { 3, &test_lambda_t::template operator ()<3> }, { 4, &test_lambda_t::template operator ()<4> } }; for( auto const &test : tests ) cout << test.par << ": " << (testPar.*test.fn)() << endl; } Show me your results (-std:c++20 / -std=c++20) ! |
You received this digest because you're subscribed to updates for this group. You can change your settings on the group membership page. To unsubscribe from this group and stop receiving emails from it send an email to comp.lang.c+++unsubscribe@googlegroups.com. |
No comments:
Post a Comment