Monday, October 11, 2021

Digest for comp.lang.c++@googlegroups.com - 25 updates in 4 topics

Branimir Maksimovic <branimir.maksimovic@icloud.com>: Oct 11 05:23AM

> fnv blocked 64: : 3.11791 GB/s 232%
> crc64: : 0.478093 GB/s
> crc64 blocked: : 2.39144 GB/s 400%
 
Proprietary?
 
--
 
7-77-777
Evil Sinner!
with software, you repeat same experiment, expecting different results...
Branimir Maksimovic <branimir.maksimovic@icloud.com>: Oct 11 05:41AM

> fnv blocked 64: : 3.11791 GB/s 232%
> crc64: : 0.478093 GB/s
> crc64 blocked: : 2.39144 GB/s 400%
 
Here is mine :
 
 
-opyright : (c) 2004 by Dzenis Softic / http://www.dzeni.com
*
* Filename : vmp_crc32.c
*
* Description: Calculates crc32 of string
*
* Company : Seenetix D.O.O.
*
* Authors : B. Maksimovic
*
* $Id$
*
*===========================================================================*/
#include <stddef.h>
 
static unsigned crc32_table[] =
{
0x00000000, 0x77073096, 0xEE0E612C, 0x990951BA,
0x076DC419, 0x706AF48F, 0xE963A535, 0x9E6495A3,
0x0EDB8832, 0x79DCB8A4, 0xE0D5E91E, 0x97D2D988,
0x09B64C2B, 0x7EB17CBD, 0xE7B82D07, 0x90BF1D91,
0x1DB71064, 0x6AB020F2, 0xF3B97148, 0x84BE41DE,
0x1ADAD47D, 0x6DDDE4EB, 0xF4D4B551, 0x83D385C7,
0x136C9856, 0x646BA8C0, 0xFD62F97A, 0x8A65C9EC,
0x14015C4F, 0x63066CD9, 0xFA0F3D63, 0x8D080DF5,
0x3B6E20C8, 0x4C69105E, 0xD56041E4, 0xA2677172,
0x3C03E4D1, 0x4B04D447, 0xD20D85FD, 0xA50AB56B,
0x35B5A8FA, 0x42B2986C, 0xDBBBC9D6, 0xACBCF940,
0x32D86CE3, 0x45DF5C75, 0xDCD60DCF, 0xABD13D59,
0x26D930AC, 0x51DE003A, 0xC8D75180, 0xBFD06116,
0x21B4F4B5, 0x56B3C423, 0xCFBA9599, 0xB8BDA50F,
0x2802B89E, 0x5F058808, 0xC60CD9B2, 0xB10BE924,
0x2F6F7C87, 0x58684C11, 0xC1611DAB, 0xB6662D3D,
 
0x76DC4190, 0x01DB7106, 0x98D220BC, 0xEFD5102A,
0x71B18589, 0x06B6B51F, 0x9FBFE4A5, 0xE8B8D433,
0x7807C9A2, 0x0F00F934, 0x9609A88E, 0xE10E9818,
0x7F6A0DBB, 0x086D3D2D, 0x91646C97, 0xE6635C01,
0x6B6B51F4, 0x1C6C6162, 0x856530D8, 0xF262004E,
0x6C0695ED, 0x1B01A57B, 0x8208F4C1, 0xF50FC457,
0x65B0D9C6, 0x12B7E950, 0x8BBEB8EA, 0xFCB9887C,
0x62DD1DDF, 0x15DA2D49, 0x8CD37CF3, 0xFBD44C65,
0x4DB26158, 0x3AB551CE, 0xA3BC0074, 0xD4BB30E2,
0x4ADFA541, 0x3DD895D7, 0xA4D1C46D, 0xD3D6F4FB,
0x4369E96A, 0x346ED9FC, 0xAD678846, 0xDA60B8D0,
0x44042D73, 0x33031DE5, 0xAA0A4C5F, 0xDD0D7CC9,
0x5005713C, 0x270241AA, 0xBE0B1010, 0xC90C2086,
0x5768B525, 0x206F85B3, 0xB966D409, 0xCE61E49F,
0x5EDEF90E, 0x29D9C998, 0xB0D09822, 0xC7D7A8B4,
0x59B33D17, 0x2EB40D81, 0xB7BD5C3B, 0xC0BA6CAD,
 
0xEDB88320, 0x9ABFB3B6, 0x03B6E20C, 0x74B1D29A,
0xEAD54739, 0x9DD277AF, 0x04DB2615, 0x73DC1683,
0xE3630B12, 0x94643B84, 0x0D6D6A3E, 0x7A6A5AA8,
0xE40ECF0B, 0x9309FF9D, 0x0A00AE27, 0x7D079EB1,
0xF00F9344, 0x8708A3D2, 0x1E01F268, 0x6906C2FE,
0xF762575D, 0x806567CB, 0x196C3671, 0x6E6B06E7,
0xFED41B76, 0x89D32BE0, 0x10DA7A5A, 0x67DD4ACC,
0xF9B9DF6F, 0x8EBEEFF9, 0x17B7BE43, 0x60B08ED5,
0xD6D6A3E8, 0xA1D1937E, 0x38D8C2C4, 0x4FDFF252,
0xD1BB67F1, 0xA6BC5767, 0x3FB506DD, 0x48B2364B,
0xD80D2BDA, 0xAF0A1B4C, 0x36034AF6, 0x41047A60,
0xDF60EFC3, 0xA867DF55, 0x316E8EEF, 0x4669BE79,
0xCB61B38C, 0xBC66831A, 0x256FD2A0, 0x5268E236,
0xCC0C7795, 0xBB0B4703, 0x220216B9, 0x5505262F,
0xC5BA3BBE, 0xB2BD0B28, 0x2BB45A92, 0x5CB36A04,
0xC2D7FFA7, 0xB5D0CF31, 0x2CD99E8B, 0x5BDEAE1D,
 
0x9B64C2B0, 0xEC63F226, 0x756AA39C, 0x026D930A,
0x9C0906A9, 0xEB0E363F, 0x72076785, 0x05005713,
0x95BF4A82, 0xE2B87A14, 0x7BB12BAE, 0x0CB61B38,
0x92D28E9B, 0xE5D5BE0D, 0x7CDCEFB7, 0x0BDBDF21,
0x86D3D2D4, 0xF1D4E242, 0x68DDB3F8, 0x1FDA836E,
0x81BE16CD, 0xF6B9265B, 0x6FB077E1, 0x18B74777,
0x88085AE6, 0xFF0F6A70, 0x66063BCA, 0x11010B5C,
0x8F659EFF, 0xF862AE69, 0x616BFFD3, 0x166CCF45,
0xA00AE278, 0xD70DD2EE, 0x4E048354, 0x3903B3C2,
0xA7672661, 0xD06016F7, 0x4969474D, 0x3E6E77DB,
0xAED16A4A, 0xD9D65ADC, 0x40DF0B66, 0x37D83BF0,
0xA9BCAE53, 0xDEBB9EC5, 0x47B2CF7F, 0x30B5FFE9,
0xBDBDF21C, 0xCABAC28A, 0x53B39330, 0x24B4A3A6,
0xBAD03605, 0xCDD70693, 0x54DE5729, 0x23D967BF,
0xB3667A2E, 0xC4614AB8, 0x5D681B02, 0x2A6F2B94,
0xB40BBE37, 0xC30C8EA1, 0x5A05DF1B, 0x2D02EF8D,
 
};
 
inline static unsigned calc_crc32(unsigned char b, unsigned crc32)
{
return (crc32 >> 8) ^ crc32_table[b ^ (crc32 & 0x000000FF)];
}
 
unsigned vmp_crc32str(const char* str)
{
unsigned crc32_tmp = 0xFFFFFFFF;

while(*str)
{
crc32_tmp = calc_crc32(*str++, crc32_tmp);
}

crc32_tmp = ~crc32_tmp;

return crc32_tmp;
}
 
unsigned vmp_crc32(const void* src, size_t size)
{
unsigned crc32_tmp = 0xFFFFFFFF;
int i = 0;
for(;i<size;++i)
{
crc32_tmp = calc_crc32(*(const unsigned char *)src++, crc32_tmp);
}
crc32_tmp = ~crc32_tmp;

return crc32_tmp;
}

/*=============================================================================
* History:
*
* $Log$
* Revision 1.2 2004/04/20 20:08:46 bmaxa
* added vmp_crc32 void version
*
* Revision 1.1 2004/04/20 19:31:53 bmaxa
* crc32 added
*
*
*===========================================================================*/
 
 
7-77-777
Evil Sinner!
with software, you repeat same experiment, expecting different results...
Bonita Montero <Bonita.Montero@gmail.com>: Oct 11 08:09AM +0200

Am 10.10.2021 um 21:36 schrieb Chris M. Thomasson:
>> crc64 blocked:   : 2.39144 GB/s 400%
 
> Btw, if you can come up with a really fast SHA2 impl, I would be
> interested because of my experimental HMAC cipher. I have a C version:
 
SHA* is completely different and can't be improved how I did it.
Bonita Montero <Bonita.Montero@gmail.com>: Oct 11 08:15AM +0200

Am 11.10.2021 um 07:23 schrieb Branimir Maksimovic:
>> crc64: : 0.478093 GB/s
>> crc64 blocked: : 2.39144 GB/s 400%
 
> Proprietary?
 
Of course - but the same equal distribution.
Bonita Montero <Bonita.Montero@gmail.com>: Oct 11 08:27AM +0200

Am 11.10.2021 um 07:23 schrieb Branimir Maksimovic:
>> crc64: : 0.478093 GB/s
>> crc64 blocked: : 2.39144 GB/s 400%
 
> Proprietary?
 
That's the improved "CRC64":
 
#include "crc64.h"
 
using namespace std;
 
uint64_t CRC64_ECMA182::operator ()( void const *p, size_t n, uint64_t
startCrc ) const
{
uint64_t crc = startCrc;
uint8_t const *s = (uint8_t *)p,
*end = s + n;
size_t t;
for( ; s != end; ++s )
t = (size_t)(crc >> 56) ^ *s,
crc = table.t[t] ^ (crc << 8);
return crc;
}
 
uint64_t CRC64_ECMA182::blocked( void const *p, size_t n, uint64_t
startCrc ) const
{
auto crc64_8x8 = []( uint8_t const *s ) -> uint64_t
{
uint64_t crcs[8] =
{
table.t[s[ 0]],
table.t[s[ 8]],
table.t[s[16]],
table.t[s[24]],
table.t[s[32]],
table.t[s[40]],
table.t[s[48]],
table.t[s[56]]
};
size_t t;
uint8_t const *end = ++s + 7;
do
t = (size_t)(crcs[0] >> 56) ^ s[ 0],
crcs[0] = table.t[t] ^ (crcs[0] << 8),
t = (size_t)(crcs[1] >> 56) ^ s[ 8],
crcs[1] = table.t[t] ^ (crcs[1] << 8),
t = (size_t)(crcs[2] >> 56) ^ s[16],
crcs[2] = table.t[t] ^ (crcs[2] << 8),
t = (size_t)(crcs[3] >> 56) ^ s[24],
crcs[3] = table.t[t] ^ (crcs[3] << 8),
t = (size_t)(crcs[4] >> 56) ^ s[32],
crcs[4] = table.t[t] ^ (crcs[4] << 8),
t = (size_t)(crcs[5] >> 56) ^ s[40],
crcs[5] = table.t[t] ^ (crcs[5] << 8),
t = (size_t)(crcs[6] >> 56) ^ s[48],
crcs[6] = table.t[t] ^ (crcs[6] << 8),
t = (size_t)(crcs[7] >> 56) ^ s[56],
crcs[7] = table.t[t] ^ (crcs[7] << 8);
while( ++s != end );
uint64_t crc = 0;
for( size_t i = 0; i != 8; ++i )
crc ^= crcs[i];
return crc;
};
uint8_t const *s = (uint8_t *)p;
uint64_t crc = startCrc;
for( uint8_t const *end = s + (n & -64); s != end; s += 64 )
crc ^= crc64_8x8( s );
crc ^= (*this)( s, n % 64, 0 );
return crc;
}
 
CRC64_ECMA182::crc64_table::crc64_table()
{
uint64_t const CRC64_ECMA182_POLY = 0x42F0E1EBA9EA3693u;
for( uint64_t i = 0; i != 256; ++i )
{
uint64_t crc = 0,
c = i << 56;
for( unsigned j = 0; j != 8; ++j )
crc = (int64_t)(crc ^ c) < 0 ? (crc << 1) ^ CRC64_ECMA182_POLY : crc
<< 1,
c <<= 1;
t[(size_t)i] = crc;
}
}
 
CRC64_ECMA182::crc64_table CRC64_ECMA182::table;
 
Why does it run faster ?
Branimir Maksimovic <branimir.maksimovic@icloud.com>: Oct 11 06:29AM

>>> crc64 blocked: : 2.39144 GB/s 400%
 
>> Proprietary?
 
> Of course - but the same equal distribution.
 
Dunno, those hashing algos work much better if there is support
from hardware. But, of course faster, better :P
 
--
 
7-77-777
Evil Sinner!
with software, you repeat same experiment, expecting different results...
Branimir Maksimovic <branimir.maksimovic@icloud.com>: Oct 11 06:32AM

> }
 
> CRC64_ECMA182::crc64_table CRC64_ECMA182::table;
 
> Why does it run faster ?
Dunno, haven't have need to calculate crc64 yet :P
Better to generate table, don't waste time on generation :P
 
 
--
 
7-77-777
Evil Sinner!
with software, you repeat same experiment, expecting different results...
Bonita Montero <Bonita.Montero@gmail.com>: Oct 11 08:33AM +0200

Am 11.10.2021 um 08:32 schrieb Branimir Maksimovic:
 
 
>> Why does it run faster ?
> Dunno, haven't have need to calculate crc64 yet :P
> Better to generate table, don't waste time on generation :P
 
Eeeh, I'm using also a table as you can see from above.
Bonita Montero <Bonita.Montero@gmail.com>: Oct 11 08:34AM +0200

Am 11.10.2021 um 08:29 schrieb Branimir Maksimovic:
 
>> Of course - but the same equal distribution.
 
> Dunno, those hashing algos work much better if there is
> support from hardware. But, of course faster, better :P
 
There are only special SSE-instrucions for CRC32, but not for CRC64.
Branimir Maksimovic <branimir.maksimovic@icloud.com>: Oct 11 07:46AM

>> Dunno, haven't have need to calculate crc64 yet :P
>> Better to generate table, don't waste time on generation :P
 
> Eeeh, I'm using also a table as you can see from above.
 
What do you think about following:
https://github.com/intel/isa-l/tree/master/crc
 
--
 
7-77-777
Evil Sinner!
with software, you repeat same experiment, expecting different results...
Bonita Montero <Bonita.Montero@gmail.com>: Oct 11 10:53AM +0200

Am 11.10.2021 um 09:46 schrieb Branimir Maksimovic:
 
>> Eeeh, I'm using also a table as you can see from above.
 
> What do you think about following:
> https://github.com/intel/isa-l/tree/master/crc
 
I won't check this ASM-code. An I don't know why people use ASM.
C / C++ and intrinsics usually result in better code.
Branimir Maksimovic <branimir.maksimovic@icloud.com>: Oct 11 09:11AM

>> https://github.com/intel/isa-l/tree/master/crc
 
> I won't check this ASM-code. An I don't know why people use ASM.
> C / C++ and intrinsics usually result in better code.
If you want to be hacker you have to program in ASM
(without debugger :) )
ASM code is most efficient always and works as tested without
surprises :P
C/C++ you can use after ypou master ASM :P
IMO :P
 
--
 
7-77-777
Evil Sinner!
with software, you repeat same experiment, expecting different results...
Bonita Montero <Bonita.Montero@gmail.com>: Oct 11 11:18AM +0200

Am 11.10.2021 um 11:11 schrieb Branimir Maksimovic:
 
> ASM code is most efficient always and works as tested without
> surprises :P
 
ASM code can be faster in rare cases when you know everyting about
your OoO-CPU, but in most cases the compiler generates better code.
I've seen code from clang where you might think: there would be no
ASM-programmer that knows all of these optimization-tricks.
Branimir Maksimovic <branimir.maksimovic@icloud.com>: Oct 04 04:36PM

> } while( remaining );
> return block;
> }
 
Talking about efficiency :P
Who will pay you for overcomplicating simple things?
 
--
 
7-77-777
Evil Sinner!
to weak you should be meek, and you should brainfuck stronger
https://github.com/rofl0r/chaos-pp
Bonita Montero <Bonita.Montero@gmail.com>: Oct 05 07:11AM +0200

Am 04.10.2021 um 21:59 schrieb Marcel Mueller:
 
> behind a pointer can significantly decrease the probability of failed
> CAS when implementing strong thread safety. But on some platforms I
> observed excessive cachline hopping with this strategy.
 
That doesn't make sense. When you prefetch you usually process a lot of
data before the point you prefetched. When you have CASes you rotatedy
process the same data; prefetching here is nonsense.
Branimir Maksimovic <branimir.maksimovic@icloud.com>: Oct 05 09:55AM


> Ok, you don't understand what I do.
 
Seems so :P
 
--
 
7-77-777
Evil Sinner!
to weak you should be meek, and you should brainfuck stronger
https://github.com/rofl0r/chaos-pp
Bonita Montero <Bonita.Montero@gmail.com>: Oct 05 07:08AM +0200

Am 04.10.2021 um 22:22 schrieb Branimir Maksimovic:
> }
 
> return 0;
> }
 
Ok, you don't understand what I do.
Bonita Montero <Bonita.Montero@gmail.com>: Oct 05 01:23PM +0200

Am 05.10.2021 um 11:55 schrieb Branimir Maksimovic:
 
>> data before the point you prefetched. When you have CASes you rotatedy
>> process the same data; prefetching here is nonsense.
 
> Prefetching is nonsense in HLL :P
 
No, automatic prefetching is dumd and there are a lot of patterns
they're unable to predict. With my 3-way interleaved access I've
even shown a very simple pattern where manual prefetching helps.
Bonita Montero <Bonita.Montero@gmail.com>: Oct 04 04:30PM +0200

There's the Unix-command wc which counts words and lines. And the
wc-implementation from the current GNU core utilities contain an
optional very tricky AVX-implementation. This improves the speed
of wc on my Linux-computer by factor 29.
I improved this algorithm further to partition the data in three
parts which I handle interleaved, i.e. 32-byte-chunks synchronously
from each part and then I increment the common offset by 32. This
is while the original-algorithm has a depencency-chain which limits
out of order execution. I partitioned the data in three but not in
four parts because there wouldn't be enough integer-registers - I
need 14 in my interleaving-loop. With four parts I'd have much
regiseter spilling and reloading which gains a performance similar
to the original-algorithm.
The speedup of the interleaved code over the original wc-algorithm
is about 60%.
The reason why I tell this is that I benchmarked the code under
different conditions. I also improved the trivial algorithm with
prefetching and partitioning and it could be run with either
switched on like the AVX-code. This are the results on my Linux
Ryzen 7 1800X:
 
trivial / non-interleaved / non-prefetched
1 thread: 468MB/s
trivial / non-interleaved / prefetched
1 thread: 492MB/s
trivial / interleaved / non-prefetched
1 thread: 778MB/s
trivial / interleaved / prefetched
1 thread: 694MB/s
AVX / non-interleaved / non-prefetched
1 thread: 13731MB/s
AVX / non-interleaved / prefetched
1 thread: 13757MB/s
AVX / interleaved / non-prefetched
1 thread: 19722MB/s
AVX / interleaved / prefetched
1 thread: 23558MB/s
 
As you can see manual prefetching gives only a little gain for the
trivial non-interleaved code, and it even drops with the trivial
interleaved / prefetched code over the trivial interleaved / non
-prefetched code. But for the AVX-code there's a significant speedup
of the interleaved / prefetched code over the interleaved / non
-prefetched code. So there are cases where prefetching gives a
significant speedup.
With interleaving there are more complex memory access patterns
and I suspect that the prefetcher doesn't work that good under
such conditions.
 
If you're interested in the code. The relevant functions are the
lambdas trivialSpaceCount and avxSpaceCount. The code is compilable
only with C++20.
 
#if defined(_MSC_VER)
#include <Windows.h>
#elif defined(__unix__)
#include <pthread.h>

No comments: