- faster "CRC"- / "FNV"-hashing - 13 Updates
- Most efficient prefetching distance - 6 Updates
- How to get mantissa of long double? - 5 Updates
- Tricky ... - 1 Update
| Branimir Maksimovic <branimir.maksimovic@icloud.com>: Oct 11 05:23AM > fnv blocked 64: : 3.11791 GB/s 232% > crc64: : 0.478093 GB/s > crc64 blocked: : 2.39144 GB/s 400% Proprietary? -- 7-77-777 Evil Sinner! with software, you repeat same experiment, expecting different results... |
| Branimir Maksimovic <branimir.maksimovic@icloud.com>: Oct 11 05:41AM > fnv blocked 64: : 3.11791 GB/s 232% > crc64: : 0.478093 GB/s > crc64 blocked: : 2.39144 GB/s 400% Here is mine : -opyright : (c) 2004 by Dzenis Softic / http://www.dzeni.com * * Filename : vmp_crc32.c * * Description: Calculates crc32 of string * * Company : Seenetix D.O.O. * * Authors : B. Maksimovic * * $Id$ * *===========================================================================*/ #include <stddef.h> static unsigned crc32_table[] = { 0x00000000, 0x77073096, 0xEE0E612C, 0x990951BA, 0x076DC419, 0x706AF48F, 0xE963A535, 0x9E6495A3, 0x0EDB8832, 0x79DCB8A4, 0xE0D5E91E, 0x97D2D988, 0x09B64C2B, 0x7EB17CBD, 0xE7B82D07, 0x90BF1D91, 0x1DB71064, 0x6AB020F2, 0xF3B97148, 0x84BE41DE, 0x1ADAD47D, 0x6DDDE4EB, 0xF4D4B551, 0x83D385C7, 0x136C9856, 0x646BA8C0, 0xFD62F97A, 0x8A65C9EC, 0x14015C4F, 0x63066CD9, 0xFA0F3D63, 0x8D080DF5, 0x3B6E20C8, 0x4C69105E, 0xD56041E4, 0xA2677172, 0x3C03E4D1, 0x4B04D447, 0xD20D85FD, 0xA50AB56B, 0x35B5A8FA, 0x42B2986C, 0xDBBBC9D6, 0xACBCF940, 0x32D86CE3, 0x45DF5C75, 0xDCD60DCF, 0xABD13D59, 0x26D930AC, 0x51DE003A, 0xC8D75180, 0xBFD06116, 0x21B4F4B5, 0x56B3C423, 0xCFBA9599, 0xB8BDA50F, 0x2802B89E, 0x5F058808, 0xC60CD9B2, 0xB10BE924, 0x2F6F7C87, 0x58684C11, 0xC1611DAB, 0xB6662D3D, 0x76DC4190, 0x01DB7106, 0x98D220BC, 0xEFD5102A, 0x71B18589, 0x06B6B51F, 0x9FBFE4A5, 0xE8B8D433, 0x7807C9A2, 0x0F00F934, 0x9609A88E, 0xE10E9818, 0x7F6A0DBB, 0x086D3D2D, 0x91646C97, 0xE6635C01, 0x6B6B51F4, 0x1C6C6162, 0x856530D8, 0xF262004E, 0x6C0695ED, 0x1B01A57B, 0x8208F4C1, 0xF50FC457, 0x65B0D9C6, 0x12B7E950, 0x8BBEB8EA, 0xFCB9887C, 0x62DD1DDF, 0x15DA2D49, 0x8CD37CF3, 0xFBD44C65, 0x4DB26158, 0x3AB551CE, 0xA3BC0074, 0xD4BB30E2, 0x4ADFA541, 0x3DD895D7, 0xA4D1C46D, 0xD3D6F4FB, 0x4369E96A, 0x346ED9FC, 0xAD678846, 0xDA60B8D0, 0x44042D73, 0x33031DE5, 0xAA0A4C5F, 0xDD0D7CC9, 0x5005713C, 0x270241AA, 0xBE0B1010, 0xC90C2086, 0x5768B525, 0x206F85B3, 0xB966D409, 0xCE61E49F, 0x5EDEF90E, 0x29D9C998, 0xB0D09822, 0xC7D7A8B4, 0x59B33D17, 0x2EB40D81, 0xB7BD5C3B, 0xC0BA6CAD, 0xEDB88320, 0x9ABFB3B6, 0x03B6E20C, 0x74B1D29A, 0xEAD54739, 0x9DD277AF, 0x04DB2615, 0x73DC1683, 0xE3630B12, 0x94643B84, 0x0D6D6A3E, 0x7A6A5AA8, 0xE40ECF0B, 0x9309FF9D, 0x0A00AE27, 0x7D079EB1, 0xF00F9344, 0x8708A3D2, 0x1E01F268, 0x6906C2FE, 0xF762575D, 0x806567CB, 0x196C3671, 0x6E6B06E7, 0xFED41B76, 0x89D32BE0, 0x10DA7A5A, 0x67DD4ACC, 0xF9B9DF6F, 0x8EBEEFF9, 0x17B7BE43, 0x60B08ED5, 0xD6D6A3E8, 0xA1D1937E, 0x38D8C2C4, 0x4FDFF252, 0xD1BB67F1, 0xA6BC5767, 0x3FB506DD, 0x48B2364B, 0xD80D2BDA, 0xAF0A1B4C, 0x36034AF6, 0x41047A60, 0xDF60EFC3, 0xA867DF55, 0x316E8EEF, 0x4669BE79, 0xCB61B38C, 0xBC66831A, 0x256FD2A0, 0x5268E236, 0xCC0C7795, 0xBB0B4703, 0x220216B9, 0x5505262F, 0xC5BA3BBE, 0xB2BD0B28, 0x2BB45A92, 0x5CB36A04, 0xC2D7FFA7, 0xB5D0CF31, 0x2CD99E8B, 0x5BDEAE1D, 0x9B64C2B0, 0xEC63F226, 0x756AA39C, 0x026D930A, 0x9C0906A9, 0xEB0E363F, 0x72076785, 0x05005713, 0x95BF4A82, 0xE2B87A14, 0x7BB12BAE, 0x0CB61B38, 0x92D28E9B, 0xE5D5BE0D, 0x7CDCEFB7, 0x0BDBDF21, 0x86D3D2D4, 0xF1D4E242, 0x68DDB3F8, 0x1FDA836E, 0x81BE16CD, 0xF6B9265B, 0x6FB077E1, 0x18B74777, 0x88085AE6, 0xFF0F6A70, 0x66063BCA, 0x11010B5C, 0x8F659EFF, 0xF862AE69, 0x616BFFD3, 0x166CCF45, 0xA00AE278, 0xD70DD2EE, 0x4E048354, 0x3903B3C2, 0xA7672661, 0xD06016F7, 0x4969474D, 0x3E6E77DB, 0xAED16A4A, 0xD9D65ADC, 0x40DF0B66, 0x37D83BF0, 0xA9BCAE53, 0xDEBB9EC5, 0x47B2CF7F, 0x30B5FFE9, 0xBDBDF21C, 0xCABAC28A, 0x53B39330, 0x24B4A3A6, 0xBAD03605, 0xCDD70693, 0x54DE5729, 0x23D967BF, 0xB3667A2E, 0xC4614AB8, 0x5D681B02, 0x2A6F2B94, 0xB40BBE37, 0xC30C8EA1, 0x5A05DF1B, 0x2D02EF8D, }; inline static unsigned calc_crc32(unsigned char b, unsigned crc32) { return (crc32 >> 8) ^ crc32_table[b ^ (crc32 & 0x000000FF)]; } unsigned vmp_crc32str(const char* str) { unsigned crc32_tmp = 0xFFFFFFFF; while(*str) { crc32_tmp = calc_crc32(*str++, crc32_tmp); } crc32_tmp = ~crc32_tmp; return crc32_tmp; } unsigned vmp_crc32(const void* src, size_t size) { unsigned crc32_tmp = 0xFFFFFFFF; int i = 0; for(;i<size;++i) { crc32_tmp = calc_crc32(*(const unsigned char *)src++, crc32_tmp); } crc32_tmp = ~crc32_tmp; return crc32_tmp; } /*============================================================================= * History: * * $Log$ * Revision 1.2 2004/04/20 20:08:46 bmaxa * added vmp_crc32 void version * * Revision 1.1 2004/04/20 19:31:53 bmaxa * crc32 added * * *===========================================================================*/ 7-77-777 Evil Sinner! with software, you repeat same experiment, expecting different results... |
| Bonita Montero <Bonita.Montero@gmail.com>: Oct 11 08:09AM +0200 Am 10.10.2021 um 21:36 schrieb Chris M. Thomasson: >> crc64 blocked: : 2.39144 GB/s 400% > Btw, if you can come up with a really fast SHA2 impl, I would be > interested because of my experimental HMAC cipher. I have a C version: SHA* is completely different and can't be improved how I did it. |
| Bonita Montero <Bonita.Montero@gmail.com>: Oct 11 08:15AM +0200 Am 11.10.2021 um 07:23 schrieb Branimir Maksimovic: >> crc64: : 0.478093 GB/s >> crc64 blocked: : 2.39144 GB/s 400% > Proprietary? Of course - but the same equal distribution. |
| Bonita Montero <Bonita.Montero@gmail.com>: Oct 11 08:27AM +0200 Am 11.10.2021 um 07:23 schrieb Branimir Maksimovic: >> crc64: : 0.478093 GB/s >> crc64 blocked: : 2.39144 GB/s 400% > Proprietary? That's the improved "CRC64": #include "crc64.h" using namespace std; uint64_t CRC64_ECMA182::operator ()( void const *p, size_t n, uint64_t startCrc ) const { uint64_t crc = startCrc; uint8_t const *s = (uint8_t *)p, *end = s + n; size_t t; for( ; s != end; ++s ) t = (size_t)(crc >> 56) ^ *s, crc = table.t[t] ^ (crc << 8); return crc; } uint64_t CRC64_ECMA182::blocked( void const *p, size_t n, uint64_t startCrc ) const { auto crc64_8x8 = []( uint8_t const *s ) -> uint64_t { uint64_t crcs[8] = { table.t[s[ 0]], table.t[s[ 8]], table.t[s[16]], table.t[s[24]], table.t[s[32]], table.t[s[40]], table.t[s[48]], table.t[s[56]] }; size_t t; uint8_t const *end = ++s + 7; do t = (size_t)(crcs[0] >> 56) ^ s[ 0], crcs[0] = table.t[t] ^ (crcs[0] << 8), t = (size_t)(crcs[1] >> 56) ^ s[ 8], crcs[1] = table.t[t] ^ (crcs[1] << 8), t = (size_t)(crcs[2] >> 56) ^ s[16], crcs[2] = table.t[t] ^ (crcs[2] << 8), t = (size_t)(crcs[3] >> 56) ^ s[24], crcs[3] = table.t[t] ^ (crcs[3] << 8), t = (size_t)(crcs[4] >> 56) ^ s[32], crcs[4] = table.t[t] ^ (crcs[4] << 8), t = (size_t)(crcs[5] >> 56) ^ s[40], crcs[5] = table.t[t] ^ (crcs[5] << 8), t = (size_t)(crcs[6] >> 56) ^ s[48], crcs[6] = table.t[t] ^ (crcs[6] << 8), t = (size_t)(crcs[7] >> 56) ^ s[56], crcs[7] = table.t[t] ^ (crcs[7] << 8); while( ++s != end ); uint64_t crc = 0; for( size_t i = 0; i != 8; ++i ) crc ^= crcs[i]; return crc; }; uint8_t const *s = (uint8_t *)p; uint64_t crc = startCrc; for( uint8_t const *end = s + (n & -64); s != end; s += 64 ) crc ^= crc64_8x8( s ); crc ^= (*this)( s, n % 64, 0 ); return crc; } CRC64_ECMA182::crc64_table::crc64_table() { uint64_t const CRC64_ECMA182_POLY = 0x42F0E1EBA9EA3693u; for( uint64_t i = 0; i != 256; ++i ) { uint64_t crc = 0, c = i << 56; for( unsigned j = 0; j != 8; ++j ) crc = (int64_t)(crc ^ c) < 0 ? (crc << 1) ^ CRC64_ECMA182_POLY : crc << 1, c <<= 1; t[(size_t)i] = crc; } } CRC64_ECMA182::crc64_table CRC64_ECMA182::table; Why does it run faster ? |
| Branimir Maksimovic <branimir.maksimovic@icloud.com>: Oct 11 06:29AM >>> crc64 blocked: : 2.39144 GB/s 400% >> Proprietary? > Of course - but the same equal distribution. Dunno, those hashing algos work much better if there is support from hardware. But, of course faster, better :P -- 7-77-777 Evil Sinner! with software, you repeat same experiment, expecting different results... |
| Branimir Maksimovic <branimir.maksimovic@icloud.com>: Oct 11 06:32AM > } > CRC64_ECMA182::crc64_table CRC64_ECMA182::table; > Why does it run faster ? Dunno, haven't have need to calculate crc64 yet :P Better to generate table, don't waste time on generation :P -- 7-77-777 Evil Sinner! with software, you repeat same experiment, expecting different results... |
| Bonita Montero <Bonita.Montero@gmail.com>: Oct 11 08:33AM +0200 Am 11.10.2021 um 08:32 schrieb Branimir Maksimovic: >> Why does it run faster ? > Dunno, haven't have need to calculate crc64 yet :P > Better to generate table, don't waste time on generation :P Eeeh, I'm using also a table as you can see from above. |
| Bonita Montero <Bonita.Montero@gmail.com>: Oct 11 08:34AM +0200 Am 11.10.2021 um 08:29 schrieb Branimir Maksimovic: >> Of course - but the same equal distribution. > Dunno, those hashing algos work much better if there is > support from hardware. But, of course faster, better :P There are only special SSE-instrucions for CRC32, but not for CRC64. |
| Branimir Maksimovic <branimir.maksimovic@icloud.com>: Oct 11 07:46AM >> Dunno, haven't have need to calculate crc64 yet :P >> Better to generate table, don't waste time on generation :P > Eeeh, I'm using also a table as you can see from above. What do you think about following: https://github.com/intel/isa-l/tree/master/crc -- 7-77-777 Evil Sinner! with software, you repeat same experiment, expecting different results... |
| Bonita Montero <Bonita.Montero@gmail.com>: Oct 11 10:53AM +0200 Am 11.10.2021 um 09:46 schrieb Branimir Maksimovic: >> Eeeh, I'm using also a table as you can see from above. > What do you think about following: > https://github.com/intel/isa-l/tree/master/crc I won't check this ASM-code. An I don't know why people use ASM. C / C++ and intrinsics usually result in better code. |
| Branimir Maksimovic <branimir.maksimovic@icloud.com>: Oct 11 09:11AM >> https://github.com/intel/isa-l/tree/master/crc > I won't check this ASM-code. An I don't know why people use ASM. > C / C++ and intrinsics usually result in better code. If you want to be hacker you have to program in ASM (without debugger :) ) ASM code is most efficient always and works as tested without surprises :P C/C++ you can use after ypou master ASM :P IMO :P -- 7-77-777 Evil Sinner! with software, you repeat same experiment, expecting different results... |
| Bonita Montero <Bonita.Montero@gmail.com>: Oct 11 11:18AM +0200 Am 11.10.2021 um 11:11 schrieb Branimir Maksimovic: > ASM code is most efficient always and works as tested without > surprises :P ASM code can be faster in rare cases when you know everyting about your OoO-CPU, but in most cases the compiler generates better code. I've seen code from clang where you might think: there would be no ASM-programmer that knows all of these optimization-tricks. |
| Branimir Maksimovic <branimir.maksimovic@icloud.com>: Oct 04 04:36PM > } while( remaining ); > return block; > } Talking about efficiency :P Who will pay you for overcomplicating simple things? -- 7-77-777 Evil Sinner! to weak you should be meek, and you should brainfuck stronger https://github.com/rofl0r/chaos-pp |
| Bonita Montero <Bonita.Montero@gmail.com>: Oct 05 07:11AM +0200 Am 04.10.2021 um 21:59 schrieb Marcel Mueller: > behind a pointer can significantly decrease the probability of failed > CAS when implementing strong thread safety. But on some platforms I > observed excessive cachline hopping with this strategy. That doesn't make sense. When you prefetch you usually process a lot of data before the point you prefetched. When you have CASes you rotatedy process the same data; prefetching here is nonsense. |
| Branimir Maksimovic <branimir.maksimovic@icloud.com>: Oct 05 09:55AM > Ok, you don't understand what I do. Seems so :P -- 7-77-777 Evil Sinner! to weak you should be meek, and you should brainfuck stronger https://github.com/rofl0r/chaos-pp |
| Bonita Montero <Bonita.Montero@gmail.com>: Oct 05 07:08AM +0200 Am 04.10.2021 um 22:22 schrieb Branimir Maksimovic: > } > return 0; > } Ok, you don't understand what I do. |
| Bonita Montero <Bonita.Montero@gmail.com>: Oct 05 01:23PM +0200 Am 05.10.2021 um 11:55 schrieb Branimir Maksimovic: >> data before the point you prefetched. When you have CASes you rotatedy >> process the same data; prefetching here is nonsense. > Prefetching is nonsense in HLL :P No, automatic prefetching is dumd and there are a lot of patterns they're unable to predict. With my 3-way interleaved access I've even shown a very simple pattern where manual prefetching helps. |
| Bonita Montero <Bonita.Montero@gmail.com>: Oct 04 04:30PM +0200 There's the Unix-command wc which counts words and lines. And the wc-implementation from the current GNU core utilities contain an optional very tricky AVX-implementation. This improves the speed of wc on my Linux-computer by factor 29. I improved this algorithm further to partition the data in three parts which I handle interleaved, i.e. 32-byte-chunks synchronously from each part and then I increment the common offset by 32. This is while the original-algorithm has a depencency-chain which limits out of order execution. I partitioned the data in three but not in four parts because there wouldn't be enough integer-registers - I need 14 in my interleaving-loop. With four parts I'd have much regiseter spilling and reloading which gains a performance similar to the original-algorithm. The speedup of the interleaved code over the original wc-algorithm is about 60%. The reason why I tell this is that I benchmarked the code under different conditions. I also improved the trivial algorithm with prefetching and partitioning and it could be run with either switched on like the AVX-code. This are the results on my Linux Ryzen 7 1800X: trivial / non-interleaved / non-prefetched 1 thread: 468MB/s trivial / non-interleaved / prefetched 1 thread: 492MB/s trivial / interleaved / non-prefetched 1 thread: 778MB/s trivial / interleaved / prefetched 1 thread: 694MB/s AVX / non-interleaved / non-prefetched 1 thread: 13731MB/s AVX / non-interleaved / prefetched 1 thread: 13757MB/s AVX / interleaved / non-prefetched 1 thread: 19722MB/s AVX / interleaved / prefetched 1 thread: 23558MB/s As you can see manual prefetching gives only a little gain for the trivial non-interleaved code, and it even drops with the trivial interleaved / prefetched code over the trivial interleaved / non -prefetched code. But for the AVX-code there's a significant speedup of the interleaved / prefetched code over the interleaved / non -prefetched code. So there are cases where prefetching gives a significant speedup. With interleaving there are more complex memory access patterns and I suspect that the prefetcher doesn't work that good under such conditions. If you're interested in the code. The relevant functions are the lambdas trivialSpaceCount and avxSpaceCount. The code is compilable only with C++20. #if defined(_MSC_VER) #include <Windows.h> #elif defined(__unix__) #include <pthread.h>
Subscribe to:
Post Comments (Atom)
|
No comments:
Post a Comment