- Union type punning in C++ - 19 Updates
- uint32_t is not the same as long unsigned int ? - 4 Updates
- little contest - 2 Updates
"Öö Tiib" <ootiib@hot.ee>: Jan 03 10:37PM -0800 On Saturday, 4 January 2020 00:38:08 UTC+2, Chris M. Thomasson wrote: > ____________________________ > Everything boils down to int types, so no special operators need to be > defined. Sorry, you are correct, I misread or misremembered. The problem it supposedly solves is bit confusing for me. I am not saying we do not need it just that I do not understand in what situation I need it. |
Bonita Montero <Bonita.Montero@gmail.com>: Jan 04 08:08AM +0100 >> That's not as easy to test as you might think. > It is not easy, that is why I wrote tests... You didn't test what I disassembled. |
Bonita Montero <Bonita.Montero@gmail.com>: Jan 04 08:34AM +0100 >> It is not easy, that is why I wrote tests... > You didn't test what I disassembled. Here's a little test: #include <iostream> #include <cstring> #include <cstdint> #include <chrono> using namespace std; using namespace chrono; size_t const SIZE = 200; struct S { char c[200]; }; void fCopy( S *a, S *b ) { memcpy( a->c, b->c, SIZE ); } int main() { void (*volatile pfCopy)( S *, S * ) = fCopy; S a, b; time_point<high_resolution_clock> start = high_resolution_clock::now(); for( size_t n = 1'000'000; n; --n ) pfCopy( &a, &b ); uint64_t ns = (uint64_t)duration_cast<nanoseconds>( high_resolution_clock::now() - start ).count();; cout << (double)ns / 1.0E6 << "ms" << endl; } Under Windows the execution-time of the loop is 3.96ms. Under Linux the execution time is 12.96ms. SSE simply rules here. |
Bonita Montero <Bonita.Montero@gmail.com>: Jan 04 08:43AM +0100 > Under Windows the execution-time of the loop is 3.96ms. > Under Linux the execution time is 12.96ms. > SSE simply rules here. MSVC even uses AVX2 for memcpy when I enable it via compiler-switch. That's what the fCopy-code looks like then: vmovups xmm0, XMMWORD PTR [rdx] vmovups XMMWORD PTR [rcx], xmm0 vmovups xmm1, XMMWORD PTR [rdx+16] vmovups XMMWORD PTR [rcx+16], xmm1 vmovups xmm0, XMMWORD PTR [rdx+32] vmovups XMMWORD PTR [rcx+32], xmm0 vmovups xmm1, XMMWORD PTR [rdx+48] vmovups XMMWORD PTR [rcx+48], xmm1 vmovups xmm0, XMMWORD PTR [rdx+64] vmovups XMMWORD PTR [rcx+64], xmm0 vmovups xmm1, XMMWORD PTR [rdx+80] vmovups XMMWORD PTR [rcx+80], xmm1 vmovups xmm0, XMMWORD PTR [rdx+96] vmovups XMMWORD PTR [rcx+96], xmm0 vmovups xmm0, XMMWORD PTR [rdx+112] vmovups XMMWORD PTR [rcx+112], xmm0 vmovups xmm1, XMMWORD PTR [rdx+128] sub rcx, -128 vmovups XMMWORD PTR [rcx], xmm1 vmovups xmm0, XMMWORD PTR [rdx+144] vmovups XMMWORD PTR [rcx+16], xmm0 vmovups xmm1, XMMWORD PTR [rdx+160] vmovups XMMWORD PTR [rcx+32], xmm1 vmovups xmm0, XMMWORD PTR [rdx+176] vmovups XMMWORD PTR [rcx+48], xmm0 mov rax, QWORD PTR [rdx+192] mov QWORD PTR [rcx+64], rax ret 0 But the performance is only slightly better with AVX2 over SSE (3.68ms). |
Melzzzzz <Melzzzzz@zzzzz.com>: Jan 04 08:20AM > Under Windows the execution-time of the loop is 3.96ms. > Under Linux the execution time is 12.96ms. > SSE simply rules here. That's not a test of that code. -- press any key to continue or any other to quit... U ničemu ja ne uživam kao u svom statusu INVALIDA -- Zli Zec Svi smo svedoci - oko 3 godine intenzivne propagande je dovoljno da jedan narod poludi -- Zli Zec Na divljem zapadu i nije bilo tako puno nasilja, upravo zato jer su svi bili naoruzani. -- Mladen Gogala |
Melzzzzz <Melzzzzz@zzzzz.com>: Jan 04 08:25AM >>> That's not as easy to test as you might think. >> It is not easy, that is why I wrote tests... > You didn't test what I disassembled. It is test of rep movs vs non temporal moves with SSE/AVX2. non temporal moves with SSE/AVX2 wuold be slower. -- press any key to continue or any other to quit... U ničemu ja ne uživam kao u svom statusu INVALIDA -- Zli Zec Svi smo svedoci - oko 3 godine intenzivne propagande je dovoljno da jedan narod poludi -- Zli Zec Na divljem zapadu i nije bilo tako puno nasilja, upravo zato jer su svi bili naoruzani. -- Mladen Gogala |
Bonita Montero <Bonita.Montero@gmail.com>: Jan 04 09:25AM +0100 >> Under Linux the execution time is 12.96ms. >> SSE simply rules here. > That's not a test of that code. It is. |
Bonita Montero <Bonita.Montero@gmail.com>: Jan 04 09:25AM +0100 >> You didn't test what I disassembled. > It is test of rep movs vs non temporal moves with SSE/AVX2. > non temporal moves with SSE/AVX2 wuold be slower. Then you're wrong in this part of the thread. |
Melzzzzz <Melzzzzz@zzzzz.com>: Jan 04 08:27AM >> It is test of rep movs vs non temporal moves with SSE/AVX2. >> non temporal moves with SSE/AVX2 wuold be slower. > Then you're wrong in this part of the thread. No I am not for block of that size gcc code is more efficient... -- press any key to continue or any other to quit... U ničemu ja ne uživam kao u svom statusu INVALIDA -- Zli Zec Svi smo svedoci - oko 3 godine intenzivne propagande je dovoljno da jedan narod poludi -- Zli Zec Na divljem zapadu i nije bilo tako puno nasilja, upravo zato jer su svi bili naoruzani. -- Mladen Gogala |
Bonita Montero <Bonita.Montero@gmail.com>: Jan 04 09:36AM +0100 >>> non temporal moves with SSE/AVX2 wuold be slower. >> Then you're wrong in this part of the thread. > No I am not for block of that size gcc code is more efficient... You're at the wrong place in the subthread. What I said started here: <qunp5o$17ci$1@gioia.aioe.org> - and everything below that posting should be related to what I said there. But you're simply a stupid person which doesn't undestand this. |
Melzzzzz <Melzzzzz@zzzzz.com>: Jan 04 08:41AM ><qunp5o$17ci$1@gioia.aioe.org> - and everything below that posting > should be related to what I said there. But you're simply a stupid > person which doesn't undestand this. No, you are buffon. I just commented on assembly output. It would be interresting what code would gcc generate on larger blocks... -- press any key to continue or any other to quit... U ničemu ja ne uživam kao u svom statusu INVALIDA -- Zli Zec Svi smo svedoci - oko 3 godine intenzivne propagande je dovoljno da jedan narod poludi -- Zli Zec Na divljem zapadu i nije bilo tako puno nasilja, upravo zato jer su svi bili naoruzani. -- Mladen Gogala |
Bonita Montero <Bonita.Montero@gmail.com>: Jan 04 09:49AM +0100 > No, you are buffon. I just commented on assembly output. > It would be interresting what code would gcc generate on > larger blocks... But this isn't true: |
Melzzzzz <Melzzzzz@zzzzz.com>: Jan 04 09:01AM > But this isn't true: >> For small blocks movs instructions are better. For big ones >> opposite. You will never learn anything. Because you don't listen more experienced... -- press any key to continue or any other to quit... U ničemu ja ne uživam kao u svom statusu INVALIDA -- Zli Zec Svi smo svedoci - oko 3 godine intenzivne propagande je dovoljno da jedan narod poludi -- Zli Zec Na divljem zapadu i nije bilo tako puno nasilja, upravo zato jer su svi bili naoruzani. -- Mladen Gogala |
Bonita Montero <Bonita.Montero@gmail.com>: Jan 04 10:02AM +0100 >>> opposite. > You will never learn anything. Because you don't listen more > experienced... I'll listen to the experience of the compiler-writers. They're have for sure more experience than you. |
Melzzzzz <Melzzzzz@zzzzz.com>: Jan 04 09:19AM >> experienced... > I'll listen to the experience of the compiler-writers. > They're have for sure more experience than you. You are dumb. You clearly see different code that does not holds your presumption. If that was true generated assembly code for same code snippet would be same and all compilers woudl be equally efficient. Not ralizing taht fact makes you a dumb. -- press any key to continue or any other to quit... U ničemu ja ne uživam kao u svom statusu INVALIDA -- Zli Zec Svi smo svedoci - oko 3 godine intenzivne propagande je dovoljno da jedan narod poludi -- Zli Zec Na divljem zapadu i nije bilo tako puno nasilja, upravo zato jer su svi bili naoruzani. -- Mladen Gogala |
Bonita Montero <Bonita.Montero@gmail.com>: Jan 04 10:38AM +0100 Am 04.01.2020 um 10:19 schrieb Melzzzzz: > presumption. If that was true generated assembly code for same code snippet > would be same and all compilers woudl be equally efficient. Not ralizing > taht fact makes you a dumb. Look at this: #include <iostream> #include <cstring> #include <cstdint> #include <chrono> using namespace std; using namespace chrono; extern "C" void fAvx(); extern "C" void fMovs(); int main() { using timestamp = time_point<high_resolution_clock>; timestamp start = high_resolution_clock::now(); fAvx(); uint64_t ns = (uint64_t)duration_cast<nanoseconds>( high_resolution_clock::now() - start ).count();; cout << (double)ns / 1.0E6 << "ms" << endl; start = high_resolution_clock::now(); fMovs(); ns = (uint64_t)duration_cast<nanoseconds>( high_resolution_clock::now() - start ).count();; cout << (double)ns / 1.0E6 << "ms" << endl; } And the assembly code fopr fAvx and fMovs: _TEXT SEGMENT fAvx PROC sub rsp, 64 mov rdx, 1000000000 avxLoop: vmovups ymm0, [rsp] vmovups [rsp + 32], ymm0 dec rdx jnz avxLoop add rsp, 64 ret fAvx ENDP fMovs PROC push rsi push rdi sub rsp, 64 mov rdx, 1000000000 movsLoop: mov rcx, 4 mov rsi, rsp lea rdi, [rsp + 32] rep movsq dec rdx jnz movsLoop add rsp, 64 pop rdi pop rsi ret fMovs ENDP _TEXT ENDS END I repeatedly copy 32 bytes - _a_small_amount_of_memory_ and according what you say copying with movs should be faster here. But on my computer fAvx is 6,6 times faster. Any questions? |
David Brown <david.brown@hesbynett.no>: Jan 04 12:45PM +0100 On 04/01/2020 01:25, Bonita Montero wrote: >> For small blocks movs instructions are better. >> For big ones opposite. > The compiler knows better than stupid Melzzz. I suspect the results will be highly dependent on details, like the exact chip you are using, and where you draw the line between "small blocks" and "big blocks". Compiler writers put a lot of effort into fine-tuning this kind of thing for different situations. But the compiler can only optimise on the basis of the information you give it. If you want optimal code for a particular device, you have to tell the compiler exactly which device you want to target (and what devices you want to support, which is not necessarily the same thing). Then it can make a stab at picking the right kind of instructions here - whether it is SIMD, loops, rep instructions, AVX, or whatever. Different compilers can make different assumptions about the target device, when this information is missing from the command lines. So in any argument about which instruction sequence is "best", it is entirely possible for both sides to be right and very likely that the argument is pointless. |
Melzzzzz <Melzzzzz@zzzzz.com>: Jan 04 12:00PM > So in any argument about which instruction sequence is "best", it is > entirely possible for both sides to be right and very likely that the > argument is pointless. In this particular case gcc is right and VC is wrong. ~/examples/assembler >>> ./rdtscp 1000000 1000000 128 byte blocks, loops:4 rep movsb 0.01132903184211 rep movsq 0.01134612947368 movntdq 0.00317828842105 movntdq prefetch 0.00316638236842 movntdq prefetch ymm 0.00316397368421 ~/examples/assembler >>> ./rdtscp 1 [32] 1 128 byte blocks, loops:4000000 rep movsb 0.00000002601557 rep movsq 0.00000001063161 movntdq 0.00000008210285 movntdq prefetch 0.00000008319713 movntdq prefetch ymm 0.00000008284743 simply put for large blocks SSE2/AVX2 is better for small movs on both Intel and AMD. On Intel pre Zen2 AVX2 was better as it has 256 bit data move while AMD prior to Zen2 has 128 bit. -- press any key to continue or any other to quit... U ničemu ja ne uživam kao u svom statusu INVALIDA -- Zli Zec Svi smo svedoci - oko 3 godine intenzivne propagande je dovoljno da jedan narod poludi -- Zli Zec Na divljem zapadu i nije bilo tako puno nasilja, upravo zato jer su svi bili naoruzani. -- Mladen Gogala |
Bonita Montero <Bonita.Montero@gmail.com>: Jan 04 01:03PM +0100 > simply put for large blocks SSE2/AVX2 is better for > small movs on both Intel and AMD. No, movsd/q is always inferiror. |
Barry Schwarz <schwarzb@delq.com>: Jan 03 10:29PM -0800 On Sat, 4 Jan 2020 05:36:17 +0100, wolfgang bauer <schutz@gmx.de> wrote: >If I remove the "l" from "lu" in the print-function, no warning appear. >This fits to your explanation. >And so printf("sizeof int %d\n", sizeof(int)) has 4 as result. So on a 32-Bit machine it would be 2. Firstly, sizeof evaluates to a size_t which need not be int. Printing it with %d is not portable, could lead to undefined behavior, and some compilers will diagnose this. Either use %zu which is designed for size_t or cast the value to a type that matches your format, in this case int. Secondly, on a 32-bit machine, sizeof(int) will almost always be 4. It may be 2 on a 16-bit machine. It really depends on what the compiler writer decided to do. If your code depends on an int having a particular size, test for it and abort if you don't get the value you want. Alternately, you could use the exact size typedef for what you need. If the typedef is not available, your code will not compile, identifying the problem immediately. If it does compile, you know you are using the size you want. -- Remove del for email |
Keith Thompson <Keith.S.Thompson+u@gmail.com>: Jan 03 11:56PM -0800 James Kuyper <jameskuyper@alumni.caltech.edu> writes: [...] > that each standard integer type must be able to represent, and are > incorporated by reference from the C standard. I've converted those > ranges into bit counts to better match your question. A quibble: the required ranges of values for the standard integer types are copied from the C standard, but are not incorporated by reference. > One a 64-bit machine, long is often a 64-bit type, with int being a 32 > bit type. True, but Microsoft's compiler has 32-bit long even on 64-bit machines. -- Keith Thompson (The_Other_Keith) Keith.S.Thompson+u@gmail.com [Note updated email address] Working, but not speaking, for Philips Healthcare void Void(void) { Void(); } /* The recursive call of the void */ |
Jorgen Grahn <grahn+nntp@snipabacken.se>: Jan 04 08:37AM On Sat, 2020-01-04, wolfgang bauer wrote: > That results in this warning: > warning: format '%lu' expects argument of type 'long unsigned int', but argument 2 has type > 'uint32_t {aka unsigned int}' Others have already explained the larger picture, but concretely: If you insist on combining std::printf and the fixed-width integer types, you need to use the PRI... macros from <cinttypes>, particularly: #define PRIu32 "u" /Jorgen -- // Jorgen Grahn <grahn@ Oo o. . . \X/ snipabacken.se> O o . |
Keith Thompson <Keith.S.Thompson+u@gmail.com>: Jan 04 02:16AM -0800 > types, you need to use the PRI... macros from <cinttypes>, > particularly: > #define PRIu32 "u" *OR* you can convert to a type that you know is wide enough and that printf handles directly: uint32_t x; std::printf("x = %lu\n", (unsigned long)x); Or, of course: std::cout << "x = " << x << "\n"; -- Keith Thompson (The_Other_Keith) Keith.S.Thompson+u@gmail.com [Note updated email address] Working, but not speaking, for Philips Healthcare void Void(void) { Void(); } /* The recursive call of the void */ |
Jorgen Grahn <grahn+nntp@snipabacken.se>: Jan 04 08:48AM On Sat, 2020-01-04, Melzzzzz wrote: >> https://www.boost.org/doc/libs/1_67_0/libs/spirit/doc/html/spirit/introduction.html >> ... and many other parser-libs. > Having library is necessary, as it is tedious to write parser in C++... You guys are talking about a subclass of languages here. I often write parsers for simple, "flat" languages, and plain C++ works well for that. For example: - the container file format for JPEG images - the TIFF file format - linker symbols as seen in the output from 'nm -CP' /Jorgen -- // Jorgen Grahn <grahn@ Oo o. . . \X/ snipabacken.se> O o . |
Bonita Montero <Bonita.Montero@gmail.com>: Jan 04 11:04AM +0100 > - the container file format for JPEG images > - the TIFF file format > - linker symbols as seen in the output from 'nm -CP' That has nothing to do with parsing. |
You received this digest because you're subscribed to updates for this group. You can change your settings on the group membership page. To unsubscribe from this group and stop receiving emails from it send an email to comp.lang.c+++unsubscribe@googlegroups.com. |
No comments:
Post a Comment