Saturday, January 4, 2020

Digest for comp.lang.c++@googlegroups.com - 25 updates in 3 topics

"Öö Tiib" <ootiib@hot.ee>: Jan 03 10:37PM -0800

On Saturday, 4 January 2020 00:38:08 UTC+2, Chris M. Thomasson wrote:
> ____________________________
 
> Everything boils down to int types, so no special operators need to be
> defined.
 
Sorry, you are correct, I misread or misremembered. The problem it
supposedly solves is bit confusing for me. I am not saying we do
not need it just that I do not understand in what situation I
need it.
 
Bonita Montero <Bonita.Montero@gmail.com>: Jan 04 08:08AM +0100

>> That's not as easy to test as you might think.
 
> It is not easy, that is why I wrote tests...
 
You didn't test what I disassembled.
Bonita Montero <Bonita.Montero@gmail.com>: Jan 04 08:34AM +0100

>> It is not easy, that is why I wrote tests...
 
> You didn't test what I disassembled.
 
Here's a little test:
 
#include <iostream>
#include <cstring>
#include <cstdint>
#include <chrono>
 
using namespace std;
using namespace chrono;
 
size_t const SIZE = 200;
 
struct S
{
char c[200];
};
 
void fCopy( S *a, S *b )
{
memcpy( a->c, b->c, SIZE );
}
 
int main()
{
void (*volatile pfCopy)( S *, S * ) = fCopy;
S a, b;
time_point<high_resolution_clock> start = high_resolution_clock::now();
for( size_t n = 1'000'000; n; --n )
pfCopy( &a, &b );
uint64_t ns = (uint64_t)duration_cast<nanoseconds>(
high_resolution_clock::now() - start ).count();;
cout << (double)ns / 1.0E6 << "ms" << endl;
}
 
Under Windows the execution-time of the loop is 3.96ms.
Under Linux the execution time is 12.96ms.
SSE simply rules here.
Bonita Montero <Bonita.Montero@gmail.com>: Jan 04 08:43AM +0100

> Under Windows the execution-time of the loop is 3.96ms.
> Under Linux the execution time is 12.96ms.
> SSE simply rules here.
 
MSVC even uses AVX2 for memcpy when I enable it via compiler-switch.
That's what the fCopy-code looks like then:
 
vmovups xmm0, XMMWORD PTR [rdx]
vmovups XMMWORD PTR [rcx], xmm0
vmovups xmm1, XMMWORD PTR [rdx+16]
vmovups XMMWORD PTR [rcx+16], xmm1
vmovups xmm0, XMMWORD PTR [rdx+32]
vmovups XMMWORD PTR [rcx+32], xmm0
vmovups xmm1, XMMWORD PTR [rdx+48]
vmovups XMMWORD PTR [rcx+48], xmm1
vmovups xmm0, XMMWORD PTR [rdx+64]
vmovups XMMWORD PTR [rcx+64], xmm0
vmovups xmm1, XMMWORD PTR [rdx+80]
vmovups XMMWORD PTR [rcx+80], xmm1
vmovups xmm0, XMMWORD PTR [rdx+96]
vmovups XMMWORD PTR [rcx+96], xmm0
vmovups xmm0, XMMWORD PTR [rdx+112]
vmovups XMMWORD PTR [rcx+112], xmm0
vmovups xmm1, XMMWORD PTR [rdx+128]
sub rcx, -128
vmovups XMMWORD PTR [rcx], xmm1
vmovups xmm0, XMMWORD PTR [rdx+144]
vmovups XMMWORD PTR [rcx+16], xmm0
vmovups xmm1, XMMWORD PTR [rdx+160]
vmovups XMMWORD PTR [rcx+32], xmm1
vmovups xmm0, XMMWORD PTR [rdx+176]
vmovups XMMWORD PTR [rcx+48], xmm0
mov rax, QWORD PTR [rdx+192]
mov QWORD PTR [rcx+64], rax
ret 0
 
But the performance is only slightly better with AVX2 over SSE
(3.68ms).
Melzzzzz <Melzzzzz@zzzzz.com>: Jan 04 08:20AM


> Under Windows the execution-time of the loop is 3.96ms.
> Under Linux the execution time is 12.96ms.
> SSE simply rules here.
That's not a test of that code.
 
--
press any key to continue or any other to quit...
U ničemu ja ne uživam kao u svom statusu INVALIDA -- Zli Zec
Svi smo svedoci - oko 3 godine intenzivne propagande je dovoljno da jedan narod poludi -- Zli Zec
Na divljem zapadu i nije bilo tako puno nasilja, upravo zato jer su svi
bili naoruzani. -- Mladen Gogala
Melzzzzz <Melzzzzz@zzzzz.com>: Jan 04 08:25AM

>>> That's not as easy to test as you might think.
 
>> It is not easy, that is why I wrote tests...
 
> You didn't test what I disassembled.
 
It is test of rep movs vs non temporal moves with SSE/AVX2.
non temporal moves with SSE/AVX2 wuold be slower.
 
 
--
press any key to continue or any other to quit...
U ničemu ja ne uživam kao u svom statusu INVALIDA -- Zli Zec
Svi smo svedoci - oko 3 godine intenzivne propagande je dovoljno da jedan narod poludi -- Zli Zec
Na divljem zapadu i nije bilo tako puno nasilja, upravo zato jer su svi
bili naoruzani. -- Mladen Gogala
Bonita Montero <Bonita.Montero@gmail.com>: Jan 04 09:25AM +0100

>> Under Linux the execution time is 12.96ms.
>> SSE simply rules here.
 
> That's not a test of that code.
 
It is.
Bonita Montero <Bonita.Montero@gmail.com>: Jan 04 09:25AM +0100

>> You didn't test what I disassembled.
 
> It is test of rep movs vs non temporal moves with SSE/AVX2.
> non temporal moves with SSE/AVX2 wuold be slower.
 
Then you're wrong in this part of the thread.
Melzzzzz <Melzzzzz@zzzzz.com>: Jan 04 08:27AM


>> It is test of rep movs vs non temporal moves with SSE/AVX2.
>> non temporal moves with SSE/AVX2 wuold be slower.
 
> Then you're wrong in this part of the thread.
 
No I am not for block of that size gcc code is more efficient...
 
--
press any key to continue or any other to quit...
U ničemu ja ne uživam kao u svom statusu INVALIDA -- Zli Zec
Svi smo svedoci - oko 3 godine intenzivne propagande je dovoljno da jedan narod poludi -- Zli Zec
Na divljem zapadu i nije bilo tako puno nasilja, upravo zato jer su svi
bili naoruzani. -- Mladen Gogala
Bonita Montero <Bonita.Montero@gmail.com>: Jan 04 09:36AM +0100

>>> non temporal moves with SSE/AVX2 wuold be slower.
 
>> Then you're wrong in this part of the thread.
 
> No I am not for block of that size gcc code is more efficient...
 
You're at the wrong place in the subthread. What I said started here:
<qunp5o$17ci$1@gioia.aioe.org> - and everything below that posting
should be related to what I said there. But you're simply a stupid
person which doesn't undestand this.
Melzzzzz <Melzzzzz@zzzzz.com>: Jan 04 08:41AM

><qunp5o$17ci$1@gioia.aioe.org> - and everything below that posting
> should be related to what I said there. But you're simply a stupid
> person which doesn't undestand this.
 
No, you are buffon. I just commented on assembly output.
It would be interresting what code would gcc generate on
larger blocks...
 
--
press any key to continue or any other to quit...
U ničemu ja ne uživam kao u svom statusu INVALIDA -- Zli Zec
Svi smo svedoci - oko 3 godine intenzivne propagande je dovoljno da jedan narod poludi -- Zli Zec
Na divljem zapadu i nije bilo tako puno nasilja, upravo zato jer su svi
bili naoruzani. -- Mladen Gogala
Bonita Montero <Bonita.Montero@gmail.com>: Jan 04 09:49AM +0100

> No, you are buffon. I just commented on assembly output.
> It would be interresting what code would gcc generate on
> larger blocks...
 
But this isn't true:
 
Melzzzzz <Melzzzzz@zzzzz.com>: Jan 04 09:01AM


> But this isn't true:
 
>> For small blocks movs instructions are better. For big ones
>> opposite.
You will never learn anything. Because you don't listen more
experienced...
 
--
press any key to continue or any other to quit...
U ničemu ja ne uživam kao u svom statusu INVALIDA -- Zli Zec
Svi smo svedoci - oko 3 godine intenzivne propagande je dovoljno da jedan narod poludi -- Zli Zec
Na divljem zapadu i nije bilo tako puno nasilja, upravo zato jer su svi
bili naoruzani. -- Mladen Gogala
Bonita Montero <Bonita.Montero@gmail.com>: Jan 04 10:02AM +0100

>>> opposite.
 
> You will never learn anything. Because you don't listen more
> experienced...
 
I'll listen to the experience of the compiler-writers.
They're have for sure more experience than you.
Melzzzzz <Melzzzzz@zzzzz.com>: Jan 04 09:19AM

>> experienced...
 
> I'll listen to the experience of the compiler-writers.
> They're have for sure more experience than you.
 
You are dumb. You clearly see different code that does not holds your
presumption. If that was true generated assembly code for same code snippet
would be same and all compilers woudl be equally efficient. Not ralizing
taht fact makes you a dumb.
 
 
--
press any key to continue or any other to quit...
U ničemu ja ne uživam kao u svom statusu INVALIDA -- Zli Zec
Svi smo svedoci - oko 3 godine intenzivne propagande je dovoljno da jedan narod poludi -- Zli Zec
Na divljem zapadu i nije bilo tako puno nasilja, upravo zato jer su svi
bili naoruzani. -- Mladen Gogala
Bonita Montero <Bonita.Montero@gmail.com>: Jan 04 10:38AM +0100

Am 04.01.2020 um 10:19 schrieb Melzzzzz:
> presumption. If that was true generated assembly code for same code snippet
> would be same and all compilers woudl be equally efficient. Not ralizing
> taht fact makes you a dumb.
 
Look at this:
 
#include <iostream>
#include <cstring>
#include <cstdint>
#include <chrono>
 
using namespace std;
using namespace chrono;
 
extern "C" void fAvx();
extern "C" void fMovs();
 
int main()
{
using timestamp = time_point<high_resolution_clock>;
timestamp start = high_resolution_clock::now();
fAvx();
uint64_t ns = (uint64_t)duration_cast<nanoseconds>(
high_resolution_clock::now() - start ).count();;
cout << (double)ns / 1.0E6 << "ms" << endl;
start = high_resolution_clock::now();
fMovs();
ns = (uint64_t)duration_cast<nanoseconds>( high_resolution_clock::now()
- start ).count();;
cout << (double)ns / 1.0E6 << "ms" << endl;
}
 
And the assembly code fopr fAvx and fMovs:
 
_TEXT SEGMENT
 
fAvx PROC
sub rsp, 64
mov rdx, 1000000000
avxLoop:
vmovups ymm0, [rsp]
vmovups [rsp + 32], ymm0
dec rdx
jnz avxLoop
add rsp, 64
ret
fAvx ENDP
 
fMovs PROC
push rsi
push rdi
sub rsp, 64
mov rdx, 1000000000
movsLoop:
mov rcx, 4
mov rsi, rsp
lea rdi, [rsp + 32]
rep movsq
dec rdx
jnz movsLoop
add rsp, 64
pop rdi
pop rsi
ret
fMovs ENDP
 
_TEXT ENDS
END
 
I repeatedly copy 32 bytes - _a_small_amount_of_memory_ and according
what you say copying with movs should be faster here. But on my computer
fAvx is 6,6 times faster. Any questions?
David Brown <david.brown@hesbynett.no>: Jan 04 12:45PM +0100

On 04/01/2020 01:25, Bonita Montero wrote:
>> For small blocks movs instructions are better.
>> For big ones opposite.
 
> The compiler knows better than stupid Melzzz.
 
I suspect the results will be highly dependent on details, like the
exact chip you are using, and where you draw the line between "small
blocks" and "big blocks".
 
Compiler writers put a lot of effort into fine-tuning this kind of thing
for different situations. But the compiler can only optimise on the
basis of the information you give it. If you want optimal code for a
particular device, you have to tell the compiler exactly which device
you want to target (and what devices you want to support, which is not
necessarily the same thing). Then it can make a stab at picking the
right kind of instructions here - whether it is SIMD, loops, rep
instructions, AVX, or whatever.
 
Different compilers can make different assumptions about the target
device, when this information is missing from the command lines.
 
So in any argument about which instruction sequence is "best", it is
entirely possible for both sides to be right and very likely that the
argument is pointless.
Melzzzzz <Melzzzzz@zzzzz.com>: Jan 04 12:00PM


> So in any argument about which instruction sequence is "best", it is
> entirely possible for both sides to be right and very likely that the
> argument is pointless.
 
In this particular case gcc is right and VC is wrong.
~/examples/assembler >>> ./rdtscp 1000000
1000000 128 byte blocks, loops:4
rep movsb 0.01132903184211
rep movsq 0.01134612947368
movntdq 0.00317828842105
movntdq prefetch 0.00316638236842
movntdq prefetch ymm 0.00316397368421
~/examples/assembler >>> ./rdtscp 1 [32]
1 128 byte blocks, loops:4000000
rep movsb 0.00000002601557
rep movsq 0.00000001063161
movntdq 0.00000008210285
movntdq prefetch 0.00000008319713
movntdq prefetch ymm 0.00000008284743
 
simply put for large blocks SSE2/AVX2 is better for
small movs on both Intel and AMD.
On Intel pre Zen2 AVX2 was better as it has 256 bit
data move while AMD prior to Zen2 has 128 bit.
 
 
--
press any key to continue or any other to quit...
U ničemu ja ne uživam kao u svom statusu INVALIDA -- Zli Zec
Svi smo svedoci - oko 3 godine intenzivne propagande je dovoljno da jedan narod poludi -- Zli Zec
Na divljem zapadu i nije bilo tako puno nasilja, upravo zato jer su svi
bili naoruzani. -- Mladen Gogala
Bonita Montero <Bonita.Montero@gmail.com>: Jan 04 01:03PM +0100

> simply put for large blocks SSE2/AVX2 is better for
> small movs on both Intel and AMD.
 
No, movsd/q is always inferiror.
Barry Schwarz <schwarzb@delq.com>: Jan 03 10:29PM -0800

On Sat, 4 Jan 2020 05:36:17 +0100, wolfgang bauer <schutz@gmx.de>
wrote:
 
 
>If I remove the "l" from "lu" in the print-function, no warning appear.
 
>This fits to your explanation.
 
>And so printf("sizeof int %d\n", sizeof(int)) has 4 as result. So on a 32-Bit machine it would be 2.
 
Firstly, sizeof evaluates to a size_t which need not be int. Printing
it with %d is not portable, could lead to undefined behavior, and some
compilers will diagnose this. Either use %zu which is designed for
size_t or cast the value to a type that matches your format, in this
case int.
 
Secondly, on a 32-bit machine, sizeof(int) will almost always be 4. It
may be 2 on a 16-bit machine. It really depends on what the compiler
writer decided to do.
 
If your code depends on an int having a particular size, test for it
and abort if you don't get the value you want. Alternately, you could
use the exact size typedef for what you need. If the typedef is not
available, your code will not compile, identifying the problem
immediately. If it does compile, you know you are using the size you
want.
 
--
Remove del for email
Keith Thompson <Keith.S.Thompson+u@gmail.com>: Jan 03 11:56PM -0800

James Kuyper <jameskuyper@alumni.caltech.edu> writes:
[...]
> that each standard integer type must be able to represent, and are
> incorporated by reference from the C standard. I've converted those
> ranges into bit counts to better match your question.
 
A quibble: the required ranges of values for the standard integer types
are copied from the C standard, but are not incorporated by reference.
 
> One a 64-bit machine, long is often a 64-bit type, with int being a 32
> bit type.
 
True, but Microsoft's compiler has 32-bit long even on 64-bit machines.
 
--
Keith Thompson (The_Other_Keith) Keith.S.Thompson+u@gmail.com
[Note updated email address]
Working, but not speaking, for Philips Healthcare
void Void(void) { Void(); } /* The recursive call of the void */
Jorgen Grahn <grahn+nntp@snipabacken.se>: Jan 04 08:37AM

On Sat, 2020-01-04, wolfgang bauer wrote:
 
> That results in this warning:
 
> warning: format '%lu' expects argument of type 'long unsigned int', but argument 2 has type
> 'uint32_t {aka unsigned int}'
 
Others have already explained the larger picture, but concretely:
 
If you insist on combining std::printf and the fixed-width integer
types, you need to use the PRI... macros from <cinttypes>,
particularly:
 
#define PRIu32 "u"
 
/Jorgen
 
--
// Jorgen Grahn <grahn@ Oo o. . .
\X/ snipabacken.se> O o .
Keith Thompson <Keith.S.Thompson+u@gmail.com>: Jan 04 02:16AM -0800

> types, you need to use the PRI... macros from <cinttypes>,
> particularly:
 
> #define PRIu32 "u"
 
*OR* you can convert to a type that you know is wide enough and that
printf handles directly:
 
uint32_t x;
std::printf("x = %lu\n", (unsigned long)x);
 
Or, of course:
 
std::cout << "x = " << x << "\n";
 
--
Keith Thompson (The_Other_Keith) Keith.S.Thompson+u@gmail.com
[Note updated email address]
Working, but not speaking, for Philips Healthcare
void Void(void) { Void(); } /* The recursive call of the void */
Jorgen Grahn <grahn+nntp@snipabacken.se>: Jan 04 08:48AM

On Sat, 2020-01-04, Melzzzzz wrote:
 
>> https://www.boost.org/doc/libs/1_67_0/libs/spirit/doc/html/spirit/introduction.html
>> ... and many other parser-libs.
 
> Having library is necessary, as it is tedious to write parser in C++...
 
You guys are talking about a subclass of languages here. I often write
parsers for simple, "flat" languages, and plain C++ works well for
that.
 
For example:
- the container file format for JPEG images
- the TIFF file format
- linker symbols as seen in the output from 'nm -CP'
 
/Jorgen
 
--
// Jorgen Grahn <grahn@ Oo o. . .
\X/ snipabacken.se> O o .
Bonita Montero <Bonita.Montero@gmail.com>: Jan 04 11:04AM +0100

> - the container file format for JPEG images
> - the TIFF file format
> - linker symbols as seen in the output from 'nm -CP'
 
That has nothing to do with parsing.
You received this digest because you're subscribed to updates for this group. You can change your settings on the group membership page.
To unsubscribe from this group and stop receiving emails from it send an email to comp.lang.c+++unsubscribe@googlegroups.com.

No comments: