Tuesday, March 31, 2020

Digest for comp.lang.c++@googlegroups.com - 25 updates in 1 topic

Bonita Montero <Bonita.Montero@gmail.com>: Mar 31 04:09AM +0200

>> There's no way to make it simpler.
 
> Well, you could just go back to the original code, and let the
> compiler's optimizer unroll the loop for you...
 
The discussion was about the simplicity of the source and not
of the compiled code.
Melzzzzz <Melzzzzz@zzzzz.com>: Mar 31 04:38AM

> two pipelined 128 bit halves) and you said that is is wrong. I
> proved that you were wrong.
> Boy, youre so stupid.
 
I am talking about AVX troughoutput. 256 bit vdivpd is twice
faster then on Haswell... my err.
 
 
--
press any key to continue or any other to quit...
U ničemu ja ne uživam kao u svom statusu INVALIDA -- Zli Zec
Svi smo svedoci - oko 3 godine intenzivne propagande je dovoljno da jedan narod poludi -- Zli Zec
Na divljem zapadu i nije bilo tako puno nasilja, upravo zato jer su svi
bili naoruzani. -- Mladen Gogala
Jorgen Grahn <grahn+nntp@snipabacken.se>: Mar 31 07:33AM

On Tue, 2020-03-31, Melzzzzz wrote:
>> Boy, youre so stupid.
 
> I am talking about AVX troughoutput. 256 bit vdivpd is twice
> faster then on Haswell... my err.
 
Twice faster if you disregard the data cache, surely?
 
I haven't paid attention to this thread, but won't memory bandwidth
be the bottleneck in the end anyway?
 
/Jorgen
 
--
// Jorgen Grahn <grahn@ Oo o. . .
\X/ snipabacken.se> O o .
Bonita Montero <Bonita.Montero@gmail.com>: Mar 31 09:51AM +0200

>> I am talking about AVX troughoutput. 256 bit vdivpd is twice
>> faster then on Haswell... my err.
 
> Twice faster if you disregard the data cache, surely?
 
Here are the VDIVPD-numbers from agner.org for the first generation
Ryzen and for Haswell: Ryzen: 8 - 13 cycles latency, 8 - 9 cycles
throughput, Haswell 19-35 cycles latency, 18-28 cycles througput.
But Coffe Lake has a much higher performance than Haswell: 13 - 14
cycles latency, 8 cycles throughput.
Melzzzzz <Melzzzzz@zzzzz.com>: Mar 31 07:58AM


> Twice faster if you disregard the data cache, surely?
 
> I haven't paid attention to this thread, but won't memory bandwidth
> be the bottleneck in the end anyway?
 
Sure. I am talking when data is in cache...
 
--
press any key to continue or any other to quit...
U ničemu ja ne uživam kao u svom statusu INVALIDA -- Zli Zec
Svi smo svedoci - oko 3 godine intenzivne propagande je dovoljno da jedan narod poludi -- Zli Zec
Na divljem zapadu i nije bilo tako puno nasilja, upravo zato jer su svi
bili naoruzani. -- Mladen Gogala
Melzzzzz <Melzzzzz@zzzzz.com>: Mar 31 08:01AM

> throughput, Haswell 19-35 cycles latency, 18-28 cycles througput.
> But Coffe Lake has a much higher performance than Haswell: 13 - 14
> cycles latency, 8 cycles throughput.
 
This is because since Skylake, Intel has 256 bit divpd.
Ryzen has 128 bit units but in pairs, so that 256
bits are single op. Only thing that drives Ryzen first gen
behind are FMA instructions as it can execute only one per cycle...
 
 
 
--
press any key to continue or any other to quit...
U ničemu ja ne uživam kao u svom statusu INVALIDA -- Zli Zec
Svi smo svedoci - oko 3 godine intenzivne propagande je dovoljno da jedan narod poludi -- Zli Zec
Na divljem zapadu i nije bilo tako puno nasilja, upravo zato jer su svi
bili naoruzani. -- Mladen Gogala
Bonita Montero <Bonita.Montero@gmail.com>: Mar 31 10:06AM +0200

> This is because since Skylake, Intel has 256 bit divpd.
> Ryzen has 128 bit units but in pairs, so that 256
> bits are single op. ..
 
Do I have to write a benchmark comparing DIVPD and VDIVPD
on my 1800X?
Melzzzzz <Melzzzzz@zzzzz.com>: Mar 31 08:22AM

>> bits are single op. ..
 
> Do I have to write a benchmark comparing DIVPD and VDIVPD
> on my 1800X?
 
If you wish:
~/.../examples/assembler >>> ./latency
recip1
15.833327168900179712 0.063157919326280312 0.063157919326280328
700.059641597344125328 0.001428449721395557 0.001428449721395557
860.050613320340289648 0.001162722268331821 0.001162722268331821
12.280964395431137600 0.081426829994884368 0.081426829994884448
144.000000 16.920612
108.000000 16.134408
108.000000 16.479540
108.000000 16.828776
144.000000 17.158536
144.000000 17.091432
108.000000 17.163324
144.000000 16.072596
108.000000 17.177688
144.000000 12.160980
72.000000 10.093536
recip2
15.833327168900179712 0.063157919326280328 0.063157919326280328
700.059641597344125328 0.001428449721395557 0.001428449721395557
860.050613320340289648 0.001162722268331821 0.001162722268331821
12.280964395431137600 0.081426829994884448 0.081426829994884448
72.000000 13.325616
72.000000 13.353768
36.000000 13.353624
72.000000 13.296960
72.000000 13.292208
72.000000 13.476024
72.000000 13.329972
72.000000 13.335264
72.000000 13.297500
72.000000 13.315464
72.000000 14.205312
recip3
15.833327168900179712 0.063157919326280328 0.063157919326280328
700.059641597344125328 0.001428449721395557 0.001428449721395557
860.050613320340289648 0.001162722268331821 0.001162722268331821
12.280964395431137600 0.081426829994884448 0.081426829994884448
72.000000 9.000108
72.000000 9.066672
72.000000 9.042948
72.000000 9.023184
72.000000 9.018360
108.000000 9.027612
72.000000 9.032760
72.000000 9.024768
72.000000 9.034740
72.000000 9.000072
72.000000 9.023256
 
 
This is latency bench on my 2700x. recip3 is pure divpd, while recip1 is
and recip 2 is newton-rapshon aprox.
As you can see divpd is fastest, unline on Intel where recip1 is 8
cycles and recip2 12 cycles (slow FMA on Ryzen).
 
~/.../examples/assembler >>> cat latency.asm
; latency test
format elf64
public recip
public recip1
public recip2
public recip3
public _rdtsc
section '.text' executable
N = 1000000
recip:
recip1:
; Load constants and input
vbroadcastsd ymm1, [one]
vpbroadcastq ymm4, [magic]
mov eax, N
.loop:
vmovdqu ymm0, [rdi]
vpsubq ymm2, ymm4, ymm0
vfnmadd213pd ymm0, ymm2, ymm1
vfmadd132pd ymm2, ymm2, ymm0
vmulpd ymm0, ymm0, ymm0
vfmadd132pd ymm2, ymm2, ymm0
vmulpd ymm0, ymm0, ymm0
vfmadd132pd ymm2, ymm2, ymm0
vmulpd ymm0, ymm0, ymm0
vfmadd132pd ymm0, ymm2, ymm2
dec eax
jnz .loop
vmovups [rdi], ymm0
ret
 
recip2:
; Load constants and input
vbroadcastsd ymm1, [one]
mov eax, N
.loop:
vmovdqu ymm0, [rdi]
vcvtpd2ps xmm2,ymm0
vrcpps xmm2,xmm2
vcvtps2pd ymm2,xmm2
vfnmadd213pd ymm0, ymm2, ymm1
vfmadd132pd ymm2, ymm2, ymm0
vmulpd ymm0, ymm0, ymm0
vfmadd132pd ymm2, ymm2, ymm0
vmulpd ymm0, ymm0, ymm0
vfmadd132pd ymm2, ymm2, ymm0
vmulpd ymm0, ymm0, ymm0
vfmadd132pd ymm0, ymm2, ymm2
dec eax
jnz .loop
vmovups [rdi], ymm0
ret
 
recip3:
; Load constants and input
vbroadcastsd ymm1, [one]
mov eax, N
.loop:
vmovdqu ymm0, [rdi]
vdivpd ymm0,ymm1,ymm0
dec eax
jnz .loop
vmovups [rdi], ymm0
ret
 
_rdtsc:
rdtscp
shl rdx, 32
or rax, rdx
ret
 
section '.data' writeable align 16
align 16
one dq 3FF0000000000000h
magic dq 7FDE6238502484BAh
 
 
--
press any key to continue or any other to quit...
U ničemu ja ne uživam kao u svom statusu INVALIDA -- Zli Zec
Svi smo svedoci - oko 3 godine intenzivne propagande je dovoljno da jedan narod poludi -- Zli Zec
Na divljem zapadu i nije bilo tako puno nasilja, upravo zato jer su svi
bili naoruzani. -- Mladen Gogala
Bonita Montero <Bonita.Montero@gmail.com>: Mar 31 10:43AM +0200

Here's my code:
 
#define NOMINMAX
#if defined(_MSC_VER)
#include <Windows.h>

No comments: