- Invert every 2nd byte in a container of raw data - 25 Updates
Bonita Montero <Bonita.Montero@gmail.com>: Mar 31 04:09AM +0200 >> There's no way to make it simpler. > Well, you could just go back to the original code, and let the > compiler's optimizer unroll the loop for you... The discussion was about the simplicity of the source and not of the compiled code. |
Melzzzzz <Melzzzzz@zzzzz.com>: Mar 31 04:38AM > two pipelined 128 bit halves) and you said that is is wrong. I > proved that you were wrong. > Boy, youre so stupid. I am talking about AVX troughoutput. 256 bit vdivpd is twice faster then on Haswell... my err. -- press any key to continue or any other to quit... U ničemu ja ne uživam kao u svom statusu INVALIDA -- Zli Zec Svi smo svedoci - oko 3 godine intenzivne propagande je dovoljno da jedan narod poludi -- Zli Zec Na divljem zapadu i nije bilo tako puno nasilja, upravo zato jer su svi bili naoruzani. -- Mladen Gogala |
Jorgen Grahn <grahn+nntp@snipabacken.se>: Mar 31 07:33AM On Tue, 2020-03-31, Melzzzzz wrote: >> Boy, youre so stupid. > I am talking about AVX troughoutput. 256 bit vdivpd is twice > faster then on Haswell... my err. Twice faster if you disregard the data cache, surely? I haven't paid attention to this thread, but won't memory bandwidth be the bottleneck in the end anyway? /Jorgen -- // Jorgen Grahn <grahn@ Oo o. . . \X/ snipabacken.se> O o . |
Bonita Montero <Bonita.Montero@gmail.com>: Mar 31 09:51AM +0200 >> I am talking about AVX troughoutput. 256 bit vdivpd is twice >> faster then on Haswell... my err. > Twice faster if you disregard the data cache, surely? Here are the VDIVPD-numbers from agner.org for the first generation Ryzen and for Haswell: Ryzen: 8 - 13 cycles latency, 8 - 9 cycles throughput, Haswell 19-35 cycles latency, 18-28 cycles througput. But Coffe Lake has a much higher performance than Haswell: 13 - 14 cycles latency, 8 cycles throughput. |
Melzzzzz <Melzzzzz@zzzzz.com>: Mar 31 07:58AM > Twice faster if you disregard the data cache, surely? > I haven't paid attention to this thread, but won't memory bandwidth > be the bottleneck in the end anyway? Sure. I am talking when data is in cache... -- press any key to continue or any other to quit... U ničemu ja ne uživam kao u svom statusu INVALIDA -- Zli Zec Svi smo svedoci - oko 3 godine intenzivne propagande je dovoljno da jedan narod poludi -- Zli Zec Na divljem zapadu i nije bilo tako puno nasilja, upravo zato jer su svi bili naoruzani. -- Mladen Gogala |
Melzzzzz <Melzzzzz@zzzzz.com>: Mar 31 08:01AM > throughput, Haswell 19-35 cycles latency, 18-28 cycles througput. > But Coffe Lake has a much higher performance than Haswell: 13 - 14 > cycles latency, 8 cycles throughput. This is because since Skylake, Intel has 256 bit divpd. Ryzen has 128 bit units but in pairs, so that 256 bits are single op. Only thing that drives Ryzen first gen behind are FMA instructions as it can execute only one per cycle... -- press any key to continue or any other to quit... U ničemu ja ne uživam kao u svom statusu INVALIDA -- Zli Zec Svi smo svedoci - oko 3 godine intenzivne propagande je dovoljno da jedan narod poludi -- Zli Zec Na divljem zapadu i nije bilo tako puno nasilja, upravo zato jer su svi bili naoruzani. -- Mladen Gogala |
Bonita Montero <Bonita.Montero@gmail.com>: Mar 31 10:06AM +0200 > This is because since Skylake, Intel has 256 bit divpd. > Ryzen has 128 bit units but in pairs, so that 256 > bits are single op. .. Do I have to write a benchmark comparing DIVPD and VDIVPD on my 1800X? |
Melzzzzz <Melzzzzz@zzzzz.com>: Mar 31 08:22AM >> bits are single op. .. > Do I have to write a benchmark comparing DIVPD and VDIVPD > on my 1800X? If you wish: ~/.../examples/assembler >>> ./latency recip1 15.833327168900179712 0.063157919326280312 0.063157919326280328 700.059641597344125328 0.001428449721395557 0.001428449721395557 860.050613320340289648 0.001162722268331821 0.001162722268331821 12.280964395431137600 0.081426829994884368 0.081426829994884448 144.000000 16.920612 108.000000 16.134408 108.000000 16.479540 108.000000 16.828776 144.000000 17.158536 144.000000 17.091432 108.000000 17.163324 144.000000 16.072596 108.000000 17.177688 144.000000 12.160980 72.000000 10.093536 recip2 15.833327168900179712 0.063157919326280328 0.063157919326280328 700.059641597344125328 0.001428449721395557 0.001428449721395557 860.050613320340289648 0.001162722268331821 0.001162722268331821 12.280964395431137600 0.081426829994884448 0.081426829994884448 72.000000 13.325616 72.000000 13.353768 36.000000 13.353624 72.000000 13.296960 72.000000 13.292208 72.000000 13.476024 72.000000 13.329972 72.000000 13.335264 72.000000 13.297500 72.000000 13.315464 72.000000 14.205312 recip3 15.833327168900179712 0.063157919326280328 0.063157919326280328 700.059641597344125328 0.001428449721395557 0.001428449721395557 860.050613320340289648 0.001162722268331821 0.001162722268331821 12.280964395431137600 0.081426829994884448 0.081426829994884448 72.000000 9.000108 72.000000 9.066672 72.000000 9.042948 72.000000 9.023184 72.000000 9.018360 108.000000 9.027612 72.000000 9.032760 72.000000 9.024768 72.000000 9.034740 72.000000 9.000072 72.000000 9.023256 This is latency bench on my 2700x. recip3 is pure divpd, while recip1 is and recip 2 is newton-rapshon aprox. As you can see divpd is fastest, unline on Intel where recip1 is 8 cycles and recip2 12 cycles (slow FMA on Ryzen). ~/.../examples/assembler >>> cat latency.asm ; latency test format elf64 public recip public recip1 public recip2 public recip3 public _rdtsc section '.text' executable N = 1000000 recip: recip1: ; Load constants and input vbroadcastsd ymm1, [one] vpbroadcastq ymm4, [magic] mov eax, N .loop: vmovdqu ymm0, [rdi] vpsubq ymm2, ymm4, ymm0 vfnmadd213pd ymm0, ymm2, ymm1 vfmadd132pd ymm2, ymm2, ymm0 vmulpd ymm0, ymm0, ymm0 vfmadd132pd ymm2, ymm2, ymm0 vmulpd ymm0, ymm0, ymm0 vfmadd132pd ymm2, ymm2, ymm0 vmulpd ymm0, ymm0, ymm0 vfmadd132pd ymm0, ymm2, ymm2 dec eax jnz .loop vmovups [rdi], ymm0 ret recip2: ; Load constants and input vbroadcastsd ymm1, [one] mov eax, N .loop: vmovdqu ymm0, [rdi] vcvtpd2ps xmm2,ymm0 vrcpps xmm2,xmm2 vcvtps2pd ymm2,xmm2 vfnmadd213pd ymm0, ymm2, ymm1 vfmadd132pd ymm2, ymm2, ymm0 vmulpd ymm0, ymm0, ymm0 vfmadd132pd ymm2, ymm2, ymm0 vmulpd ymm0, ymm0, ymm0 vfmadd132pd ymm2, ymm2, ymm0 vmulpd ymm0, ymm0, ymm0 vfmadd132pd ymm0, ymm2, ymm2 dec eax jnz .loop vmovups [rdi], ymm0 ret recip3: ; Load constants and input vbroadcastsd ymm1, [one] mov eax, N .loop: vmovdqu ymm0, [rdi] vdivpd ymm0,ymm1,ymm0 dec eax jnz .loop vmovups [rdi], ymm0 ret _rdtsc: rdtscp shl rdx, 32 or rax, rdx ret section '.data' writeable align 16 align 16 one dq 3FF0000000000000h magic dq 7FDE6238502484BAh -- press any key to continue or any other to quit... U ničemu ja ne uživam kao u svom statusu INVALIDA -- Zli Zec Svi smo svedoci - oko 3 godine intenzivne propagande je dovoljno da jedan narod poludi -- Zli Zec Na divljem zapadu i nije bilo tako puno nasilja, upravo zato jer su svi bili naoruzani. -- Mladen Gogala |
Bonita Montero <Bonita.Montero@gmail.com>: Mar 31 10:43AM +0200 Here's my code: #define NOMINMAX #if defined(_MSC_VER) #include <Windows.h>
Subscribe to:
Post Comments (Atom)
|
No comments:
Post a Comment