Monday, May 8, 2017

Digest for comp.programming.threads@googlegroups.com - 3 updates in 2 topics

rami17 <rami17@rami17.net>: May 07 05:25PM -0400

Hello...
 
 
Capitalism is changing from commodity capital to intellectual capital.
 
Intellectual capital is where you will have more money and it is where
you will become more rich and prosperous.
 
 
See the following video:
 
https://www.youtube.com/watch?v=7qYmZPP4AO8
 
 
Thank you,
Amine Moulay Ramdane.
Bonita Montero <Bonita.Montero@gmail.com>: May 07 04:16PM +0200

Ok, you're an idiot on the one side, but on the other side you
were not even right in this case, but you even underestimated
the time of a cacheline-transfer.
I wrote a little program that measures the time of a 32 bit
LOCK CMPXCHG when die cacheline was in another l1-cache before
(or changed by a sibling thread on smt).
Here it is:
https://pastebin.com/M5W3tvHd
There are two spawned threads in this program. Both check a DWORD
value if it is even or odd. The one looking for even values increments
it through LOCK CMPXCHG (the InternlockedCompareExchange directly maps
to the intrinsic when you use MSVC++) when it is even, the other when
it is odd. So both play pingpong with the cache-line. The time to do
this is taken with RDTSC (ok, rdtsc isn't absolutely accuraten on my
Ryzen 1800X due to XFR). Each thread does a fixed number of iterations
and counts the successful swaps. Both threads singal their number of
ticks and successfull swaps to the main-thread. The main-thread does
the "pingpong" alone with itself with the number of iterations minus
the number of successful swaps but does only unsucessful CMXCHGs so
that it gets the overhead of unsuccessful CMPXCHs in both threads.
This overhead is the time of all iterations with unsuccessful CMPXCHGs.
The average time of both threads is subtracted by this time, so that
we get the pure time spent by doing successful CMPXCHGs, i.e. we get
the time mostly spent by transferring the cachelines between the two
cores.
The program tests the overhead of core 0 versus all other cores in
the system. With my ryzen, the overhead looks like this:
processor 0 and processor 1: 117.606
processor 0 and processor 2: 146.972
processor 0 and processor 3: 159.324
processor 0 and processor 4: 148.491
processor 0 and processor 5: 150.407
processor 0 and processor 6: 175.156
processor 0 and processor 7: 198.427
processor 0 and processor 8: 810.89
processor 0 and processor 9: 806.403
processor 0 and processor 10: 799.987
processor 0 and processor 11: 796.437
processor 0 and processor 12: 795.243
processor 0 and processor 13: 797.405
processor 0 and processor 14: 795.879
processor 0 and processor 15: 792.173
As we can see, there's a huge overhead even for core 0 with its
SMT-sibling ant the overhead increases from core 1 to core 2, i.e.
where a transfers between two cores become necessary. And there's
a hughe increase from core 7 to core 8 because Ryzen is organized
in two CCX-modules connected to a crossbar that isn't able to
transfer faster than the RAM (regarding both throughput and
latency).
Bonita Montero <Bonita.Montero@gmail.com>: May 07 04:54PM +0200

There was a little flaw with the PingPong-function.
Here is the accurate code:
https://pastebin.com/Ya3TFcB5
You received this digest because you're subscribed to updates for this group. You can change your settings on the group membership page.
To unsubscribe from this group and stop receiving emails from it send an email to comp.programming.threads+unsubscribe@googlegroups.com.

No comments: