- Capitalism is changing - 1 Update
- Coroutines or threads ? - 2 Updates
rami17 <rami17@rami17.net>: May 07 05:25PM -0400 Hello... Capitalism is changing from commodity capital to intellectual capital. Intellectual capital is where you will have more money and it is where you will become more rich and prosperous. See the following video: https://www.youtube.com/watch?v=7qYmZPP4AO8 Thank you, Amine Moulay Ramdane. |
Bonita Montero <Bonita.Montero@gmail.com>: May 07 04:16PM +0200 Ok, you're an idiot on the one side, but on the other side you were not even right in this case, but you even underestimated the time of a cacheline-transfer. I wrote a little program that measures the time of a 32 bit LOCK CMPXCHG when die cacheline was in another l1-cache before (or changed by a sibling thread on smt). Here it is: https://pastebin.com/M5W3tvHd There are two spawned threads in this program. Both check a DWORD value if it is even or odd. The one looking for even values increments it through LOCK CMPXCHG (the InternlockedCompareExchange directly maps to the intrinsic when you use MSVC++) when it is even, the other when it is odd. So both play pingpong with the cache-line. The time to do this is taken with RDTSC (ok, rdtsc isn't absolutely accuraten on my Ryzen 1800X due to XFR). Each thread does a fixed number of iterations and counts the successful swaps. Both threads singal their number of ticks and successfull swaps to the main-thread. The main-thread does the "pingpong" alone with itself with the number of iterations minus the number of successful swaps but does only unsucessful CMXCHGs so that it gets the overhead of unsuccessful CMPXCHs in both threads. This overhead is the time of all iterations with unsuccessful CMPXCHGs. The average time of both threads is subtracted by this time, so that we get the pure time spent by doing successful CMPXCHGs, i.e. we get the time mostly spent by transferring the cachelines between the two cores. The program tests the overhead of core 0 versus all other cores in the system. With my ryzen, the overhead looks like this: processor 0 and processor 1: 117.606 processor 0 and processor 2: 146.972 processor 0 and processor 3: 159.324 processor 0 and processor 4: 148.491 processor 0 and processor 5: 150.407 processor 0 and processor 6: 175.156 processor 0 and processor 7: 198.427 processor 0 and processor 8: 810.89 processor 0 and processor 9: 806.403 processor 0 and processor 10: 799.987 processor 0 and processor 11: 796.437 processor 0 and processor 12: 795.243 processor 0 and processor 13: 797.405 processor 0 and processor 14: 795.879 processor 0 and processor 15: 792.173 As we can see, there's a huge overhead even for core 0 with its SMT-sibling ant the overhead increases from core 1 to core 2, i.e. where a transfers between two cores become necessary. And there's a hughe increase from core 7 to core 8 because Ryzen is organized in two CCX-modules connected to a crossbar that isn't able to transfer faster than the RAM (regarding both throughput and latency). |
Bonita Montero <Bonita.Montero@gmail.com>: May 07 04:54PM +0200 There was a little flaw with the PingPong-function. Here is the accurate code: https://pastebin.com/Ya3TFcB5 |
You received this digest because you're subscribed to updates for this group. You can change your settings on the group membership page. To unsubscribe from this group and stop receiving emails from it send an email to comp.programming.threads+unsubscribe@googlegroups.com. |
No comments:
Post a Comment