aminer68@gmail.com: Dec 12 01:44PM -0800 Hello, I am a white arab who live in Canada since year 1989, so read my following writing about Canada: Canada ranked 3rd best country in the world for education Read more here: https://dailyhive.com/toronto/canada-ranked-best-country-education-2019 Also Canada is becoming very attractive with its remarkable Economy, read more to notice it: Why so many Silicon Valley companies are moving to Vancouver Canada https://www.straight.com/tech/1261681/why-so-many-silicon-valley-companies-are-moving-vancouver And look at this interesting video about Canadian economy: The Remarkable Economy of Canada https://www.youtube.com/watch?v=uRnIHyI02EU Thank you, Amine Moulay Ramdane. |
aminer68@gmail.com: Dec 12 12:33PM -0800 Hello, More precision about the Linux sys_membarrier() expedited and the windows FlushProcessWriteBuffers(), read again: I have just read the following webpage: https://lwn.net/Articles/636878/ And it is interesting and it says: --- Results in liburcu: Operations in 10s, 6 readers, 2 writers: memory barriers in reader: 1701557485 reads, 3129842 writes signal-based scheme: 9825306874 reads, 5386 writes sys_membarrier expedited: 6637539697 reads, 852129 writes sys_membarrier non-expedited: 7992076602 reads, 220 writes --- Look at how "sys_membarrier expedited" is powerful. So as you have noticed i have already implemented my following scalable scalable Asymmetric RWLocks that use the windows FlushProcessWriteBuffers(), they are called Fast_RWLockX and LW_Fast_RWLockX and they are limited to 400 threads but you can manually extended the maximum number of threads by setting the NbrThreads parameter of the constructor, and you have to start once and for all your threads and work with all your threads, don't start every time a thread and exit from the thread. Fast_RWLockX and LW_Fast_RWLockX don't use any atomic operations and/or StoreLoad style memory barriers on the reader side, so they are scalable and very fast, and i will soon port them to Linux and they will support both sys_membarrier expedited and sys_membarrier non-expedited. You can download my inventions of scalable Asymmetric RWLocks that use IPIs and that are costless on the reader side from here: https://sites.google.com/site/scalable68/scalable-rwlock Cache-coherency protocols do not use IPIs, and as a user-space level developer you do not care about IPIs at all. One is most interested in the cost of cache-coherency itself. However, Win32 API provides a function that issues IPIs to all processors (in the affinity mask of the current process) FlushProcessWriteBuffers(). You can use it to investigate the cost of IPIs. FlushProcessWriteBuffers() API does the following: - Implicitly execute full memory barrier on all other processors. - Generates an interprocessor interrupt (IPI) to all processors that are part of the current process affinity. - Uses IPI to "synchronously" signal all processors. - It guarantees the visibility of write operations performed on one processor to the other processors. - Supported since Windows Vista and Windows Server 2008 If we investigate the cost of IPIs, when i do simple synthetic test on a Quad core machine I've obtained the following numbers: 420 cycles is the minimum cost of the FlushProcessWriteBuffers() function on issuing core. 1600 cycles is mean cost of the FlushProcessWriteBuffers() function on issuing core. 1300 cycles is mean cost of the FlushProcessWriteBuffers() function on remote core. Note that, as far as I understand, the function issues IPI to remote core, then remote core acks it with another IPI, issuing core waits for ack IPI and then returns. And the IPIs have indirect cost of flushing the processor pipeline. Thank you, Amine Moulay Ramdane. |
You received this digest because you're subscribed to updates for this group. You can change your settings on the group membership page. To unsubscribe from this group and stop receiving emails from it send an email to comp.programming.threads+unsubscribe@googlegroups.com. |
No comments:
Post a Comment