comp.programming.threads@googlegroups.com | Google Groups | ![]() |
Unsure why you received this message? You previously subscribed to digests from this group, but we haven't been sending them for a while. We fixed that, but if you don't want to get these messages, send an email to comp.programming.threads+unsubscribe@googlegroups.com. |
- I have just read the following about Intel TSX - 4 Updates
- About my new algorithm - 2 Updates
- About my previous post - 1 Update
- Here is what's important about reader-writer algorithms... - 2 Updates
- Here is my new algorithm of a scalable distributed sequential lock - 2 Updates
Ramine <ramine@1.1>: Dec 06 05:04PM -0800 Hello, I have just read the following about Intel TSX: "At the moment, TSX is quite slow, even if there's no contention and you don't do anything in the block. There's a lot of overhead. Using TSX naively may slow down even threaded code. Getting significant performance gains from it is non-trivial." read more here a little bit down the webpage: http://cbloomrants.blogspot.ca/search?updated-min=2014-01-01T00:00:00-08:00&updated-max=2015-01-01T00:00:00-08:00&max-results=43 So i think my newer scalable distributed sequential lock is faster than Intel TSX on read-mostly scenarios, also my newer algorithm is kind of general synchronization mechanism because it works with both memory and hardisk. Thank you, Amine Moulay Ramdane. |
"Chris M. Thomasson" <no@spam.invalid>: Dec 06 02:42PM -0800 > "Ramine" wrote in message news:m5vufu$uk6$2@dont-email.me... > Hello, > I have just read the following about Intel TSX: [...] why do you insist on multi-posting? ;^( |
Melzzzzz <mel@zzzzz.com>: Dec 07 01:21AM +0100 On Sat, 6 Dec 2014 14:42:53 -0800 > [...] > why do you insist on multi-posting? > ;^( I believe it is restriction of his news server. |
"Chris M. Thomasson" <no@spam.invalid>: Dec 06 04:31PM -0800 > > why do you insist on multi-posting? > > ;^( > I believe it is restriction of his news server. That's a shame. ;^( |
"Chris M. Thomasson" <no@spam.invalid>: Dec 06 01:45PM -0800 > I don't quite understand the "or" part of your "and/or memory barrier" > statement. If there was a different algorithm which used only memory > barrier, but not RMW, would that be so much worse than RCU? Yes. Anytime you can reduce the number of memory barriers and/or atomics, the better. I have tested RCU verses the same RCU with a single MFENCE instruction on the reader side. The one without the membar slaughtered it wrt reads-per-second per-thread. > But then, if that's correct, that means that modern CPUs (whichever they > are) still don't have features that old Alpha CPUs have. And not the other > way round. Yup. On a DEC Alpha, RCU simply needs a membar for it does not honor data-dependent loads. RCU relies on data-dependences being there. |
"Chris M. Thomasson" <no@spam.invalid>: Dec 06 01:56PM -0800 > news:m5vteu$dp3$1@speranza.aioe.org... > > "Drazen Kacar" wrote in message > > news:slrnm84di2.je0.dave@fly.srk.fer.hr... [...] > Yup. On a DEC Alpha, RCU simply needs a membar for it does not honor > data-dependent loads. RCU relies on data-dependences being there. AFAICT, this is why `std::memory_order_consume' exists. |
Ramine <ramine@1.1>: Dec 06 03:51PM -0800 Hello, I have wrote in my previous post this: "I have benchmarked the distribute reader-write lock of Dmitry Vyukov and i have found that it is so expensive that it takes over 16 or 32 cores to make it scale as Seqlock with 4 cores" You will say that in computer science we need proof of what i am saying above.. Here is my proof by using the Amdahl's law: The distributed reader-writer mutex using a rwlock of Dmitry Vyukov don't eliminate the full memory barrier in the reader side of the RWLock used by the distributed reader-writer mutex of Dmitry Vyukov, this full memory barrier does take 350 CPU cycles on my quadcore Q6600 and i think it takes around the same cycles on Intel i7, i have just done a parallel program benchmark between Seqlock and the distributed reader-writer mutex of Dmitry Vyukov using a rwlock on x86 and it gives on read-mostly scenario 25 microseconds with Seqlock and it takes around 165 microseconds with the Dmitry Vykov distributed reader-writer mutex using a rwlock, it takes 165 microseconds because it is using an expensive call to GetCurrentProcessorNumber() of Windows and it is using full memory fences on the reader side and it is using more code on the reader side , but this GetCurrentProcessorNumber() and full memory fences on the Dmitry Vyukov algorithm are part of the parallel part of the Amdahl's law, so they will scale, but they are so expensive compared to Seqlock that it takes 16 to 32 cores on small to medium reader section size to make it scale as a Seqlock on 4 cores, cause look at the 165 microseconds compared to the 25 microseconds, there is a big difference, this is my proof , so the distributed reader-writer mutex using a rwlock is scalable but it is too much slow than Seqlock or RCU on reader-mostly scenarios, this is why i have invented my scalable distributed sequential lock that beats Seqlock and that is as fast and as scalable as both Seqlock and RCU on read-mostly scenarios. You can download my scalable distributed sequential lock version 1.1 from: https://sites.google.com/site/aminer68/scalable-distributed-sequential-lock Thank you, Amien Moulay Ramdane. |
Ramine <ramine@1.1>: Dec 06 02:51PM -0800 Hello, Here is what's important about reader-writer algorithms... I have thought more about this subject and i think i have mastered to a certain level the subject of synchronization and reader-writer algorithms... You have to be carefull my dear programmers and hardware engineers, cause if you take a look at the PThread reader-writer lock , it is atomicaly incrementing a shared variable on the reader side, so it makes it inefficient , cause on x86 for example when you atomicaly increment a variable by writing for example "lock add" in assembler or using atimics, this "lock add" or atomics will put a full memory barrier that is costly, i have measured a full memory barrier on x86 and it takes around over 300 CPU cycles and that's too costly , cause if we add to this a time to transfer cache line(s) between cores it will be costly and this "lock add" or atomics will belgong to the Serial part of the Amdah's law , so this "lock add" or atomics will not scale , and that's the weakness with the Pthread reader-writer lock, it makes it really inefficient, so what have done Dmitry Vyukov with his distributed reader-writer algorithm here: http://www.1024cores.net/home/lock-free-algorithms/reader-writer-problem/distributed-reader-writer-mutex Dmitry Vyukov has used a distributed mechanism , this distributed mechanism allows for example the "lock add" in assembler or the atomics of the reader side of the RWLock used by the distributed algorithm to be run by only a group of threads belonging to the same core, so this has made the "lock add" or atomics a part of the parallel part of the Amdahl's law, so his distributed algorithm has allowed the "lock add" or atomics to effectivly scale, but there is still a weakness with this distributed algorithm of Dmitry Vyukov , because his algorithm is slow cause this "lock add" or atomics on the reader side of the RWLock used by the distributed algorithm will issue a full memory barrier on every call to the "lock add" or atomics, and this full memory barrier is expensive , i have benchmarked it and i have found that it is so expensive that it takes over 16 or 32 cores to make it scale as Seqlock with 4 cores, cause this full memory barrier is part of the parallel part of the Amdahl's law on the distributed algorithm, so i think that the Dmitry Vyukov distributed reader-writer lock is inefficient up to 32 cores compared to the Seqlock, now what about Seqlock? Seqlock doesn't use atomics on the reader-side so it is very fast and it is scalable on read-mostly scenarios, it is as fast and as scalable as RCU on read-mostly scenarios, but the weakness with Seqlock is that Seqlock can starve and livelock when a greater percentage of writers are used, so this is why i have invented my scalable distributed sequential lock because it is as fast and as scalable as Seqlock and it avoids the weakness of Seqlock because it doesn't starve or livelock when a greater percentage of writers are used. Hope you have understood what i have just explained to you, because it is very important. You can download my scalable distributed sequential lock version 1.1 from: https://sites.google.com/site/aminer68/scalable-distributed-sequential-lock Thank you, Amien Moulay Ramdane. |
Ramine <ramine@1.1>: Dec 06 02:56PM -0800 On 12/6/2014 2:51 PM, Ramine wrote: > memory barrier is expensive , i have benchmarked it and i have > found that it is so expensive that it takes over 16 or 32 cores to make > it scale as Seqlock with 4 cores, I have done this benchmark for small to medium reader sections. |
Ramine <ramine@1.1>: Dec 06 01:43PM -0800 Hello, As i have promised, here is my newer algorithm versoin 1.1 of a scalable distributed sequential lock that is competitive with RCU, and that beats Seqlock on some characteristics because it doesn't starve and it doesn't livelock when a greater percentage of writers are used, please take a look at the source code , the RLock() and RUnlock methods have changed, what i am doing on RLock() is that i am taking a copy of "FCount6^.fcount6" counter that is incremented on the writer side, and i am taking the modulo of the "number of cores" of this counter and when this modulo is equal to 0, i am switching to a distributed algorithm on the reader side, and on RUnlock() method i am testing if the FCount6^.fcount6 is equal to the old copy of FCount6^.fcount6 that had the modulo equal to 0, so if they are equal, i will exit by unlocking the distributed reader-writer lock, if it's not, i will stay in the Seqlock mode. My newer algorithm has allowed me to avoid memory barriers and atomics on the reader side, so it has become competitive with RCU , so i think it can replace RCU, and it beats RCU on some characteristics such as it doesn't starve or livelock when there is a greater percentage of writers. About the sequential consistency of my scalable distributed sequential lock, my algorithm works on x86 architecture and i think my algorithm is correct cause look at the source code of the WLock() method, since i am using a Ticket spinlock with a proportional backoff on the writer side, the Ticket spinlock is using a "lock add" assembler instruction to increment a counter in the Enter() method of the ticket spinlock , and this "lock add" assembler instruction is a barrier for stores and loads on x86, so the WLock() method is sequential consistent and correct, now look at the WUnlock() , we don't need an "sfence" cause stores are not reordered with stores on x86 , so WUnlock() method is sequential consistent and correct, now look at the RLock() method, the loads inside RLock() method are not reordered with the loads of the reader section , and on RUnlock(), the loads of RUnlock() are not reordered with older loads of the critical section , so all in all my algorithm i think my algorithm is sequential consistent and correct on x86. So be confident cause i have reasoned correctly and i think my algorithm is correct and it is a powerful synchronization mechanism that can replace RCU and that can replace Seqlock cause it beats Seqlock. You can download my scalable distributed sequential lock version 1.1 from: https://sites.google.com/site/aminer68/scalable-distributed-sequential-lock Thank you, Amien Moulay Ramdane. |
Ramine <ramine@1.1>: Dec 06 02:07PM -0800 On 12/6/2014 1:43 PM, Ramine wrote: > has allowed me to avoid memory barriers and atomics on the reader side, > so it has become competitive with RCU , so i think it can replace RCU, > and it beats RCU on some characteristics such as it doesn't starve or I mean it beats Seqlock. |
You received this digest because you're subscribed to updates for this group. You can change your settings on the group membership page. To unsubscribe from this group and stop receiving emails from it send an email to comp.programming.threads+unsubscribe@googlegroups.com. |
No comments:
Post a Comment