- About my parallel algorithm and NUMA... - 1 Update
- cmsg cancel <mddi7u$q3q$2@dont-email.me> - 5 Updates
- More about my scalable SeqlockX algorithm - 1 Update
- About my new SeqlockX algorithm - 2 Updates
- Scalable SeqlockX is here... - 1 Update
Ramine <ramine@1.1>: Mar 06 09:48PM -0800 Hello... We have to be smart, so follow with me please.. As you have noticed i have implemented and invented a parallel Conjugate gradient linear system solver library... Here it is: https://sites.google.com/site/aminer68/scalable-parallel-implementation-of-conjugate-gradient-linear-system-solver-library-that-is-numa-aware-and-cache-aware My parallel algorithm is scalable on NUMA architecture... But You have to undertand my way of designing my NUMA-aware parallel algorithms, the first way of implementing a NUMA-aware parallel algorithm is by implementing a threadpool that schedules a job on a given thread by specifying for example the NUMA-node explicitly depending on the wich NUMA node's memory you will do your processing ... this way will buy you 40% more throughput on NUMA architecture, but there is another way of doing is to use the classical threadpool without specifying the NUMA node explicitly , but you will divide for example your parallel memory processing between the NUMA nodes, this is the way i have implemented my parallel algorithms that are NUMA-aware, my way of doing is scalable on NUMA architecture but you will get 40% less throughput on NUMA architecture, but even if it's 40% less throughput i think that my parallel algorithms that are NUMA-aware are scalable on NUMA architecture and they are still good enough, my next parallel sort library will be also scalable on NUMA-architecture. From were i have got this 40% less throughput ? please read here: "Performance impact: the cost of NUMA remote memory access For instance, this Dell whitepaper has some test results on the Xeon 5500 processors, showing that local memory access can have 40% higher bandwidth than remote memory access, and the latency of local memory access is around 70 nanoseconds whereas remote memory access has a latency of about 100 nanoseconds." Read more here: http://sqlblog.com/blogs/linchi_shea/archive/2012/01/30/performance-impact-the-cost-of-numa-remote-memory-access.aspx As you have noticed on my NUMA-aware i am using my classical threadpool and i am not scheduling the jobs by specifying an explicit NUMA node, but i am dividing the parallel memory processing between the NUMA nodes, and by doing so you will get a scalable algorithm with 40% less throughput than if you design a more optimized parallel algorithm and threadpool that schedules the jobs by specifying explicitly a NUMA node so that to avoid at best remote memory accesses on NUMA nodes, and this will get you 40% more throughput, bu my parallel algorithm that are NUMA-aware and that uses a classical threadpool are also good i think, and it is still good enough , but if you need me to optimize more my threadpool so that to get 40% more throughput i will do it as a next project. Thank you, Amine Moulay Ramdane. |
bleachbot <bleachbot@httrack.com>: Mar 07 01:55AM +0100 |
bleachbot <bleachbot@httrack.com>: Mar 07 02:06AM +0100 |
bleachbot <bleachbot@httrack.com>: Mar 07 02:19AM +0100 |
bleachbot <bleachbot@httrack.com>: Mar 07 02:45AM +0100 |
bleachbot <bleachbot@httrack.com>: Mar 07 03:47AM +0100 |
Ramine <ramine@1.1>: Mar 06 08:45PM -0800 Hello, I think my new scalable SeqlockX algorithm is the right tool to use if you want to replace scalable RWLocks or to replace Dmitry Vyukov's distributed reader-writer mutex, it can even replace RCU, it has good characteristics: since it doesns't "livelock" the readers when there is more writers, and since it is extremely fast and since it is more scalable than Dmitry Vyukov's distributed reader-writer mutex on more and more cores, so i think you have to port it to C++ and Java and C# also. As you have noticed, currently , I have implemented my algorithm for Delphi and FreePascal compilers... You can download my scalable SeqlockX and read about the algorithm from: https://sites.google.com/site/aminer68/scalable-seqlockx Thank you for your time. |
Ramine <ramine@1.1>: Mar 06 08:06PM -0800 Hello, My new SeqlockX algorithm has elminated a weakness of the classical Seqlock algorithm: it has completly eliminated the "livelock" when there is more writers. This was my improvement over the scalable Seqlock classical algorithm. I have just benchmarked my new scalable SeqlockX algorithm , and i have noticed that it's much more scalable and much faster than the Dmitry Vyukov's distributed reader-writer mutex. Thank you, Amine Moulay Ramdane. |
Ramine <ramine@1.1>: Mar 06 08:19PM -0800 On 3/6/2015 8:06 PM, Ramine wrote: > Hello, > My new SeqlockX algorithm has elminated a weakness of the classical > Seqlock algorithm: it has completly eliminated the "livelock" when there I mean it has completly eliminated the livelock of the readers when there is many writers... |
Ramine <ramine@1.1>: Mar 06 08:02PM -0800 Hello, Scalable SeqlockX version 1.0 Author: Amine Moulay Ramdane. Description: This scalable Seqlock's variant was invented by Amine Moulay Ramdane, it is much faster and more powerfull than the the Dmitry Vyukov's distributed reader-writer mutex , cause the Dmitry Vyukov's distributed reader-writer lock will become slower and slower on the writer side with more and more cores because it transfers too many cache-lines between cores on the writer side, this invention has eliminated the weakness of the Dmitry Vyukov's distributed reader-writer mutex, so that the writers throughput has become faster and very fast, and my scalable SeqlockX elminates the weaknesses of the classical Seqlock (sequential lock) that is "livelock", so it avoids livelock when there is more writers, so my scalable SeqlockX is a powerful sychronization mechanism that can replace RCU and that can replace Seqlock and that can replace Dmitry Vyukov's distributed reader-writer mutex. And if you look at my algorithm you will notice that on the reader side it looks like a classical sequential lock, but i have added a variable called fcount2^.fcount2 that allows the readers to not livelock when there is more writers ,this scalable SeqlockX is faster and more scalable than the Dmitry Vyukov's distributed reader-writer mutex and it beats the classical Seqlock because it avoids livelock when there is many writers. Please look at the "test.pas" example that i have included and you will notice that the reader side have to be used in a transactional mode with a repeat-until loop, like this: repeat t.rlock(id1); until t.runlock(id1); it is like a transactional mode of a Seqlock and id1 must be of type "long" that must be imported from the SeqlockX unit. The writer side is easy to program , please look at the "test.pas" example and you will understand how to use my scalable distributed sequential lock. My scalable SeqlockX can be used from accross processes and threads. My new invention called scalable SeqlockX k beats the following algorithm, it is more scalable and faster than the following algorithm of Dmitry Vyukov called Distributed reader-writer mutex: http://www.1024cores.net/home/lock-free-algorithms/reader-writer-problem/distributed-reader-writer-mutex And my new invention called scalable SeqlockX beats also the classical Seqlock (i have just explained to you why ...), read here about it: http://en.wikipedia.org/wiki/Seqlock And this new invention called scalable SeqlockX can replace RCU cause it is a powerfull synchronization mechanism. I will explain my new algorithm of a scalable distributed sequential lock that is very powerful... I have arrived at my new algorithm by also reading many PhD papers and by looking first at the weakness of the following algorithm of Dmitry Vyukov's Reader-writer mutex: http://www.1024cores.net/home/lock-free-algorithms/reader-writer-problem/distributed-reader-writer-mutex look at the writer side of the Dmitry Vykov's algorithm , it is doing with every rwlock in each corresponding core: for (i = 0; i != mtx->proc_count; i += 1) pthread_rwlock_wrlock(&mtx-cell[i].mtx); but this render this algorithm inefficient for the writer side , cause every pthread_rwlock_wrlock() call transfer many cache-lines betwenn cores, so this will generate too much cache line transfers between cores when each writer executes distr_rw_mutex_wrlock() and this will become worse and worse when you add more and more cores, i mean this will transfer more and more cache-lines between cores with more and more cores... so this will render the writer side throughput slow and this is the weakness of this algorithm... So this is why i have used a sequential lock mechanism in the write side of my new algorithm so that to minimize efficiently the cache-lines transfers, so if you look at my algorithm, it adds the follwing to the classical Seqlock: On the writer side if the result of TSeqlockX.RUnlock() method is "false", that means the reader transaction has failed, so i am putting 1 into the FCount2^.FCount2 variable and if it's true, that means the reader transaction has succeeded, i am putting 0 into the FCount2^.FCount2 variable, and inside the writer side inside the TSeqlockX.WLock() method, i am doing this: while FCount2^.FCount2<>0 do sleep(0); So if FCount2^.FCount2 is equal to its initial value it will exit the while loop, and if FCount2^.FCount2 is equal to 1 it will wait for TSeqlockX.RUnlock() to return true, that means it will wait for a reader transaction to succeed, this part of the algorithm allows my algorithm to avoids livelock when there is many writers, and the rest of the algorithm look like a classical seqlock. That's all. About the sequential consistency of my scalable distributed sequential lock, my algorithm works on x86 architecture and it compiles on Delphi and FreePascal compilers they both respect the strong memory model of the x86 architecture. i think my algorithm is correct cause look at the source code of the WLock() method, since i am using a Ticket spinlock with a proportional backoff on the writer side, the Ticket spinlock is using a "lock add" assembler instruction to increment a counter in the Enter() method of the ticket spinlock , and this "lock add" assembler instruction is a barrier for stores and loads on x86, and the loads of Fcount2^.fcount2 is not reordered with the previous atomic lock of the ticket spinlock, and the stores of FCount5^.fcount5 are not reordred with olher stores and older loads , so the WLock() method is sequentially consistent and correct, now look at the WUnlock() , we don't need an "sfence" cause stores and loads on WUnlock() are not reordered with older stores or older loads on x86 etc. so WUnlock() method is sequentially consistent and correct, now look at the RLock() method, it's also sequentially consistent., so all in all i think my algorithm is sequentially consistent and correct on x86 and on the Delphi and FreePascal compilers. So i think you can be confident cause i have reasoned correctly and i think my scalable SeqlockX algorithm is correct and it is a powerful synchronization mechanism that can replace RCU and that can replace classical Seqlock cause it beats Selock. You have to know also that the first parameter of the constructor is the name that identifies the scalable SeqlockX object accross processes bounderies. You can download my scalable SeqlockX from: https://sites.google.com/site/aminer68/scalable-seqlockx Language: FPC Pascal v2.2.0+ / Delphi 7+: http://www.freepascal.org/ Operating Systems: Windows, Mac OSX , Linux on x86. Required FPC switches: -O3 -Sd -dFPC -dFreePascal -Sd for delphi mode.... Required Delphi switches: -$H+ -DDelphi {$DEFINE CPU32} and {$DEFINE Windows32} for 32 bit systems {$DEFINE CPU64} and {$DEFINE Windows64} for 64 bit systems Thank you, Amine Moulay Ramdane. |
You received this digest because you're subscribed to updates for this group. You can change your settings on the group membership page. To unsubscribe from this group and stop receiving emails from it send an email to comp.programming.threads+unsubscribe@googlegroups.com. |
No comments:
Post a Comment