Saturday, March 7, 2015

Digest for comp.programming.threads@googlegroups.com - 10 updates in 5 topics

Ramine <ramine@1.1>: Mar 06 09:48PM -0800

Hello...
 
 
We have to be smart, so follow with me please..
 
As you have noticed i have implemented and invented a parallel Conjugate
gradient linear system solver library...
 
Here it is:
 
https://sites.google.com/site/aminer68/scalable-parallel-implementation-of-conjugate-gradient-linear-system-solver-library-that-is-numa-aware-and-cache-aware
 
My parallel algorithm is scalable on NUMA architecture...
 
But You have to undertand my way of designing my NUMA-aware parallel
algorithms, the first way of implementing a NUMA-aware parallel
algorithm is by implementing a threadpool that schedules a job on a
given thread by specifying for example the NUMA-node explicitly
depending on the wich NUMA node's memory you will do your processing ...
this way will buy you 40% more throughput on NUMA architecture, but
there is another way of doing is to use the classical threadpool without
specifying the NUMA node explicitly , but you will divide for example
your parallel memory processing between the NUMA nodes, this is the way
i have implemented my parallel algorithms that are NUMA-aware, my way of
doing is scalable on NUMA architecture but you will get 40% less
throughput on NUMA architecture, but even if it's 40% less throughput i
think that my parallel algorithms that are NUMA-aware are scalable on
NUMA architecture and they are still good enough, my next parallel sort
library will be also scalable on NUMA-architecture.
 
From were i have got this 40% less throughput ? please read here:
 
"Performance impact: the cost of NUMA remote memory access
 
For instance, this Dell whitepaper has some test results on the Xeon
5500 processors, showing that local memory access can have 40% higher
bandwidth than remote memory access, and the latency of local memory
access is around 70 nanoseconds whereas remote memory access has a
latency of about 100 nanoseconds."
 
 
Read more here:
 
http://sqlblog.com/blogs/linchi_shea/archive/2012/01/30/performance-impact-the-cost-of-numa-remote-memory-access.aspx
 
 
As you have noticed on my NUMA-aware i am using my classical threadpool
and i am not scheduling the jobs by specifying an explicit NUMA node,
but i am dividing the parallel memory processing between the NUMA nodes,
and by doing so you will get a scalable algorithm with 40% less
throughput than if you design a more optimized parallel algorithm and
threadpool that schedules the jobs by specifying explicitly a NUMA node
so that to avoid at best remote memory accesses on NUMA nodes, and this
will get you 40% more throughput, bu my parallel algorithm that are
NUMA-aware and that uses a classical threadpool are also good i think,
and it is still good enough , but if you need me to optimize more my
threadpool so that to get 40% more throughput i will do it as a next
project.
 
 
 
Thank you,
Amine Moulay Ramdane.
bleachbot <bleachbot@httrack.com>: Mar 07 01:55AM +0100

bleachbot <bleachbot@httrack.com>: Mar 07 02:06AM +0100

bleachbot <bleachbot@httrack.com>: Mar 07 02:19AM +0100

bleachbot <bleachbot@httrack.com>: Mar 07 02:45AM +0100

bleachbot <bleachbot@httrack.com>: Mar 07 03:47AM +0100

Ramine <ramine@1.1>: Mar 06 08:45PM -0800

Hello,
 
 
I think my new scalable SeqlockX algorithm is the right tool
to use if you want to replace scalable RWLocks or to replace
Dmitry Vyukov's distributed reader-writer mutex, it can even replace
RCU, it has good characteristics: since it doesns't "livelock" the
readers when there is more writers, and since it is extremely fast
and since it is more scalable than Dmitry Vyukov's distributed
reader-writer mutex on more and more cores, so i think you have to port
it to C++ and Java and C# also.
 
As you have noticed, currently , I have implemented my algorithm for
Delphi and FreePascal compilers...
 
 
You can download my scalable SeqlockX and read about the algorithm from:
 
https://sites.google.com/site/aminer68/scalable-seqlockx
 
 
 
 
Thank you for your time.
Ramine <ramine@1.1>: Mar 06 08:06PM -0800

Hello,
 
 
 
My new SeqlockX algorithm has elminated a weakness of the classical
Seqlock algorithm: it has completly eliminated the "livelock" when there
is more writers.
 
 
This was my improvement over the scalable Seqlock classical algorithm.
 
 
I have just benchmarked my new scalable SeqlockX algorithm , and i have
noticed that it's much more scalable and much faster than the Dmitry
Vyukov's distributed reader-writer mutex.
 
 
 
Thank you,
Amine Moulay Ramdane.
Ramine <ramine@1.1>: Mar 06 08:19PM -0800

On 3/6/2015 8:06 PM, Ramine wrote:
> Hello,
 
> My new SeqlockX algorithm has elminated a weakness of the classical
> Seqlock algorithm: it has completly eliminated the "livelock" when there
 
I mean it has completly eliminated the livelock of the readers when
there is many writers...
 
 
 
Ramine <ramine@1.1>: Mar 06 08:02PM -0800

Hello,
 
 
Scalable SeqlockX version 1.0
 
 
Author: Amine Moulay Ramdane.
 
 
Description:
 
This scalable Seqlock's variant was invented by Amine Moulay Ramdane,
it is much faster and more powerfull than the the Dmitry Vyukov's
distributed reader-writer mutex , cause the Dmitry Vyukov's distributed
reader-writer lock will become slower and slower on the writer side with
more and more cores because it transfers too many cache-lines between
cores on the writer side, this invention has eliminated the weakness of
the Dmitry Vyukov's distributed reader-writer mutex, so that the
writers throughput has become faster and very fast, and my scalable
SeqlockX elminates the weaknesses of the classical Seqlock (sequential
lock) that is "livelock", so it avoids livelock when there is more
writers, so my scalable SeqlockX is a powerful sychronization mechanism
that can replace RCU and that can replace Seqlock and that can replace
Dmitry Vyukov's distributed reader-writer mutex. And if you look at my
algorithm you will notice that on the reader side it looks like a
classical sequential lock, but i have added a variable called
fcount2^.fcount2 that allows the readers to not livelock when there is
more writers ,this scalable SeqlockX is faster and more scalable than
the Dmitry Vyukov's distributed reader-writer mutex and it beats the
classical Seqlock because it avoids livelock when there is many writers.
 
Please look at the "test.pas" example that i have included and you will
notice that the reader side have to be used in a transactional mode with
a repeat-until loop, like this:
 
repeat
 
t.rlock(id1);
 
until t.runlock(id1);
 
it is like a transactional mode of a Seqlock and id1 must be of type
"long" that must be imported from the SeqlockX unit. The writer side is
easy to program , please look at the "test.pas" example and you will
understand how to use my scalable distributed sequential lock.
 
My scalable SeqlockX can be used from accross processes and threads.
 
My new invention called scalable SeqlockX k beats the following
algorithm, it is more scalable and faster than the following algorithm
of Dmitry Vyukov called Distributed reader-writer mutex:
 
http://www.1024cores.net/home/lock-free-algorithms/reader-writer-problem/distributed-reader-writer-mutex
 
 
And my new invention called scalable SeqlockX beats also the classical
Seqlock (i have just explained to you why ...), read here about it:
 
http://en.wikipedia.org/wiki/Seqlock
 
And this new invention called scalable SeqlockX can replace RCU cause it
is a powerfull synchronization mechanism.
 
I will explain my new algorithm of a scalable distributed sequential
lock that is very powerful...
 
I have arrived at my new algorithm by also reading many PhD papers and
by looking first at the weakness of the following algorithm of Dmitry
Vyukov's Reader-writer mutex:
 
http://www.1024cores.net/home/lock-free-algorithms/reader-writer-problem/distributed-reader-writer-mutex
 
 
look at the writer side of the Dmitry Vykov's algorithm , it is doing
with every rwlock in each corresponding core:
 
for (i = 0; i != mtx->proc_count; i += 1)
 
pthread_rwlock_wrlock(&mtx-cell[i].mtx);
 
but this render this algorithm inefficient for the writer side , cause
every pthread_rwlock_wrlock() call transfer many cache-lines betwenn
cores, so this will generate too much cache line transfers between cores
when each writer executes distr_rw_mutex_wrlock() and this will become
worse and worse when you add more and more cores, i mean this will
transfer more and more cache-lines between cores with more and more
cores... so this will render the writer side throughput slow and this is
the weakness of this algorithm...
 
So this is why i have used a sequential lock mechanism in the write side
of my new algorithm so that to minimize efficiently the cache-lines
transfers, so if you look at my algorithm, it adds the follwing to the
classical Seqlock:
 
On the writer side if the result of TSeqlockX.RUnlock() method is
"false", that means the reader transaction has failed, so i am putting 1
into the FCount2^.FCount2 variable and if it's true, that means the
reader transaction has succeeded, i am putting 0 into the
FCount2^.FCount2 variable, and inside the writer side inside the
TSeqlockX.WLock() method, i am doing this:
 
while FCount2^.FCount2<>0 do sleep(0);
 
So if FCount2^.FCount2 is equal to its initial value it will exit the
while loop, and if FCount2^.FCount2 is equal to 1 it will wait for
TSeqlockX.RUnlock() to return true, that means it will wait for a reader
transaction to succeed, this part of the algorithm allows my algorithm
to avoids livelock when there is many writers, and the rest of the
algorithm look like a classical seqlock. That's all.
 
About the sequential consistency of my scalable distributed sequential
lock, my algorithm works on x86 architecture and it compiles on Delphi
and FreePascal compilers they both respect the strong memory model of
the x86 architecture. i think my algorithm is correct cause look at the
source code of the WLock() method, since i am using a Ticket spinlock
with a proportional backoff on the writer side, the Ticket spinlock is
using a "lock add" assembler instruction to increment a counter in the
Enter() method of the ticket spinlock , and this "lock add" assembler
instruction is a barrier for stores and loads on x86, and the loads of
Fcount2^.fcount2 is not reordered with the previous atomic lock of the
ticket spinlock, and the stores of FCount5^.fcount5 are not reordred
with olher stores and older loads , so the WLock() method is
sequentially consistent and correct, now look at the WUnlock() , we
don't need an "sfence" cause stores and loads on WUnlock() are not
reordered with older stores or older loads on x86 etc. so WUnlock()
method is sequentially consistent and correct, now look at the RLock()
method, it's also sequentially consistent., so all in all i think my
algorithm is sequentially consistent and correct on x86 and on the
Delphi and FreePascal compilers. So i think you can be confident cause i
have reasoned correctly and i think my scalable SeqlockX algorithm is
correct and it is a powerful synchronization mechanism that can replace
RCU and that can replace classical Seqlock cause it beats Selock.
 
You have to know also that the first parameter of the constructor is the
name that identifies the scalable SeqlockX object accross processes
bounderies.
 
 
You can download my scalable SeqlockX from:
 
https://sites.google.com/site/aminer68/scalable-seqlockx
 
 
 
Language: FPC Pascal v2.2.0+ / Delphi 7+: http://www.freepascal.org/
 
Operating Systems: Windows, Mac OSX , Linux on x86.
 
Required FPC switches: -O3 -Sd -dFPC -dFreePascal
 
-Sd for delphi mode....
 
Required Delphi switches: -$H+ -DDelphi
 
{$DEFINE CPU32} and {$DEFINE Windows32} for 32 bit systems
 
{$DEFINE CPU64} and {$DEFINE Windows64} for 64 bit systems
 
 
 
Thank you,
Amine Moulay Ramdane.
You received this digest because you're subscribed to updates for this group. You can change your settings on the group membership page.
To unsubscribe from this group and stop receiving emails from it send an email to comp.programming.threads+unsubscribe@googlegroups.com.

No comments: