Sunday, December 7, 2014

Digest for comp.programming.threads@googlegroups.com - 11 updates in 5 topics

comp.programming.threads@googlegroups.com Google Groups
Unsure why you received this message? You previously subscribed to digests from this group, but we haven't been sending them for a while. We fixed that, but if you don't want to get these messages, send an email to comp.programming.threads+unsubscribe@googlegroups.com.
Ramine <ramine@1.1>: Dec 06 05:04PM -0800

Hello,
 
 
I have just read the following about Intel TSX:
 
"At the moment, TSX is quite slow, even if there's no contention and you
don't do anything in the block. There's a lot of overhead. Using TSX
naively may slow down even threaded code. Getting significant
performance gains from it is non-trivial."
 
 
read more here a little bit down the webpage:
 
http://cbloomrants.blogspot.ca/search?updated-min=2014-01-01T00:00:00-08:00&updated-max=2015-01-01T00:00:00-08:00&max-results=43
 
So i think my newer scalable distributed sequential lock is
faster than Intel TSX on read-mostly scenarios, also my newer algorithm
is kind of general synchronization mechanism because it works with both
memory and hardisk.
 
 
 
Thank you,
Amine Moulay Ramdane.
"Chris M. Thomasson" <no@spam.invalid>: Dec 06 02:42PM -0800

> "Ramine" wrote in message news:m5vufu$uk6$2@dont-email.me...
> Hello,
> I have just read the following about Intel TSX:
 
[...]
 
why do you insist on multi-posting?
 
;^(
Melzzzzz <mel@zzzzz.com>: Dec 07 01:21AM +0100

On Sat, 6 Dec 2014 14:42:53 -0800
 
> [...]
 
> why do you insist on multi-posting?
 
> ;^(
 
I believe it is restriction of his news server.
"Chris M. Thomasson" <no@spam.invalid>: Dec 06 04:31PM -0800


> > why do you insist on multi-posting?
 
> > ;^(
 
> I believe it is restriction of his news server.
 
That's a shame.
 
;^(
"Chris M. Thomasson" <no@spam.invalid>: Dec 06 01:45PM -0800


> I don't quite understand the "or" part of your "and/or memory barrier"
> statement. If there was a different algorithm which used only memory
> barrier, but not RMW, would that be so much worse than RCU?
 
Yes. Anytime you can reduce the number of memory barriers and/or
atomics, the better. I have tested RCU verses the same RCU with a single
MFENCE instruction on the reader side. The one without the membar
slaughtered it wrt reads-per-second per-thread.
 
 
 
> But then, if that's correct, that means that modern CPUs (whichever they
> are) still don't have features that old Alpha CPUs have. And not the other
> way round.
 
Yup. On a DEC Alpha, RCU simply needs a membar for it does not honor
data-dependent loads. RCU relies on data-dependences being there.
"Chris M. Thomasson" <no@spam.invalid>: Dec 06 01:56PM -0800

> news:m5vteu$dp3$1@speranza.aioe.org...
 
> > "Drazen Kacar" wrote in message
> > news:slrnm84di2.je0.dave@fly.srk.fer.hr...
 
[...]
 
> Yup. On a DEC Alpha, RCU simply needs a membar for it does not honor
> data-dependent loads. RCU relies on data-dependences being there.
 
AFAICT, this is why `std::memory_order_consume' exists.
Ramine <ramine@1.1>: Dec 06 03:51PM -0800

Hello,
 
 
I have wrote in my previous post this:
 
 
"I have benchmarked the distribute reader-write lock of Dmitry Vyukov
and i have found that it is so expensive that it takes over 16 or 32
cores to make it scale as Seqlock with 4 cores"
 
 
You will say that in computer science we need proof of what i am
saying above..
 
Here is my proof by using the Amdahl's law:
 
 
The distributed reader-writer mutex using a rwlock of Dmitry Vyukov
don't eliminate the full memory barrier in the reader side of the RWLock
used by the distributed reader-writer mutex of Dmitry Vyukov, this full
memory barrier does take 350 CPU cycles on my quadcore Q6600 and i think
it takes around the same cycles on Intel i7, i have just done a parallel
program benchmark between Seqlock and the distributed reader-writer
mutex of Dmitry Vyukov using a rwlock on x86 and it gives on read-mostly
scenario 25 microseconds with Seqlock and it takes around 165
microseconds with the Dmitry Vykov distributed reader-writer mutex
using a rwlock, it takes 165 microseconds because it is using an
expensive call to GetCurrentProcessorNumber() of Windows and it is using
full memory fences on the reader side and it is using more code on the
reader side , but this GetCurrentProcessorNumber() and full memory
fences on the Dmitry Vyukov algorithm are part of the parallel part of
the Amdahl's law, so they will scale, but they are so expensive compared
to Seqlock that it takes 16 to 32 cores on small to medium reader
section size to make it scale as a Seqlock on 4 cores, cause look at the
165 microseconds compared to the 25 microseconds, there is a big
difference, this is my proof , so the distributed reader-writer mutex
using a rwlock is scalable but it is too much slow than Seqlock or RCU
on reader-mostly scenarios, this is why i have invented my scalable
distributed sequential lock that beats Seqlock and that is as fast and
as scalable as both Seqlock and RCU on read-mostly
scenarios.
 
You can download my scalable distributed sequential lock version 1.1 from:
 
https://sites.google.com/site/aminer68/scalable-distributed-sequential-lock
 
 
Thank you,
Amien Moulay Ramdane.
Ramine <ramine@1.1>: Dec 06 02:51PM -0800

Hello,
 
 
Here is what's important about reader-writer algorithms...
 
I have thought more about this subject and i think i have mastered to
a certain level the subject of synchronization and reader-writer
algorithms...
 
You have to be carefull my dear programmers and hardware engineers,
cause if you take a look at the PThread reader-writer lock , it
is atomicaly incrementing a shared variable on the reader side,
so it makes it inefficient , cause on x86 for example when you
atomicaly increment a variable by writing for example "lock add"
in assembler or using atimics, this "lock add" or atomics will put a
full memory barrier that is costly, i have measured a full memory
barrier on x86 and it takes around over 300 CPU cycles and that's too
costly , cause if we add to this a time to transfer cache line(s)
between cores it will be costly and this "lock add" or atomics will
belgong to the Serial part of the Amdah's law , so this "lock add" or
atomics will not scale , and that's the weakness with the Pthread
reader-writer lock, it makes it really inefficient, so what have done
Dmitry Vyukov with his distributed reader-writer algorithm here:
 
http://www.1024cores.net/home/lock-free-algorithms/reader-writer-problem/distributed-reader-writer-mutex
 
 
 
Dmitry Vyukov has used a distributed mechanism , this distributed
mechanism allows for example the "lock add" in assembler or the atomics
of the reader side of the RWLock used by the distributed algorithm to be
run by only a group of threads belonging to the same core, so this
has made the "lock add" or atomics a part of the parallel part of the
Amdahl's law, so his distributed algorithm has allowed the "lock add" or
atomics to effectivly scale, but there is still a weakness with this
distributed algorithm of Dmitry Vyukov , because his algorithm
is slow cause this "lock add" or atomics on the reader side of the
RWLock used by the distributed algorithm will issue a full memory
barrier on every call to the "lock add" or atomics, and this full
memory barrier is expensive , i have benchmarked it and i have
found that it is so expensive that it takes over 16 or 32 cores to make
it scale as Seqlock with 4 cores, cause this full memory barrier
is part of the parallel part of the Amdahl's law on the distributed
algorithm, so i think that the Dmitry Vyukov distributed reader-writer
lock is inefficient up to 32 cores compared to the Seqlock,
now what about Seqlock? Seqlock doesn't use atomics on the reader-side
so it is very fast and it is scalable on read-mostly scenarios, it is as
fast and as scalable as RCU on read-mostly scenarios, but the weakness
with Seqlock is that Seqlock can starve and livelock when
a greater percentage of writers are used, so this is why i have invented
my scalable distributed sequential lock because it is
as fast and as scalable as Seqlock and it avoids the weakness of
Seqlock because it doesn't starve or livelock when a greater percentage
of writers are used.
 
 
Hope you have understood what i have just explained to you, because
it is very important.
 
You can download my scalable distributed sequential lock version 1.1 from:
 
https://sites.google.com/site/aminer68/scalable-distributed-sequential-lock
 
 
Thank you,
Amien Moulay Ramdane.
Ramine <ramine@1.1>: Dec 06 02:56PM -0800

On 12/6/2014 2:51 PM, Ramine wrote:
> memory barrier is expensive , i have benchmarked it and i have
> found that it is so expensive that it takes over 16 or 32 cores to make
> it scale as Seqlock with 4 cores,
 
I have done this benchmark for small to medium reader sections.
 
 
 
Ramine <ramine@1.1>: Dec 06 01:43PM -0800

Hello,
 
As i have promised, here is my newer algorithm versoin 1.1 of a scalable
distributed sequential lock that is competitive with RCU,
and that beats Seqlock on some characteristics because it
doesn't starve and it doesn't livelock when a greater percentage of
writers are used, please take a look at the source code , the RLock()
and RUnlock methods have changed, what i am doing on RLock() is that
i am taking a copy of "FCount6^.fcount6" counter that is incremented on
the writer side, and i am taking the modulo of the "number of cores"
of this counter and when this modulo is equal to 0, i am switching
to a distributed algorithm on the reader side, and on RUnlock() method
i am testing if the FCount6^.fcount6 is equal to the old copy of
FCount6^.fcount6 that had the modulo equal to 0, so if they are
equal, i will exit by unlocking the distributed reader-writer lock, if
it's not, i will stay in the Seqlock mode. My newer algorithm
has allowed me to avoid memory barriers and atomics on the reader side,
so it has become competitive with RCU , so i think it can replace RCU,
and it beats RCU on some characteristics such as it doesn't starve or
livelock when there is a greater percentage of writers.
 
 
About the sequential consistency of my scalable distributed sequential
lock, my algorithm works on x86 architecture and i think my algorithm is
correct cause look at the source code of the WLock() method, since i am
using a Ticket spinlock with a proportional backoff on the writer side,
the Ticket spinlock is using a "lock add" assembler instruction to
increment a counter in the Enter() method of the ticket spinlock , and
this "lock add" assembler instruction is a barrier for stores and loads
on x86, so the WLock() method is sequential consistent and correct, now
look at the WUnlock() , we don't need an "sfence" cause stores are not
reordered with stores on x86 , so WUnlock() method is sequential
consistent and correct, now look at the RLock() method, the loads inside
RLock() method are not reordered with the loads of the reader section ,
and on RUnlock(), the loads of RUnlock() are not reordered with older
loads of the critical section , so all in all my algorithm i think my
algorithm is sequential consistent and correct on x86. So be confident
cause i have reasoned correctly and i think my algorithm is correct and
it is a powerful synchronization mechanism that can replace RCU and that
can replace Seqlock cause it beats Seqlock.
 
 
 
You can download my scalable distributed sequential lock version 1.1 from:
 
https://sites.google.com/site/aminer68/scalable-distributed-sequential-lock
 
 
Thank you,
Amien Moulay Ramdane.
Ramine <ramine@1.1>: Dec 06 02:07PM -0800

On 12/6/2014 1:43 PM, Ramine wrote:
> has allowed me to avoid memory barriers and atomics on the reader side,
> so it has become competitive with RCU , so i think it can replace RCU,
> and it beats RCU on some characteristics such as it doesn't starve or
 
I mean it beats Seqlock.
 
You received this digest because you're subscribed to updates for this group. You can change your settings on the group membership page.
To unsubscribe from this group and stop receiving emails from it send an email to comp.programming.threads+unsubscribe@googlegroups.com.

No comments: