Wednesday, October 9, 2013

comp.programming.threads - 26 new messages in 13 topics - digest

comp.programming.threads
http://groups.google.com/group/comp.programming.threads?hl=en

comp.programming.threads@googlegroups.com

Today's topics:

* About my RWLock ... - 2 messages, 1 author
http://groups.google.com/group/comp.programming.threads/t/161f8b7d91666fb7?hl=en
* About the SpinLock... - 1 messages, 1 author
http://groups.google.com/group/comp.programming.threads/t/1253c59b21ab7dd5?hl=en
* There is still a problem with SpinLock... - 3 messages, 2 authors
http://groups.google.com/group/comp.programming.threads/t/d4c9e9277f0bb0ce?hl=en
* Scalable RWLock 1.12 - 2 messages, 1 author
http://groups.google.com/group/comp.programming.threads/t/9a83250ad6ea782b?hl=en
* Ticket spinlock... - 2 messages, 1 author
http://groups.google.com/group/comp.programming.threads/t/1d1de81d08a57ee6?hl=en
* Spinlock and starvation... - 2 messages, 1 author
http://groups.google.com/group/comp.programming.threads/t/aa52d8e0756f5741?hl=en
* MCS queue Lock... - 1 messages, 1 author
http://groups.google.com/group/comp.programming.threads/t/e6688b6b26360f95?hl=en
* SpinLock with backoff... - 1 messages, 1 author
http://groups.google.com/group/comp.programming.threads/t/ea3587dcc8d52697?hl=en
* A new algorithm for for a distributed priority and ,non priority LOCK... - 5
messages, 1 author
http://groups.google.com/group/comp.programming.threads/t/302ae293275d265e?hl=en
* About lockfree queues - 2 messages, 2 authors
http://groups.google.com/group/comp.programming.threads/t/b80b838a140c8a1d?hl=en
* About my distributed priority and non priority LOCK... - 1 messages, 1
author
http://groups.google.com/group/comp.programming.threads/t/f57a04e98166f0da?hl=en
* About Spinlock... - 3 messages, 1 author
http://groups.google.com/group/comp.programming.threads/t/66a839ec03890e53?hl=en
* About lockfree queues ... - 1 messages, 1 author
http://groups.google.com/group/comp.programming.threads/t/331de802b7b87fa5?hl=en

==============================================================================
TOPIC: About my RWLock ...
http://groups.google.com/group/comp.programming.threads/t/161f8b7d91666fb7?hl=en
==============================================================================

== 1 of 2 ==
Date: Sun, Sep 22 2013 8:40 pm
From: aminer


On 9/22/2013 11:37 PM, aminer wrote:
> This SpinLock that i am using will not impact the performance of my RWLock.

i mean will not impact negatively the performance of my RWLock.



Thank you,
Amine Moulay Ramdane.




== 2 of 2 ==
Date: Sun, Sep 22 2013 8:47 pm
From: aminer



I correct:

Hello,

This SpinLock that i am using will not impact negatively the performance of
my RWLock cause to use effectivly my RWLock there must be far less and
fewer number of writers than readers. But it will impact negatively when
there is high contention the performance of my SemaCondVar and
SemaMonitor, so i will design something that look like an MCS queue Lock
and that is
more efficient than this SpinLock with a backoff mechanism,
and as i said before we have to spin on local variables not on shared
variable and use a backoff mechanism this way our Lock will be
very efficient.


Thank you,
Amine Moulay Ramdane.






==============================================================================
TOPIC: About the SpinLock...
http://groups.google.com/group/comp.programming.threads/t/1253c59b21ab7dd5?hl=en
==============================================================================

== 1 of 1 ==
Date: Sun, Sep 22 2013 9:36 pm
From: aminer




Hello,

About the following SpinLock hat i am using inside my SemaMonitor and
SemaCondvar:


http://code.google.com/p/gpdelphiunits/source/browse/trunk/src/SpinLock.pas?r=37



I think it's not so bad and it can give a decent performance even under
high contention cause it uses a backoff mechanism, and even though it spins
on shared variables and not on local variables causing more
cache-coherence traffic, i don't think it will slow substantially and
globally the threads of your computer, hence
i think this TTAS spinlock with an exponential that i am using have a
decent performance inside my SemaCondvar and SemaMonitor even under high
contention, so no need for an MCS queue Lock.


So enjoy my SemaCondvar and my SemaMonitor and my scalable RWLock cause
they are fast.

You can download them from:

http://pages.videotron.com/aminer/


Thank you,
Amine Moulay Ramdane.



















==============================================================================
TOPIC: There is still a problem with SpinLock...
http://groups.google.com/group/comp.programming.threads/t/d4c9e9277f0bb0ce?hl=en
==============================================================================

== 1 of 3 ==
Date: Mon, Sep 23 2013 12:51 am
From: aminer


Hello,

That's not the end of the story, there is still a problem with the
SpinLock that i am using, it's a TTAS with an exponential backoff but
this kind
of spinlock suffers from starvation, i mean that some threads may starve
for a long time and this is not acceptable, and i think
that the lockfree algorithms that spin on a CAS suffers from the same
problem that is starvation, so the solution is to use a Ticket Spinlock
with backoff that is fair to resolve the starvation problem , but this
kind of Ticket Spinlock spins on a shared variable and this higher the
cache-coherence traffic, so this will slow the system globally so this
is not
so efficient, so the MCS queue Lock is the solution i think.

But i have still a question for you:

Does the Windows critical section suffers from this starvation problem
and does it uses a TTAS with backoff or is it the same as an MCS queue
Lock or ..?



Thank you,
Amine Moulay Ramdane.





== 2 of 3 ==
Date: Mon, Sep 23 2013 1:18 am
From: aminer



Hello,

I have asked you about Windows critical sections and the starvation
problem, and i have just read the following:


"The change to unfair locks clearly has the risk of leading to
starvation. But, statistically speaking, timing in concurrent systems
tends to be so volatile that each thread will eventually get its turn to
run, probabilistically speaking. Many more programs would suffer from
the convoy problems resulting from fair locks than would notice
starvation happening in production systems as a result of unfair locks."

read here:

http://joeduffyblog.com/2006/12/14/anticonvoy-locks-in-windows-server-2003-sp1-and-windows-vista/


So as you have noticed the Windows critical section has become unfair
and since it has become unfair starvation can happen now with windows
critical section etc. so some threads may starve for a long time and
this not acceptable i think, and not acceptable for me.


Thank you,
Amine Moulay Ramdane.















== 3 of 3 ==
Date: Mon, Sep 23 2013 11:39 am
From: "Chris M. Thomasson"


> "aminer" wrote in message news:l1ortb$5sm$1@news.albasani.net...

[...]

http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.48.6490

BTW, you don't need DWCAS to implement this...

https://groups.google.com/forum/#!original/comp.arch/aLVoxdQdRac/micVZJgOjdUJ






==============================================================================
TOPIC: Scalable RWLock 1.12
http://groups.google.com/group/comp.programming.threads/t/9a83250ad6ea782b?hl=en
==============================================================================

== 1 of 2 ==
Date: Mon, Sep 23 2013 12:39 pm
From: aminer



Hello,

SemaCondvar was updated to version 1.11 and Scalable RWLock was updated
to version 1.12, i have used a Ticket spinlock with and exponentiel
backoff inside SemaCondvar and SemaMonitor and now the Ticket spinlock
avoids the starvation problem and it has a decent performance.


You can download SemaCondvar and my scalable RWLock from:

http://pages.videotron.com/aminer/


Thank you,
Amine Moulay Ramdane.




== 2 of 2 ==
Date: Mon, Sep 23 2013 1:52 pm
From: aminer



Hello,

That's not the end of the story, i have benchmarked the ticket spinlock
that i am using and that you find here:

http://code.google.com/p/gpdelphiunits/source/browse/trunk/src/SpinLock.pas?r=37


and found that it has a poor performance cause in the case of a spinlock
without tickets when it is unlocked() the first thread that comes first
to the lock() will enter immediatly the locked section, but that's not
the case with a Ticket spinlock cause if the first thread that enter the
lock() have not the ticket the other threads that have the ticket and
that are waiting for there turn will wait more , so this is why the
Ticket spinlock have a poor performance compared to a simple spinlock
with a backoff.


So i will advice you to avoid the Ticket spinlock.




Thank you,
Amine Moulay Ramdane.



On 9/23/2013 3:39 PM, aminer wrote:
>
> Hello,
>
> SemaCondvar was updated to version 1.11 and Scalable RWLock was updated
> to version 1.12, i have used a Ticket spinlock with and exponentiel
> backoff inside SemaCondvar and SemaMonitor and now the Ticket spinlock
> avoids the starvation problem and it has a decent performance.
>
>
> You can download SemaCondvar and my scalable RWLock from:
>
> http://pages.videotron.com/aminer/
>
>
> Thank you,
> Amine Moulay Ramdane.






==============================================================================
TOPIC: Ticket spinlock...
http://groups.google.com/group/comp.programming.threads/t/1d1de81d08a57ee6?hl=en
==============================================================================

== 1 of 2 ==
Date: Mon, Sep 23 2013 1:52 pm
From: aminer



Hello,

That's not the end of the story, i have benchmarked the ticket spinlock
that i am using and that you find here:

http://code.google.com/p/gpdelphiunits/source/browse/trunk/src/SpinLock.pas?r=37


and found that it has a poor performance cause in the case of a spinlock
without tickets when it is unlocked() the first thread that comes first
to the lock() will enter immediatly the locked section, but that's not
the case with a Ticket spinlock cause if the first thread that enter the
lock() have not the ticket the other threads that have the ticket and
that are waiting for there turn will wait more , so this is why the
Ticket spinlock have a poor performance compared to a simple spinlock
with a backoff.


So i will advice you to avoid the Ticket spinlock.




Thank you,
Amine Moulay Ramdane.




== 2 of 2 ==
Date: Mon, Sep 23 2013 2:07 pm
From: aminer



I correct some english mistakes, pease read again...



Hello,

That's not the end of the story, i have benchmarked the ticket spinlock
that i am using and that you find here:

http://code.google.com/p/gpdelphiunits/source/browse/trunk/src/SpinLock.pas?r=37

and found that it has a poor performance cause in the case of a spinlock
without tickets when it is unlocked() the first thread that comes first
to the lock() will enter immediatly the locked section, but that's not
the case with a Ticket spinlock cause if the first thread that enters
the lock() first has not the ticket to enter, the other thread that have
the ticket to enter and that is waiting for his turn will wait
longer/more , so this is why the Ticket spinlock has a poor performance
compared to a simple spinlock with a backoff.


So i will advise you to avoid the Ticket spinlock.




Thank you,
Amine Moulay Ramdane.


On 9/23/2013 4:52 PM, aminer wrote:
>
> Hello,
>
> That's not the end of the story, i have benchmarked the ticket spinlock
> that i am using and that you find here:
>
> http://code.google.com/p/gpdelphiunits/source/browse/trunk/src/SpinLock.pas?r=37
>
>
> and found that it has a poor performance cause in the case of a spinlock
> without tickets when it is unlocked() the first thread that comes first
> to the lock() will enter immediatly the locked section, but that's not
> the case with a Ticket spinlock cause if the first thread that enter the
> lock() have not the ticket the other threads that have the ticket and
> that are waiting for there turn will wait more , so this is why the
> Ticket spinlock have a poor performance compared to a simple spinlock
> with a backoff.
>
>
> So i will advice you to avoid the Ticket spinlock.
>
>
>
>
> Thank you,
> Amine Moulay Ramdane.






==============================================================================
TOPIC: Spinlock and starvation...
http://groups.google.com/group/comp.programming.threads/t/aa52d8e0756f5741?hl=en
==============================================================================

== 1 of 2 ==
Date: Mon, Sep 23 2013 2:31 pm
From: aminer



Hello,

Finally , i have read the following, and it says this:

"Starvation freedom is desirable, but not essential
�practical locks: many permit starvation, although it is unlikely to occur"

read here:

http://www.cs.rice.edu/~vs3/comp422/lecture-notes/comp422-lec19-s08-v1.pdf

So as you have just read , starvation is unlikely to occur with spinlock
eith backoff, so the probability is
very low that starvation occurs, so i will return back to the Spinlock
with exponential backoff inside my SemaMonitor and SemaCondvar cause it
has a decent performance.



Thank you,
Amine Moulay Ramdane.












== 2 of 2 ==
Date: Mon, Sep 23 2013 2:46 pm
From: aminer





And look at the graph of the "Lock comparison" up to 80 cores
the test&set with an exponential backoff have a good performance,
so no need for an MCS queue Lock, so i will return back to
the SpinLock with an exponential backoff cause it has a good performance.

Here is the graph bellow on this PDF page:

http://www.cs.rice.edu/~vs3/comp422/lecture-notes/comp422-lec19-s08-v1.pdf




Thank you,
Amine Moulay Ramdane.

On 9/23/2013 5:31 PM, aminer wrote:
>
> Hello,
>
> Finally , i have read the following, and it says this:
>
> "Starvation freedom is desirable, but not essential
> �practical locks: many permit starvation, although it is unlikely to occur"
>
> read here:
>
> http://www.cs.rice.edu/~vs3/comp422/lecture-notes/comp422-lec19-s08-v1.pdf
>
> So as you have just read , starvation is unlikely to occur with spinlock
> eith backoff, so the probability is
> very low that starvation occurs, so i will return back to the Spinlock
> with exponential backoff inside my SemaMonitor and SemaCondvar cause it
> has a decent performance.
>
>
>
> Thank you,
> Amine Moulay Ramdane.
>
>
>
>
>
>
>
>






==============================================================================
TOPIC: MCS queue Lock...
http://groups.google.com/group/comp.programming.threads/t/e6688b6b26360f95?hl=en
==============================================================================

== 1 of 1 ==
Date: Mon, Sep 23 2013 5:23 pm
From: aminer



Hello,

Please take a look at the follwing MCS queue lock algorithm:

==

mcs_node{
mcs_node next;
int is_locked;
}

mcs_lock{
mcs_node queue;
}

function Lock(mcs_lock lock, mcs_node my_node){
my_node.next = NULL;
mcs_node predecessor = fetch_and_store(lock.queue, my_node);
//if its null, we now own the lock and can leave, else....
if (predecessor != NULL){
my_node.is_locked = true;
//when the predecessor unlocks, it will give us the lock
predecessor.next = my_node;
while(my_node.is_locked){}
}

function UnLock(mcs_lock lock, mcs_node my_node){
//is there anyone to give the lock to?
if (my_node.next == NULL){
//need to make sure there wasn't a race here
if (compare_and_swap(lock.queue, my_node, NULL)){
return;
}
else{
//someone has executed fetch_and_store but not set our
next field
while(my_node.next==NULL){}
}
}
//if we made it here, there is someone waiting, so lets wake them up
my_node.next.is_locked=false;
}

===


Firt you will notice that it's using an atomic fetch_and_store,
plus it is using a spin wait, so it's more expensive than
the Spinlock with an exponential backoff, so i prefer the
SpinLock with an exponential backoff cause the exponential backoff
will reduce the cache coherency traffic and reduce the CPU utilisation
when there is more contention, this is why the Spinlock with an
exponential backoff is giving a good performance , and almost the same
performance as the MCS queue Lock,
proof of that ? look at the the following graph of the "Lock
comparison" up to 80 core bellow in the following PDF file and
you will notice that the test&set with an exponential backoff
have almost the same performance as the MCS queue Lock:

http://www.cs.rice.edu/~vs3/comp422/lecture-notes/comp422-lec19-s08-v1.pdf


So i think i will stay with the Spinlock with an exponential backoff
cause it has a good performance.



Thank you,
Amine Moulay Ramdane.










==============================================================================
TOPIC: SpinLock with backoff...
http://groups.google.com/group/comp.programming.threads/t/ea3587dcc8d52697?hl=en
==============================================================================

== 1 of 1 ==
Date: Mon, Sep 23 2013 7:56 pm
From: aminer


Hello,

That's not the end of the story, i have benchmarked
the following Spin lock with exponential backoff that i am
using:

http://code.google.com/p/gpdelphiunits/source/browse/trunk/src/SpinLock.pas?r=37



And this is not good at all, cause there is a problem with
the exponential backoff , i have benchmarked it and i have notived
that the backoff slows down a lot the critical section even if the
contention is not high, so this TTAS with an exponential backoff is a
stupid thing , the ticket spinlock is also a stupid thing.

I have also benchmarked the ticket spinlock and found that it has a poor
performance cause in the case of a spinlock without tickets when it is
unlocked() the first thread that comes first to the lock() will enter
immediatly the locked section, but that's not the case with a Ticket
spinlock cause if the first thread that enters the lock() first has not
yet the ticket to enter, the other thread that has the ticket to enter
and that is waiting for his turn will wait longer , so this is why the
Ticket spinlock has also a poor performance compared to a simple
spinlock with a backoff.


So i will advise you to avoid the Spinlock with exponential backoff and
the Ticket spinlock.


Thank you,
Amine Moulay Ramdane.










==============================================================================
TOPIC: A new algorithm for for a distributed priority and ,non priority LOCK...

http://groups.google.com/group/comp.programming.threads/t/302ae293275d265e?hl=en
==============================================================================

== 1 of 5 ==
Date: Tues, Sep 24 2013 7:33 am
From: aminer



Hello,

Here is my new algorithm for a distributed priority and
non priority LOCK, this algorithm reduce a lot the
cache coherency traffic and is good, so follow me please...


First you you allocate an array with the same number of elements
as the numbers of cores and the elements in the array must be 64 bytes
cache aligned and must be a record containing one element that is the
a flag for the first CAS that will be used as a critical section .

You first initialise the distributed priority LOCK by setting
a flag for the second CAS that will be used as a critical section to 0
and you set the flag of the first CAS that will be used also as a
critical section to 1 (to permit the first thread to enter the lock())

The lock() function will look like this:

Every thread that enters the lock() must acquire his processor
number by using GetProcessorNumber() function, but this function
will be optimized to amortize the number of calls to it, and that's
easy to do.

You enter the lock() function by pushing the processor number of the
threads into a queue or priority queue that will be optimized, this
queue will be using a CAS on the push() and will not use any CAS on the
pop(), we don't need it on the pop() cause only one thread will pop()
from the unlock() function, after that you enter a repeat loop where you
test with the first CAS and you test also with the second CAS the flag
of the corresponding element(flag) of the array using the processor
number of the threads as an index , the flag for the second CAS will be
set to 1 so the first thread will enter the lock() and the other threads
will test in a repeat loop for the first CAS and the second CAS and
there flags will have been set to zero so they will wait..

After that the first thread will arrive to the unlock() function and
he will pop() the processor number from the optimized priority queue
or non priority queue and set the flag for the first CAS to 0 for the
corresponding processor core this will allow a thread to enter the lock
, if there is no elements in the queue the thread will set the flag of
the second CAS to zero and this will allow a thread to enter the lock.


So as you have noticed my algorithm is efficient also cause if there
is threads waiting, the cache-coherence traffic will be reduced a lot
cause we are using local variables in each element of the array alligned
to 64 byte.


So if you have noticed, in fact i am using 2 CASes , so my algorithm
is good.

This was my algorithm for a distributed priority and non priority LOCK
that is efficient, and i will code it as soon as possible.



Thank you,
Amine Moulay Ramdane.















== 2 of 5 ==
Date: Tues, Sep 24 2013 8:44 am
From: aminer


On 9/24/2013 10:33 AM, aminer wrote:
>
> Hello,
>
> Here is my new algorithm for a distributed priority and
> non priority LOCK, this algorithm reduce a lot the
> cache coherency traffic and is good, so follow me please...
>
>
> First you you allocate an array with the same number of elements
> as the numbers of cores and the elements in the array must be 64 bytes
> cache aligned and must be a record containing one element that is the
> a flag for the first CAS that will be used as a critical section .
>
> You first initialise the distributed priority LOCK by setting
> a flag for the second CAS that will be used as a critical section to 0
> and you set the flag of the first CAS that will be used also as a
> critical section to 1 (to permit the first thread to enter the lock())
>
> The lock() function will look like this:
>
> Every thread that enters the lock() must acquire his processor
> number by using GetProcessorNumber() function, but this function
> will be optimized to amortize the number of calls to it, and that's
> easy to do.
>
> You enter the lock() function by pushing the processor number of the
> threads into a queue or priority queue that will be optimized, this
> queue will be using a CAS on the push() and will not use any CAS on the
> pop(), we don't need it on the pop() cause only one thread will pop()
> from the unlock() function, after that you enter a repeat loop where you
> test with the first CAS and you test also with the second CAS the flag
> of the corresponding element(flag) of the array using the processor
> number of the threads as an index , the flag for the second CAS will be
> set to 1 so the first thread will enter the lock() and the other threads
> will test in a repeat loop for the first CAS and the second CAS and
> there flags will have been set to zero so they will wait..
>
> After that the first thread will arrive to the unlock() function and
> he will pop() the processor number from the optimized priority queue
> or non priority queue and set the flag for the first CAS to 0 for the

i mean: set the flag for the first CAS to 1

> corresponding processor core this will allow a thread to enter the lock
> , if there is no elements in the queue the thread will set the flag of
> the second CAS to zero and this will allow a thread to enter the lock.

i mean: set the flag for the second CAS to 1

>
>
> So as you have noticed my algorithm is efficient also cause if there
> is threads waiting, the cache-coherence traffic will be reduced a lot
> cause we are using local variables in each element of the array alligned
> to 64 byte.
>
>
> So if you have noticed, in fact i am using 2 CASes , so my algorithm
> is good.
>
> This was my algorithm for a distributed priority and non priority LOCK
> that is efficient, and i will code it as soon as possible.
>
>
>
> Thank you,
> Amine Moulay Ramdane.
>
>
>
>
>
>
>
>
>
>
>





== 3 of 5 ==
Date: Tues, Sep 24 2013 8:48 am
From: aminer



I correct some typos, please read again...

Hello,

Here is my new algorithm for a distributed priority and
non priority LOCK, this algorithm reduce a lot the
cache-coherence traffic and is good, so follow me please...


First you you allocate an array with the same number of elements
as the numbers of cores and the elements in the array must be 64 bytes
cache aligned and must be a record containing one element that is the
a flag for the first CAS that will be used as a critical section .

You first initialise the distributed priority LOCK by setting
a flag for the second CAS that will be used as a critical section to 0
and you set the flag of the first CAS that will be used also as a
critical section to 1 (to permit the first thread to enter the lock())

The lock() function will look like this:

Every thread that enters the lock() must acquire his processor
number by using GetProcessorNumber() function, but this function
will be optimized to amortize the number of calls to it, and that's
easy to do.

You enter the lock() function by pushing the processor number of the
threads into a queue or priority queue that will be optimized, this
queue will be using a CAS on the push() and will not use any CAS on the
pop(), we don't need it on the pop() cause only one thread will pop()
from the unlock() function, after that you enter a repeat loop where you
test with the first CAS and you test also with the second CAS the flag
of the corresponding element(flag) of the array using the processor
number of the threads as an index , the flag for the second CAS will be
set to 1 so the first thread will enter the lock() and the other threads
will test in a repeat loop for the first CAS and the second CAS and
there flags will have been set to zero so they will wait..

After that the first thread will arrive to the unlock() function and
he will pop() the processor number from the optimized priority queue
or non priority queue and set the flag for the first CAS to 1 for the
corresponding processor core this will allow a thread to enter the lock
, if there is no elements in the queue the thread will set the flag of
the second CAS to 1 and this will allow a thread to enter the lock.


So as you have noticed my algorithm is efficient also cause if there
is threads waiting, the cache-coherence traffic will be reduced a lot
cause we are using local variables in each element of the array alligned
to 64 byte.


So if you have noticed, in fact i am using 2 CASes , so my algorithm
is good.

This was my algorithm for a distributed priority and non priority LOCK
that is efficient, and i will code it as soon as possible.



Thank you,
Amine Moulay Ramdane.





== 4 of 5 ==
Date: Tues, Sep 24 2013 9:40 am
From: aminer







Hello,

Be smart please, to reduce a lot the cache-coherence traffic,
you have to choose a queue that reduces a lot the cache-coherence
traffic, and for that you have to code a queue as a linklist
(similar to to the MCS queue) that reduces a lot the cache-coherence
traffic, but you have to avoid the lockfree queues cause they will
higher the cache-coherence. This is how you have to code my distributed
LOCK and it will efficient and be good.



Thank you,
Amine Moulay Ramdane.









On 9/24/2013 11:48 AM, aminer wrote:
>
> I correct some typos, please read again...
>
> Hello,
>
> Here is my new algorithm for a distributed priority and
> non priority LOCK, this algorithm reduce a lot the
> cache-coherence traffic and is good, so follow me please...
>
>
> First you you allocate an array with the same number of elements
> as the numbers of cores and the elements in the array must be 64 bytes
> cache aligned and must be a record containing one element that is the
> a flag for the first CAS that will be used as a critical section .
>
> You first initialise the distributed priority LOCK by setting
> a flag for the second CAS that will be used as a critical section to 0
> and you set the flag of the first CAS that will be used also as a
> critical section to 1 (to permit the first thread to enter the lock())
>
> The lock() function will look like this:
>
> Every thread that enters the lock() must acquire his processor
> number by using GetProcessorNumber() function, but this function
> will be optimized to amortize the number of calls to it, and that's
> easy to do.
>
> You enter the lock() function by pushing the processor number of the
> threads into a queue or priority queue that will be optimized, this
> queue will be using a CAS on the push() and will not use any CAS on the
> pop(), we don't need it on the pop() cause only one thread will pop()
> from the unlock() function, after that you enter a repeat loop where you
> test with the first CAS and you test also with the second CAS the flag
> of the corresponding element(flag) of the array using the processor
> number of the threads as an index , the flag for the second CAS will be
> set to 1 so the first thread will enter the lock() and the other threads
> will test in a repeat loop for the first CAS and the second CAS and
> there flags will have been set to zero so they will wait..
>
> After that the first thread will arrive to the unlock() function and
> he will pop() the processor number from the optimized priority queue
> or non priority queue and set the flag for the first CAS to 1 for the
> corresponding processor core this will allow a thread to enter the lock
> , if there is no elements in the queue the thread will set the flag of
> the second CAS to 1 and this will allow a thread to enter the lock.
>
>
> So as you have noticed my algorithm is efficient also cause if there
> is threads waiting, the cache-coherence traffic will be reduced a lot
> cause we are using local variables in each element of the array alligned
> to 64 byte.
>
>
> So if you have noticed, in fact i am using 2 CASes , so my algorithm
> is good.
>
> This was my algorithm for a distributed priority and non priority LOCK
> that is efficient, and i will code it as soon as possible.
>
>
>
> Thank you,
> Amine Moulay Ramdane.
>





== 5 of 5 ==
Date: Tues, Sep 24 2013 9:45 am
From: aminer




I correct:

Hello,

Be smart please, to reduce a lot the cache-coherence traffic,
you have to choose a queue that reduces a lot the cache-coherence
traffic, and for that you have to code a queue as a linklist
(similar to to the MCS queue) that reduces a lot the cache-coherence
traffic, but you have to avoid the lockfree queues cause they will
higher the cache-coherence traffic. This is how you have to code my
distributed LOCK and it will efficient and be good.



Thank you,
Amine Moulay Ramdane.

On 9/24/2013 11:48 AM, aminer wrote:
>
> I correct some typos, please read again...
>
> Hello,
>
> Here is my new algorithm for a distributed priority and
> non priority LOCK, this algorithm reduce a lot the
> cache-coherence traffic and is good, so follow me please...
>
>
> First you you allocate an array with the same number of elements
> as the numbers of cores and the elements in the array must be 64 bytes
> cache aligned and must be a record containing one element that is the
> a flag for the first CAS that will be used as a critical section .
>
> You first initialise the distributed priority LOCK by setting
> a flag for the second CAS that will be used as a critical section to 0
> and you set the flag of the first CAS that will be used also as a
> critical section to 1 (to permit the first thread to enter the lock())
>
> The lock() function will look like this:
>
> Every thread that enters the lock() must acquire his processor
> number by using GetProcessorNumber() function, but this function
> will be optimized to amortize the number of calls to it, and that's
> easy to do.
>
> You enter the lock() function by pushing the processor number of the
> threads into a queue or priority queue that will be optimized, this
> queue will be using a CAS on the push() and will not use any CAS on the
> pop(), we don't need it on the pop() cause only one thread will pop()
> from the unlock() function, after that you enter a repeat loop where you
> test with the first CAS and you test also with the second CAS the flag
> of the corresponding element(flag) of the array using the processor
> number of the threads as an index , the flag for the second CAS will be
> set to 1 so the first thread will enter the lock() and the other threads
> will test in a repeat loop for the first CAS and the second CAS and
> there flags will have been set to zero so they will wait..
>
> After that the first thread will arrive to the unlock() function and
> he will pop() the processor number from the optimized priority queue
> or non priority queue and set the flag for the first CAS to 1 for the
> corresponding processor core this will allow a thread to enter the lock
> , if there is no elements in the queue the thread will set the flag of
> the second CAS to 1 and this will allow a thread to enter the lock.
>
>
> So as you have noticed my algorithm is efficient also cause if there
> is threads waiting, the cache-coherence traffic will be reduced a lot
> cause we are using local variables in each element of the array alligned
> to 64 byte.
>
>
> So if you have noticed, in fact i am using 2 CASes , so my algorithm
> is good.
>
> This was my algorithm for a distributed priority and non priority LOCK
> that is efficient, and i will code it as soon as possible.
>
>
>
> Thank you,
> Amine Moulay Ramdane.
>






==============================================================================
TOPIC: About lockfree queues
http://groups.google.com/group/comp.programming.threads/t/b80b838a140c8a1d?hl=en
==============================================================================

== 1 of 2 ==
Date: Tues, Sep 24 2013 10:09 am
From: aminer



Hello,


I have come to an interresting subject about lockfree queues,
if you take a look at those lockfree queues, like in transactional
memory, if the data have changed those lockfree queue retries and loop
again, but this (lockfree) mechanism generates a lot of cache-coherence
traffic, so i will not advice this lockfree mechanism, so how
can we do it ? you can take a look at the MCS queue Lock , i
think they are doing it like a waitfree linklist that doesn't spin and
that reduces a lot the cache-coherence traffic, other than that
if your memory manager is not optimal and uses a lockfree mechanism and
generates a lot cache-coherence traffic, so you have to use a freelist
to lower the cache coherence traffic.


So be smart please and think about that.


Thank youi,
Amine Moulay Ramdane.




== 2 of 2 ==
Date: Tues, Sep 24 2013 10:32 am
From: "Chris M. Thomasson"


> "aminer" wrote in message news:l1sh13$krl$1@news.albasani.net... I have
> come to an interresting subject about lockfree queues,
> if you take a look at those lockfree queues[...]

This MPMC queue one scales very good for a damn anti-pattern:

I think I will post the algorithm that uses eventcounts to get around
spin-wait.

The code needs Relacy to run.


Some context:


https://groups.google.com/forum/#!searchin/comp.arch/random$20isa/comp.arch/aLVoxdQdRac/overview
(read all)

https://groups.google.com/forum/#!original/comp.arch/aLVoxdQdRac/9DtioMswSmUJ
(pseudo-code
for spin-based algorithm)






==============================================================================
TOPIC: About my distributed priority and non priority LOCK...
http://groups.google.com/group/comp.programming.threads/t/f57a04e98166f0da?hl=en
==============================================================================

== 1 of 1 ==
Date: Tues, Sep 24 2013 11:48 am
From: aminer



Hello,

I have wrote this about my distributed priority and non priority LOCK :

"So if you have noticed, in fact i am using 2 CASes , so my algorithm is
good."


You will say that i will in fact use 3 atomic operations , 2 CASes and one
inside the waitfree queue, but be smart please, when there is
threads waiting and spining , one CAS among the CASes will be very
cheap cause the variable will be inside the L2 cache, and the other CAS
will be
expensive cause the threads must load the variable from the
remote caches and this is expensive, so it is as we would
have one CAS cause one amongst the second is very cheap,
so it will make this a total of 2 CASes if we count the CAS for
the waitfree queue.

This was my algorithm for a distributed priority and non priority LOCK
that is efficient, and i will code it as soon as possible.


Thank you,
Amine Moulay Ramdane.







==============================================================================
TOPIC: About Spinlock...
http://groups.google.com/group/comp.programming.threads/t/66a839ec03890e53?hl=en
==============================================================================

== 1 of 3 ==
Date: Tues, Sep 24 2013 3:27 pm
From: aminer



Hello,

I think i have discovered what gone wrong with the following Spinlock
with a Dynamic Sleep:

http://code.google.com/p/gpdelphiunits/source/browse/trunk/src/SpinLock.pas?r=37



First you will notice that they are using a Dynamic Sleep , not
an exponential backoff, and that's not correct, cause you have to use an
exponential backoff to be able to lower correctly the contention,
second, if you have noticed this Dynamic Sleep or the exponential
backoff do use the Sleep() function to lower the contention , but
this is not correct, cause when you use an exponential backoff with
the Sleep() when the CAS fails under contention it will sleep for
milliseconds and that's too long and it will make you Spinlock very
slow, so the correct solution is to use the PAUSE instruction instead of
Sleep() like this:


for i:=0 to LSleepAmount do asm pause end;

and after that call also Sleep(0), to not freeze your computer.

I have patched the above Spinlock like that adding an exponential
backoff that uses
the pause instruction instead of the Sleep() and the Spinlock
have worked perfectly even under contention.


So i think i will return now back to the Spinlock with Exponential backoff
cause it has a good performance.


Thank you,
Amine Moulay Ramdane.











== 2 of 3 ==
Date: Tues, Sep 24 2013 3:58 pm
From: aminer



Hello,

Personnaly i love very much he Spinlock with an exponential cause it is
a simple and efficient and a bright idea, the exponential backoff will
amortize/lower a lot the cache transfer between the cores this is why it
has almost the
same performance as the MCS queue Lock (look to the graphs inside
the following link and you will be convinced). And if you say
what about the starvation problem ? Look at the following, it says this:

"Starvation freedom is desirable, but not essential
�practical locks: many permit starvation, although it is unlikely to occur"

Read here:

http://www.cs.rice.edu/~vs3/comp422/lecture-notes/comp422-lec19-s08-v1.pdf

As you have noticed it says Starvation freedom is "not essential" ,
other than that with fair locks as the MCS queue Lock
we risk the problem of lock convoy this is why i love very much the
Spinlock with
and Exponential backoff cause that was a simple and bright and efficient
idea.




Thank you,
Amine Moulay Ramdane.





== 3 of 3 ==
Date: Tues, Sep 24 2013 4:00 pm
From: aminer



Hello,

Personaly i love very much the Spinlock with an exponential cause it is
a simple and efficient and a bright idea, the exponential backoff will
amortize/lower a lot the cache transfer between the cores this is why it
has almost the
same performance as the MCS queue Lock (look to the graphs inside
the following link and you will be convinced). And if you say
what about the starvation problem ? Look at the following, it says this:

"Starvation freedom is desirable, but not essential
�practical locks: many permit starvation, although it is unlikely to occur"

Read here:

http://www.cs.rice.edu/~vs3/comp422/lecture-notes/comp422-lec19-s08-v1.pdf

As you have noticed it says Starvation freedom is "not essential" ,
other than that with fair locks as the MCS queue Lock
we risk the problem of lock convoy this is why i love very much the
Spinlock with
and Exponential backoff cause that was a simple and bright and efficient
idea.




Thank you,
Amine Moulay Ramdane.






==============================================================================
TOPIC: About lockfree queues ...
http://groups.google.com/group/comp.programming.threads/t/331de802b7b87fa5?hl=en
==============================================================================

== 1 of 1 ==
Date: Tues, Sep 24 2013 4:17 pm
From: aminer


Hello,


I have wrote before this:

"I have come to an interresting subject about lockfree queues,
if you take a look at those lockfree queues, like in transactional
memory, if the data have changed those lockfree queue retries and loop
again, but this (lockfree) mechanism generates a lot of cache-coherence
traffic, so i will not advice this lockfree mechanism, so how
can we do it ? you can take a look at the MCS queue Lock , i
think they are doing it like a waitfree linklist that doesn't spin and
that reduces a lot the cache-coherence traffic, other than that
if your memory manager is not optimal and uses a lockfree mechanism and
generates a lot cache-coherence traffic, so you have to use a freelist
to lower the cache coherence traffic."


But you have to know even that you are using spin waits inside lockfree
algorithms that genrates a lot of cache-coherence traffic , you have
to know that you can avoid this problem and make your lockfree queue
more efficient by adding an exponential backoff to you lockfree
mechanism using a PAUSE asm instruction followed by a Sleep(0) to not
freeze the computer and this way , like with the Spinlock, it will
reduce the cache-coherence traffic and make your lockfree queue more
efficient,.




Thank you,
Amine Moulay Ramdane.





==============================================================================

You received this message because you are subscribed to the Google Groups "comp.programming.threads"
group.

To post to this group, visit http://groups.google.com/group/comp.programming.threads?hl=en

To unsubscribe from this group, send email to comp.programming.threads+unsubscribe@googlegroups.com

To change the way you get mail from this group, visit:
http://groups.google.com/group/comp.programming.threads/subscribe?hl=en

To report abuse, send email explaining the problem to abuse@googlegroups.com

==============================================================================
Google Groups: http://groups.google.com/?hl=en

No comments: