soft and program: comp.programming.threads - 25 new messages in 4 topics

comp.programming.threads
http://groups.google.com/group/comp.programming.threads?hl=en

comp.programming.threads@googlegroups.com

Today's topics:

* Multi-cores and contention,,, - 3 messages, 3 authors
http://groups.google.com/group/comp.programming.threads/t/b74390b785ea0e26?hl=en
* weird thread behavior on itanium/hp-ux with pthread - 1 messages, 1 author
http://groups.google.com/group/comp.programming.threads/t/b2b5d744b99b760c?hl=en
* Spinlocks - trouble with speed - 13 messages, 4 authors
http://groups.google.com/group/comp.programming.threads/t/9f5e34251ccf45e3?hl=en
* Multi-producer/multi-consumer SEH-based queue - 8 messages, 2 authors
http://groups.google.com/group/comp.programming.threads/t/c43d1197c8168b4e?hl=en

==============================================================================
TOPIC: Multi-cores and contention,,,
http://groups.google.com/group/comp.programming.threads/t/b74390b785ea0e26?hl=en
==============================================================================

== 1 of 3 ==
Date: Wed, Jul 29 2009 7:16 pm
From: David Schwartz

On Jul 29, 12:21 pm, Amine <ami...@colba.net> wrote:

> What is the benefit of a very fast lock-free hash-table if the in
> memory data
> it access is big and it's *randomly* accessed: and this will lower the
> speed
> *AND*
> also there is a contention on the bus in an M/M/1 manner and this
> will also
> lower the speed ?

The benefit is that the lock-free hash table stays out of the way and
doesn't have any significant negative impact on performance, and
that's the best you can expect it to do.

> I mean what is the benefit of a very FAST Ferari car if the roads are
> not so good
> and they lower a lot it's speed ?

The benefit is that you only have one thing to work on rather than
two. The roads being slow is not the problem the fast car is intended
to solve.

== 2 of 3 ==
Date: Thurs, Jul 30 2009 3:39 am
From: Dave Butenhof

David Schwartz wrote:
> On Jul 29, 12:21 pm, Amine <ami...@colba.net> wrote:

>> I mean what is the benefit of a very FAST Ferari car if the roads are
>> not so good and they lower a lot it's speed ?
>
> The benefit is that you only have one thing to work on rather than
> two. The roads being slow is not the problem the fast car is intended
> to solve.

And, let's face the facts: the Ferrari is pretty cool even standing
still in your driveway. Sometimes that alone is sufficient benefit. ;-)

But to build on David's final comment, having the Ferrari means that
when you hit the Autobahn you can floor it and really move. The value
isn't diminished by the fact that it inevitably goes slow on congested
residential streets. (Although, to take it even further, if you plan to
drive exclusively on those streets, the value of the Ferrari is reduced
to the fact that it looks cooler than those other cars going the same
speed.)

In other words, even if the way you generally need to use your lock-free
algorithms prevents them from doing their best, you may still be able to
use them in other ways that exploit their abilities. If you don't ever
use them the way they really were intended to be used, you still have
the satisfaction of knowing you're using cool state-of-the-art lock free
synchronization.

(And maybe you can think about improving the road...)

== 3 of 3 ==
Date: Thurs, Jul 30 2009 10:29 am
From: Amine

Hello again,

I was still *thinking*...

Now if you have followed with me, i think there is three things:
You have to look at the *scalability* from three perspectives:

ALGORITHM + HARDWARE + DATA

I wrote:

"I mean what is the benefit of a very FAST Ferrari car if the roads
are not so good and they lower a lot its speed ?"

Now please follow with me..

I wrote:

[1] "So, let say for example that we are runnning under an Azul system
with more than 700 cores: http://www.azulsystems.com/and using
this lock-free Hash table:http://www.azulsystems.com/events/
javaone_2007/2007_LockFreeHash.pdf

and millions of transactions (let say find() transactions that search
and return some data) are coming with an arrival rate, those
transations
are mostly memory bound and let suppose an instant that those are
mostly
random transactions that *miss* the cash, so that they are serviced
not in an
M/M/m manner but in M/M/1 with the memory as the server and with its
low service
rate lower than the caches, that will be very bad. So in this
situation: how can even
this Azul system with more than 700 cores with let say a lock-free
Hash-table scale
in performance ?"

If you have noticed, this problem can still occur with my lock-free
Threadpool and lockfree_SPMC.

Now this problem doesn't come from the ALGORITHM but can
come from the HARDWARE: one bus that is servicing the jobs
in an M/M/1 manner, or/and come from the essence of
the DATA: random access to memory that *miss* the cash.

Dave Butenhof wrote in a post:

"And maybe you can think about improving the road... "

But if you have noticed the 'bad road' can be the essence of DATA in
itself.

So let take as an example the folowing situation
(that is not the worst case):

Let say that the access of the big in-memory data is not so
random and 3 meg of data is frequently used in let say 80% of
the time, so what i want is to avoid *completly* the situation
where the 3 meg cached data that is used 80% of the time will
not be erased completly from the cache: a situation that
can look like a livelock if we have a very high transaction arrival
rate....

So, what i want in this situation is also an explicit control over
the cache , that means to *force* the cache to lock 3 meg of data
in a cache of let say 4 meg (to be able to parallelize like an M/M/
m...)

Hence my question:

Wich hardware can do this sort of thing in this sort of situation ?

Regards,
Amine Moulay Ramdane.

On Jul 29, 8:34 pm, Amine <ami...@colba.net> wrote:
> Hello all,
>
> I wrote:
>
> "I mean what is the benefit of a very FAST Ferari car if the roads
> are
> not so good and they lower a lot its speed ?"
>
> I was still *thinking* ...
>
> So, let say for example that we are runnning under an Azul system
> with more than 700 cores: http://www.azulsystems.com/and using
> this lock-free Hash table:http://www.azulsystems.com/events/javaone_2007/2007_LockFreeHash.pdf
>
> and millions of transactions (let say find() transactions that search
> and return
> some data) are coming with an arrival rate, those transations are
> mostly
> memory bound and let suppose an instant that those are mostly random
> transactions
> that *miss* the cash, so that they are serviced not in an M/M/m manner
> but in M/M/1
> with the memory as the server and with its low service rate lower than
> the caches,
> that will be very bad. So in this situation: how can even this Azul
> system with more
> than 700 cores with let say a lock-free Hash-table scale in
> performance ?
>
> Regards,
> Amine.
>
> On Jul 29, 3:32 pm, Amine <ami...@colba.net> wrote:
>
>
>
> > On Jul 29, 3:21 pm, Amine <ami...@colba.net> wrote:
>
> > > I wrote:
> > > > So, now imagine that we elevate the problem by using an M/M/m in
> > > > the bus memory side , this will lower contention. But imagine that
> > > > we are using some big in memory data and accessing it *randomly*
> > > > with a lock-free Hash-table , what will be the performance then ?
> > > > not so good i think...
>
> > > I mean what is the benefit of a very FAST Ferari car if the roads are
> > > not so good
> > > and they lower a lot it's speed ?
>
> > typo:
>
> > I mean what is the benefit of a very FAST Ferari car if the roads are
> > not so good and they lower a lot its speed ?
>
> > Regards,
> > Amine Moulay Ramdane.
>
> > > What is the benefit of a very fast lock-free hash-table if the in
> > > memory data
> > > it access is big and it's *randomly* accessed: and this will lower the
> > > speed
> > > *AND*
> > > also there is a contention on the bus in an M/M/1 manner and this
> > > will also
> > > lower the speed ?
>
> > > Regards,
> > > Amine. Moulay Ramdane.
>
> > > On Jul 29, 2:35 pm, Amine <ami...@colba.net> wrote:
>
> > > > Hello all,
>
> > > > I was thinking about my lock-free Threadpool and lock-free_SPMC
>
> > > >http://www.colba.net/~aminer/
>
> > > > i am using a queue for every thread worker ( and in the producer side
> > > > you can push in parallel also) and i have come to the conclusion that
> > > > this will scale very well.in the software side...
>
> > > > But i was thinking more about *scalability* and i have come to the
> > > > conclusion that there is still a problem, please follow with me:
>
> > > > As Amadahl law state, the Speed = 1 / (1 -P) + P /N (N: number of
> > > > cores)
> > > > now let's take as an example we are working with a scalable lock-free
> > > > Hashtable
> > > > now, if the transactions(find,insert..) are coming to the lock-free
> > > > hashtable
> > > > and the jobs(transactions) are processed in an M/M/m manner with
> > > > multiple
> > > > threads and cores servicing the jobs, that's very good...
>
> > > > Now suppose that the application like a lock-free Hashtable is
> > > > working
> > > > frequently with data and we are accessing big data directly in
> > > > memory
> > > > *AND* the data is randomly accessed *AND* there is contention between
> > > > processors over the memory bus in and M/M/1 manner , so, there will
> > > > be
> > > > an appreciable serial part 1-P in the Amadahl equation, hence:
>
> > > > what would will you expect from the scalable lock-free Hashtable ?
>
> > > > what would be then the performance ? i think this will be very bad.
>
> > > > So, now imagine that we elevate the problem by using an M/M/m in
> > > > the bus memory side , this will lower contention. But imagine that
> > > > we are using some big in memory data and accessing it *randomly*
> > > > with a lock-free Hash-table , what will be the performance then ?
> > > > not so good i think...
>
> > > > Also i have another question:
>
> > > > Read this:
>
> > > >http://www.azulsystems.com/events/javaone_2007/2007_LockFreeHash.pdf
>
> > > > How the performance of this lock-free Hashtable can scale well in
> > > > performance
> > > > on an Azul system with more than 700 cores, if we are using some very
> > > > big in
> > > > memory data and accessing it *RANDOMLY* ? and what memory bus
> > > > architechture
> > > > are they using ?...
>
> > > > What do you think about this problem ?
>
> > > > Regards,
> > > > Amine Moulay Ramdane.- Hide quoted text -
>
> > > - Show quoted text -- Hide quoted text -
>
> > - Show quoted text -- Hide quoted text -
>
> - Show quoted text -

==============================================================================
TOPIC: weird thread behavior on itanium/hp-ux with pthread
http://groups.google.com/group/comp.programming.threads/t/b2b5d744b99b760c?hl=en
==============================================================================

== 1 of 1 ==
Date: Thurs, Jul 30 2009 1:31 am
From: Loïc Domaigné

Hello,

> > I guess, this should be "sleeping = false" in thread 1, right? It's
> > difficult to pinpoint problem using pseudo-code. Can you please post a
> > minimalistic example (which compiles) that shows your particular
> > problem?
>
> Thanks Loic. Yes that was a typo. the problem only happens in a remote
> site under particular conditions and could not be readily reproduced
> locally. what interests me is whether this sleep/wakeup mechanism is
> faulty on HP/Itanium for whatever reasons.

I see. As pointed out by David, the scheme used is unecessary
complicated, and could have some logic errors resp. might be
inefficient. For instance, I don't see where the signaled variable is
reset to false? Second, thread1 would spin as long as sleeping is
false?

Regards,
Loïc
--
My blog: http://www.domaigne.com/blog

"The best thing about a boolean is even if you are wrong, you are only
off by a bit."

==============================================================================
TOPIC: Spinlocks - trouble with speed
http://groups.google.com/group/comp.programming.threads/t/9f5e34251ccf45e3?hl=en
==============================================================================

== 1 of 13 ==
Date: Thurs, Jul 30 2009 9:17 am
From: Thomas Richter

Hi folks,

to protect a shared resource which needs to be locked only very shortly,
I'm using a very simple spinlock technique:

static inline void FutexObtain(volatile int *fu)
{
while(__sync_fetch_and_add(fu,-1) <= 0) {
Locked_Add(fu,1);
while(*fu <=0) {};
}
}
static inline void FutexRelease(volatile int *fu)
{
__sync_fetch_and_add(fu,1);
}

where __sync_fetch_and_add is the GNU gcc built-in for a locked add
instruction on the x86.

Interestingly, while this works on x86 as intended, it is pretty fast on
some systems, while performance on other systems is slower than the
corresponding Os mutex implementations. I'm here only speaking about
dual (or multi-) core x86 implementations, and the (current) target is
Linux.

I wonder whether there are known defects about this simple spinlocking
technique - and remember, the typical number of spins of a core on this
lock has to be verified to be very low on average.

Thanks,

Thomas

== 2 of 13 ==
Date: Thurs, Jul 30 2009 2:31 pm
From: Loïc Domaigné

Hi Thomas,

> to protect a shared resource which needs to be locked only very shortly,
> I'm using a very simple spinlock technique:
>
> static inline void FutexObtain(volatile int *fu)
> {
> while(__sync_fetch_and_add(fu,-1) <= 0) {
> Locked_Add(fu,1);
> while(*fu <=0) {};
> }
> }
> static inline void FutexRelease(volatile int *fu)
> {
> __sync_fetch_and_add(fu,1);
> }
>
> where __sync_fetch_and_add is the GNU gcc built-in for a locked add
> instruction on the x86.
>
> Interestingly, while this works on x86 as intended, it is pretty fast on
> some systems, while performance on other systems is slower than the
> corresponding Os mutex implementations. I'm here only speaking about
> dual (or multi-) core x86 implementations, and the (current) target is
> Linux.
>
> I wonder whether there are known defects about this simple spinlocking
> technique - and remember, the typical number of spins of a core on this
> lock has to be verified to be very low on average.

I'm a noob in lock-free stuff, but don't you need a membar for the
while (*fu<=0) {};

Second idea: what about using pthread_spin_lock() family
synchronization device.

Cheers,
Loïc
--
My Blog: http://www.domaigne.com/blog

"Should array indices start at 0 or 1? My compromise of 0.5 was
rejected without, I thought, proper consideration." -- Stan Kelly-
Bootle

== 3 of 13 ==
Date: Thurs, Jul 30 2009 3:03 pm
From: David Schwartz

On Jul 30, 9:17 am, Thomas Richter <t...@math.tu-berlin.de> wrote:
> Hi folks,
>
> to protect a shared resource which needs to be locked only very shortly,
> I'm using a very simple spinlock technique:
>
> static inline void FutexObtain(volatile int *fu)
> {
> while(__sync_fetch_and_add(fu,-1) <= 0) {
> Locked_Add(fu,1);
> while(*fu <=0) {};
> }
> }
> static inline void FutexRelease(volatile int *fu)
> {
> __sync_fetch_and_add(fu,1);
> }

Wow, you made every possible mistake!

> I wonder whether there are known defects about this simple spinlocking
> technique - and remember, the typical number of spins of a core on this
> lock has to be verified to be very low on average.

Are there any known defects?! Your naive spinlock has every defect
known to man!

Here's the biggest. Consider an x86 machine with four cores. Core A
holds the spinlock. Cores B, C, and D are blocked on the spinlock. How
is core A supposed to make any forward progress if cores B, C, D are
saturating the FSB fighting over ownership of the spinlock integer?
(NEVER spin on an operation that writes or can write!!)

Here's the second biggest: When a CPU does acquire the spinlock after
spinning on it, it takes the mother of all branch mispredictions,
likely flushing all pipelines and resulting in a huge performance
penalty. So, now you have a CPU that's acquired a contested spinlock
-- it's absolutely vital that it perform at the very optimum to
preserve performance -- and you've slowed it to a crawl.

Here's the third biggest: Consider a hyper-threaded CPU. Now imagine
one virtual core holds the spinlock and the other is blocked on it.
The "blocked" CPU will spin at full speed, depriving the other virtual
core of execution resources. So, again, the CPU that holds the
spinlock will run at a massively reduced speed at precisely the time
it most needs to hurry.

There are more, of course. You've almost literally made every possible
mistake. Spinlocks are phenomenally complex. You have to have a very
deep understanding of the CPU internals to get them right.

== 4 of 13 ==
Date: Thurs, Jul 30 2009 3:09 pm
From: David Schwartz

On Jul 30, 3:03 pm, David Schwartz <dav...@webmaster.com> wrote:

> Are there any known defects?! Your naive spinlock has every defect
> known to man!

Sorry, I misread your code. You don't have this defect:

> Here's the biggest. Consider an x86 machine with four cores. Core A
> holds the spinlock. Cores B, C, and D are blocked on the spinlock. How
> is core A supposed to make any forward progress if cores B, C, D are
> saturating the FSB fighting over ownership of the spinlock integer?
> (NEVER spin on an operation that writes or can write!!)

But you do have the branch misprediction and the execution unit
deprivation.

== 5 of 13 ==
Date: Thurs, Jul 30 2009 3:41 pm
From: Thomas Richter

David Schwartz wrote:
> On Jul 30, 9:17 am, Thomas Richter <t...@math.tu-berlin.de> wrote:
>> Hi folks,
>>
>> to protect a shared resource which needs to be locked only very shortly,
>> I'm using a very simple spinlock technique:
>>
>> static inline void FutexObtain(volatile int *fu)
>> {
>> while(__sync_fetch_and_add(fu,-1) <= 0) {
>> Locked_Add(fu,1);
>> while(*fu <=0) {};
>> }
>> }
>> static inline void FutexRelease(volatile int *fu)
>> {
>> __sync_fetch_and_add(fu,1);
>> }

As the "biggest" problem is apparently not present, let's go to the other:

> Here's the second biggest: When a CPU does acquire the spinlock after
> spinning on it, it takes the mother of all branch mispredictions,
> likely flushing all pipelines and resulting in a huge performance
> penalty. So, now you have a CPU that's acquired a contested spinlock
> -- it's absolutely vital that it perform at the very optimum to
> preserve performance -- and you've slowed it to a crawl.

So how to avoid this, i.e. instead of a negative critique, provide an
improvement. If the CPU leaves the spinlock, of course the branch cache
is no longer up to date, but how can it be if it leaves the spin. Why is
the pipeline flushed? How, how can one avoid the pipeline stall?

> Here's the third biggest: Consider a hyper-threaded CPU. Now imagine
> one virtual core holds the spinlock and the other is blocked on it.
> The "blocked" CPU will spin at full speed, depriving the other virtual
> core of execution resources. So, again, the CPU that holds the
> spinlock will run at a massively reduced speed at precisely the time
> it most needs to hurry.

I would currently not care enough about hyperthreading CPUs right now,
but I can throw in a couple of "nops" in the loop, or a bogus
computation. Feel ensured that I tried this, made no difference, but
this isn't a hyperthreading CPU in first place.

> There are more, of course. You've almost literally made every possible
> mistake. Spinlocks are phenomenally complex. You have to have a very
> deep understanding of the CPU internals to get them right.

Ok, so get them right, please.

Thanks,

Thomas

== 6 of 13 ==
Date: Thurs, Jul 30 2009 3:44 pm
From: Thomas Richter

Loïc Domaigné wrote:

>> I wonder whether there are known defects about this simple spinlocking
>> technique - and remember, the typical number of spins of a core on this
>> lock has to be verified to be very low on average.
>
> I'm a noob in lock-free stuff, but don't you need a membar for the
> while (*fu<=0) {};

Actually, you should. The "locked add" keeps the bus locked for the
working processor available.

> Second idea: what about using pthread_spin_lock() family
> synchronization device.

Yes, of course, if I'd had them available - which I don't.

So long,
Thomas

== 7 of 13 ==
Date: Thurs, Jul 30 2009 3:46 pm
From: "Chris M. Thomasson"

"Loïc Domaigné" <loic.domaigne@googlemail.com> wrote in message
news:df4ef88b-41e0-4115-8b16-0eefab033592@r2g2000yqm.googlegroups.com...
Hi Thomas,

> I'm a noob in lock-free stuff, but don't you need a membar for the
> while (*fu<=0) {};

No. When that while loop exits it loops back around to the
`__sync_fetch_and_add()' which is a membar.

> Second idea: what about using pthread_spin_lock() family
> synchronization device.

Good idea. :^)

== 8 of 13 ==
Date: Thurs, Jul 30 2009 3:56 pm
From: "Chris M. Thomasson"

"Thomas Richter" <thor@math.tu-berlin.de> wrote in message
news:h4t7eq$bf0$1@infosun2.rus.uni-stuttgart.de...
[...]
> I would currently not care enough about hyperthreading CPUs right now, but
> I can throw in a couple of "nops" in the loop, or a bogus computation.
> Feel ensured that I tried this, made no difference, but this isn't a
> hyperthreading CPU in first place.

Use the PAUSE instruction. Here is a simple spinlock (in MSVC++) you can try
out:
________________________________________________________________________
typedef __int32 atomicword_ia32;
typedef atomicword_ia32 spinlock_ia32;

#define SPINLOCK_IA32_INIT 0

#define spinlock_ia32_init(mp_self) ( \
*(mp_self) = 0 \
)

__declspec(naked) void
spinlock_ia32_acquire(
spinlock_ia32* const self
) {
_asm {
MOV ECX, [ESP + 4]
MOV EAX, 1

spinlock_ia32_acquire_retry:
XCHG [ECX], EAX
TEST EAX, EAX
JNZ spinlock_ia32_acquire_failed
RET

spinlock_ia32_acquire_failed:
PAUSE
JMP spinlock_ia32_acquire_retry
}
}

__declspec(naked) void
spinlock_ia32_release(
spinlock_ia32* const self
) {
_asm {
MOV EAX, [ESP + 4]
MOV DWORD PTR [EAX], 0
RET
}
}
________________________________________________________________________

You can easily convert this to AT&T syntax and use GAS to assemble it.

== 9 of 13 ==
Date: Thurs, Jul 30 2009 4:00 pm
From: "Chris M. Thomasson"

"Thomas Richter" <thor@math.tu-berlin.de> wrote in message
news:h4sgvg$kd$1@infosun2.rus.uni-stuttgart.de...
> Hi folks,
>
> to protect a shared resource which needs to be locked only very shortly,
> I'm using a very simple spinlock technique:
>
>
> static inline void FutexObtain(volatile int *fu)
> {
> while(__sync_fetch_and_add(fu,-1) <= 0) {
> Locked_Add(fu,1);
> while(*fu <=0) {};
> }
> }
> static inline void FutexRelease(volatile int *fu)
> {
> __sync_fetch_and_add(fu,1);
> }
[...]

Why are you using an atomic RMW instruction to release the lock on an x86?
That is unnecessary overhead as you only need a plain MOV instruction:

http://www.intel.com/Assets/PDF/manual/253668.pdf
(read section 8.2)

Also, why are you using FAA? Its clearly not needed. BTW, what's the point
of `Locked_Add()'?

== 10 of 13 ==
Date: Thurs, Jul 30 2009 4:07 pm
From: "Chris M. Thomasson"

"Chris M. Thomasson" <no@spam.invalid> wrote in message
news:h4t8m2$10f9$1@news.ett.com.ua...
>
> "Thomas Richter" <thor@math.tu-berlin.de> wrote in message
> news:h4sgvg$kd$1@infosun2.rus.uni-stuttgart.de...
>> Hi folks,
>>
>> to protect a shared resource which needs to be locked only very shortly,
>> I'm using a very simple spinlock technique:
[...]
> Why are you using an atomic RMW instruction to release the lock on an x86?
> That is unnecessary overhead as you only need a plain MOV instruction:
>
>
> http://www.intel.com/Assets/PDF/manual/253668.pdf
> (read section 8.2)
>
>
> Also, why are you using FAA? Its clearly not needed. BTW, what's the point
> of `Locked_Add()'?

One other really important question... Are you sure that you have eliminated
any chance of false-sharing between the lock word and any state within the
critical section? Or between the lock work and any other lock word? You need
to ensure that the lock word is padded up to and aligned on a L2 cache line.

== 11 of 13 ==
Date: Thurs, Jul 30 2009 4:24 pm
From: "Chris M. Thomasson"

I think I notice a potential problem... Think about this scenario:

static inline void FutexObtain(volatile int *fu)
{
O1: while(__sync_fetch_and_add(fu,-1) <= 0) {
O2: Locked_Add(fu,1);
O3: while(*fu <=0) {};
}
}
static inline void FutexRelease(volatile int *fu)
{
R1: __sync_fetch_and_add(fu,1);
}

These operations will be assigned to threads A-F...

assuming lock word is init to 1

Notice how Thread A released the lock at A2, which means the next thread
that comes along should be able to acquire the lock (e.g., thread F).
However, it cannot acquire it because the lock word was -3, and it
decremented it to -2 which forces a slow-path. That is absolutely horrible.
It seems as if your trying to hand off ownership, but even then it doesn't
seem like it would work. Lets continue the sequence to see just how long it
would take for another thread to acquire the lock:

B1: O2 // lock word -1
B1: O3 // spin on -1
C1: O2 // lock word 0
C1: O3 // spin on 0
D1: O2 // lock word 1
C1: O1 // lock word 0 ** OWNER **

Wow! That's a hardcore fuc%ed up nightmare. Also, I think I might be able to
come up with a sequence in which no thread can get the lock, and you
livelock on the acquire portion! ;^(...

Why not just do something simple like this first:

// init lock word to 0

static inline void FutexObtain(volatile int *fu)
{
while (__sync_lock_test_and_set(fu, 1)) {
asm("pause"); // backoff
}
}

static inline void FutexRelease(volatile int *fu)
{
__sync_lock_release(fu);
}

Were you looking for lock fairness? If so, just use a standard bakery
algorithm and be done with it.

== 12 of 13 ==
Date: Thurs, Jul 30 2009 5:52 pm
From: "Chris M. Thomasson"

First of all, sorry for all the posts.

"Chris M. Thomasson" <no@spam.invalid> wrote in message
news:h4ta33$115h$1@news.ett.com.ua...
[...]
> B1: O2 // lock word -1
> B1: O3 // spin on -1
> C1: O2 // lock word 0
> C1: O3 // spin on 0
> D1: O2 // lock word 1
> C1: O1 // lock word 0 ** OWNER **
[...]

I forgot to increment the thread's sequence numbers on the latter part; here
is entire corrected sequence:

A1: O1 // lock word == 0 ** OWNER **
B1: O1 // lock word == -1
C1: O1 // lock word == -2
D1: O1 // lock word == -3
E1: O1 // lock word == -4
A2: R1 // lock word == -3 ** LOCK IS FREE **
F1: O1 // lock word == -2 ** OOPS!!! **
B2: O2 // lock word == -1
B3: O3 // spin on -1
C2: O2 // lock word == 0
C3: O3 // spin on 0
D2: O2 // lock word == 1
C4: O1 // lock word == 0 ** OWNER... **

Sorry for any confusion!

;^(...

Anyway, this is a very bad performance problem indeed. I suggest that you
look into it any apply proper corrections to the algorithm.

== 13 of 13 ==
Date: Thurs, Jul 30 2009 7:27 pm
From: Loïc Domaigné

Good Morning,

> > Second idea: what about using pthread_spin_lock() family
> > synchronization device.
>
> Yes, of course, if I'd had them available - which I don't.

I guess, you mean that the following grep returns nothing, right?
$ grep "pthread_spin" /usr/include/pthread.h

Otherwise you may perhaps just need to tell the compiler to turn the
SuSv6 feature on (by setting _XOPEN_SOURCE=600 or
_C_POSIX_SOURCE=200112L)

Cheers,
LD
--
My Blog: http://www.domaigne.com/blog

"If debugging is the process of removing bugs, then programming must
be the process of putting them in." -- Edsger W. Dijkstra.

==============================================================================
TOPIC: Multi-producer/multi-consumer SEH-based queue
http://groups.google.com/group/comp.programming.threads/t/c43d1197c8168b4e?hl=en
==============================================================================

== 1 of 8 ==
Date: Thurs, Jul 30 2009 4:57 pm
From: "Dmitriy V'jukov"

Here is my multi-producer/single-consumer queue:
http://groups.google.ru/group/lock-free/browse_frm/thread/55df71b87acb8201
The interesting part of the algorithm is an XCHG-based producer part.

As Chris Thomasson correctly noted, the XCHG-based producer part can
be combined with the well-known CAS-based consumer part in order to
get multi-producer/multi-consumer (MPMC) queue:
http://groups.google.ru/group/comp.programming.threads/browse_frm/thread/053e322ea90e4ad5

In general, one may combine different producer and consumer parts into
single queue provided that queue structure stays the same. For
example, it's possible to combine the XCHG-based producer part with
consumer part from 2-LOCK queue algorithm. The resulting 1-LOCK/1-XCHG
queue will have quite appealing characteristics (wait-free producers,
and 1 spin-lock acquisition for consumers, no need for ABA prevention
nor safe memory reclamation).

But what I'm going to show is a bit more interesting, it's a novel
consumer part for MPMC queue.
The problem with classical CAS-based MPMC queue is that both producers
and consumers may touch potentially freed/reused memory, moreover
producers may write to that memory. That's why it requires safe memory
reclamation (SMR), and in general SMR is quite problematic in a non-
managed non-kernel environment (C/C++).
XCHG-based producer part gracefully avoids touching freed/reused
memory. So now the problem is with consumer part only, but note that
consumers may only read from freed/reused (no writes to that memory).
The key point of the proposed algorithm is handling of reads from
reused memory with failing CAS, and handling of reads from freed
memory with SEH/signal handler.
Main characteristics of the algorithm:
- intrusive
- producers: 1 XCHG, wait-free
- consumers: 1 CAS on common path, mostly lock-free (***)
- producers and consumers do not content with each other (until queue
is empty)
- no need for safe memory reclamation

(***) requires additional comments. There is a small (1 machine
instruction in length) window of inconsistency for producers. If
producer will be preempted there he may (or may not) cause blocking of
consumers (other producers are still wait-free). If producer will be
terminated there he will cause system-wide stall. Taking into account
length of the window, probability of these things may be considered
negligible in most situations.

The algorithm requires double-word CAS (for pointer + ABA counter). On
64-bit systems it may be reduced to single-word (64-bit) CAS with
pointer packing technique. For example, on Intel64/Windows any aligned
pointer may be packed to 39 bits, this allows for 25-bit ABA counter.

OK, here we go:

/* Multi-producer/multi-consumer queue
* 2009, Dmitriy V'yukov
* Distributed under the terms of the GNU General Public License
* as published by the Free Software Foundation,
* either version 3 of the License,
* or (at your option) any later version.
* See: http://www.gnu.org/licenses
*/

// 32-bit, Windows, MSVC
#include <windows.h>
#include <intrin.h>

class mpmc_queue
{
public:
struct node_t
{
node_t* volatile next_;
};

mpmc_queue()
{
head_.ptr_ = 0;
head_.cnt_ = 0;
tail_ = &head_.ptr_;
}

~mpmc_queue()
{
ASSERT(head_.ptr_ == 0);
ASSERT(tail_ == &head_.ptr_);
}

void enqueue(node_t* node)
{
ASSERT(node);
node->next_ = 0;
node_t** prev = (node_t**)
_InterlockedExchange((long*)&tail_, (long)node);
ASSERT(prev);
// <--- the window of inconsistency is HERE (***)
prev[0] = node;
}

node_t* dequeue()
{
unsigned retry_count = 0;
retry:
__try
{
head_t h;
h.ptr_= head_.ptr_;
h.cnt_ = head_.cnt_;
for (;;)
{
node_t* n = h.ptr_;
if (n == 0)
return 0;
if (n->next_)
{
head_t xchg = {n->next_, h.cnt_ + 1};
__int64 prev_raw =
_InterlockedCompareExchange64
(&head_.whole_, xchg.whole_, h.whole_);
head_t prev = *(head_t*)&prev_raw;
if (*(__int64*)&prev == *(__int64*)&h)
return n;
h.ptr_ = prev.ptr_;
h.cnt_ = prev.cnt_;
}
else
{
node_t* t = (node_t*)tail_;
if (n != t)
{
// spinning here may only be caused
// by producer preempted in (***)
SwitchToThread();
h.ptr_= head_.ptr_;
h.cnt_ = head_.cnt_;
continue;
}
head_t xchg = {0, h.cnt_ + 1};
head_t prev;
prev.whole_ = _InterlockedCompareExchange64
(&head_.whole_, xchg.whole_, h.whole_);
if (prev.whole_ == h.whole_)
{
node_t* prev_tail = (node_t*)
_InterlockedCompareExchange
((long*)&tail_, (long)&head_.ptr_, (long)n);
if (prev_tail == n)
return n;
// spinning here may only be caused
// by producer preempted in (***)
while (n->next_ == 0)
SwitchToThread();
head_.ptr_ = n->next_;
return n;
}
h.ptr_ = prev.ptr_;
h.cnt_ = prev.cnt_;
}
}
}
__except ((GetExceptionCode() == EXCEPTION_ACCESS_VIOLATION
&& ++retry_count < 64*1024) ?
EXCEPTION_EXECUTE_HANDLER : EXCEPTION_CONTINUE_SEARCH)
{
goto retry;
}
}

private:
union head_t
{
struct
{
node_t* ptr_;
unsigned cnt_;
};
__int64 whole_;
};

head_t volatile head_;
char pad_ [64];
node_t* volatile* volatile tail_;

mpmc_queue(mpmc_queue const&);
mpmc_queue& operator = (mpmc_queue const&);
};

And here is a small test:

size_t const thread_count = 8;
size_t const batch_size = 32;
size_t const iter_count = 400000;

bool volatile g_start = 0;

struct my_node : mpmc_queue::node_t
{
int data;
char pad [64];
};

unsigned __stdcall thread_func(void* ctx)
{
mpmc_queue& queue = *(mpmc_queue*)ctx;

srand((unsigned)time(0) + GetCurrentThreadId());
size_t pause = rand() % 1000;

my_node* node_cache [batch_size];
for (size_t i = 0; i != batch_size; i += 1)
{
node_cache[i] = new my_node;
node_cache[i]->data = i;
}

while (g_start == 0)
SwitchToThread();

for (size_t i = 0; i != pause; i += 1)
_mm_pause();

for (int iter = 0; iter != iter_count; ++iter)
{
for (size_t i = 0; i != batch_size; i += 1)
{
queue.enqueue(node_cache[i]);
}
for (size_t i = 0; i != batch_size; i += 1)
{
for (;;)
{
my_node* node = (my_node*)queue.dequeue();
if (node)
{
node_cache[i] = node;
break;
}
SwitchToThread();
}
}
}

return 0;
}

int main()
{
mpmc_queue queue;

HANDLE threads [thread_count];
for (int i = 0; i != thread_count; ++i)
{
threads[i] = (HANDLE)_beginthreadex
(0, 0, thread_func, &queue, 0, 0);
}

Sleep(1);

unsigned __int64 start = __rdtsc();
g_start = 1;

WaitForMultipleObjects(thread_count, threads, 1, INFINITE);

unsigned __int64 end = __rdtsc();
unsigned __int64 time = end - start;
std::cout << "cycles/op=" << time /
batch_size * iter_count * 2 * thread_count)
<< std::endl;
}

--
Dmitriy V'yukov

== 2 of 8 ==
Date: Thurs, Jul 30 2009 5:39 pm
From: "Chris M. Thomasson"

I am not sure about this yet but I think that you might need to load the ABA
counter _before_ the head pointer. I know that this is requirement for
lock-free stack when one uses normal CAS for producer side of the algorithm.
Here is how a such a lock-free stack can bite the dust if the ABA counter is
loaded _after_ the head pointer:
___________________________________________________________________
SC0: loop
SC1: head = lf->top
SC2: oc = lf->ocount
SC3: if head == NULL
SC4: return NULL
SC5: next = head->next
SC6: if DWCAS(&lf->top, <head, oc>, <next, oc + 1>)
SC7: break
SC8: endloop
SC9: return head

Think of current state being top = pointerA and oc = 1 and threadA loads
pointerA; get premepted by threadB which pops pointerA, thus increment the
oc from 1 to 2; ThreadA resumes and loads version 2 and gets to line SC5
thus reading NULL as a next pointer; gets preempted by threadB which pushes
pointerB, and then pointerA onto the list mutating the state to top =
pointerA and oc = 2 with pointerA having a next node equal to pointerB;
ThreadA resumes and performs a successful DWCAS operation when it should
have failed thus swapping the top of the list with NULL and all the nodes
get lost!
___________________________________________________________________

Not sure if this applies to the DWCAS portion of the MPMC queue.

== 3 of 8 ==
Date: Thurs, Jul 30 2009 5:45 pm
From: "Chris M. Thomasson"

"Dmitriy V'jukov" <dvyukov@gmail.com> wrote in message
news:8487febc-e1bd-4b3b-9de0-895f825aaf77@w41g2000yqb.googlegroups.com...
> Here is my multi-producer/single-consumer queue:
> http://groups.google.ru/group/lock-free/browse_frm/thread/55df71b87acb8201
> The interesting part of the algorithm is an XCHG-based producer part.
>
> As Chris Thomasson correctly noted, the XCHG-based producer part can
> be combined with the well-known CAS-based consumer part in order to
> get multi-producer/multi-consumer (MPMC) queue:
> http://groups.google.ru/group/comp.programming.threads/browse_frm/thread/053e322ea90e4ad5
>
> In general, one may combine different producer and consumer parts into
> single queue provided that queue structure stays the same. For
> example, it's possible to combine the XCHG-based producer part with
> consumer part from 2-LOCK queue algorithm. The resulting 1-LOCK/1-XCHG
> queue will have quite appealing characteristics (wait-free producers,
> and 1 spin-lock acquisition for consumers, no need for ABA prevention
> nor safe memory reclamation).
>
> But what I'm going to show is a bit more interesting, it's a novel
> consumer part for MPMC queue.
> The problem with classical CAS-based MPMC queue is that both producers
> and consumers may touch potentially freed/reused memory, moreover
> producers may write to that memory. That's why it requires safe memory
> reclamation (SMR), and in general SMR is quite problematic in a non-
> managed non-kernel environment (C/C++).
> XCHG-based producer part gracefully avoids touching freed/reused
> memory. So now the problem is with consumer part only, but note that
> consumers may only read from freed/reused (no writes to that memory).
> The key point of the proposed algorithm is handling of reads from
> reused memory with failing CAS, and handling of reads from freed
> memory with SEH/signal handler.
> Main characteristics of the algorithm:
> - intrusive
> - producers: 1 XCHG, wait-free
> - consumers: 1 CAS on common path, mostly lock-free (***)
> - producers and consumers do not content with each other (until queue
> is empty)
> - no need for safe memory reclamation
[...]

I don't worry about the (***) part as it's an EXTREMELY small window. Also,
if a programs threads can just up and die at (***) then, IMHO, the program
has a lot more "issues" to worry about than this one...

;^)

So, we now have intrusive/non-intrusive versions of a high-performance MPMC
queue that beats the shi% out of the Michael and Scott queue:

Very cool!

== 4 of 8 ==
Date: Thurs, Jul 30 2009 5:48 pm
From: "Dmitriy V'jukov"

On 31 июл, 04:39, "Chris M. Thomasson" <n...@spam.invalid> wrote:
> "Dmitriy V'jukov" <dvyu...@gmail.com> wrote in message
>
> news:8487febc-e1bd-4b3b-9de0-895f825aaf77@w41g2000yqb.googlegroups.com...
>
> > Here is my multi-producer/single-consumer queue:
> >http://groups.google.ru/group/lock-free/browse_frm/thread/55df71b87ac...
> > The interesting part of the algorithm is an XCHG-based producer part.
>
> > As Chris Thomasson correctly noted, the XCHG-based producer part can
> > be combined with the well-known CAS-based consumer part in order to
> > get multi-producer/multi-consumer (MPMC) queue:
> >http://groups.google.ru/group/comp.programming.threads/browse_frm/thr...
> [...]
> > node_t* dequeue()
> > {
> > unsigned retry_count = 0;
> > retry:
> > __try
> > {
> > head_t h;
> > h.ptr_= head_.ptr_;
> > h.cnt_ = head_.cnt_;
>
> [...]
>
> I am not sure about this yet but I think that you might need to load the ABA
> counter _before_ the head pointer. I know that this is requirement for
> lock-free stack when one uses normal CAS for producer side of the algorithm.
> Here is how a such a lock-free stack can bite the dust if the ABA counter is
> loaded _after_ the head pointer:
> ___________________________________________________________________
> SC0: loop
> SC1: head = lf->top
> SC2: oc = lf->ocount
> SC3: if head == NULL
> SC4: return NULL
> SC5: next = head->next
> SC6: if DWCAS(&lf->top, <head, oc>, <next, oc + 1>)
> SC7: break
> SC8: endloop
> SC9: return head
>
> Think of current state being top = pointerA and oc = 1 and threadA loads
> pointerA; get premepted by threadB which pops pointerA, thus increment the
> oc from 1 to 2; ThreadA resumes and loads version 2 and gets to line SC5
> thus reading NULL as a next pointer; gets preempted by threadB which pushes
> pointerB, and then pointerA onto the list mutating the state to top =
> pointerA and oc = 2 with pointerA having a next node equal to pointerB;
> ThreadA resumes and performs a successful DWCAS operation when it should
> have failed thus swapping the top of the list with NULL and all the nodes
> get lost!
> ___________________________________________________________________
>
> Not sure if this applies to the DWCAS portion of the MPMC queue.

I don't see now how this may break my algorithm. I think that Relacy
will show the truth.

--
Dmitriy V'jukov

== 5 of 8 ==
Date: Thurs, Jul 30 2009 6:07 pm
From: "Chris M. Thomasson"

"Dmitriy V'jukov" <dvyukov@gmail.com> wrote in message
news:c872ad32-6a06-4696-af81-3d907acac607@r2g2000yqm.googlegroups.com...
On 31 июл, 04:39, "Chris M. Thomasson" <n...@spam.invalid> wrote:
> > "Dmitriy V'jukov" <dvyu...@gmail.com> wrote in message
[...]
> > I am not sure about this yet but I think that you might need to load the
> > ABA
> > counter _before_ the head pointer. I know that this is requirement for
> > lock-free stack when one uses normal CAS for producer side of the
> > algorithm.
> > Here is how a such a lock-free stack can bite the dust if the ABA
> > counter is
> > loaded _after_ the head pointer:
> > ___________________________________________________________________
[...]
> > ___________________________________________________________________
> >
> > Not sure if this applies to the DWCAS portion of the MPMC queue.

> I don't see now how this may break my algorithm.

I can't come up with anything yet. It's probably a complete false-alarm.

> I think that Relacy will show the truth.

I bet it will. I know that this is a problem for the lock-free stack. Humm,
perhaps I can see if Relacy can detect the error on the stack algorithm. Its
a very subtle race-condition, but it will destroy the stack and loose nodes.
Relacy should detect this as a memory leak. One problem, I am not sure how
to model this moment in Relacy. I need to create a special struct for a
DWCAS anchor in order to be compatible with rl::atomic. However, once I do
that, rl::atomic will load the entire anchor atomically with no chance of a
preemption in between the load to the anchors head pointer and the
subsequent load to the anchors aba counter. Am I making any sense to you?

== 6 of 8 ==
Date: Thurs, Jul 30 2009 6:13 pm
From: "Dmitriy V'jukov"

On 31 июл, 05:07, "Chris M. Thomasson" <n...@spam.invalid> wrote:

> > I think that Relacy will show the truth.
>
> I bet it will. I know that this is a problem for the lock-free stack. Humm,
> perhaps I can see if Relacy can detect the error on the stack algorithm. Its
> a very subtle race-condition, but it will destroy the stack and loose nodes.
> Relacy should detect this as a memory leak. One problem, I am not sure how
> to model this moment in Relacy. I need to create a special struct for a
> DWCAS anchor in order to be compatible with rl::atomic. However, once I do
> that, rl::atomic will load the entire anchor atomically with no chance of a
> preemption in between the load to the anchors head pointer and the
> subsequent load to the anchors aba counter. Am I making any sense to you?

C++0x atomics do not support non-atomic accesses to atomic variables.
Hmmm.... Well, I think, you may just load anchor twice :)
Load whole anchor, get pointer from it. Load whole anchor one more
time, get counter from it. It must be enough for Relacy to squeeze
into.

--
Dmitriy V'jukov

== 7 of 8 ==
Date: Thurs, Jul 30 2009 6:19 pm
From: "Dmitriy V'jukov"

On 31 июл, 04:45, "Chris M. Thomasson" <n...@spam.invalid> wrote:

> I don't worry about the (***) part as it's an EXTREMELY small window. Also,
> if a programs threads can just up and die at (***) then, IMHO, the program
> has a lot more "issues" to worry about than this one...

Yes, spuriously dying threads are indeed problematic :)

That was just a comments on *formal* semantics. Theoretically, it's
not a lock-free queue because lock-free algorithm must be safe wrt
thread termination.
But from practical POV, yes, I do not care about (***) part too.

>
> ;^)
>
> So, we now have intrusive/non-intrusive versions of a high-performance MPMC
> queue that beats the shi% out of the Michael and Scott queue:
>
> Very cool!

Ah, I know what Michael and Scott will say, they will say "your queue
is not lock-free". Damn!

--
Dmitriy V'jukov

== 8 of 8 ==
Date: Thurs, Jul 30 2009 8:07 pm
From: "Chris M. Thomasson"

"Dmitriy V'jukov" <dvyukov@gmail.com> wrote in message
news:413440f3-a557-468d-a623-d7adb9353148@n11g2000yqb.googlegroups.com...
On 31 июл, 05:07, "Chris M. Thomasson" <n...@spam.invalid> wrote:

> > > I think that Relacy will show the truth.
[...]
> > However, once I do
> > that, rl::atomic will load the entire anchor atomically with no chance
> > of a
> > preemption in between the load to the anchors head pointer and the
> > subsequent load to the anchors aba counter. Am I making any sense to
> > you?

> C++0x atomics do not support non-atomic accesses to atomic variables.

This raises another question... Lets say that a particular C++0x impl on a
32-bit system has lock-free access to a 64-bit variable. Will it have to use
64-bit atomic load instruction on the 32-bit system to load 64-bit anchor?
For instance, on IA-32, it would use a MOVQ instruction?

> Hmmm.... Well, I think, you may just load anchor twice :)
> Load whole anchor, get pointer from it. Load whole anchor one more
> time, get counter from it. It must be enough for Relacy to squeeze
> into.

Load anchor twice; something like:

void pop() {
for (;;) {
stack_anchor real;
stack_anchor a = m_anchor($).load(rl::memory_order_relaxed);
real.m_head = a.m_head;
// perhaps add an explicit yield?
stack_anchor b = m_anchor($).load(rl::memory_order_relaxed);
real.m_aba = b.m_aba;

// blah; perform CAS with `real' variable...
}
}

well, that should do it.

:^)

Thanks Dmitriy.

==============================================================================

You received this message because you are subscribed to the Google Groups "comp.programming.threads"
group.

To post to this group, visit http://groups.google.com/group/comp.programming.threads?hl=en

To unsubscribe from this group, send email to comp.programming.threads+unsubscribe@googlegroups.com

To change the way you get mail from this group, visit:
http://groups.google.com/group/comp.programming.threads/subscribe?hl=en

To report abuse, send email explaining the problem to abuse@googlegroups.com

==============================================================================
Google Groups: http://groups.google.com/?hl=en

soft and program

Thursday, July 30, 2009

comp.programming.threads - 25 new messages in 4 topics - digest

No comments:

Blog Archive

About Me