soft and program: comp.programming.threads - 16 new messages in 3 topics

comp.programming.threads
http://groups.google.com/group/comp.programming.threads?hl=en

comp.programming.threads@googlegroups.com

Today's topics:

* amortizing read-access in read/write mutex... - 4 messages, 3 authors
http://groups.google.com/group/comp.programming.threads/t/7b92af0a72a360ba?hl=en
* strong atomic reference counting... - 11 messages, 3 authors
http://groups.google.com/group/comp.programming.threads/t/1b35d9cff58efb83?hl=en
* using proxy garbage collector in C# / Java... - 1 messages, 1 author
http://groups.google.com/group/comp.programming.threads/t/8c5014a275907b0e?hl=en

==============================================================================
TOPIC: amortizing read-access in read/write mutex...
http://groups.google.com/group/comp.programming.threads/t/7b92af0a72a360ba?hl=en
==============================================================================

== 1 of 4 ==
Date: Thurs, Feb 4 2010 9:14 pm
From: David Schwartz

I added an implementation to my Linux-specific futex-based reader/
writer lock. Fast paths are fully optimized. Any reports of problems
noted would be helpful.

Here's the implementation. We use a few non-standard primitives:

InterlockedGet(&x) is kind of like *(volatile int *)x

InterlockedInc and InterlockedDec are atomic increments and decrements
that don't return anything.

InterlockedIncrement and InterlockedDecrement are atomic increments
and decrements that return zero if and only if the resulting value is
zero.

InterlockedExchange(&x, y) is an atomic *x=y that returns the original
value of *x

Also, if you're not familiar with the futex operations, we use only
two. FUTEX_WAIT and FUTEX_WAKE. For example:

sys_futex(&writers, FUTEX_WAIT, j, NULL);
This says to atomically check if writers==j, and if not, to block on
the writers futex.

sys_futex(&areaders, FUTEX_WAKE, 1, NULL);
This says to wake 1 thread that is blocked on the areaders futex.

Any integer can become a futex simply by a thread blocking on it.

Here it is:

// Copyright David J. Schwartz <davids@webmaster.com>, 2006-2010
// All rights reserved
// Offered without any warranty or claim of fitness for any purpose

class RWFutex
{
protected:
int areaders; // number of active readers (active writer blocks for
zero)
int pad1; // pads give each hot variable its own cache line
int wreaders; // number of waiting readers
int pad2;
int writers; // count of active/waiting writers (readers block for
zero)
int pad3;
int writer; // lock the active writer holds (other writers block
for zero)
int pad4;

private:
RWFutex(const RWFutex &); // no implementation
RWFutex &operator=(const RWFutex &); // no implementation

public:
RWFutex(void) : areaders(0), wreaders(0), writers(0), writer(0) { ; }

void ReadLock(void)
{
int j;
while(1)
{
if(InterlockedGet(&writers)!=0)
{ // wait for no writers
InterlockedInc(&wreaders);
while((j=InterlockedGet(&writers))!=0)
sys_futex(&writers, FUTEX_WAIT, j, NULL);
InterlockedDec(&wreaders);
}
InterlockedInc(&areaders); // add us as a reader
if(InterlockedGet(&writer)==0) // did we race with a writer?
return; // no
if(InterlockedDecrement(&areaders)==0) // we were only attempting
reader
if(InterlockedGet(&writers)!=0) // there is at least one waiting
writer
sys_futex(&areaders, FUTEX_WAKE, 1, NULL);
}
}

bool TryReadLock(void)
{
if(InterlockedGet(&writers)!=0) return false; // active/waiting
writers
InterlockedInc(&areaders);
if(InterlockedGet(&writer)==0) // did we race?
return true;
if(InterlockedDecrement(&areaders)==0) // we are only attempting
reader
sys_futex(&areaders, FUTEX_WAKE, 1, NULL);
return false;
}

void ReadUnlock(void)
{
if(InterlockedDecrement(&areaders)!=0) // one less reader
return; // still other readers
if(InterlockedGet(&writers)!=0) // a writer awaits
sys_futex(&areaders, FUTEX_WAKE, 1, NULL); // no readers now
}

bool CondReadUnlock(void)
{ // unlock read if there's a waiting writer and return when writer
is done
if(InterlockedGet(&writers)==0) return false;
ReadUnlock();
ReadLock();
return true;
}

void WriteLock(void)
{
int j;
InterlockedInc(&writers);
while(fInterlockedExchange(&writer, 1)!=0) // wait out other writers
sys_futex(&writer, FUTEX_WAIT, 1, NULL); // optimizable slow path
// we are the only writer
while((j=InterlockedGet(&areaders))!=0) // wait for no readers
sys_futex(&areaders, FUTEX_WAIT, j, NULL);
}

bool TryWriteLock(void)
{
if( (InterlockedGet(&writers)!=0) || (InterlockedGet(&areaders)!
=0) )
return false;
InterlockedInc(&writers); // add us as a writer
if(InterlockedExchange(&writer, 1)!=0) // can we grab the lock
{ // no, did we block anyone?
if(InterlockedDecrement(&writers)!=0) // waiting writer
sys_futex(&writer, FUTEX_WAKE, 1, NULL); // wake him
else if(InterlockedGet(&wreaders)!=0) // waiting reader(s)
sys_futex(&writers, FUTEX_WAKE, INT_MAX, NULL); // wake them
return false;
}
if(InterlockedGet(&areaders)==0) // if no readers, we're done
return true;
InterlockedSet(&writer, 0); // we raced with a reader, release write
lock
if(InterlockedDecrement(&writers)!=0) // waiting writer
sys_futex(&writer, FUTEX_WAKE, 1, NULL); // wake him
else if(InterlockedGet(&wreaders)!=0) // waiting reader(s)
sys_futex(&writers, FUTEX_WAKE, INT_MAX, NULL); // wake them
return false;
}

void WriteUnlock(void)
{
__asm__ __volatile__("" : : : "memory");
InterlockedSet(&writer, 0); // no active writer, no active readers
if(InterlockedDecrement(&writers)!=0) // waiting writer
sys_futex(&writer, FUTEX_WAKE, 1, NULL); // wake writer
else if(InterlockedGet(&wreaders)!=0) // waiting reader(s)
sys_futex(&writers, FUTEX_WAKE, INT_MAX, NULL); // wake readers
}

};

Comments appreciated. I've had this code percolating for many years
but never actually used it in anything deployed.

== 2 of 4 ==
Date: Fri, Feb 5 2010 7:41 am
From: "Chris M. Thomasson"

"David Schwartz" <davids@webmaster.com> wrote in message
news:2799851d-8673-4345-b2dd-78182f4aea84@f17g2000prh.googlegroups.com...
>I added an implementation to my Linux-specific futex-based reader/
> writer lock. Fast paths are fully optimized. Any reports of problems
> noted would be helpful.
>
> Here's the implementation. We use a few non-standard primitives:

[...]

> Comments appreciated.

If I understand this correctly, writers can starve readers right? I cannot
say if this is a problem or not. Well, if you have enough write activity to
starve readers then you should probably not be using a read/write mutex
anyway.

;^)

> I've had this code percolating for many years
> but never actually used it in anything deployed.

I cannot seem to find anything wrong with it right now. Humm, It would be
neat if you could directly model futex based algorithms in Relacy.

== 3 of 4 ==
Date: Fri, Feb 5 2010 9:06 am
From: David Schwartz

On Feb 5, 7:41 am, "Chris M. Thomasson" <n...@spam.invalid> wrote:

> If I understand this correctly, writers can starve readers right?

I don't think that's a meaningful statement. It does have absolute
writer priority, but:

1) If writers aren't rare, what's the point in using a reader/writer
lock? Writers must be much less common than readers or you're using
the wrong primitive.

2) So what? The only guarantee is that you get forward progress. If
the writer isn't going to make useful forward progress, why did it try
to acquire the lock anyway?

> I cannot
> say if this is a problem or not.

If it is, the problem is the choice of a reader/writer lock, IMO.

> Well, if you have enough write activity to
> starve readers then you should probably not be using a read/write mutex
> anyway.

Bingo.

> > I've had this code percolating for many years
> > but never actually used it in anything deployed.

> I cannot seem to find anything wrong with it right now. Humm, It would be
> neat if you could directly model futex based algorithms in Relacy.

The Linux futex abilities get rather complex. But I just use the
simplest one, a combination of an atomic 'test and block' and a
'wake'.

== 4 of 4 ==
Date: Fri, Feb 5 2010 9:14 am
From: Dmitriy Vyukov

On Feb 5, 8:06 pm, David Schwartz <dav...@webmaster.com> wrote:

> > > I've had this code percolating for many years
> > > but never actually used it in anything deployed.
> > I cannot seem to find anything wrong with it right now. Humm, It would be
> > neat if you could directly model futex based algorithms in Relacy.
>
> The Linux futex abilities get rather complex. But I just use the
> simplest one, a combination of an atomic 'test and block' and a
> 'wake'.

I added futex emulation to todo list:
http://groups.google.com/group/relacy/web/todo-feature-list
But keep in mind that it's that hard to add something like a futex to
Relacy manually. Ironically, you even do not need to be any
experienced with multithreading to do that, because Relacy uses
cooperative fibers for simulation. There is already a waitset and a
primitive to synchronize memory between threads, so the implementation
will be quite straightforward.

--
Dmitriy V'jukov

==============================================================================
TOPIC: strong atomic reference counting...
http://groups.google.com/group/comp.programming.threads/t/1b35d9cff58efb83?hl=en
==============================================================================

== 1 of 11 ==
Date: Fri, Feb 5 2010 4:41 am
From: Dmitriy Vyukov

On Feb 5, 2:04 am, Marcel Müller <news.5.ma...@spamgourmet.com> wrote:

> > BTW, have you ever played around with Dmitriy Vyukov's Relacy Race
> > Detector?
>
> I had a look at it when I developed the above algo. But I stopped at the
> point when I realized that porting to my platform is required. On
> eComStation I am restricted to gcc 3.3 (no std::atomic) and have no
> posix compatibility.

Do I get you right, that you do not have access to any single Windows/
Linux box?

--
Dmitriy V'jukov

== 2 of 11 ==
Date: Fri, Feb 5 2010 7:00 am
From: "Chris M. Thomasson"

"Marcel Müller" <news.5.maazl@spamgourmet.com> wrote in message
news:4b6b529b$0$6562$9b4e6d93@newsspool4.arcor-online.net...
> Chris M. Thomasson wrote:
>> "Marcel Müller" <news.5.maazl@spamgourmet.com> wrote in message
>> news:4b69ad47$0$7626$9b4e6d93@newsspool1.arcor-online.net...
>>> Chris M. Thomasson wrote:
>>>> IMVHO, it's nice to be able to remove the CAS-loop from in the function
>>>> that acquires strong references. Anyway, what do you think of the
>>>> technique?
>>>
>>> Once I completed to move my home I will have a look at this.
>>>
>>> But I currently already have an implementation without a CAS loop and
>>> with only one machine size word memory footprint. (The stolen bits
>>> trick, but without the restriction in the number of active references.)
>>> Unfortunately not very well tested so far.
>>
>> Now that's pretty interesting because the "stolen bits" are typically
>> used to actually represent the reference count itself. Therefore, I would
>> very much like to examine your algorithm Marcel.
>
> The primary reference counter is intrusive in my case. (Implemented by
> deriving from a base class.) However, that does not make any difference.
> But the additional counter for strong thread safety works with stolen
> bits. The basic idea is that this counter value is moved to the primary
> reference counter immediately after the local reference is acquired from
> the shared pointer. This creates a frequency distribution of values that
> drops exponentially with increasing value. In fact I never was able to
> increment the counter with the stolen bits above 2 in test applications
> with some dozen threads.

Could you please provide just some very brief pseudo-code? It sounds like
bad things could happen if the counter on the shared pointer overflows;
right? Is this what you were getting at by saying that you could not
increment the counter with the stolen bits above 2?

Humm... Well, it kind of sounds like you are doing something like:
_________________________________________________
// increment outer count.
uword32 outer = ATOMIC_FAA(&g_outer, 1) + 1;

uword32 count = GET_COUNT(outer);

object* obj = GET_OBJECT(outer);

// transfer outer to inner count.
ATOMIC_FAA(&obj->count, count);

// try and reset the outer count.
ATOMIC_CAS(&g_outer, outer, (uword32)obj);
_________________________________________________

I am most likely totally misunderstanding your algorithm and the pseudo-code
above is just way off target. What am I missing here Marcel? If I am close,
then I can definitely see the shi% hitting the fan if the outer count just
happens to overflow.

> Even this is non-trivial. Since most platforms provide at least 32 bit
> alignment, the two superfluous bits are usually sufficient and no special
> allocator is required.

>> BTW, have you ever played around with Dmitriy Vyukov's Relacy Race
>> Detector?
>
> I had a look at it when I developed the above algo. But I stopped at the
> point when I realized that porting to my platform is required. On
> eComStation I am restricted to gcc 3.3 (no std::atomic) and have no posix
> compatibility.

If I understand you correctly, then in this case you would need to try and
port your algorithm over to Relacy. IMVHO, it would be a worthwhile use of
your time. However, keep in mind that Relacy will blast the shi% out of the
algorithm and probably get it into a scenario in which the counter
overflows.

== 3 of 11 ==
Date: Fri, Feb 5 2010 8:07 am
From: Dmitriy Vyukov

On Feb 5, 6:00 pm, "Chris M. Thomasson" <n...@spam.invalid> wrote:
> Could you please provide just some very brief pseudo-code? It sounds like
> bad things could happen if the counter on the shared pointer overflows;
> right? Is this what you were getting at by saying that you could not
> increment the counter with the stolen bits above 2?

I suspect it's this:
http://groups.google.com/group/comp.programming.threads/msg/ce852a32d7994a06

--
Dmitriy V'jukov

== 4 of 11 ==
Date: Fri, Feb 5 2010 9:06 am
From: Marcel Müller

Dmitriy Vyukov wrote:
>> I had a look at it when I developed the above algo. But I stopped at the
>> point when I realized that porting to my platform is required. On
>> eComStation I am restricted to gcc 3.3 (no std::atomic) and have no
>> posix compatibility.
>
> Do I get you right, that you do not have access to any single Windows/
> Linux box?

Yes. Win-Notebook without any development environment and my Linux
Server, Debian Lenny but 300MHz, only ssh and no X.
Both are no serious options.

However, I am planning a new Linux VirtualBox-Server late this year.
There I can install VMs wich match the requirements.

Marcel

== 5 of 11 ==
Date: Fri, Feb 5 2010 9:23 am
From: Marcel Müller

Chris M. Thomasson wrote:
> Could you please provide just some very brief pseudo-code?

Not immediately. The eCS machine is still down.

> It sounds
> like bad things could happen if the counter on the shared pointer
> overflows; right?

Yes. Very bad. The application will terminate.

> Is this what you were getting at by saying that you
> could not increment the counter with the stolen bits above 2?

No. The workaround for the above problem worked. Although in theory it
only decreases the demand from the number of acquired local copies to
the number of threads, in practice the probability for a crash is
/extremely/ low. I was definitely not able to force a crash with some
dozen threads.

> Humm... Well, it kind of sounds like you are doing something like:
> _________________________________________________
> // increment outer count.
> uword32 outer = ATOMIC_FAA(&g_outer, 1) + 1;
>
> uword32 count = GET_COUNT(outer);
>
> object* obj = GET_OBJECT(outer);
>
>
> // transfer outer to inner count.
> ATOMIC_FAA(&obj->count, count);
>
> // try and reset the outer count.
> ATOMIC_CAS(&g_outer, outer, (uword32)obj);
> _________________________________________________

I do not exactly remember the code, but basically this is true. The only
difference I remember is, that I have to undo the FAA to obj->count if
the last CAS failed, because another thread has already transfered the
outer count in this case.

> I am most likely totally misunderstanding your algorithm and the
> pseudo-code above is just way off target. What am I missing here Marcel?
> If I am close, then I can definitely see the shi% hitting the fan if the
> outer count just happens to overflow.

Absolutely. But the CAS to the outer count is only a few assembler
instructions away from the FAA. It is nearly impossible to get more than
two threads exactly to this location at the same time as long there is
no blocking instruction in between. In fact I was no even able to reach
a counter value of 2 without a synchronized debug output in between.

>> Even this is non-trivial. Since most platforms provide at least 32 bit
>> alignment, the two superfluous bits are usually sufficient and no
>> special allocator is required.
>
>>> BTW, have you ever played around with Dmitriy Vyukov's Relacy Race
>>> Detector?
>>
>> I had a look at it when I developed the above algo. But I stopped at
>> the point when I realized that porting to my platform is required. On
>> eComStation I am restricted to gcc 3.3 (no std::atomic) and have no
>> posix compatibility.
>
> If I understand you correctly, then in this case you would need to try
> and port your algorithm over to Relacy. IMVHO, it would be a worthwhile
> use of your time. However, keep in mind that Relacy will blast the shi%
> out of the algorithm and probably get it into a scenario in which the
> counter overflows.

Yes. If n threads try to acquire a pointer from the same shared location
in theory the counter could reach n. In practice the probability to
reach n is approximately the probability of one thread being in this
code lines to the power of n. This decays extremely fast in practice.
Of course, if you run several hundreds of threads in parallel the
probability increases. But this is no typical application case and most
operating systems (or administrators) will block this kind of behavior.

In fact on 32 bit platforms the counter could safely grow to 3. On a 64
bit platform it could typically grow to 7. The latter is really very
hard to reach. I think most real world applications have much more
critical errors.
In fact I prefer to know that the code is bad and the risk is
admissible rather than not to know about the quality, which is quite
more often the case.

And if you don't trust, also larger numbers are supported. (Compile time
option.) But this causes additional overhead for memory alignment,
especially for small objects.
The base class that implements the inner count automatically overloads
the new and delete operators in this case. Of course, you must not
allocate arrays of your type. But this applies to reference counted
objects anyway.

Amongst others I use the above pointer in a thread safe string class
with shared storage. Volatile instances provide strong thread safety.
The string has a custom allocator that stores the inner count and the
length at negative offsets, so the conversion of local (non-volatile)
strings to const char* is a no-op. Even the conversion from string[] to
const char*[] is a no-op. This makes wrappers to C libraries very
efficient. The direct access to shared instances other than copy
construction or assignment (lhs & rhs) is rejected by the compiler.

In this string implementation with stolen bits it is very essential that
the maximum outer count does not scale with the number of local copies
of the shared object. This can grow very large, since a single thread
may create almost any number of them over time. It only scales with the
number of threads. And even this only in the worst case. Typically it is
significantly less than the number of threads.

Back to the topic.
I have no CAS loop in my implementation, but I needed some work arounds
to things that can happen to the inner count in certain cases. At this
point your increment by two trick may help. Actually the implementation
might be currently buggy with respect to the inner count.

Marcel

== 6 of 11 ==
Date: Wed, Feb 3 2010 5:52 am
From: "Chris M. Thomasson"

"Chris M. Thomasson" <no@spam.invalid> wrote in message
news:L0g9n.1815$5n.101@newsfe23.iad...
> As you may already know, you can implement strongly thread-safe atomic
> reference counting using a PDR algorithm. The basic overall pattern is as
> follows:
> ____________________________________________________________________
[...]
> ____________________________________________________________________
>
>
>
>
> I wanted to try and eliminate that nasty CAS-loop in `acquire_strong()'
> which implements the "increment if not zero" logic. So, here is what I
> came up with:
> ____________________________________________________________________
[...]
> ____________________________________________________________________
>
> Instead of forcing the `acquire_strong()' function to never even try to
> increment a reference count that is zero, I simply inc/dec the count by 2,
> and when that goes to zero I _attempt_ to set it to 1. Therefore, the
> `acquire_strong()' function can "easily" detect reference counted objects
> that have already been deferred by checking to see if the count has the
> 0x1 bit set. I think I will model this in Relacy 2.2 using the following
> PDR implementation:
>
>
> http://cpt.pastebin.com/f71480694
>
>
> So far, I think that algorithm should work out fine.