soft and program: comp.programming.threads - 25 new messages in 2 topics

comp.programming.threads
http://groups.google.com/group/comp.programming.threads?hl=en

comp.programming.threads@googlegroups.com

Today's topics:

* AMD Advanced Synchronization Facility: HTM - 24 messages, 5 authors
http://groups.google.com/group/comp.programming.threads/t/c1c6c6327aed79b6?hl=en
* Deitel & Deitel's How to Program C++ 6th Edition chapters 1-22 full code
solutions for $50 - 1 messages, 1 author
http://groups.google.com/group/comp.programming.threads/t/4621a963e96dd42b?hl=en

==============================================================================
TOPIC: AMD Advanced Synchronization Facility: HTM
http://groups.google.com/group/comp.programming.threads/t/c1c6c6327aed79b6?hl=en
==============================================================================

== 1 of 24 ==
Date: Thurs, Apr 30 2009 7:38 pm
From: Mayan Moudgill

MitchAlsup wrote:
> On Apr 30, 8:28 am, Mayan Moudgill <ma...@bestweb.net> wrote:
>
>>- while a cache-line is locked, any requests to that cache line get deferred
>
>
> This leads to unsolvable deadlock/livelock issues--depending on the
> protocol on the interconnect(s). And created massive headaches for
> interrupts, and more that a few issues for exceptions.

1) The abort-on-probe in ASF can lead to live-lock.
2) If the interconnect does NOT allow deferred access to cache-lines,
then the ASF proposal does not work either; the ASF has to defer access
whilst committing the locked regions to the cache.
3) The mechanism described specifically allows interrupts and exceptions
to be taken (and reservations to fail)
4) The mechanism described allows the h/w to implement any failure
policy, including probe-on-abort.

>
> This requires a search (directed or not) through the cache to modify
> the states after successful conclusion. See below.

Not really; what it requires is easy to do _IF_ you've implemented your
cache state using flops instead of SRAMs. In that case, there will be
an extra three control lines to each cell (toL, LtoI, LtoE) and probably
4-6 gates. The area and speed cost is insignifcant (the high fanout of
the LtoI and LtoE controls is compensated for by the fact that there is
no address decode; it goes to ALL cells).

If you've implemented the states as SRAMs, then it depends on whether
they are implemented as x2 SRAMs (just MESI) or xN SRAMs (MESI+LRU
state, N dependent on LRU algorithm) or some other variant (e.g. M for
sub-cache-line), then there may be a larger area penalty, but it is
still insignificant.

>
>> Another negative is that to guarantee 4 lockable cache-lines,
>>one must have at least 5-way associative caches, or a regular cache
>>paired with a smaller but higher associativity cache (e.g. victim cache).
>
>
> ASF does this whole thing in the memory access (miss) buffers** and
> can do the whole visiblity thing in one clock. Nor does ASF have to do
> anything other than make it all visible (flushes cache state
> changes,...). This makes it a lot esier on the HW. And is not (sic)
> associated with the associativity of the cache in any way shape or
> form. In fact, ASF works even if there is no cache in the whole
> system.
>
> (**) this is where the limitation on the number of protected lines
> comes from. Even a processor without a cache will have miss buffers.

Sorry, there is no requirement for miss-buffers at the L1. Consider a
write-through L1 with a allocate-on-load policy (i.e. store misses do
not cause L1 line fetches), backed by an on-chip L2. Every load miss is
sent to the L2 along with the necessary control information needed to
complete the load (e.g. identity of the load causing the miss or the LQ
entry). Every cache refill is sent to the L1 by the L2 along with the
associated control information. Any load miss combining is done at the L2.

BTW: you still have two copies of the state: old and
updated-but-not-committed. Unless your miss-buffer is different than how
I understand it, it only holds the identity of outstanding misses. So,
you have to have some structure to hold the copies. Is this the
write-buffers, or some other buffers? And on a successful commit or
invalidate, the cache and this buffer have to be reconciled.

> ----------------------------------------------
>
> Back when I was working on this, we were intending to support 8
> protected lines (with 7 as a back up). There are useful atomic
> primatives that use 5 protected lines (moving a concurrent data
> structure (CDS) element from one place in a CDS to another in a single
> atomic event.)

Did you talk to any processor designers when you did this?

== 2 of 24 ==
Date: Fri, May 1 2009 12:13 am
From: Terje Mathisen <"terje.mathisen at tmsw.no">

Mayan Moudgill wrote:
> MitchAlsup wrote:
>> Back when I was working on this, we were intending to support 8
>> protected lines (with 7 as a back up). There are useful atomic
>> primatives that use 5 protected lines (moving a concurrent data
>> structure (CDS) element from one place in a CDS to another in a single
>> atomic event.)
>
> Did you talk to any processor designers when you did this?

Are you serious?

From LinkedIn:

Mitch Alsup's Experience

*
Architect
AMD

(Public Company; Mechanical or Industrial Engineering industry)

1999 — 2007 (8 years)
*
Chief Architect
Ross Technology

(Public Company; Mechanical or Industrial Engineering industry)

1991 — 1998 (7 years)
*
Architect
Motorola SPS

(Mechanical or Industrial Engineering industry)

1983 — 1992 (9 years)
*
Processor Architect
Motorola

(Public Company; Mechanical or Industrial Engineering industry)

1983 — 1991 (8 years)

I sort of have the feeling Mitch has spent way more time than most, even
here at c.arch, talking to processor designers. :-)

Terje

--
- <Terje.Mathisen at tmsw.no>
"almost all programming can be viewed as an exercise in caching"

== 3 of 24 ==
Date: Fri, May 1 2009 4:20 am
From: Mayan Moudgill

Terje Mathisen wrote:
> Mayan Moudgill wrote:
>
>> MitchAlsup wrote:
>>
>>> Back when I was working on this, we were intending to support 8
>>> protected lines (with 7 as a back up). There are useful atomic
>>> primatives that use 5 protected lines (moving a concurrent data
>>> structure (CDS) element from one place in a CDS to another in a single
>>> atomic event.)
>>
>>
>> Did you talk to any processor designers when you did this?
>
>
> Are you serious?
>
> From LinkedIn:
>
>
> Mitch Alsup's Experience
>
> *
> Architect
> AMD
>
> (Public Company; Mechanical or Industrial Engineering industry)
>
> 1999 — 2007 (8 years)
> *
> Chief Architect
> Ross Technology
>
> (Public Company; Mechanical or Industrial Engineering industry)
>
> 1991 — 1998 (7 years)
> *
> Architect
> Motorola SPS
>
> (Mechanical or Industrial Engineering industry)
>
> 1983 — 1992 (9 years)
> *
> Processor Architect
> Motorola
>
> (Public Company; Mechanical or Industrial Engineering industry)
>
> 1983 — 1991 (8 years)
>
> I sort of have the feeling Mitch has spent way more time than most, even
> here at c.arch, talking to processor designers. :-)
>
> Terje
>
My *sincere* apologies for doubting his abilities. Sorry.

In my defence, I will say that a lot of the objections looked very
academic/tunnel-visioned. In particular, the implication that one needed
a state machine to update the cache tag bits or that a miss buffer is
absolutely required pressed some buttons.

Again, I apologize for doubting Mitch Aslup's credentials. However, I
still think that he is suffering from a little bit of tunnel-vision.

== 4 of 24 ==
Date: Fri, May 1 2009 7:02 am
From: EricP

MitchAlsup wrote:
>
> So, its nice to have it see the light of day, It is shameful to have
> lost so much power and expressivity along the way. It is truley sad
> that ASF lost the ability to attack the BigO(n**2) problem along the
> way.

Hmmmm... actually now I'm thinking they may have
screwed the pooch with their changes to the ASF design.

1) Because the new design resolves contention by aborting the owner,
it means even an update to a single cache line can livelock.
So even a simple DCAS has to deal with the possibility.

The NAK could fix that but is insufficient for multiple lines.

2) There is no guaranteed forward progress, so every usage
must deal with livelock, and ultimately they make
a system call to wait the thread to guarantee progress
(after trying various tricks to avoid doing so).
This makes every usage more complex and more expensive.

3) The new design does not collect a set of protected lines
and verify them first, but rather does it laissez-faire,
so there is no formal point at which a go/no-go decision is made,
other than the final commit.

So to upgrade the design and ensure there is forward progress
requires making Commit do what Acquire did before.
(and there can be conflicting transactions with cpu 1 owning
line A and wanting B, and cpu 2 owning B and wanting A,
except with 4 or more cache lines and/or cpus.
To guarentee a winner you must know the whole update set).

But moving update set dispute resolution to COMMIT means you must
change LOCK MOVx(store) instructions to not actually transfer line
ownership until the commit, but then that breaks A-B-A detection
which uses ownership transfer to do this detection.

So it may be that livelock issue is unfixable,
or vastly more complex.

Eric

== 5 of 24 ==
Date: Fri, May 1 2009 7:54 am
From: Dmitriy Vyukov

On May 1, 7:02 am, EricP <ThatWouldBeTell...@thevillage.com> wrote:

> So it may be that livelock issue is unfixable,
> or vastly more complex.

I also think that 100% live-lock prevention even if possible not
feasible doing in hardware (I'm not a hardware guy!). Here is another
moment the ultimate goal must be not only absence of live-locks, but
reasonable performance (low abort rate). What sense will make hardware
implementation which will guarantee absence of live-locks but anyway
will provide crappy performance? It still must be fixed by software
(the same back-offs, etc). So personally for me it's Ok that
specification somehow amenable to live-locks, anyway I don't believe
that heavy update-contented data-structure may have reasonable
performance and scalability.
However as for formal specification, probably they must at least allow
NACKs. Future implementations of the ASF will include more and more
intelligent logic for live-lock (and unnecessary aborts in general)
prevention,

--
Dmitriy V'jukov

== 6 of 24 ==
Date: Fri, May 1 2009 8:04 am
From: Dmitriy Vyukov

On Apr 29, 1:05 pm, MitchAlsup <MitchAl...@aol.com> wrote:
> Well, horay, my invention has finally seen the light of day.

I wanted to ask some question on AMD forum since they are asking for
feedback. But since Mitch is here I will ask here.

Table 6.2.1. postulates that if CPU B holds cache-line in Protected
Owned state, and CPU A make plain (non-transactional) read, then CPU B
aborts.
Why it's necessary to abort CPU B? I can imagine implementation which
will satisfy read from CPU A with old value (which was actual before
CPU B started transaction), and not abort CPU B. I.e. for software it
will look like read from CPU A just happen-before transaction on CPU
B. At first glance it does not conflict with transactional semantics,
while allowing more parallelism (less aborts). Such implementation
will use separate buffer for transactional writes (not cache).

--
Dmitriy V'jukov

== 7 of 24 ==
Date: Fri, May 1 2009 8:13 am
From: Dmitriy Vyukov

On Apr 29, 1:05 pm, MitchAlsup <MitchAl...@aol.com> wrote:
> Well, horay, my invention has finally seen the light of day.

Another question:
Why it's necessary to use LOCK MOV, why all reads and writes inside
transactional region may not be treated as transactional? This is the
approach taken by Sun's HTM in Rock. On one hand AMD's approach is
more flexible (I may not include some reads/writes into transaction),
on the other hand with Sun's approach I may do following thing.

Assume I have large mutex-based code-base:

LOCK(x);
some_c_or_cpp_code_here;
UNLOCK(x);

I may replace it with:

TRANSACTIONAL_LOCK(x);
some_c_or_cpp_code_here;
TRANSACTIONAL_UNLOCK(x);

Where TRANSACTIONAL_LOCK/TRANSACTIONAL_UNLOCK first tries to use
transaction to execute code, then after several attempts falls back to
mutex (so called transactional lock-ellision).
Sun's works show that transactional lock-ellision is extremely easy to
apply to large code-base and does give some substantiation performance
improvements.
With AMD's approach (LOCK MOV) I will have to rewrite all code in
asm... not very cool...
As for additional flexibility, well, I don't see many use cases for
non-transactional reads/writes inside of transaction, anyway we want
to make transaction as slim as possible...
So what is the rationale behind explicit LOCK MOVs?

--
Dmitriy V'jukov

== 8 of 24 ==
Date: Fri, May 1 2009 8:39 am
From: EricP

Dmitriy Vyukov wrote:
> On May 1, 7:02 am, EricP <ThatWouldBeTell...@thevillage.com> wrote:
>
>> So it may be that livelock issue is unfixable,
>> or vastly more complex.
>
>
> I also think that 100% live-lock prevention even if possible not
> feasible doing in hardware (I'm not a hardware guy!). Here is another
> moment the ultimate goal must be not only absence of live-locks, but
> reasonable performance (low abort rate). What sense will make hardware
> implementation which will guarantee absence of live-locks but anyway
> will provide crappy performance? It still must be fixed by software
> (the same back-offs, etc). So personally for me it's Ok that
> specification somehow amenable to live-locks, anyway I don't believe
> that heavy update-contented data-structure may have reasonable
> performance and scalability.
> However as for formal specification, probably they must at least allow
> NACKs. Future implementations of the ASF will include more and more
> intelligent logic for live-lock (and unnecessary aborts in general)
> prevention,

But as a minimum they must at least ensure that single cache line
operations that don't take "too long" are guaranteed to succeed.
I.e. the same rules that Alpha has for LL/SC.
This doesn't necessarily mean NAKs, just holding onto
a locked cache line for some small clock window to give
the commit a chance to occur.

Eric

== 9 of 24 ==
Date: Fri, May 1 2009 9:13 am
From: Dmitriy Vyukov

On May 1, 8:39 am, EricP <ThatWouldBeTell...@thevillage.com> wrote:
> Dmitriy Vyukov wrote:
> > On May 1, 7:02 am, EricP <ThatWouldBeTell...@thevillage.com> wrote:
>
> >> So it may be that livelock issue is unfixable,
> >> or vastly more complex.
>
> > I also think that 100% live-lock prevention even if possible not
> > feasible doing in hardware (I'm not a hardware guy!). Here is another
> > moment the ultimate goal must be not only absence of live-locks, but
> > reasonable performance (low abort rate). What sense will make hardware
> > implementation which will guarantee absence of live-locks but anyway
> > will provide crappy performance? It still must be fixed by software
> > (the same back-offs, etc). So personally for me it's Ok that
> > specification somehow amenable to live-locks, anyway I don't believe
> > that heavy update-contented data-structure may have reasonable
> > performance and scalability.
> > However as for formal specification, probably they must at least allow
> > NACKs. Future implementations of the ASF will include more and more
> > intelligent logic for live-lock (and unnecessary aborts in general)
> > prevention,
>
> But as a minimum they must at least ensure that single cache line
> operations that don't take "too long" are guaranteed to succeed.
> I.e. the same rules that Alpha has for LL/SC.
> This doesn't necessarily mean NAKs, just holding onto
> a locked cache line for some small clock window to give
> the commit a chance to occur.

Agree. CPU may just delay ACK.
It will be beneficial for formal specification to support different
implementations. [Very] hopefully ASF will be supported by Intel too,
who knows what apporach they will take.

--
Dmitriy V'jukov

== 10 of 24 ==
Date: Fri, May 1 2009 9:30 am
From: MitchAlsup

On May 1, 10:13 am, Dmitriy Vyukov <dvyu...@gmail.com> wrote:
> Why it's necessary to use LOCK MOV, why all reads and writes inside
> transactional region may not be treated as transactional?

A good question.

But consider that one must end up dereferencing memory references to
set up the cache lines for the critical section. And notice that there
is a strict limit to the number of protected cache lines. Thus, if
there is any additional level of indirection, you run out of
protectable resources.

Secondly, in order to dbugg these critical regions, one must build a
thread-safe memory buffer and dump critical region state into same so
that you can printi it out (or otherwise examine it) after success or
failure. You really (REALY) do not want these to be protected and
rolled back on failure.

Thus, the limited number of protectable lines, and the absolute need
to be able to figure out what went on inside the critical region,
prevents treating everything as protected.

In my opinion, this is a good thiing, a very good thing.

Mitch

== 11 of 24 ==
Date: Fri, May 1 2009 9:43 am
From: MitchAlsup

On May 1, 6:20 am, Mayan Moudgill <ma...@bestweb.net> wrote:
> Again, I apologize for doubting Mitch Aslup's credentials. However, I
> still think that he is suffering from a little bit of tunnel-vision.- Hide quoted text -

For the record, I thank Terje for his defense and would like to
indicate that I take no offense upon Myan.

This whole synchronization philosophy was developed from the hardware
level (make the HW work first) and then figure out how to program it
(present reasonable abstraction to the SW levels). It is 'different'
because we ultimately felt that the current directions {DCAS, xxTM}
were lacking at varous points in their (sub)architectures.

I would (happily) submit that the field of synchronization has so many
mind numbing details that all paths to any potential solutions have a
certain amount of tunnel vision.

However, the more original ASF proposal was the first (that I know of)
that showed a way to attack the BigO(n**2) problem. This problem has
to do with the cache coherent protocols and how memory systems are
architected with performance towards the 'normal' access patterns and
a near complete blind eye to synchronization access patterns (repeated
access to hot lines). DCAS (and varients) and Transactional Memory
have these exact same problems. The problem is not DCAS, nor xxTM, but
inside the cache hierarchy and memory systems. And we went into
significant detail in the aforeposted thread earlier.

ASF evolved over 2-2.5 years (while I was there) with insight from the
best architects withini AMD and from selected AMD partners who gave
useful and illuminating input into what they might have liked and what
they did not. I'm pretty sure that this evolution did not stop after I
left.

Mitch

== 12 of 24 ==
Date: Fri, May 1 2009 9:47 am
From: MitchAlsup

On Apr 30, 4:38 pm, EricP <ThatWouldBeTell...@thevillage.com> wrote:
> The current ASF would also seem to preclude the hardware
> accelerators as it never forms an update "set" until the commit.
> I rather liked that idea, given the wealth of transistors available,
> and I think it would have given a distinct competitive advantage
> in the multi-core arena.

I saw nothing in the spoecification that would have precluded an ASF
implementation from modifying the cache at the normal pipeline
(LOCKed) store timing with the caveat that the premodified data be
moved to the miss data buffers where it can be recovered upon failure.
Thus success can be made fast, penalizing failures; instead of the
other way around.

Mitch

== 13 of 24 ==
Date: Fri, May 1 2009 9:58 am
From: MitchAlsup

On Apr 30, 4:38 pm, EricP <ThatWouldBeTell...@thevillage.com> wrote:
> I assume you mean the N**2 cost of collision-retry loops.
> Yeah... and I don't like their unbounded queue delays.

N**2 is the amount of memory traffic to deal with N processor threads
accessing (for/with write permission) a single cache line and all N of
these processors agreeing on who "got" that line. Its simple cache
coherence at play/work.

Software programatics and SW synchronization architectures can make it
worse than this.

Mitch

== 14 of 24 ==
Date: Fri, May 1 2009 10:19 am
From: MitchAlsup

On Apr 30, 9:38 pm, Mayan Moudgill <ma...@bestweb.net> wrote:
> MitchAlsup wrote:
> > On Apr 30, 8:28 am, Mayan Moudgill <ma...@bestweb.net> wrote:
>
> >>- while a cache-line is locked, any requests to that cache line get deferred
>
> > This leads to unsolvable deadlock/livelock issues--depending on the
> > protocol on the interconnect(s). And created massive headaches for
> > interrupts, and more that a few issues for exceptions.
>
> 1) The abort-on-probe in ASF can lead to live-lock.

Agreed. This is why the earlier earlyASF {that I know about} had a HW
mechanism that HW could use to prevent live-lock.

> 2) If the interconnect does NOT allow deferred access to cache-lines,
> then the ASF proposal does not work either; the ASF has to defer access
> whilst committing the locked regions to the cache.

This is why I was lobbying for a NAK to be added to the HyperTansport
interconnect fabric and only for ASF events.

> 3) The mechanism described specifically allows interrupts and exceptions
> to be taken (and reservations to fail)

My point was that UnConstrained NAKing makes interrupts really tricky,
whereas tightly controlled NAKing avoids these pitfalls without
loosing the value of the NAKing {i.e. preventing livelock} and does
not make interrupts any more (usefully) difficult than they already
are.

The earlyASF that I was associated with set a timout at the point when
the critical section "became" critical. During this short window of
time, interrupts would be defered. This time was supposed to be about
2*worst case DRAM access. Thus, the critical regioni program should
have most of its protected data in its cache, and if it does, then the
timeout would not become visible. If the critical region had lost
several of its lines there is a good chance that interference would
cause a failure in the critical region anyways.

Thus, one NEEDS an instruction to MARK the transistion from collecting
the lines to be processed, and to manglement of the lines in the
ATOMIC event. This is what was lost. In addition, the instruction that
used to do this delivered an integer (+0-) and embedded in this
integer was the success indicator (0) or failure indicators. Negative
failure numbers were assigned to spurrious errors (buffer overflows,
table overflows,...) to allow HW simplification and allow SW to ignore
these and simply try again. Positive numbers were assigned by the
order of interference on a given set of protected lines. Thus one
could attemt to ACQUIRE the lines needed to perform a critical
section, and receive back an indication that 6 other threads have
touched (at least part of) those same lines to be protected. SW was
supposed to use this indicator to do something other than just try
again. Just trying again leads to BigO(N**2) contention. Knowing that
you are 6th in line allows that routine to march down the linked list
and then attempt an access that has low probability that others are
accessing. This is what allows SW to attack the BigO(N**2) problem.

Consider a timer goes off and 15 threads jump at the head of the run
queue. In normal synchronizatioins, this (lets just say) takes a long
time. However, in earlyASF everybody wakes up, jumps at the head of
the queue, is given an interference number, uses this number to access
independent elements on the subsequent attempt and BigO(N**2) is not
BigO(3). Since timing cannot be guarenteed, BigO(3) degrades to
"about" BigO(log(N)).

However, it appeasr that this has been lost.

Mitch

== 15 of 24 ==
Date: Fri, May 1 2009 10:26 am
From: MitchAlsup

On Apr 30, 9:38 pm, Mayan Moudgill <ma...@bestweb.net> wrote:
> MitchAlsup wrote:
> > This requires a search (directed or not) through the cache to modify
> > the states after successful conclusion. See below.
>
> Not really; what it requires is easy to do _IF_ you've implemented your
> cache state using flops instead of SRAMs.

Look, NOBODY is going to build their cache(s) in register technology,
its just not dense enough. However everybody is going to have a number
of miss buffers that ARE build in register technology. Thus if you can
accept a reasonable limit to the number of protected liines, then the
miss buffers are actually easy to use for this purpose.

Now, you might be able to afford to build all or part of your TAG
array in register technology, and some of this is (and has been done)
(especially invalidations. But you will never be able to afford the
data portion of the cache to be implemented in register technology.

In any event, the miss buffers already have all the desired
characteristics. They already deal with inbound and outbound data, the
demand address and the victim address, and they are snooped
continuously. So they are the perfect place to add a "little"
functionality. In addition, for all intents and purposes, they ARE a
cache; a cache of memory things in progress that need constant
tending.

Mitch

== 16 of 24 ==
Date: Fri, May 1 2009 10:30 am
From: MitchAlsup

On May 1, 10:04 am, Dmitriy Vyukov <dvyu...@gmail.com> wrote:
> On Apr 29, 1:05 pm, MitchAlsup <MitchAl...@aol.com> wrote:
>
> > Well, horay, my invention has finally seen the light of day.
>
> I wanted to ask some question on AMD forum since they are asking for
> feedback. But since Mitch is here I will ask here.
>
> Table 6.2.1. postulates that if CPU B holds cache-line in Protected
> Owned state, and CPU A make plain (non-transactional) read, then CPU B
> aborts.
> Why it's necessary to abort CPU B?

You have lost the illusion of atomicity on that cache line. Thus, its
time to abort and retry (potentially later).

ASF is not guarenteeing atomicity, it is guarenteeing the ILLUSION of
atomicity.

Mitch

== 17 of 24 ==
Date: Fri, May 1 2009 10:38 am
From: MitchAlsup

On May 1, 10:13 am, Dmitriy Vyukov <dvyu...@gmail.com> wrote:
> On Apr 29, 1:05 pm, MitchAlsup <MitchAl...@aol.com> wrote:
> With AMD's approach (LOCK MOV) I will have to rewrite all code in
> asm... not very cool...

AMD has worked with SUN and we jointly came to the conclusion that ASF
would be a powerful tool to make transactional memory work better/
faster.

I think it was Mark Moir that said "Transactional memory is for people
who just want synchronization to work" at which point I said "ASF if
for people who want synchronization to work fast", and we all had a
good chuckle. There is a completely different philosophy here. ASF is
for coders who understand atomicity, synchronization, nonBlocking,
WaitFree, and the intricate techniques involved therein, and are
willing to work for their performance. TM maybe/is for everybody else.

If you like TM write in TM, only if TM is not giving you the
performance you are looking for, then selectively recode those time
critical section in ASF to beat down the costs.

I also think that over 5-ish years SW people will come up with better
abstractions than did we so as to present SW coders with powerful more
easy to use primatives. This is why I spend so much effort building
primatives than solutions (DCAS,TCAS,QCAS).

Mitch

== 18 of 24 ==
Date: Fri, May 1 2009 10:43 am
From: Dmitriy Vyukov

On May 1, 9:30 am, MitchAlsup <MitchAl...@aol.com> wrote:
> On May 1, 10:13 am, Dmitriy Vyukov <dvyu...@gmail.com> wrote:
>
> > Why it's necessary to use LOCK MOV, why all reads and writes inside
> > transactional region may not be treated as transactional?
>
> A good question.
>
> But consider that one must end up dereferencing memory references to
> set up the cache lines for the critical section. And notice that there
> is a strict limit to the number of protected cache lines. Thus, if
> there is any additional level of indirection, you run out of
> protectable resources.
>
> Secondly, in order to dbugg these critical regions, one must build a
> thread-safe memory buffer and dump critical region state into same so
> that you can printi it out (or otherwise examine it) after success or
> failure. You really (REALY) do not want these to be protected and
> rolled back on failure.
>
> Thus, the limited number of protectable lines, and the absolute need
> to be able to figure out what went on inside the critical region,
> prevents treating everything as protected.
>
> In my opinion, this is a good thiing, a very good thing.

Personally I like precise fine-grainer control over read-, write-sets
(anyway it a thing for crazy low-level people), possibility to RELEASE
a line from read-set is awesome (especially for linked-lists).
But then maybe just choose different defaults:
LOCK MOV -> MOV
MOV -> NONLOCKED MOV
Consider, non-transactional code inside transaction must deal with
something special: debugging, contention-control, wicked
optimizations, etc. I.e. something which is especially introduced, and
received according attention, probably something emitted by aware
compiler, something which is intended "to appear inside of a
transaction". Basically something for which it's Ok to explicitly add
'UNLOCKED'. Plain synchronization code on the other hand is something
which a "normal" code just surrounded with enter/leave_critical_region
(transaction or mutex-based).
So IMVHO it makes perfect sense to allow normal synchronization code
to use normal MOVes, and at the same time allow system code too (which
will use UNLOCKed MOVes).

--
Dmitriy V'jukov

== 19 of 24 ==
Date: Fri, May 1 2009 10:51 am
From: Dmitriy Vyukov

On May 1, 10:30 am, MitchAlsup <MitchAl...@aol.com> wrote:
> On May 1, 10:04 am, Dmitriy Vyukov <dvyu...@gmail.com> wrote:
>
> > On Apr 29, 1:05 pm, MitchAlsup <MitchAl...@aol.com> wrote:
>
> > > Well, horay, my invention has finally seen the light of day.
>
> > I wanted to ask some question on AMD forum since they are asking for
> > feedback. But since Mitch is here I will ask here.
>
> > Table 6.2.1. postulates that if CPU B holds cache-line in Protected
> > Owned state, and CPU A make plain (non-transactional) read, then CPU B
> > aborts.
> > Why it's necessary to abort CPU B?
>
> You have lost the illusion of atomicity on that cache line. Thus, its
> time to abort and retry (potentially later).

For whom exactly I lost illusion of atomicity?
CPU A is doing non-transactional reads, so it's ok for it to see non-
consistent data (if it reads more than 1 variable), it's only
important for it to not see speculative stores.
CPU B does not commit yet and his speculative stores are still in the
special buffer, so it is not in the game yet - all his reads are still
valid yet, his speculative stores are not observed by anybody yet.

> ASF is not guarenteeing atomicity, it is guarenteeing the ILLUSION of
> atomicity.

Isn't illusion of atomicity IS-A atomicity? :)

--
Dmitriy V'jukov

== 20 of 24 ==
Date: Fri, May 1 2009 10:53 am
From: Mayan Moudgill

MitchAlsup wrote:
> On Apr 30, 9:38 pm, Mayan Moudgill <ma...@bestweb.net> wrote:
>
>>MitchAlsup wrote:
>>
>>>This requires a search (directed or not) through the cache to modify
>>>the states after successful conclusion. See below.
>>
>>Not really; what it requires is easy to do _IF_ you've implemented your
>>cache state using flops instead of SRAMs.
>
>
> Look, NOBODY is going to build their cache(s) in register technology,
> its just not dense enough. However everybody is going to have a number
> of miss buffers that ARE build in register technology. Thus if you can
> accept a reasonable limit to the number of protected liines, then the
> miss buffers are actually easy to use for this purpose.
>
> Now, you might be able to afford to build all or part of your TAG
> array in register technology, and some of this is (and has been done)
> (especially invalidations. But you will never be able to afford the
> data portion of the cache to be implemented in register technology.

Absolutely. I was only talking about the cache-state (i.e. MESI) bits,
not even the whole tag (MESI+LRU+sub-cache occupancy+index etc.), which
might be 512x2 bits. If you're going to support two loads per cycle
against each set, there is a lot of state that you might want
implemented as flops rather than arrays.

Given the requirement that the cache start up with all entries invalid,
you may already be implementing the MESI bits (partly) using flops.
(Alternative approaches include using a state machine to clear each
state, which also has area implications).

> In any event, the miss buffers already have all the desired
> characteristics. They already deal with inbound and outbound data, the
> demand address and the victim address, and they are snooped
> continuously. So they are the perfect place to add a "little"
> functionality. In addition, for all intents and purposes, they ARE a
> cache; a cache of memory things in progress that need constant
> tending.
>
> Mitch

I dislike using MAF/miss-buffers. I would much rather send the control
information with the miss-request, and have it echoed back along with
the miss-return-data, defering all the miss-combining etc. to the L2.
{By control information, I mean stuff like
- register to be written back
- insn to retired
}
This costs extra pipeline flops, but it removes an extra queuing
resource from the core, and takes one? stage out of the cache miss
completion path.

== 21 of 24 ==
Date: Fri, May 1 2009 11:11 am
From: Dmitriy Vyukov

On May 1, 10:38 am, MitchAlsup <MitchAl...@aol.com> wrote:
> On May 1, 10:13 am, Dmitriy Vyukov <dvyu...@gmail.com> wrote:
>
> > On Apr 29, 1:05 pm, MitchAlsup <MitchAl...@aol.com> wrote:
> > With AMD's approach (LOCK MOV) I will have to rewrite all code in
> > asm... not very cool...
>
> AMD has worked with SUN and we jointly came to the conclusion that ASF
> would be a powerful tool to make transactional memory work better/
> faster.
>
> I think it was Mark Moir that said "Transactional memory is for people
> who just want synchronization to work" at which point I said "ASF if
> for people who want synchronization to work fast"

:)
Indeed!
We need synchronization primitives only to work, and to work fast. I
am even Ok with the name SPECULATE :)

> , and we all had a
> good chuckle. There is a completely different philosophy here. ASF is
> for coders who understand atomicity, synchronization, nonBlocking,
> WaitFree, and the intricate techniques involved therein, and are
> willing to work for their performance. TM maybe/is for everybody else.
>
> If you like TM write in TM, only if TM is not giving you the
> performance you are looking for, then selectively recode those time
> critical section in ASF to beat down the costs.
>
> I also think that over 5-ish years SW people will come up with better
> abstractions than did we so as to present SW coders with powerful more
> easy to use primatives. This is why I spend so much effort building
> primatives than solutions (DCAS,TCAS,QCAS).

If some design decision is a trade-off decision which affects
performance, then I vote with 3 hands for solution which provides more
performance/scalability. Well. but why not just provide UNLOCKED MOV
instead, this does not sacrifice performance (at my first glance).

I have to thank you for such a great job. Especially I am happy that
it was in the "x86 company" :)
RELEASE feature is super-cool! Something which even todays STM lack
of. I was talking with developers of Intel STM about "partial commits"
or "partly overlapping transactions", not sure whether they are going
to support them...

Btw, here is one question I have to ask. AMD was working with SUN, but
did AMD have some discussions with Intel on this (about support in
Intel processors)? IMVHO such initiatives must be a join effort of
both x86 giants. Not sure whether Intel had discussions with AMD about
AVX...

--
Dmitriy V'jukov

== 22 of 24 ==
Date: Fri, May 1 2009 12:51 pm
From: MitchAlsup

On May 1, 1:11 pm, Dmitriy Vyukov <dvyu...@gmail.com> wrote:
> On May 1, 10:38 am, MitchAlsup <MitchAl...@aol.com> wrote:
> > I also think that over 5-ish years SW people will come up with better
> > abstractions than did we so as to present SW coders with powerful more
> > easy to use primatives. This is why I spend so much effort building
> > primatives than solutions (DCAS,TCAS,QCAS).
>
> If some design decision is a trade-off decision which affects
> performance, then I vote with 3 hands for solution which provides more
> performance/scalability.

What would you do if the intellectual burden was such that, after you
developed/exposed those primatives, only 2 dozen people in the world
understood the sublties to the point needed to code reliable
synchronizations? We stumbled over more than a few such possibilities
uncovered along the way.

> Well. but why not just provide UNLOCKED MOV
> instead, this does not sacrifice performance (at my first glance).

We talked about this on the previous thread. Placing synchronization/
atomicity on top of the Cache coherence protocol makes complexity go
through the roof. Placing beside the CC protocols only makes it really
hard to get right. You are playing HW games with the cache coherence
protocol. The cache IS actually incoherent for a short period of time,
on exactly those protected lines, and in addition there is the
potential that the memory request protocol is different/enhanced/more-
dangerous/more-complicated durring part of that period of time. This
leaves vast windows-in-time where you can screw up royally, and not
have the ability to recover (architecture is fundamentally broken).
What we are talking about here, however, is if after all these
sublties (99.9%-ish level) have been discovered, talked about, dealt
with, applied in (demonstration-like) practice, and written down.

> I have to thank you for such a great job. Especially I am happy that
> it was in the "x86 company" :)

Thanks.

> RELEASE feature is super-cool! Something which even todays STM lack
> of. I was talking with developers of Intel STM about "partial commits"
> or "partly overlapping transactions", not sure whether they are going
> to support them...

I have to defer credit to Dave Christie, here.

> Btw, here is one question I have to ask. AMD was working with SUN, but
> did AMD have some discussions with Intel on this (about support in
> Intel processors)? IMVHO such initiatives must be a join effort of
> both x86 giants. Not sure whether Intel had discussions with AMD about
> AVX...

The necessities of nonContamination of Intellectual Property means
that AMD/Intel cannot work together until something is at the level of
the just issued specification. However, I am 99.44% sure that a few
people with in Intel were aware of the general nature of this effort
as we were aware of several TM-like investigations inside Intel. This
is how NonDisclosure agreements work in practice. IP stuff is not
directly disclosed, but the questions from NDA signers are phrased in
such a way that the NDA-holding party does end up understanding that
somebody else is looking at <blah>. And generally one can google up
enough intermediate data to infer what the other party may be thnking
about.

Mitch

== 23 of 24 ==
Date: Fri, May 1 2009 12:54 pm
From: MitchAlsup

On May 1, 12:51 pm, Dmitriy Vyukov <dvyu...@gmail.com> wrote:
> On May 1, 10:30 am, MitchAlsup <MitchAl...@aol.com> wrote:
> > You have lost the illusion of atomicity on that cache line. Thus, its
> > time to abort and retry (potentially later).
>
> For whom exactly I lost illusion of atomicity?

Interested third parties (like a DMA device, graphics, other threads)
may have seen the cache line. So, you must revert back to the pre-
atomic-event state. At this point, its simply easier, and less
perilous to abandon the current critical sectioin and take a fresh
look at the concurrent data structure. And then figure out what is the
best course of events from here onwards.

Mitch

== 24 of 24 ==
Date: Fri, May 1 2009 12:57 pm
From: MitchAlsup

On May 1, 12:51 pm, Dmitriy Vyukov <dvyu...@gmail.com> wrote:
> On May 1, 10:30 am, MitchAlsup <MitchAl...@aol.com> wrote:
> > ASF is not guarenteeing atomicity, it is guarenteeing the ILLUSION of
> > atomicity.
>
> Isn't illusion of atomicity IS-A atomicity? :)

Sort of depends if you are thiniking at the instruction set level, or
at the clock by clock level.

At the clock by clock level, once you attempt to provide an atomic
event that is wider (number of stores) than one can performs stores on
a per clock basis, the illusion of atomicity is all that can be
provided.

Mitch

==============================================================================
TOPIC: Deitel & Deitel's How to Program C++ 6th Edition chapters 1-22 full
code solutions for $50
http://groups.google.com/group/comp.programming.threads/t/4621a963e96dd42b?hl=en
==============================================================================

== 1 of 1 ==
Date: Fri, May 1 2009 6:32 am
From: instructors.team@gmail.com

If you need help with your learning from Deitel & Deitel's How to
Program C++ 6th Edition and need these solutions send us an email at
instructors.team[at]gmail.com.
Code solutions for $50. All code solutions for ch 1-22.

==============================================================================

You received this message because you are subscribed to the Google Groups "comp.programming.threads"
group.

To post to this group, visit http://groups.google.com/group/comp.programming.threads?hl=en

To unsubscribe from this group, send email to comp.programming.threads+unsubscribe@googlegroups.com

To change the way you get mail from this group, visit:
http://groups.google.com/group/comp.programming.threads/subscribe?hl=en

To report abuse, send email explaining the problem to abuse@googlegroups.com

==============================================================================
Google Groups: http://groups.google.com/?hl=en

soft and program

Friday, May 1, 2009

comp.programming.threads - 25 new messages in 2 topics - digest

No comments:

Blog Archive

About Me