soft and program: Digest for comp.lang.c++@googlegroups.com

comp.lang.c++@googlegroups.com

Google Groups

Question about relaxed memory ordering - 5 Updates
Dekker's Algorithm on atomics with memory ordering - 4 Updates
new benchmark for my read/write algorithm... - 5 Updates
is there any UB in this integer math... - 7 Updates

mvorbrodt@gmail.com: Feb 23 12:44PM -0800

Is it possible for this simple program to print false? Given the compiler or CPU can reorder around the relaxed atomic. Or is my understanding off.

#include <iostream>
#include <iomanip>
#include <thread>
using namespace std;

int main(int argc, char** argv)
{
atomic_bool flag{false};
bool data{false};

thread t1([&]() {
data = true;
flag.store(true, memory_order_relaxed);
});

thread t2([&]() {
while(flag.load(memory_order_relaxed) == false);
cout << "data = " << boolalpha << data << endl;
});

t1.join();
t2.join();

return 1;
}

Chris Vine <chris@cvine--nospam--.freeserve.co.uk>: Feb 23 10:09PM

On Sat, 23 Feb 2019 12:44:33 -0800 (PST)
mvorbrodt@gmail.com wrote
> t2.join();

> return 1;
> }

Yes it is possible, and because of that technically you also have
undefined behaviour as regards the read and write of 'data', which is
not an atomic type.

"Chris M. Thomasson" <invalid_chris_thomasson_invalid@invalid.com>: Feb 23 02:22PM -0800

On 2/23/2019 2:09 PM, Chris Vine wrote:

> Yes it is possible, and because of that technically you also have
> undefined behaviour as regards the read and write of 'data', which is
> not an atomic type.

Big time. This is a massive data race on 'bool data'.

mvorbrodt@gmail.com: Feb 23 03:09PM -0800

> t2.join();

> return 1;
> }

the data race goes away if i change it to store(release) and load(acquire) or SC on both, correct?

"Chris M. Thomasson" <invalid_chris_thomasson_invalid@invalid.com>: Feb 23 03:20PM -0800

>> return 1;
>> }

> the data race goes away if i change it to store(release) and load(acquire) or SC on both, correct?

Correct. :^)

Dekker's Algorithm on atomics with memory ordering

Chris Vine <chris@cvine--nospam--.freeserve.co.uk>: Feb 23 11:46AM

On Fri, 22 Feb 2019 14:35:46 -0800 (PST)
> > It needs seq_cst. Keep in mind that an x86 can reorder a store followed
> > by a load to another location. ;^)

> thank you. yes i realized the mistake and rewrote it with seq_cst.

Dekker's algorithm is OK to learn for pedagogical purposes but don't
use it in real life. Even though sequential consistency is the default
in C++ for atomics, to avoid unnecessary fence instructions you should
where possible prefer acquire/release semantics, particularly on x86.

std::memory_order_seq_cst is needed where a thread must see the same
order of modification of two or more atomic variables, some of which it
might not modify, as does another thread. Two threads co-operating with
respect to a single atomic variable do not need it (nor where
co-operating with more than one atomic variable where program action
does not depend on their modification ordering). Most algorithms can
be (re)written to void cases of sequential consistency.

"Chris M. Thomasson" <invalid_chris_thomasson_invalid@invalid.com>: Feb 23 12:35PM -0800

On 2/23/2019 3:46 AM, Chris Vine wrote:
> co-operating with more than one atomic variable where program action
> does not depend on their modification ordering). Most algorithms can
> be (re)written to void cases of sequential consistency.

Dekker algorihtm basically needs seq_cst wrt its store followed by a
load to another location. It needs an explicit MFENCE or dummy LOCK'ed
atomic, even on x86.

Well, there is the more exotic asymmetric version, but that uses another
type of barrier that is not provided by C++11.

"Chris M. Thomasson" <invalid_chris_thomasson_invalid@invalid.com>: Feb 23 02:47PM -0800

> int main(int argc, char** argv)
> {
> atomic_bool f1{false}, f2{false};
[...]

You are using a different algorihtm than Dekker. Afaict, you are
omitting the turn variable:

https://en.wikipedia.org/wiki/Dekker%27s_algorithm

"Chris M. Thomasson" <invalid_chris_thomasson_invalid@invalid.com>: Feb 23 03:15PM -0800

> Can you please review my implementation of Dekker's algorithm and tell me if my memory ordering on atomic operations is correct. I think it is but I need to double check...
[...]

Fwiw, I can only get Dekker's algorihtm to work with the following
membars. This is a Relacy test unit, take a deep look at ct_dekker:
__________________________________
// Dekker Test
// Relacy Test By Chris M. Thomasson
//_______________________________________________

//#define RL_DEBUGBREAK_ON_ASSERT
//#define RL_MSVC_OUTPUT
//#define RL_FORCE_SEQ_CST
//#define RL_GC

#include <relacy/relacy_std.hpp>
#include <iostream>

#define THREADS (2) // Only 2 threads will work here...

struct ct_dekker
{
std::atomic<bool> f1;
std::atomic<bool> f2;

ct_dekker() : f1(false), f2(false) {}

void p0_lock()
{
for (;;)
{
f1.store(true, std::memory_order_seq_cst);

if (f2.load(std::memory_order_seq_cst) == false)
{
break;
}

f1.store(false, std::memory_order_relaxed);
rl::yield(1, $);
}
}

void p0_unlock()
{
f1.store(false, std::memory_order_release);
}

void p1_lock()
{
for (;;)
{
f2.store(true, std::memory_order_seq_cst);

if (f1.load(std::memory_order_seq_cst) == false)
{
break;
}

f2.store(false, std::memory_order_relaxed);
rl::yield(1, $);
}
}

void p1_unlock()
{
f2.store(false, std::memory_order_release);
}
};

// Relacy Dekker
struct ct_dekker_test
: rl::test_suite<ct_dekker_test, THREADS>
{
VAR_T(long) g_shared_state;
ct_dekker g_dekker;

void before()
{
VAR(g_shared_state) = 0;
}

void after()
{

}

void thread(unsigned int tidx)
{
if (tidx < 1)
{
g_dekker.p0_lock();
++VAR(g_shared_state);
g_dekker.p0_unlock();
}

else
{
g_dekker.p1_lock();
++VAR(g_shared_state);
g_dekker.p1_unlock();
}
}
};

// Test away... Or fly? Humm...
int main()
{
{
rl::test_params p;

p.iteration_count = 10000000;
//p.execution_depth_limit = 33333;
//p.search_type = rl::sched_bound;
//p.search_type = rl::fair_full_search_scheduler_type;
//p.search_type = rl::fair_context_bound_scheduler_type;

rl::simulate<ct_dekker_test>(p);
}

return 0;
}
__________________________________

new benchmark for my read/write algorithm...

"Chris M. Thomasson" <invalid_chris_thomasson_invalid@invalid.com>: Feb 22 05:11PM -0800

On 2/22/2019 12:03 PM, Geoff wrote:

> Fin!

> There seems to be quite a performance hit on x64 code vs x86.
> Both tests were run from Windows Power Shell.

Yeah. I get similar results on my end as well, and still not sure why.
Humm...

Geoff <geoff@invalid.invalid>: Feb 22 12:03PM -0800

On Tue, 19 Feb 2019 22:49:13 -0800, "Chris M. Thomasson"
>be performed in this test at 60 seconds. The number of threads is
>variable and is determined by std::hardware_concurrency * THREADS, set
>at 8 in the test.

On Win32 binary release build from VS2017 Community:
Testing Version 0.1: std::shared_mutex

___________________________________
cpu_threads_n = 6
threads_n = 48
writers = 24
readers = 24
test duration = 60 seconds
___________________________________

___________________________________
Raw Reads: 16488
Raw Writes: 2672
reads_per_tick = 274
writes_per_tick = 44
Ticks = 60.1008
___________________________________

Fin!

Testing Version 0.1: Chris M. Thomasson's Experimental Read/Write
Mutex

___________________________________
cpu_threads_n = 6
threads_n = 48
writers = 24
readers = 24
test duration = 60 seconds
___________________________________

___________________________________
Raw Reads: 41335
Raw Writes: 2814
reads_per_tick = 688
writes_per_tick = 46
Ticks = 60.0609
___________________________________

Fin!

On x86_64 Release Build VS2017:

Testing Version 0.1: std::shared_mutex

___________________________________
cpu_threads_n = 6
threads_n = 48
writers = 24
readers = 24
test duration = 60 seconds
___________________________________

___________________________________
Raw Reads: 5983
Raw Writes: 3715
reads_per_tick = 99
writes_per_tick = 61
Ticks = 60.2067
___________________________________

Fin!

Testing Version 0.1: Chris M. Thomasson's Experimental Read/Write
Mutex

___________________________________
cpu_threads_n = 6
threads_n = 48
writers = 24
readers = 24
test duration = 60 seconds
___________________________________

___________________________________
Raw Reads: 18364
Raw Writes: 2123
reads_per_tick = 305
writes_per_tick = 35
Ticks = 60.1187
___________________________________

Fin!

There seems to be quite a performance hit on x64 code vs x86.
Both tests were run from Windows Power Shell.

Bonita Montero <Bonita.Montero@gmail.com>: Feb 23 07:27AM +0100

> Ahhhh. Try comparing CMPXCHG vs XADD. To do this, we can "emulate" XADD
> with CMPXCHG. The loopless XADD performs much better than the CMPXCHG
> emulation.

You didn't understand what I said: with my CAS-benchmark with as many
threads doing tight CAS-loops as there are cores I had about an averagge
of 2,5 successfull CASes until a thread relinquishes the cacheline hol-
ding the flags. That's because when having fetched the cacheline from
another core it remains for a while in the core fetched to because the
MO(E)SI-signalling between the cores is slow.
This means that with the CAS-loop in my mutex the CAS should succeed
in the first attempt in almost any case. So there should be no need
for XADD here.
And even more, XADD could only be used for getting ownership as a reader
and not as a writer because in the latter case there is not only an add
but I set also the sign-flag.
And with XADD I couldn't do the overflow-assert when NDEBUG is not
defined.

>> be very likely that we have kernel-waits and -signalling.

> When your CAS loop spins due to contention on your counter, it is in
> user space.

Boy, you're so slow on the updatek! When theres contention on this
flag it's very likely that the there will be kernel-waits and signalling
also; and if this happens the performance of the atomic operations don't
count.

> However, there is no bound on how many times your CAS loop can spin
> under high contention scenarios. ...

As I already said and according to what I measured with my benchmark
the CMPXCHG should succeed in almost any time at the first attempt.

"Chris M. Thomasson" <invalid_chris_thomasson_invalid@invalid.com>: Feb 23 12:31PM -0800

On 2/22/2019 10:27 PM, Bonita Montero wrote:
>> Ahhhh. Try comparing CMPXCHG vs XADD. To do this, we can "emulate"
>> XADD with CMPXCHG. The loopless XADD performs much better than the
>> CMPXCHG emulation.
[...]
> As I already said and according to what I measured with my benchmark
> the CMPXCHG should succeed in almost any time at the first attempt.

Should? Humm... Think of the following simple scenario:

Writer A reads your count.

Reader B gains read access.

Writer A spins on CAS because it failed!

Reader B releases read access.

Writer A spins on CAS because it failed!

Reader C gains read access.

Writer A spins on CAS because it failed!

Reader C releases read access.

Writer A spins on CAS because it failed!

[...]

Sustained read access can starve your CAS loop before any kernel waits
even occur, and make it loop many many times in userspace without doing
any real work at all. This scenario can never happen in my algorithm
because it has no loops. What is so hard to understand? This can happen
in your work, not in mine. You cannot prove that this scenario will
never happen in your algorihtm. Period.

Notice how the writer is starting to livelock? I don't think you
understand how a CAS can livelock in these scenarios. This is why I
avoided CAS loops when I designed my algorithm.

You cannot give me a fixed bound on your CAS loops, no matter how hard
you try. These are general purpose read write locks. So, you cannot
predict how they are going to be used.

You need to understand that a CAS loop can break down. This is nothing new.

https://groups.google.com/d/msg/lock-free/zzZX4fvtG04/ebVWl0BCBAAJ

CAS loops degrade under high contention. End of story.

Btw, do you want me to tell you about compare_exchange_weak?

Also, you really should test the difference between XADD and its CMPXCHG
emulation. That is the key.

"Chris M. Thomasson" <invalid_chris_thomasson_invalid@invalid.com>: Feb 23 12:43PM -0800

On 2/22/2019 10:27 PM, Bonita Montero wrote:

[...]

Also, Bonita: Can you please properly quote context? Why you do seem to
always snip all authorship?

is there any UB in this integer math...

"Chris M. Thomasson" <invalid_chris_thomasson_invalid@invalid.com>: Feb 22 05:22PM -0800

On 2/22/2019 2:05 AM, Rosario19 wrote:
> type is it 32 bit? is it 64? than even if it is fixed to 64 bit for
> example: is the operation + defined from standard to all value in the
> range over is defined?

Take a look at the fetch_add function:

https://en.cppreference.com/w/cpp/atomic/atomic/fetch_add

It does not care about wrapping. It just blindly adds addend. If the
caller tries to add 1 to LONG_MAX using a long, then they are in UB
land, fetch_add or not.

David Brown <david.brown@hesbynett.no>: Feb 23 04:18PM +0100

On 22/02/2019 16:55, Öö Tiib wrote:
> the software has been tested and used so far. So how to minimize
> damage when only thing that we know is that every software contains
> defects and damage from wrong output can be large?

I have no good answer to that - I don't think there is a complete good
answer to be found. We have to improve in lots of areas. Better
testing is one of them - a great deal of software is poorly tested.
Splitting code into testable units, at attempting to verify that these
units are entirely correct, is another. Contracts (which are found in
some languages, and are coming to C++ Real Soon Now™) can help too. Of
course, the biggest step is to persuade developers - and the people
managing them - to take testing and correctness seriously.

>> until all bugs were eliminated.

> You seem to forget that complex software with several potential undefined
> behaviors and other defects in it does in practice misbehave very rarely.

Actually no, I am not forgetting that - it is /exactly/ my point. Lots
of software has bugs, or at least potential bugs, and yet manages to do
a useful job despite that. Turning these potential bugs into crashes
and error messages simply guarantees that it will /not/ do a useful job
(though it may mean the bugs get found and fixed faster).

Many errors, such as signed integer overflow, pointer mistakes, or
buffer overruns, are clearly undefined behaviour. But they won't
necessarily cause adverse effects in the program. If a buffer overrun
means you accidentally clear beyond the end of an array, but the space
you clear is not used (or is written before it is next read), no harm is
done. If your integer arithmetic overflows but the result is not used,
then it could be harmless. These sorts of things are /potential/
problems - a different compiler, different options, unrelated changes in
the source code could all turn them into /real/ problems. But they
won't be spotted using testing. And options like -ftrapv can make them
/real/ problems.

> Technically it was relatively buggy, unstable and under-documented GUI
> toolkit for MS DOS. Developers did hate it but end users liked it and so
> I used it.

I never had the "pleasure" of Windows 2.0. Between MSDOS with GEM and
Windows 3.0 I was at university, using SunOS and Solaris.

> Sure, how should program decide that the defect manifests only as
> incoherence in warning light data? There are several better strategies
> to achieve stability of critical systems than to ignore sanity checks.

I did not say you should ignore them - I said you should not necessarily
stop with an error message.

I think we are saying the same thing in different ways. It appears that
I am saying "Don't always halt on all errors - consider carefully how to
deal with them." And you are saying "Don't always ignore errors -
consider carefully how to deal with them".

>> If your data comes from a source where it could be bad, then you check it.

> Exactly. Large amount of defects are that some corner case check wasn't
> written by programmer and so it is not handled and no one did notice.

Make your checks positive, not negative - check that the data fits the
patterns you want, rather than checking for patterns that you know are bad.

Reinhardt Behm <rbehm@hushmail.com>: Feb 23 11:53PM +0800

AT Saturday 23 February 2019 23:18, David Brown wrote:

> Make your checks positive, not negative - check that the data fits the
> patterns you want, rather than checking for patterns that you know are
> bad.

+1
Simply because you (hopefully) know what is correct but you never know all
incorrect patterns.

--
Reinhardt

Bonita Montero <Bonita.Montero@gmail.com>: Feb 23 05:52PM +0100

> There are no modern systems that don't use 2's complement arithmetic -
> that half is correct. But modern compilers can, and do, assume that
> your signed arithmetic never overflows.

For what is this assumption good for? This looks rather like a nerd
-decision of a compiler-writer for me. I mean integer-arithmetic is
not like fp-arithmetic on any platform; the ISAs on which the com-
pilers are based on always overflow.

David Brown <david.brown@hesbynett.no>: Feb 23 06:04PM +0100

On 23/02/2019 17:52, Bonita Montero wrote:
> -decision of a compiler-writer for me. I mean integer-arithmetic is
> not like fp-arithmetic on any platform; the ISAs on which the com-
> pilers are based on always overflow.

They make that assumption for optimisation purposes. C++ is dependent
on optimising compilers - without optimisation, a lot of C++ code would
be very big and slow. Inlining and template expansion, followed by
strength reduction, constant propagation, and various optimisation
passes leads to all sorts of opportunities for making more efficient
object code. But a lot of that depends on the assumption that undefined
behaviour does not occur. And in the case of signed integer arithmetic,
assuming that there is no overflow allows all sorts of simplifications.

So the compiler knows that "x + 1 > x" is true, and "2 * x / 2" is "x",
and so on.

Bonita Montero <Bonita.Montero@gmail.com>: Feb 23 06:17PM +0100

> They make that assumption for optimisation purposes.

Cool, if you had not said so then I would not have come up with it!

> C++ is dependent on optimising compilers - without optimisation,
> a lot of C++ code would be very big and slow.

But I see in this case, no optimization that would be worth this
interpretation.

> So the compiler knows that "x + 1 > x" ...
> is true, and "2 * x / 2" is "x", and so on.

I think it's better for a compiler not to assume this.
All the more so because no programmer is so stupid to write such
code that makes such optimizations necessary. That is, there are
few advantages of such optimizations, but the disadvantage is
that the compiler does many things that you would not intuitively
expect from it.
For example, It can be very handy if you have a variable with a
set bit that you shift from the lowest position to the highest
and beyond to empty. In such and many other cases, one of the
compilers would get in the way.

Christian Gollwitzer <auriocus@gmx.de>: Feb 23 09:18PM +0100

Am 23.02.19 um 18:17 schrieb Bonita Montero:

> I think it's better for a compiler not to assume this.
> All the more so because no programmer is so stupid to write such
> code that makes such optimizations necessary.

You're getting it wrong. The user may not explicitly write code such as
2*x/2, but due to inlining and template expansion an expression may come
up like this.

Think of index arithmetic with vectors and such things. For example:

std::vector<long> bla(30);
size_t sizeofvector = bla.size()*sizeof(long);

This second line could result in code that does the equivalent of

size_t bytesinvector;
sizeofvector = bytesinvector/sizeof(long)*sizeof(long);

Christian

You received this digest because you're subscribed to updates for this group. You can change your settings on the group membership page.
To unsubscribe from this group and stop receiving emails from it send an email to comp.lang.c+++unsubscribe@googlegroups.com.

soft and program

Saturday, February 23, 2019

Digest for comp.lang.c++@googlegroups.com - 21 updates in 4 topics

No comments:

Blog Archive

About Me