- Question about relaxed memory ordering - 5 Updates
- Dekker's Algorithm on atomics with memory ordering - 4 Updates
- new benchmark for my read/write algorithm... - 5 Updates
- is there any UB in this integer math... - 7 Updates
mvorbrodt@gmail.com: Feb 23 12:44PM -0800 Is it possible for this simple program to print false? Given the compiler or CPU can reorder around the relaxed atomic. Or is my understanding off. #include <iostream> #include <iomanip> #include <thread> using namespace std; int main(int argc, char** argv) { atomic_bool flag{false}; bool data{false}; thread t1([&]() { data = true; flag.store(true, memory_order_relaxed); }); thread t2([&]() { while(flag.load(memory_order_relaxed) == false); cout << "data = " << boolalpha << data << endl; }); t1.join(); t2.join(); return 1; } |
Chris Vine <chris@cvine--nospam--.freeserve.co.uk>: Feb 23 10:09PM On Sat, 23 Feb 2019 12:44:33 -0800 (PST) mvorbrodt@gmail.com wrote > t2.join(); > return 1; > } Yes it is possible, and because of that technically you also have undefined behaviour as regards the read and write of 'data', which is not an atomic type. |
"Chris M. Thomasson" <invalid_chris_thomasson_invalid@invalid.com>: Feb 23 02:22PM -0800 On 2/23/2019 2:09 PM, Chris Vine wrote: > Yes it is possible, and because of that technically you also have > undefined behaviour as regards the read and write of 'data', which is > not an atomic type. Big time. This is a massive data race on 'bool data'. |
mvorbrodt@gmail.com: Feb 23 03:09PM -0800 > t2.join(); > return 1; > } the data race goes away if i change it to store(release) and load(acquire) or SC on both, correct? |
"Chris M. Thomasson" <invalid_chris_thomasson_invalid@invalid.com>: Feb 23 03:20PM -0800 >> return 1; >> } > the data race goes away if i change it to store(release) and load(acquire) or SC on both, correct? Correct. :^) |
Chris Vine <chris@cvine--nospam--.freeserve.co.uk>: Feb 23 11:46AM On Fri, 22 Feb 2019 14:35:46 -0800 (PST) > > It needs seq_cst. Keep in mind that an x86 can reorder a store followed > > by a load to another location. ;^) > thank you. yes i realized the mistake and rewrote it with seq_cst. Dekker's algorithm is OK to learn for pedagogical purposes but don't use it in real life. Even though sequential consistency is the default in C++ for atomics, to avoid unnecessary fence instructions you should where possible prefer acquire/release semantics, particularly on x86. std::memory_order_seq_cst is needed where a thread must see the same order of modification of two or more atomic variables, some of which it might not modify, as does another thread. Two threads co-operating with respect to a single atomic variable do not need it (nor where co-operating with more than one atomic variable where program action does not depend on their modification ordering). Most algorithms can be (re)written to void cases of sequential consistency. |
"Chris M. Thomasson" <invalid_chris_thomasson_invalid@invalid.com>: Feb 23 12:35PM -0800 On 2/23/2019 3:46 AM, Chris Vine wrote: > co-operating with more than one atomic variable where program action > does not depend on their modification ordering). Most algorithms can > be (re)written to void cases of sequential consistency. Dekker algorihtm basically needs seq_cst wrt its store followed by a load to another location. It needs an explicit MFENCE or dummy LOCK'ed atomic, even on x86. Well, there is the more exotic asymmetric version, but that uses another type of barrier that is not provided by C++11. |
"Chris M. Thomasson" <invalid_chris_thomasson_invalid@invalid.com>: Feb 23 02:47PM -0800 > int main(int argc, char** argv) > { > atomic_bool f1{false}, f2{false}; [...] You are using a different algorihtm than Dekker. Afaict, you are omitting the turn variable: https://en.wikipedia.org/wiki/Dekker%27s_algorithm |
"Chris M. Thomasson" <invalid_chris_thomasson_invalid@invalid.com>: Feb 23 03:15PM -0800 > Can you please review my implementation of Dekker's algorithm and tell me if my memory ordering on atomic operations is correct. I think it is but I need to double check... [...] Fwiw, I can only get Dekker's algorihtm to work with the following membars. This is a Relacy test unit, take a deep look at ct_dekker: __________________________________ // Dekker Test // Relacy Test By Chris M. Thomasson //_______________________________________________ //#define RL_DEBUGBREAK_ON_ASSERT //#define RL_MSVC_OUTPUT //#define RL_FORCE_SEQ_CST //#define RL_GC #include <relacy/relacy_std.hpp> #include <iostream> #define THREADS (2) // Only 2 threads will work here... struct ct_dekker { std::atomic<bool> f1; std::atomic<bool> f2; ct_dekker() : f1(false), f2(false) {} void p0_lock() { for (;;) { f1.store(true, std::memory_order_seq_cst); if (f2.load(std::memory_order_seq_cst) == false) { break; } f1.store(false, std::memory_order_relaxed); rl::yield(1, $); } } void p0_unlock() { f1.store(false, std::memory_order_release); } void p1_lock() { for (;;) { f2.store(true, std::memory_order_seq_cst); if (f1.load(std::memory_order_seq_cst) == false) { break; } f2.store(false, std::memory_order_relaxed); rl::yield(1, $); } } void p1_unlock() { f2.store(false, std::memory_order_release); } }; // Relacy Dekker struct ct_dekker_test : rl::test_suite<ct_dekker_test, THREADS> { VAR_T(long) g_shared_state; ct_dekker g_dekker; void before() { VAR(g_shared_state) = 0; } void after() { } void thread(unsigned int tidx) { if (tidx < 1) { g_dekker.p0_lock(); ++VAR(g_shared_state); g_dekker.p0_unlock(); } else { g_dekker.p1_lock(); ++VAR(g_shared_state); g_dekker.p1_unlock(); } } }; // Test away... Or fly? Humm... int main() { { rl::test_params p; p.iteration_count = 10000000; //p.execution_depth_limit = 33333; //p.search_type = rl::sched_bound; //p.search_type = rl::fair_full_search_scheduler_type; //p.search_type = rl::fair_context_bound_scheduler_type; rl::simulate<ct_dekker_test>(p); } return 0; } __________________________________ |
"Chris M. Thomasson" <invalid_chris_thomasson_invalid@invalid.com>: Feb 22 05:11PM -0800 On 2/22/2019 12:03 PM, Geoff wrote: > Fin! > There seems to be quite a performance hit on x64 code vs x86. > Both tests were run from Windows Power Shell. Yeah. I get similar results on my end as well, and still not sure why. Humm... |
Geoff <geoff@invalid.invalid>: Feb 22 12:03PM -0800 On Tue, 19 Feb 2019 22:49:13 -0800, "Chris M. Thomasson" >be performed in this test at 60 seconds. The number of threads is >variable and is determined by std::hardware_concurrency * THREADS, set >at 8 in the test. On Win32 binary release build from VS2017 Community: Testing Version 0.1: std::shared_mutex ___________________________________ cpu_threads_n = 6 threads_n = 48 writers = 24 readers = 24 test duration = 60 seconds ___________________________________ ___________________________________ Raw Reads: 16488 Raw Writes: 2672 reads_per_tick = 274 writes_per_tick = 44 Ticks = 60.1008 ___________________________________ Fin! Testing Version 0.1: Chris M. Thomasson's Experimental Read/Write Mutex ___________________________________ cpu_threads_n = 6 threads_n = 48 writers = 24 readers = 24 test duration = 60 seconds ___________________________________ ___________________________________ Raw Reads: 41335 Raw Writes: 2814 reads_per_tick = 688 writes_per_tick = 46 Ticks = 60.0609 ___________________________________ Fin! On x86_64 Release Build VS2017: Testing Version 0.1: std::shared_mutex ___________________________________ cpu_threads_n = 6 threads_n = 48 writers = 24 readers = 24 test duration = 60 seconds ___________________________________ ___________________________________ Raw Reads: 5983 Raw Writes: 3715 reads_per_tick = 99 writes_per_tick = 61 Ticks = 60.2067 ___________________________________ Fin! Testing Version 0.1: Chris M. Thomasson's Experimental Read/Write Mutex ___________________________________ cpu_threads_n = 6 threads_n = 48 writers = 24 readers = 24 test duration = 60 seconds ___________________________________ ___________________________________ Raw Reads: 18364 Raw Writes: 2123 reads_per_tick = 305 writes_per_tick = 35 Ticks = 60.1187 ___________________________________ Fin! There seems to be quite a performance hit on x64 code vs x86. Both tests were run from Windows Power Shell. |
Bonita Montero <Bonita.Montero@gmail.com>: Feb 23 07:27AM +0100 > Ahhhh. Try comparing CMPXCHG vs XADD. To do this, we can "emulate" XADD > with CMPXCHG. The loopless XADD performs much better than the CMPXCHG > emulation. You didn't understand what I said: with my CAS-benchmark with as many threads doing tight CAS-loops as there are cores I had about an averagge of 2,5 successfull CASes until a thread relinquishes the cacheline hol- ding the flags. That's because when having fetched the cacheline from another core it remains for a while in the core fetched to because the MO(E)SI-signalling between the cores is slow. This means that with the CAS-loop in my mutex the CAS should succeed in the first attempt in almost any case. So there should be no need for XADD here. And even more, XADD could only be used for getting ownership as a reader and not as a writer because in the latter case there is not only an add but I set also the sign-flag. And with XADD I couldn't do the overflow-assert when NDEBUG is not defined. >> be very likely that we have kernel-waits and -signalling. > When your CAS loop spins due to contention on your counter, it is in > user space. Boy, you're so slow on the updatek! When theres contention on this flag it's very likely that the there will be kernel-waits and signalling also; and if this happens the performance of the atomic operations don't count. > However, there is no bound on how many times your CAS loop can spin > under high contention scenarios. ... As I already said and according to what I measured with my benchmark the CMPXCHG should succeed in almost any time at the first attempt. |
"Chris M. Thomasson" <invalid_chris_thomasson_invalid@invalid.com>: Feb 23 12:31PM -0800 On 2/22/2019 10:27 PM, Bonita Montero wrote: >> Ahhhh. Try comparing CMPXCHG vs XADD. To do this, we can "emulate" >> XADD with CMPXCHG. The loopless XADD performs much better than the >> CMPXCHG emulation. [...] > As I already said and according to what I measured with my benchmark > the CMPXCHG should succeed in almost any time at the first attempt. Should? Humm... Think of the following simple scenario: Writer A reads your count. Reader B gains read access. Writer A spins on CAS because it failed! Reader B releases read access. Writer A spins on CAS because it failed! Reader C gains read access. Writer A spins on CAS because it failed! Reader C releases read access. Writer A spins on CAS because it failed! [...] Sustained read access can starve your CAS loop before any kernel waits even occur, and make it loop many many times in userspace without doing any real work at all. This scenario can never happen in my algorithm because it has no loops. What is so hard to understand? This can happen in your work, not in mine. You cannot prove that this scenario will never happen in your algorihtm. Period. Notice how the writer is starting to livelock? I don't think you understand how a CAS can livelock in these scenarios. This is why I avoided CAS loops when I designed my algorithm. You cannot give me a fixed bound on your CAS loops, no matter how hard you try. These are general purpose read write locks. So, you cannot predict how they are going to be used. You need to understand that a CAS loop can break down. This is nothing new. https://groups.google.com/d/msg/lock-free/zzZX4fvtG04/ebVWl0BCBAAJ CAS loops degrade under high contention. End of story. Btw, do you want me to tell you about compare_exchange_weak? Also, you really should test the difference between XADD and its CMPXCHG emulation. That is the key. |
"Chris M. Thomasson" <invalid_chris_thomasson_invalid@invalid.com>: Feb 23 12:43PM -0800 On 2/22/2019 10:27 PM, Bonita Montero wrote: [...] Also, Bonita: Can you please properly quote context? Why you do seem to always snip all authorship? |
"Chris M. Thomasson" <invalid_chris_thomasson_invalid@invalid.com>: Feb 22 05:22PM -0800 On 2/22/2019 2:05 AM, Rosario19 wrote: > type is it 32 bit? is it 64? than even if it is fixed to 64 bit for > example: is the operation + defined from standard to all value in the > range over is defined? Take a look at the fetch_add function: https://en.cppreference.com/w/cpp/atomic/atomic/fetch_add It does not care about wrapping. It just blindly adds addend. If the caller tries to add 1 to LONG_MAX using a long, then they are in UB land, fetch_add or not. |
David Brown <david.brown@hesbynett.no>: Feb 23 04:18PM +0100 On 22/02/2019 16:55, Öö Tiib wrote: > the software has been tested and used so far. So how to minimize > damage when only thing that we know is that every software contains > defects and damage from wrong output can be large? I have no good answer to that - I don't think there is a complete good answer to be found. We have to improve in lots of areas. Better testing is one of them - a great deal of software is poorly tested. Splitting code into testable units, at attempting to verify that these units are entirely correct, is another. Contracts (which are found in some languages, and are coming to C++ Real Soon Now™) can help too. Of course, the biggest step is to persuade developers - and the people managing them - to take testing and correctness seriously. >> until all bugs were eliminated. > You seem to forget that complex software with several potential undefined > behaviors and other defects in it does in practice misbehave very rarely. Actually no, I am not forgetting that - it is /exactly/ my point. Lots of software has bugs, or at least potential bugs, and yet manages to do a useful job despite that. Turning these potential bugs into crashes and error messages simply guarantees that it will /not/ do a useful job (though it may mean the bugs get found and fixed faster). Many errors, such as signed integer overflow, pointer mistakes, or buffer overruns, are clearly undefined behaviour. But they won't necessarily cause adverse effects in the program. If a buffer overrun means you accidentally clear beyond the end of an array, but the space you clear is not used (or is written before it is next read), no harm is done. If your integer arithmetic overflows but the result is not used, then it could be harmless. These sorts of things are /potential/ problems - a different compiler, different options, unrelated changes in the source code could all turn them into /real/ problems. But they won't be spotted using testing. And options like -ftrapv can make them /real/ problems. > Technically it was relatively buggy, unstable and under-documented GUI > toolkit for MS DOS. Developers did hate it but end users liked it and so > I used it. I never had the "pleasure" of Windows 2.0. Between MSDOS with GEM and Windows 3.0 I was at university, using SunOS and Solaris. > Sure, how should program decide that the defect manifests only as > incoherence in warning light data? There are several better strategies > to achieve stability of critical systems than to ignore sanity checks. I did not say you should ignore them - I said you should not necessarily stop with an error message. I think we are saying the same thing in different ways. It appears that I am saying "Don't always halt on all errors - consider carefully how to deal with them." And you are saying "Don't always ignore errors - consider carefully how to deal with them". >> If your data comes from a source where it could be bad, then you check it. > Exactly. Large amount of defects are that some corner case check wasn't > written by programmer and so it is not handled and no one did notice. Make your checks positive, not negative - check that the data fits the patterns you want, rather than checking for patterns that you know are bad. |
Reinhardt Behm <rbehm@hushmail.com>: Feb 23 11:53PM +0800 AT Saturday 23 February 2019 23:18, David Brown wrote: > Make your checks positive, not negative - check that the data fits the > patterns you want, rather than checking for patterns that you know are > bad. +1 Simply because you (hopefully) know what is correct but you never know all incorrect patterns. -- Reinhardt |
Bonita Montero <Bonita.Montero@gmail.com>: Feb 23 05:52PM +0100 > There are no modern systems that don't use 2's complement arithmetic - > that half is correct. But modern compilers can, and do, assume that > your signed arithmetic never overflows. For what is this assumption good for? This looks rather like a nerd -decision of a compiler-writer for me. I mean integer-arithmetic is not like fp-arithmetic on any platform; the ISAs on which the com- pilers are based on always overflow. |
David Brown <david.brown@hesbynett.no>: Feb 23 06:04PM +0100 On 23/02/2019 17:52, Bonita Montero wrote: > -decision of a compiler-writer for me. I mean integer-arithmetic is > not like fp-arithmetic on any platform; the ISAs on which the com- > pilers are based on always overflow. They make that assumption for optimisation purposes. C++ is dependent on optimising compilers - without optimisation, a lot of C++ code would be very big and slow. Inlining and template expansion, followed by strength reduction, constant propagation, and various optimisation passes leads to all sorts of opportunities for making more efficient object code. But a lot of that depends on the assumption that undefined behaviour does not occur. And in the case of signed integer arithmetic, assuming that there is no overflow allows all sorts of simplifications. So the compiler knows that "x + 1 > x" is true, and "2 * x / 2" is "x", and so on. |
Bonita Montero <Bonita.Montero@gmail.com>: Feb 23 06:17PM +0100 > They make that assumption for optimisation purposes. Cool, if you had not said so then I would not have come up with it! > C++ is dependent on optimising compilers - without optimisation, > a lot of C++ code would be very big and slow. But I see in this case, no optimization that would be worth this interpretation. > So the compiler knows that "x + 1 > x" ... > is true, and "2 * x / 2" is "x", and so on. I think it's better for a compiler not to assume this. All the more so because no programmer is so stupid to write such code that makes such optimizations necessary. That is, there are few advantages of such optimizations, but the disadvantage is that the compiler does many things that you would not intuitively expect from it. For example, It can be very handy if you have a variable with a set bit that you shift from the lowest position to the highest and beyond to empty. In such and many other cases, one of the compilers would get in the way. |
Christian Gollwitzer <auriocus@gmx.de>: Feb 23 09:18PM +0100 Am 23.02.19 um 18:17 schrieb Bonita Montero: > I think it's better for a compiler not to assume this. > All the more so because no programmer is so stupid to write such > code that makes such optimizations necessary. You're getting it wrong. The user may not explicitly write code such as 2*x/2, but due to inlining and template expansion an expression may come up like this. Think of index arithmetic with vectors and such things. For example: std::vector<long> bla(30); size_t sizeofvector = bla.size()*sizeof(long); This second line could result in code that does the equivalent of size_t bytesinvector; sizeofvector = bytesinvector/sizeof(long)*sizeof(long); Christian |
You received this digest because you're subscribed to updates for this group. You can change your settings on the group membership page. To unsubscribe from this group and stop receiving emails from it send an email to comp.lang.c+++unsubscribe@googlegroups.com. |
No comments:
Post a Comment