- Thread-safe initialization of static objects - 25 Updates
Michael S <already5chosen@yahoo.com>: Sep 21 02:59AM -0700 On Thursday, September 21, 2023 at 12:03:01 PM UTC+3, Bonita Montero wrote: > > Actually, considering that contention is very rare in practice, ... > The initialization might last an arbitrary time, so spinning > is inacceptable. Why is it unacceptable? The impact of needless wake up of the waiting thread once per tick is negligible in terms of system performance and not too horrible in terms power consumption. The impact of needless sleep till the next tick is also insignificant under assumption that initialization took long time. I'd call it non-ideal when better solutions available, but perfectly acceptable otherwise. > > So even use of Futex | Critical Section | SRW lock, although it make > > a lot of sense in practice, is not necessary from theoretical P.O.V. > Why should a SRW-lock make sense here ? Because SRW lock in exclusive mode is better critical section than critical section. On newer versions of Windows critical sections exist purely for backward compatibility. That's what I was told unofficially by knowledgeable Microsoft guy. > If the object isn't initialized > iz can't be read meanwhile. And a futex isn't a replacement for a mutex > but just a faster slow path for a mutex. Are you sure? My impression was that fast path (uncontended case) is also faster due to absence of syscall overhead. |
David Brown <david.brown@hesbynett.no>: Sep 21 01:19PM +0200 On 21/09/2023 11:59, Michael S wrote: > horrible in terms power consumption. The impact of needless sleep till the > next tick is also insignificant under assumption that initialization took long time. > I'd call it non-ideal when better solutions available, but perfectly acceptable otherwise. I don't know why Bonita dislikes spinning and sleeping, but I know why /I/ don't like it as a solution. Sleeping does not affect thread priority in the way mutexes do. That means if one low priority thread starts the initialisation and takes the lock, then gets pre-empted by a higher priority thread that tries to take the lock, your whole system can deadlock. (The high priority thread may be sleeping, but there may be a mid-priority thread that can run and blocks the low-priority thread from running.) This is going to be a more likely failure situation if you have fewer (or only one) cores. But the same compilers and toolchains are used for a wide variety of systems - you can't pick a solution that might fail on some systems. And note that these kinds of failures never turn up in testing - they are rare, and only happen at the worst possible moment after deployment. This is particularly a risk for real-time systems and embedded systems, where you are more likely to have fewer cores (perhaps only one), or maybe some of your cores are dedicated to real-time tasks and unavailable for other threads. You will also likely have high priority and real-time priority threads, which will block the low-priority threads. And on such systems there might be very serious costs or safety risks involved in system hangs - you want to be sure that they are not possible, to the best of practical abilities, rather than designing in a solution that has a definite non-zero chance of failure. Using mutexes avoids this problem, because if a high priority thread is waiting on a mutex held by a low priority thread, the holder's priority gets boosted to that of the high priority thread until it releases the lock. Spinning should only be used when you have full control of the priorities of threads that may take the spin lock, and can be sure it is safe. |
Richard Damon <Richard@Damon-Family.org>: Sep 21 07:36AM -0400 On 9/20/23 11:42 PM, Bonita Montero wrote: >> other threads get a chance to run, and then check the initialization >> status again. > A yield yould be to inefficient. Why? You can't do anything until the object you are waiting to be initialized gets initialized, so your "efficiency" isn't important. Remember, the ONLY reason you need to wait is because you came to an object that is actively in the process of being initialized, so a fairly rare condition. If it hasn't been initialized yet, you don't wait, you just claim the initialization and go, or if the objects initialization is complete, you just use it. Also, Yield can't be less efficient than blocking on a Mutex, as the implementation of the mutex blocking needs to yield too. As has elsewhere been mentioned, you typically wait a little bit in a spin wait to see if it will be quick or slow, so "Yield" is only done on the "Slow" path, and it must be ok to be slow on the slow path, as it is the slow path. |
Richard Damon <Richard@Damon-Family.org>: Sep 21 07:42AM -0400 On 9/21/23 4:03 AM, Bonita Montero wrote: > return c; } ) == end( mem ); > cout << (zero ? "all zero" : "has non-zero") << endl; > } As it must by the standard. Note, depending on the OS and implementation, that zeroing will either be done by having the loaded program image just contain a great big block of zeros, or (to save image size at the cost of CPU time) the system might just put "static" objects that need to be zero initialized into one common segment and zero-fill as an extension of the loading process, or in some cases the OS can just promise that new segments for the program are just always zero-filled. |
Michael S <already5chosen@yahoo.com>: Sep 21 05:01AM -0700 On Thursday, September 21, 2023 at 2:19:59 PM UTC+3, David Brown wrote: > Spinning should only be used when you have full control of the > priorities of threads that may take the spin lock, and can be sure it is > safe. The problem you mention does not apply to general-purpose systems like Windows, Linux, BSD-linage Unixes, Solaris e.t.c. They all have built-in avoidance of deadlocks caused by priority-inversion. Most commonly it's done by applying random priority boosts. Also I don't see how mutex semantics can possibly help. The common problem scenario is that low-priority thread that is doing initialization is preempted when it *does not* hold the mutex. The mutex is held for two very short durations of flags update. The preemption is far more likely to happen during the middle phase - constructor itself. On real-time system with 1 or 2 processors you probably want completely different design in which a mutex is held during all duration of initialization, but you can't expect such solution to be included in g++/clang++/MSVC compiler support libraries. Which again brings us to the point made by myself and few others: don't do it! Don't use function-local static objects with constructors. Or, at least don't use them in multitasking scenarios. Compiler's one time initialization infrastructure can be super-robust, but being generic it's unlikely to be optimal answer in any concrete scenario. |
Bonita Montero <Bonita.Montero@gmail.com>: Sep 21 02:34PM +0200 Am 21.09.2023 um 11:59 schrieb Michael S: > Why is it unacceptable? ... Because the initializing thread might be scheduled away, thereby keeping other threads spinning. > I'd call it non-ideal when better solutions available, but perfectly acceptable otherwise. It's inacceptable. > Because SRW lock in exclusive mode is better critical section than critical > section. ... No, if you only lock exclusively there's no difference. > Are you sure? The fast path of mutexes is fast anyway, there's no need for further speed ups. The futex accellerates the slow path and is a replacement for the binary semaphore attached to a mutex. |
Bonita Montero <Bonita.Montero@gmail.com>: Sep 21 02:37PM +0200 Am 21.09.2023 um 13:19 schrieb David Brown: > I don't know why Bonita dislikes spinning and sleeping, ... I only dislike spinning in userspace except when there's a limit to go for the slow path. The thread holding a spinlock can be scheduled away an arbitrary time, that's unacceptable. |
Bonita Montero <Bonita.Montero@gmail.com>: Sep 21 02:40PM +0200 Am 21.09.2023 um 13:36 schrieb Richard Damon: > Why? Because the initialization might only take a microsecond or less and a yield is a full timeslice, on Linux and Windows usually one millisecond. |
David Brown <david.brown@hesbynett.no>: Sep 21 03:03PM +0200 On 21/09/2023 14:01, Michael S wrote: > like Windows, Linux, BSD-linage Unixes, Solaris e.t.c. They all have > built-in avoidance of deadlocks caused by priority-inversion. Most > commonly it's done by applying random priority boosts. As far as I know, such random priority boosts will not boost a non-realtime priority thread to real-time priority on Linux (and presumably not on other systems). That would negate the whole concept of realtime priority threads. > completely different design in which a mutex is held during all duration > of initialization, but you can't expect such solution to be included > in g++/clang++/MSVC compiler support libraries. Yes, you would need the mutex to be held during critical stages of the initialisation. I can understand that this would be expensive compared to spinlocks that would work fine for "normal" systems (especially with the random boosts you described). And I understand that toolchains have to be optimised for normal situations, not things that are extremely rare even in particular niche situations. But getting threading, locking and synchronisation details right is very difficult - most programmers don't get it right. But you can't tell that you have a problem by testing the code, as failures typically require extreme bad luck in timing. So it always worries me when I see something that is hidden, which most programmers will assume "just works, by some compiler magic", but which can be a problem in some circumstances. (As another example, gcc's "libatomic" uses spinlocks - if you use std::atomic types on a microcontroller in situations where atomics would be useful, they will hang your system any time there is a coincidental access.) > Or, at least don't use them in multitasking scenarios. Compiler's one > time initialization infrastructure can be super-robust, but being generic > it's unlikely to be optimal answer in any concrete scenario. Yes, that is one option. You can also use "constinit" statics safely. And gcc has a "-fno-threadsafe-statics" option (also useable as a pragma, but unfortunately not as a neater __attribute__) that puts the responsibility of ensuring things are threadsafe onto the user. |
David Brown <david.brown@hesbynett.no>: Sep 21 03:05PM +0200 On 21/09/2023 14:40, Bonita Montero wrote: > Because the initialization might only take a microsecond or less > and a yield is a full timeslice, on Linux and Windows usually one > millisecond. "Yield" won't take a timeslice unless there is something else of the same priority, waiting to run on the same core. And if that's the case, then fine - that thread is clearly equally important. |
Bonita Montero <Bonita.Montero@gmail.com>: Sep 21 04:17PM +0200 Am 21.09.2023 um 15:05 schrieb David Brown: > "Yield" won't take a timeslice unless there is something else of the > same priority, waiting to run on the same core. ... Ok, you're right, but the code would be still unacceptable under load. |
Michael S <already5chosen@yahoo.com>: Sep 21 07:30AM -0700 On Thursday, September 21, 2023 at 4:03:46 PM UTC+3, David Brown wrote: > non-realtime priority thread to real-time priority on Linux (and > presumably not on other systems). That would negate the whole concept > of realtime priority threads. If you have enough of ready real time threads to occupy all CPUs for unacceptably long time then, indeed, you have a problem. But to me it looks unlikely that deadlock due to priority inversion is your main problem in such scenario. > I can understand that this would be expensive compared to spinlocks that > would work fine for "normal" systems (especially with the random boosts > you described). What's advocated here by most posters and what's seems to be implemented by major toolchains is not a choice between speenlocks and one or pair of mutexes or equivalents. The spinlocks is mere something that we mention to argue against Bonita's suggestion that the C++ Standard in its current form can be impossible to implement. Bonita aside, the real choice is between 4 options: (A) Mutex per object. Held for all duration of the constructor. (B) Serializing all initialization of function-local static objects with one global mutex. Again, the mutex held throughout all duration of the constructor. (C) Few flags and possible a little more of auxiliary states (like thread Id in my proposed solution) per objects. Access to flags is guarded by global mutex, but initialization itself is not guarded. The are several possible solution for wake-up of contending threads of waiting for finish of initialization. Among such solutions, spinlocks, either classical or amended by sleep() call are possible, but least likely on full-featured OSes. The most natural solution [on f-f OS] is wait on per-object conditional variable with global mutex. (D) Creative combinations of A and B, e.g. small pool of mutexs distributed dynamically between objects. Like A and B, mutex held throughout all duration of the constructor. All major toolchains appear to use variants of (C) > > time initialization infrastructure can be super-robust, but being generic > > it's unlikely to be optimal answer in any concrete scenario. > Yes, that is one option. You can also use "constinit" statics safely. I am not aware of constinit. Is it something new, like C++14 or later? |
Michael S <already5chosen@yahoo.com>: Sep 21 07:48AM -0700 On Thursday, September 21, 2023 at 3:40:37 PM UTC+3, Bonita Montero wrote: > Because the initialization might only take a microsecond or less > and a yield is a full timeslice, on Linux and Windows usually one > millisecond. If initialization is short then contention is very unlikely. Wasting a a full timeslice once in a blue moon is o.k. If initialization is long then contention is likely, however wasting a a full timeslice after long initialization is o.k. Not an excellent solution, but acceptable. |
scott@slp53.sl.home (Scott Lurndal): Sep 21 02:54PM >>>> They are only set to zero if they are not initialized explicitly. >>> I'm not talking about what the standad says but how it's implemented >>> with all operating systems that support virtual memory. Nonsense. As Pavel pointed out in the part this thread you snipped, both pthread_once and pthread_mutex can be statically initialized (and the value may or may not be zero). >> This shows "all zero" on all modern operating systems: You're changing the topic. BSS has been guaranteed to be initialized to zero since long before windows existed. >into one common segment and zero-fill as an extension of the loading >process, or in some cases the OS can just promise that new segments for >the program are just always zero-filled. If the bss section is page-aligned, demand paging will ensure that the contents are zeroed before being accessed. |
scott@slp53.sl.home (Scott Lurndal): Sep 21 02:56PM >> Actually, considering that contention is very rare in practice, ... >The initialization might last an arbitrary time, so spinning >is inacceptable. It's rare in practice because good programmers very seldom[*] use non-global static variables in threaded applications. [*] Effectively never. |
Bonita Montero <Bonita.Montero@gmail.com>: Sep 21 05:03PM +0200 Am 21.09.2023 um 16:48 schrieb Michael S: > If initialization is short then contention is very unlikely. Maybe, but no one would cose this solutions if mutexes are available. Yielding in this situation is crap. |
Bonita Montero <Bonita.Montero@gmail.com>: Sep 21 05:04PM +0200 Am 21.09.2023 um 16:56 schrieb Scott Lurndal: > It's rare in practice because good programmers very seldom[*] > use non-global static variables in threaded applications. It's just your taste that this should be avoided. |
David Brown <david.brown@hesbynett.no>: Sep 21 08:10PM +0200 On 21/09/2023 16:30, Michael S wrote: > unacceptably long time then, indeed, you have a problem. > But to me it looks unlikely that deadlock due to priority inversion is your > main problem in such scenario. I always want to think through possible issues when it comes to multi-threading - no matter how unlikely. It is certainly not uncommon to have more ready real-time threads than cpus in real-time systems - though perhaps not on real-time Linux systems. In microcontroller systems, your OS (if you have one) is usually an RTOS, and most of the threads are real-time (of different priorities). And you usually only have one, occasionally two, cpu. But I think for that kind of system, your recommended solution - don't have function-local statics that need such initialisation synchronisation - is far and away the most common solution. (It's certainly the one I use myself.) Function-local statics with dynamic initialisation have quite a bit of overhead, both for initialisation and in use. > dynamically between objects. Like A and B, mutex held throughout all duration > of the constructor. > All major toolchains appear to use variants of (C) I must have a look at the language support library provided with the embedded (32-bit arm-none-eabi-gcc) I use. (D) seems the natural choice there, with less risk of conflict than (B) while avoiding the space costs of (A). But my guess is that is actually done like (D) but with simple spinlocks, since the compiler support library is independent of the RTOS and therefore does not have access to mutexes. Looking at generated code (using godbolt.org) for Linux, it seems there is just a single byte boolean guard per object - there is no thread id or anything else stored per object. I don't know what is going on inside "__cxa_guard_acquire" and other such functions, however. >>> it's unlikely to be optimal answer in any concrete scenario. >> Yes, that is one option. You can also use "constinit" statics safely. > I am not aware of constinit. Is it something new, like C++14 or later? C++20 : <https://en.cppreference.com/w/cpp/language/constinit> If a variable is declared with the "constinit" specifier, it must have static initialisation (so in practice it is initialised like a C static variable, though it can be done using something like a constexpr function call). Such statics will all be initialised before main() starts, so the there are no synchronisation issues. |
"Chris M. Thomasson" <chris.m.thomasson.1@gmail.com>: Sep 21 11:13AM -0700 On 9/21/2023 1:04 AM, Michael S wrote: > It's only disadvantage is higher memory footage - 6 bytes per guarded object instead of 1. > Can be reduced to 5 bytes, but in practice compiler will allocate 8 bytes anyway, so let it > be 6. Here is Relacy, a pretty nice way to model sync algorithms: https://www.1024cores.net/home/relacy-race-detector |
"Chris M. Thomasson" <chris.m.thomasson.1@gmail.com>: Sep 21 11:19AM -0700 On 9/21/2023 12:47 AM, Bonita Montero wrote: >> Have you heard about an adaptive backoff, different than a >> "traditional" adaptive mutex? ... > We discussed yield with periodic polling. An adaptive backoff tries to automatically set its spin limit based on some statistics it gathers dynamically during runtime, aka before it waits in the kernel. One way to do it, a very interesting way imvho, is instead of spinning and yielding, it tries to do some other unrelated work instead. Pretty nice, actually. Something akin to: while (! try_lock_or_whatever()) { if (! try_to_do_some_other_unrelated_work()) { // wait in the kernel, really slow path... } } { // We are locked! :^D } unlock(); If your program can handle something like this, it actually works pretty good... |
"Chris M. Thomasson" <chris.m.thomasson.1@gmail.com>: Sep 21 11:21AM -0700 On 9/21/2023 5:40 AM, Bonita Montero wrote: > Because the initialization might only take a microsecond or less > and a yield is a full timeslice, on Linux and Windows usually one > millisecond. An adaptive backoff can be a mixture of the PAUSE instruction aka x86, and yield wrt the OS. |
"Chris M. Thomasson" <chris.m.thomasson.1@gmail.com>: Sep 21 11:25AM -0700 On 9/21/2023 7:17 AM, Bonita Montero wrote: >> "Yield" won't take a timeslice unless there is something else of the >> same priority, waiting to run on the same core. ... > Ok, you're right, but the code would be still unacceptable under load. Wait, are you talking about a _pure_ spin lock in user space here? For some reason you seem to be struggling with adaptive backoffs, that can choose to wait in the kernel for a "really slow path", so to speak. Keep in mind that spinning is already a slow path by default. Now, there is a way to organize a program where a yield can be replaced by a "try_to_do_something_else()" function. Its been a while sense I implemented one. Man, how time flies! |
"Chris M. Thomasson" <chris.m.thomasson.1@gmail.com>: Sep 21 11:27AM -0700 On 9/21/2023 8:03 AM, Bonita Montero wrote: >> If initialization is short then contention is very unlikely. > Maybe, but no one would cose this solutions if mutexes are available. > Yielding in this situation is crap. You tell em Bonita. You should start protesting about this in front of Microsoft head quarters in Redmond. |
"Chris M. Thomasson" <chris.m.thomasson.1@gmail.com>: Sep 21 11:29AM -0700 On 9/19/2023 1:50 AM, Bonita Montero wrote: >> that user logic that waits for infinity while its trying to initialize >> itself should be handled in the std? Humm... Not sure about you. > You often seem confused. Really? Actually, my comments here are trying to help you. But, you like to flush them down the toilet and call me an idiot. Well, shit happens. |
"Chris M. Thomasson" <chris.m.thomasson.1@gmail.com>: Sep 21 11:34AM -0700 On 9/21/2023 12:51 AM, Bonita Montero wrote: > A recursive lock wouldn't make sense here since the object might have > beein partitially created when it re-enters the mutex which guards the > initialization. Huh? I must be misunderstanding you here. A Recursive mutex means it allows the same thread to acquire it more than once. It can be detected rather easily. |
You received this digest because you're subscribed to updates for this group. You can change your settings on the group membership page. To unsubscribe from this group and stop receiving emails from it send an email to comp.lang.c+++unsubscribe@googlegroups.com. |
No comments:
Post a Comment