- An argument *against* (the liberal use of) references - 10 Updates
- Why is that usually faster than a normal string_view == comparison ? - 2 Updates
- vector 🤔 - 10 Updates
- Pause event handler, go enter another one, then come back - 1 Update
Juha Nieminen <nospam@thanks.invalid>: Dec 01 06:45AM > refcounter got incremented (and later decremented). Alas, this was > accidentally done from parallel threads at the same time, without any > synchronization, so eventually the refcounter got messed up. I think that if the reference count is declared atomic, it can be safely directly incremented. When decrementing you would need to use the fetch_sub() function to see if the object needs to be destroyed. While modifying an atomic might not be equally fast as a non-atomic, it shouldn't be all that much slower either, at least if the target architecture supports atomic operations. |
Paavo Helde <eesnimi@osa.pri.ee>: Dec 01 09:19AM +0200 01.12.2022 08:45 Juha Nieminen kirjutas: > While modifying an atomic might not be equally fast as a non-atomic, > it shouldn't be all that much slower either, at least if the target > architecture supports atomic operations. I have pondered this myself. Maybe I should measure the actual slowdown after temporarily making the refcounters atomic. But this seems overkill because these smartpointers would still point to single-threaded objects which are meant to be primarily used in single-thread regime, so in most cases making the smartpointers atomic does not buy anything. When tracking down this bug, I monitored all refcounter changes for a particular single smartpointer during the program run (ca 10 min). There were 591848 increments and decrements, from which 1526 came from the problematic (parallelized) part. It looks like a pessimization to slow down 99.75% of accesses when only 0.25% would actually benefit from this. |
Stuart Redmann <DerTopper@web.de>: Dec 01 02:06PM +0100 > were 591848 increments and decrements, from which 1526 came from the > problematic (parallelized) part. It looks like a pessimization to slow > down 99.75% of accesses when only 0.25% would actually benefit from this. 600k changes in reference counts look suspicious to me. When you pass a ref-counted object to a worker thread, there should only be a single change in the refcount. This would be because inside the worker thread some object takes (shared) ownership of shared object. If the shared object needs to be passed to sub-routines, you should pass them as references or plain pointers (if the subroutine must be able to cope with non-existing objects). It should be rare occurrence that another object in the worker thread needs to take ownership of the shared object. Another thought: if thread-safety is too costly, you could use two smart pointer classes: thread-safe pointers and forwarding non-thread-safe smart pointers. The forwarding smart pointers have their own thread-UNsafe refcount and the thread-safe smart pointer as member. Regards, Stuart |
Paavo Helde <eesnimi@osa.pri.ee>: Dec 01 04:45PM +0200 01.12.2022 15:06 Stuart Redmann kirjutas: > pointers (if the subroutine must be able to cope with non-existing > objects). It should be rare occurrence that another object in the worker > thread needs to take ownership of the shared object. You are right, that's how I fixed the bug (by using a reference). There are now 1526 less changes in refcounts ;-) As for others ~600k changes, these seem legitimate. This is a scripting language engine (think something like Python) with complex data structures built up via refcounted smartpointers. This particular object is apparently used as some default column in data tables. I think it was used for 2000 columns in some 4000-column table. And there were many tables like that. If you insert the same refcounted vector as a new column in a table 2000 times, via a member function having a smartpointer parameter, then you already get something like at least 6000 refcount changes. > pointer classes: thread-safe pointers and forwarding non-thread-safe smart > pointers. The forwarding smart pointers have their own thread-UNsafe > refcount and the thread-safe smart pointer as member. I tried to measure the impact of std::atomic<int> refcounters and in first tests it seems the overhead on x86_64 is zero (with no contention). So it seems I could use them without drawbacks, but this would not save much because the pointed objects would still be not thread-safe. I guess it might work out if I could ensure that all objects are physically immutable after some initialization. Hmm, time for thoughts. |
scott@slp53.sl.home (Scott Lurndal): Dec 01 04:02PM >01.12.2022 15:06 Stuart Redmann kirjutas: <snip> >I tried to measure the impact of std::atomic<int> refcounters and in >first tests it seems the overhead on x86_64 is zero (with no >contention). Which follows naturally from the fact that the core doing the atomic access has exclusive access to the cache line containing the refcounter. No overhead at all, unless the ref counter isn't aligned and crosses a cache-line boundary (or the access is to an uncached memory range or caching is disabled), in which case the processor will take a system-wide lock to perform the operation, which is catastrophic on systems with large processor counts. (Note that both Intel and AMD processors will fall back to the system-wide lock if a cache line is highly contended after some time period has elapsed in order to make forward progress.) If the atomic access is to memory on a CXL.memory device, the operation will not benefit from local cache line latencies and the atomicity will be guaranteed by the CXL.memory device exporting the memory to the host in some implementation defined manner. |
Paavo Helde <eesnimi@osa.pri.ee>: Dec 01 08:21PM +0200 01.12.2022 18:02 Scott Lurndal kirjutas: > Which follows naturally from the fact that the core doing > the atomic access has exclusive access to the cache line > containing the refcounter. No overhead at all, unless Thanks for the clarifications! > will take a system-wide > lock to perform the operation, which is catastrophic > on systems with large processor counts. It is clear that having a misaligned cross-border atomic would be very bad. But what about normal uncached memory ranges, wouldn't these be just loaded into the cache, without disturbing other processors, and without any "catastrophic" consequences? |
scott@slp53.sl.home (Scott Lurndal): Dec 01 07:09PM >bad. But what about normal uncached memory ranges, wouldn't these be >just loaded into the cache, without disturbing other processors, and >without any "catastrophic" consequences? Generally uncached means that the processor fetches directly from memory bypassing the cache and never evicting any lines. This is an important characteristic for MMIO space where a read access has a side effect (e.g. reading a UART Data Register). It also depends on how the processor atomic instructions are implemented. In legacy Intel/AMD systems, where the LOCK prefix is being used, the systemwide lock is the only possibility [*]. For ARM64 with the Large System Extensions (LSE) atomic instructions, the processor can send the atomic operation to the point of coherency (either the cache subsystem, or if caching is disabled, to the DRAM controller or PCI-Express device (PCIe supports atomics) and the synchronization happens at the "endpoint". Without support all the way to the memory controller or endpoint, there is no other way to sychronize all agents accessing the controller or endpoint without acquiring a global mutex of some sort. [*] It's been a decade since I worked directly with those processors and they may have added support for atomic operations to the internal ring or now mesh structures used to communicate between the processing elements and the memory controllers and PCI root port bridges, in which case, like ARM64, they can push the atomic op all the way out to the endpoint/controller. |
Michael S <already5chosen@yahoo.com>: Dec 01 11:59AM -0800 On Thursday, December 1, 2022 at 8:22:12 PM UTC+2, Paavo Helde wrote: > bad. But what about normal uncached memory ranges, wouldn't these be > just loaded into the cache, without disturbing other processors, and > without any "catastrophic" consequences? "Unchached memory range" is a misnormer. A proper name is uncacheable range (region). Unfortunately "uncached" in the meaning of "uncacheable" is used quite often. Even Intel's official manuals suffer from such inconsistent vocabulary. |
"Chris M. Thomasson" <chris.m.thomasson.1@gmail.com>: Dec 01 12:21PM -0800 On 12/1/2022 5:06 AM, Stuart Redmann wrote: > pointer classes: thread-safe pointers and forwarding non-thread-safe smart > pointers. The forwarding smart pointers have their own thread-UNsafe > refcount and the thread-safe smart pointer as member. Check this out: https://github.com/jseigh/atomic-ptr-plus/blob/master/atomic-ptr/atomic_ptr.h It is a truly atomic reference counted pointer. A thread can take a reference without owning a prior reference. Here is a patent: https://patents.justia.com/patent/5295262 |
Paavo Helde <eesnimi@osa.pri.ee>: Dec 01 11:53PM +0200 01.12.2022 21:59 Michael S kirjutas: > Unfortunately "uncached" in the meaning of "uncacheable" is used > quite often. Even Intel's official manuals suffer from such > inconsistent vocabulary. Thanks, I had to look up what is "uncacheable memory". I guess the 80186 processor where I learned my basics did not have such a thing. |
Bonita Montero <Bonita.Montero@gmail.com>: Dec 01 07:23PM +0100 At last I only managed to have a faster comparison of a char-array against a string_view on Windows: template<typename CharType, typename TraitsType> #if defined(__cpp_concepts) requires std::same_as<make_unsigned_t<CharType>, unsigned char> || std::same_as<make_unsigned_t<CharType>, wchar_t> || std::same_as<make_unsigned_t<CharType>, char8_t> || std::same_as<make_unsigned_t<CharType>, char16_t> || std::same_as<make_unsigned_t<CharType>, char32_t>
Subscribe to:
Post Comments (Atom)
|
No comments:
Post a Comment