Thursday, December 1, 2022

Digest for comp.lang.c++@googlegroups.com - 23 updates in 4 topics

Juha Nieminen <nospam@thanks.invalid>: Dec 01 06:45AM

> refcounter got incremented (and later decremented). Alas, this was
> accidentally done from parallel threads at the same time, without any
> synchronization, so eventually the refcounter got messed up.
 
I think that if the reference count is declared atomic, it can be safely
directly incremented. When decrementing you would need to use the
fetch_sub() function to see if the object needs to be destroyed.
 
While modifying an atomic might not be equally fast as a non-atomic,
it shouldn't be all that much slower either, at least if the target
architecture supports atomic operations.
Paavo Helde <eesnimi@osa.pri.ee>: Dec 01 09:19AM +0200

01.12.2022 08:45 Juha Nieminen kirjutas:
 
> While modifying an atomic might not be equally fast as a non-atomic,
> it shouldn't be all that much slower either, at least if the target
> architecture supports atomic operations.
 
I have pondered this myself. Maybe I should measure the actual slowdown
after temporarily making the refcounters atomic. But this seems overkill
because these smartpointers would still point to single-threaded objects
which are meant to be primarily used in single-thread regime, so in most
cases making the smartpointers atomic does not buy anything.
 
When tracking down this bug, I monitored all refcounter changes for a
particular single smartpointer during the program run (ca 10 min). There
were 591848 increments and decrements, from which 1526 came from the
problematic (parallelized) part. It looks like a pessimization to slow
down 99.75% of accesses when only 0.25% would actually benefit from this.
Stuart Redmann <DerTopper@web.de>: Dec 01 02:06PM +0100

> were 591848 increments and decrements, from which 1526 came from the
> problematic (parallelized) part. It looks like a pessimization to slow
> down 99.75% of accesses when only 0.25% would actually benefit from this.
 
600k changes in reference counts look suspicious to me. When you pass a
ref-counted object to a worker thread, there should only be a single change
in the refcount. This would be because inside the worker thread some object
takes (shared) ownership of shared object. If the shared object needs to be
passed to sub-routines, you should pass them as references or plain
pointers (if the subroutine must be able to cope with non-existing
objects). It should be rare occurrence that another object in the worker
thread needs to take ownership of the shared object.
 
Another thought: if thread-safety is too costly, you could use two smart
pointer classes: thread-safe pointers and forwarding non-thread-safe smart
pointers. The forwarding smart pointers have their own thread-UNsafe
refcount and the thread-safe smart pointer as member.
 
Regards,
Stuart
Paavo Helde <eesnimi@osa.pri.ee>: Dec 01 04:45PM +0200

01.12.2022 15:06 Stuart Redmann kirjutas:
> pointers (if the subroutine must be able to cope with non-existing
> objects). It should be rare occurrence that another object in the worker
> thread needs to take ownership of the shared object.
 
You are right, that's how I fixed the bug (by using a reference). There
are now 1526 less changes in refcounts ;-)
 
As for others ~600k changes, these seem legitimate. This is a scripting
language engine (think something like Python) with complex data
structures built up via refcounted smartpointers. This particular object
is apparently used as some default column in data tables. I think it was
used for 2000 columns in some 4000-column table. And there were many
tables like that. If you insert the same refcounted vector as a new
column in a table 2000 times, via a member function having a
smartpointer parameter, then you already get something like at least
6000 refcount changes.
 
> pointer classes: thread-safe pointers and forwarding non-thread-safe smart
> pointers. The forwarding smart pointers have their own thread-UNsafe
> refcount and the thread-safe smart pointer as member.
 
I tried to measure the impact of std::atomic<int> refcounters and in
first tests it seems the overhead on x86_64 is zero (with no
contention). So it seems I could use them without drawbacks, but this
would not save much because the pointed objects would still be not
thread-safe. I guess it might work out if I could ensure that all
objects are physically immutable after some initialization. Hmm, time
for thoughts.
scott@slp53.sl.home (Scott Lurndal): Dec 01 04:02PM

>01.12.2022 15:06 Stuart Redmann kirjutas:
 
<snip>
 
 
>I tried to measure the impact of std::atomic<int> refcounters and in
>first tests it seems the overhead on x86_64 is zero (with no
>contention).
 
Which follows naturally from the fact that the core doing
the atomic access has exclusive access to the cache line
containing the refcounter. No overhead at all, unless
the ref counter isn't aligned and crosses a cache-line
boundary (or the access is to an uncached memory
range or caching is disabled), in which case the processor
will take a system-wide
lock to perform the operation, which is catastrophic
on systems with large processor counts.
 
(Note that both Intel and AMD processors will fall back to
the system-wide lock if a cache line is highly contended
after some time period has elapsed in order to make forward
progress.)
 
If the atomic access is to memory on a CXL.memory device, the
operation will not benefit from local cache line latencies
and the atomicity will be guaranteed by the CXL.memory device
exporting the memory to the host in some implementation defined
manner.
Paavo Helde <eesnimi@osa.pri.ee>: Dec 01 08:21PM +0200

01.12.2022 18:02 Scott Lurndal kirjutas:
 
> Which follows naturally from the fact that the core doing
> the atomic access has exclusive access to the cache line
> containing the refcounter. No overhead at all, unless
 
Thanks for the clarifications!
 
> will take a system-wide
> lock to perform the operation, which is catastrophic
> on systems with large processor counts.
 
It is clear that having a misaligned cross-border atomic would be very
bad. But what about normal uncached memory ranges, wouldn't these be
just loaded into the cache, without disturbing other processors, and
without any "catastrophic" consequences?
scott@slp53.sl.home (Scott Lurndal): Dec 01 07:09PM

>bad. But what about normal uncached memory ranges, wouldn't these be
>just loaded into the cache, without disturbing other processors, and
>without any "catastrophic" consequences?
 
Generally uncached means that the processor fetches
directly from memory bypassing the cache and never evicting any
lines. This is an important characteristic for MMIO space
where a read access has a side effect (e.g. reading a UART
Data Register).
 
It also depends on how the processor atomic instructions are implemented.
 
In legacy Intel/AMD systems, where the LOCK prefix is being used, the
systemwide lock is the only possibility [*].
 
For ARM64 with the Large System Extensions (LSE) atomic instructions,
the processor can send the atomic operation to the point of coherency
(either the cache subsystem, or if caching is disabled, to the DRAM
controller or PCI-Express device (PCIe supports atomics) and the
synchronization happens at the "endpoint".
 
Without support all the way to the memory controller or endpoint,
there is no other way to sychronize all agents accessing the controller
or endpoint without acquiring a global mutex of some sort.
 
[*] It's been a decade since I worked directly with those processors
and they may have added support for atomic operations to the
internal ring or now mesh structures used to communicate between
the processing elements and the memory controllers and PCI root port
bridges, in which case, like ARM64, they can push the atomic op
all the way out to the endpoint/controller.
Michael S <already5chosen@yahoo.com>: Dec 01 11:59AM -0800

On Thursday, December 1, 2022 at 8:22:12 PM UTC+2, Paavo Helde wrote:
> bad. But what about normal uncached memory ranges, wouldn't these be
> just loaded into the cache, without disturbing other processors, and
> without any "catastrophic" consequences?
 
"Unchached memory range" is a misnormer.
A proper name is uncacheable range (region).
Unfortunately "uncached" in the meaning of "uncacheable" is used
quite often. Even Intel's official manuals suffer from such
inconsistent vocabulary.
"Chris M. Thomasson" <chris.m.thomasson.1@gmail.com>: Dec 01 12:21PM -0800

On 12/1/2022 5:06 AM, Stuart Redmann wrote:
> pointer classes: thread-safe pointers and forwarding non-thread-safe smart
> pointers. The forwarding smart pointers have their own thread-UNsafe
> refcount and the thread-safe smart pointer as member.
 
Check this out:
 
https://github.com/jseigh/atomic-ptr-plus/blob/master/atomic-ptr/atomic_ptr.h
 
It is a truly atomic reference counted pointer. A thread can take a
reference without owning a prior reference.
 
Here is a patent:
 
https://patents.justia.com/patent/5295262
Paavo Helde <eesnimi@osa.pri.ee>: Dec 01 11:53PM +0200

01.12.2022 21:59 Michael S kirjutas:
 
> Unfortunately "uncached" in the meaning of "uncacheable" is used
> quite often. Even Intel's official manuals suffer from such
> inconsistent vocabulary.
 
Thanks, I had to look up what is "uncacheable memory". I guess the 80186
processor where I learned my basics did not have such a thing.
Bonita Montero <Bonita.Montero@gmail.com>: Dec 01 07:23PM +0100

At last I only managed to have a faster comparison
of a char-array against a string_view on Windows:
 
template<typename CharType, typename TraitsType>
#if defined(__cpp_concepts)
requires std::same_as<make_unsigned_t<CharType>, unsigned char>
|| std::same_as<make_unsigned_t<CharType>, wchar_t>
|| std::same_as<make_unsigned_t<CharType>, char8_t>
|| std::same_as<make_unsigned_t<CharType>, char16_t>
|| std::same_as<make_unsigned_t<CharType>, char32_t>

No comments: