soft and program: Digest for comp.lang.c++@googlegroups.com

comp.lang.c++@googlegroups.com

Google Groups

An argument *against* (the liberal use of) references - 13 Updates
Why is that usually faster than a normal string_view == comparison ? - 9 Updates
A ℙ≠ℕℙ proof for review (v1.0) - 1 Update
Brilliant - 1 Update
Pause event handler, go enter another one, then come back - 1 Update

An argument *against* (the liberal use of) references

Michael S <already5chosen@yahoo.com>: Dec 02 04:11AM -0800

On Thursday, December 1, 2022 at 11:54:02 PM UTC+2, Paavo Helde wrote:
> > inconsistent vocabulary.
> Thanks, I had to look up what is "uncacheable memory". I guess the 80186
> processor where I learned my basics did not have such a thing.

80186 was "embedded" microprocessor similar at core to 8086.
It was typically used with no cache so didn't need a concept of uncacheable
regions.
80286 and especially i386 was used with (external) cache quite often, but
according to my understanding their caches were what we call today
"memory-side caches" associated with main memory.
From system (both CPU and other bus masters) perspective such caches
are totally transparent (except for entering/leaving deep sleep states, but
back then they didn't do it) so there still was no need for uncacheable regions.

System-side caches and associated problem first appear in x86 world in i486.
Still, the in original i486 the system cache had strict write-through policy, so the
problems were minor.
Then came Pentium with 8 KB of write-back Data cache and a little later came
new models of i486 with even bigger write-back cache and problems became
quite real, especially because approximately in the same time PCI took over
I/O bus role and suddenly multiple bus masters that were high-end curiosity
before then, became common in consumer PC hardware.
But even then x86 architecture lacked adequate answer to a new challenge.

The first reasonable answer (MTRR registers) came only in PPro but it still
had problems of scalability - too few regions.
Later (P-III) they invented PAT which from theoretical point of view is inferior
to MTRRs because in PAT scheme cachability is an attribute of virtual address
rather than of physical address. But PAT is ultimately scalable and is one
solution that, as long as OS does a proper plumbing, is one solution that can
rule over all aspects of cachabilty. So it won.

Michael S <already5chosen@yahoo.com>: Dec 02 05:53AM -0800

On Thursday, December 1, 2022 at 9:09:39 PM UTC+2, Scott Lurndal wrote:

> It also depends on how the processor atomic instructions are implemented.

> In legacy Intel/AMD systems, where the LOCK prefix is being used, the
> systemwide lock is the only possibility [*].

Huh?
There is absolute no relationship between what prefix is used (instruction
encoding issue) and implementation.

> the processing elements and the memory controllers and PCI root port
> bridges, in which case, like ARM64, they can push the atomic op
> all the way out to the endpoint/controller.

x86 atomic operations have global order, which is somewhat stronger
than "total order" of the rest of normal* x86 stores. The difference is that
unlike "total order" global order makes no exceptions for store-to-load
forwarding from core's local store queue.

BTW, it means that your claim in post above "no overhead at all unless ..."
is incorrect in the absolute sense. There is an overhead even without "unless".
But the overhead in uncontended case is small - order of dozen or two of CPU
clocks. So, undetectable in Paavo's case of only 1000 updates per second.
For 1M updates per second impact would me detectable with precise time
measurements and for 100M per second there would be big slowdown.

Last September Intel published this manual:
https://cdrdv2-public.intel.com/671368/architecture-instruction-set-extensions-programming-reference.pdf
Manual contains new atomic instruction AADD/AAND/AOR/AXOR that provide
weaker (WC) ordering in WB memory regions. The manual does not say when
this instructions are going to be implemented nor if they will be implemented at all.
It also does not explain in which situation they are expected to be useful.
However one thing is clear: they will *not* be useful in typical user-mode code
that deals with reference counting.
May be, usable in userland in extreme fire-and-foget situations like counting
events where counter is updated not too often, but often enough to matter,
typically not from the same core as the last one and read approximately never.

My guess is that this instructions were invented to help Optane DIMMs.
So today, with Optane DIMMs officially dead, it would be logical for Intel to
never implement this strange instructions that could easily lead to
programmer's mistakes.

[*] normal in this case means WB or UC. For WC stores it's more relaxed.

scott@slp53.sl.home (Scott Lurndal): Dec 02 02:36PM

>Huh?
>There is absolute no relationship between what prefix is used (instruction
>encoding issue) and implementation.

The only way to specify an atomic access in those chips was
to use the LOCK prefix (e.g. LOCK ADD generates an atomic
add, et alia).

Paavo Helde <eesnimi@osa.pri.ee>: Dec 02 06:33PM +0200

02.12.2022 15:53 Michael S kirjutas:
> clocks. So, undetectable in Paavo's case of only 1000 updates per second.
> For 1M updates per second impact would me detectable with precise time
> measurements and for 100M per second there would be big slowdown.

Just a minor note, earlier I posted numbers for only a single
smartpointer, whereas in the full program there are probably tens or
hundreds of thousands of them. I just measured it and the total rate of
refcount changes is something like 5M per second.

With this 5M/s rate I cannot see any slowdown (of using std::atomic<int>
instead of int) in my measurements (with no contention), variations
caused by other uncontrollable factors seem to be much larger. Maybe I
should rerun this on a Linux box where the things are more stable.

"Chris M. Thomasson" <chris.m.thomasson.1@gmail.com>: Dec 02 12:29PM -0800

On 12/2/2022 6:36 AM, Scott Lurndal wrote:

> The only way to specify an atomic access in those chips was
> to use the LOCK prefix (e.g. LOCK ADD generates an atomic
> add, et alia).

Iirc, XCHG has an implicit LOCK prefix?

scott@slp53.sl.home (Scott Lurndal): Dec 02 08:55PM

>> to use the LOCK prefix (e.g. LOCK ADD generates an atomic
>> add, et alia).

>Iirc, XCHG has an implicit LOCK prefix?

Not sure about XCHG, but CMPXCHG requires the prefix
when used in a multiprocessor system; on a uniprocessor
it will be atomic because interrupts are always taken
between instructions (unlike, the VAX, for instance,
where certain instructions (MOVC3/5) can be interrupted
and restarted). I suspect that XCHG has similar
characteristics.

"Chris M. Thomasson" <chris.m.thomasson.1@gmail.com>: Dec 02 12:58PM -0800

On 12/2/2022 12:55 PM, Scott Lurndal wrote:
> where certain instructions (MOVC3/5) can be interrupted
> and restarted). I suspect that XCHG has similar
> characteristics.

Iirc, XCHG is the _only_ atomic RMW instruction that has an _implicit_
LOCK prefix. CMPXCHG _needs_ the programmer to put in a LOCK prefix.

Branimir Maksimovic <branimir.maksimovic@icloud.com>: Dec 02 09:09PM

> where certain instructions (MOVC3/5) can be interrupted
> and restarted). I suspect that XCHG has similar
> characteristics.

"lock" prefix causes the processor's bus-lock signal to be asserted during
execution of the accompanying instruction. In a multiprocessor environment,
the bus-lock signal insures that the processor has exclusive use of any shared
memory while the signal is asserted. The "lock" prefix can be prepended only
to the following instructions and only to those forms of the instructions
where the destination operand is a memory operand: "add", "adc", "and", "btc",
"btr", "bts", "cmpxchg", "cmpxchg8b", "dec", "inc", "neg", "not", "or", "sbb",
"sub", "xor", "xadd" and "xchg". If the "lock" prefix is used with one of
these instructions and the source operand is a memory operand, an undefined
opcode exception may be generated. An undefined opcode exception will also be
generated if the "lock" prefix is used with any instruction not in the above
list. The "xchg" instruction always asserts the bus-lock signal regardless of
the presence or absence of the "lock" prefix.

--

7-77-777
Evil Sinner!
with software, you repeat same experiment, expecting different results...

scott@slp53.sl.home (Scott Lurndal): Dec 02 09:18PM

>> characteristics.

>Iirc, XCHG is the _only_ atomic RMW instruction that has an _implicit_
>LOCK prefix. CMPXCHG _needs_ the programmer to put in a LOCK prefix.

Yes, that is the case

XCHG:
"If a memory operand is referenced, the processor's locking protocol is automatically
implemented for the duration of the exchange operation, regardless of the presence
or absence of the LOCK prefix or of the value of the IOPL. (See the LOCK prefix
description in this chapter for more information on the locking protocol.)

"Chris M. Thomasson" <chris.m.thomasson.1@gmail.com>: Dec 02 01:22PM -0800

On 12/2/2022 1:09 PM, Branimir Maksimovic wrote:
> list.

> The "xchg" instruction always asserts the bus-lock signal regardless of
> the presence or absence of the "lock" prefix.
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

BINGO! I had a strong feeling I was right.

https://youtu.be/TnZrWWUFl8I

Branimir Maksimovic <branimir.maksimovic@icloud.com>: Dec 02 09:36PM

> ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

> BINGO! I had a strong feeling I was right.

> https://youtu.be/TnZrWWUFl8I
Nice music :p

--

7-77-777
Evil Sinner!
with software, you repeat same experiment, expecting different results...

"Chris M. Thomasson" <chris.m.thomasson.1@gmail.com>: Dec 02 02:13PM -0800

On 12/2/2022 1:36 PM, Branimir Maksimovic wrote:

>> BINGO! I had a strong feeling I was right.

>> https://youtu.be/TnZrWWUFl8I
> Nice music :p

Thanks again, Branimir, for the quote of the docs.

https://youtu.be/UZ2-FfXZlAU
(super mario music in a live big band format? Nice... :^)

My hat is off to you.

"Chris M. Thomasson" <chris.m.thomasson.1@gmail.com>: Dec 02 02:16PM -0800

On 12/2/2022 1:18 PM, Scott Lurndal wrote:
> implemented for the duration of the exchange operation, regardless of the presence
> or absence of the LOCK prefix or of the value of the IOPL. (See the LOCK prefix
> description in this chapter for more information on the locking protocol.)

Indeed. When I remembered this I kind of doubted myself for a moment.
Then, I said well, I know its true, and if I am wrong then I must of
fried my brain a bit.

Why is that usually faster than a normal string_view == comparison ?

Mr Flibble <flibble@reddwarf.jmc.corp>: Dec 01 11:36PM

On Thu, 01 Dec 2022 21:54:55 +0100, Bonita Montero wrote:

> #include <atomic>

> template<typename CharType, typename TraitsType>
> requires std::same_as<std::make_unsigned_t<CharType>, unsigned
char> ||
> #if defined(_WIN32)
> __try {
> return !str[sv.length()] && memcmp( str, sv.data(),
sv.length() *
> {
> using sv_cit = std::string_view::const_iterator;
> for( sv_cit itSv = sv.cbegin(), itSvEnd = sv.cend(); itSv !
= itSvEnd;
> cStr( N, '*' ),
> svStr( cStr );
> auto cmpStrcmp = []( char const *str, string_view const &sv ) ->
bool {
> return strcmp( str, sv.data() ) == 0; };
> auto cmpSvCmp = []( char const *str, string_view const &sv ) ->
bool {
> return svCmp( str, sv ); };
> using cmp_sig_t = bool (*)( char const *str, string_view const
&sv );
> I'm using a volatile pointer to the comparison function to prevent any
> optimizations on my own function to have a fair comparison against
> memcmp().

Your code is terrible, egregiously terrible: buffer overflows are NEVER a
good idea even in example code; and "using namespace std;" is bad too.

/Flibble

Bonita Montero <Bonita.Montero@gmail.com>: Dec 02 05:30AM +0100

Am 02.12.2022 um 00:36 schrieb Mr Flibble:

> Your code is terrible, egregiously terrible: buffer overflows are NEVER a
> good idea even in example code; and "using namespace std;" is bad too.

There's nothing wrong with my code and I gues you
don't even understand it; you're just a bad programmer.

Muttley@dastardlyhq.com: Dec 02 10:09AM

On Fri, 2 Dec 2022 05:30:45 +0100
>> good idea even in example code; and "using namespace std;" is bad too.

>There's nothing wrong with my code and I gues you
>don't even understand it; you're just a bad programmer.

You think everyone is a bad programmer if they don't recognise your coding
genuise. I don't agree with Flibble about much but he's right about this.
Your code is almost always an overcomplicated mess. I don't know whether its
because you're showing off or you simply can't think clearly but I'm glad I'll
never have to maintain it.

Bonita Montero <Bonita.Montero@gmail.com>: Dec 02 03:22PM +0100

> You think everyone is a bad programmer if they don't recognise your coding
> genuise. ...

The code is reliable and 15 times faster than stramp on my machine.

> Your code is almost always an overcomplicated mess. ...

It's rather simple for this kind of improvement.

Mr Flibble <flibble@reddwarf.jmc.corp>: Dec 02 02:48PM

On Fri, 02 Dec 2022 05:30:45 +0100, Bonita Montero wrote:

>> too.

> There's nothing wrong with my code and I gues you don't even understand
> it; you're just a bad programmer.

Of course I understand the mess that you think is good code: you are
relying on undefined behaviour (buffer overruns and page faults) in the
fucktarded belief you can create a faster string comparison function; if
we ignore the undefined behaviour there is still an issue: most strings
are shorter than a typical page size which is a problem if the length of
"str" is less than the length of "sv"; if you cannot see what that problem
is then it is YOU who is "just a bad programmer".

/Flibble

Bonita Montero <Bonita.Montero@gmail.com>: Dec 02 06:39PM +0100

Am 02.12.2022 um 15:48 schrieb Mr Flibble:

> Of course I understand the mess that you think is good code: you are
> relying on undefined behaviour (buffer overruns and page faults) in the
> fucktarded belief you can create a faster string comparison function; ...

On Windows this style of programming is safe.
The speedup I measured is factor 15.

> most strings are shorter than a typical page size ...

Of course, but I would still get a speedup by comparing eight chars
in each chunk.

> which is a problem if the length of "str" is less than the length of "sv";

The mistake is the opposite, i.e. if str is longer than sv.
But I've fixed that by returning !*str.

Mr Flibble <flibble@reddwarf.jmc.corp>: Dec 02 06:17PM

On Fri, 02 Dec 2022 18:39:51 +0100, Bonita Montero wrote:

>> "sv";

> The mistake is the opposite, i.e. if str is longer than sv.
> But I've fixed that by returning !*str.

No it isn't and you haven't fixed anything: you are a bad programmer
indeed.

/Flibble

Bonita Montero <Bonita.Montero@gmail.com>: Dec 02 07:25PM +0100

Am 02.12.2022 um 19:17 schrieb Mr Flibble:

> No it isn't and you haven't fixed anything: you are a bad programmer
> indeed.

You don't understand your bug assumptions. If the length of str is less
than that of sv this is detected inside the loop. If it is larger this
is detected by looking if the charater inside str at the end-index of
sv is zero. Who's the bad programmer here ?

Mr Flibble <flibble@reddwarf.jmc.corp>: Dec 02 06:50PM

On Fri, 02 Dec 2022 19:25:50 +0100, Bonita Montero wrote:

> than that of sv this is detected inside the loop. If it is larger this
> is detected by looking if the charater inside str at the end-index of sv
> is zero. Who's the bad programmer here ?

No it isn't. You are the bad programmer here.

/Flibble

A ℙ≠ℕℙ proof for review (v1.0)

wij <wyniijj5@gmail.com>: Dec 02 10:10AM -0800

Using C-like notation:

ℙ::= Set of decision problem which can be computed by TM in P-time.

ℕℙ::= Set of decision problem q of which the positive answer can be verified by
a certificate c. I.e. a decision function v(q,c) yields true in P-time.

Spec: Function "bool S(Prog,UInt)" decides whether or not the argument Prog
program (equal-powerful as TM) is satisfiable in the range specified by UInt,
and the algorithmic steps is bounded by a given polynomial formula P. IOW:
S(f,n)==true iff ∃v,x, x<=n, v(f,x) proves f(x)==true (e.g. v simulates f(x))
and Time(v(f,x))<=P(|f|+|x|) proves a P-time certificate.

From the specification, Problem(S)∈ℕℙ (precisely, ℕℙℂ).
S can be implemented by a DTM. And, a Prog u of which the function is defined as
follow will also exist (The implement of u is very different. The implement of S
and u as existence proof are omitted, because they are supposedly understood
immediately):

bool u(UInt x) {
return !S(u,x);
}

From Liar's paradox and the HP proof, u is a never-terminating (undecidable)
instance. If S(u,n) computes in greater than P time, the answer S(u,n)==false is
fine. But, if S is in P time, u(x) can be in P time, then, the only condition
for S to return false in this case is u(x)==false which will not happen.
P-time S is unimplementable. Therefore, Problem(S)∉ℙ, ℙ≠ℕℙ.
QED.

Brilliant

Mr Flibble <flibble@reddwarf.jmc.corp>: Dec 02 04:41PM

https://www.tiktok.com/@looneytony4/video/7166638178647035137?
is_from_webapp=v1&item_id=7166638178647035137

Pause event handler, go enter another one, then come back

Chris Vine <chris@cvine--nospam--.freeserve.co.uk>: Dec 02 12:43AM

On Wed, 30 Nov 2022 05:46:54 -0800 (PST)
> raised -- but there are major restrictions on what can happen inside a
> signal handler.

> Does anyone know how this works?

I don't know about windows, but wxWidgets when running on unix-like
systems uses glib/GTK+ as the underlying toolkit, and for modal dialogs
of the kind you have described that generally means using recursive main
loop invocations (in other words, recursive blocking calls to
gtk_main()) which has the effect of multiplexing the single-threaded
event loop: see https://docs.gtk.org/glib/main-loop.html.

All GTK+ events (including I believe all GDK rendering) execute in the
main loop thread: GTK+ itself is single threaded (GTK+2 did make some
limited provision for it to run under multiple threads but this was
deprecated in GTK+3 and removed in GTK4). However glib/gio provides
mechanisms for launching tasks on worker threads, which when complete
will post an event to the main loop so that its continuation runs on
the main loop thread. Some asynchronous gio i/o operates in this way.

I presume (but do not know) that wxWidgets operates in a similar way
when using windows as it backend.

You received this digest because you're subscribed to updates for this group. You can change your settings on the group membership page.
To unsubscribe from this group and stop receiving emails from it send an email to comp.lang.c+++unsubscribe@googlegroups.com.

soft and program

Friday, December 2, 2022

Digest for comp.lang.c++@googlegroups.com - 25 updates in 5 topics

No comments:

Blog Archive

About Me