soft and program: Digest for comp.lang.c++@googlegroups.com

comp.lang.c++@googlegroups.com

Google Groups

Character encoding conversion in wide string literals - 13 Updates
Little program to test concurrency of .fetch_add and .compare_exchange_weak - 7 Updates

Character encoding conversion in wide string literals

Juha Nieminen <nospam@thanks.invalid>: Dec 07 11:37AM

Recently I stumbled across a problem where I had wide string literals
with non-ascii characters UTF-8 encoded. In other words, I had code like
this (I'm using non-ascii in the code below, I hope it doesn't get
mangled up, but even if it does, it should nevertheless be clear what
I'm trying to express):

std::wstring str = L"non-ascii chars: ???";

The C++ source file itself uses UTF-8 encoding, meaning that that line
of code is likewise UTF-8 encoded. If it were a narrow string literal
(being assigned to a std::string) then it works just fine (primarily
because the compiler doesn't need to do anything to it, it can simply
take those bytes from the source file as is).

However, since it's a wide string literal (being assigned to a std::wstring)
it's not as clear-cut anymore. What does the standard say about this
situation?

The thing is that it works just fine in Linux using gcc. The compiler will
re-encode the UTF-8 encoded characters in the source file inside the
parentheses into whatever encoding wide char string use, so the correct
content will end up in the executable binary (and thus in the wstring).

Apparently it does not work correctly in (some recent version of)
Visual Studio, where apparently it just takes the byte values from the
source file within the parentheses as-is, and just assigns those values
as-is to the wide chars that end up in the binary. (Or something like that.)

Does the standard specify what the compiler should do in this situation?
If not, then what is the proper way of specifying wide string literals
that contain non-ascii characters?

"Alf P. Steinbach" <alf.p.steinbach@gmail.com>: Dec 07 04:44PM +0100

On 7 Dec 2021 12:37, Juha Nieminen wrote:
> Visual Studio, where apparently it just takes the byte values from the
> source file within the parentheses as-is, and just assigns those values
> as-is to the wide chars that end up in the binary. (Or something like that.)

The Visual C++ compiler assumes that source code is Windows ANSI encoded
unless

* you use an encoding option such as `/utf8`, or
* the source is UTF-8 with BOM, or
* the source is UTF-16.

Independently of that Visual C++ assumes that the execution character
set (the byte-based encoding that should be used for text data in the
executable) is Windows ANSI, unless it's specified as something else.
The `/utf8` option specifies also that. It's a combo option that
specifies both source encoding and execution character set as UTF-8.

Unfortunately as of VS 2022 `/utf8` is not set by default in a VS
project, and unfortunately there's nothing you can just click to set it.
You have to to type it in (right click project then) Properties -> C/C++
-> Command Line. I usually set "/utf-8 /Zc:__cplusplus".

> Does the standard specify what the compiler should do in this situation?
> If not, then what is the proper way of specifying wide string literals
> that contain non-ascii characters?

I'll let others discuss that, but (1) it does, and (2) just so you're
aware: the main problem is that the C and C++ standards do not conform
to reality in their requirement that a `wchar_t` value should suffice to
encode all possible code points in the wide character set.

In Windows wide text is UTF-16, with 16-bit `wchar_t`. Which means that
some emojis etc. that appear as a single character and constitute one
21-bit code point, can become a pair of two `wchar_t` values, an UTF-16
"surrogate pair".

That's probably not your problem though, but it is a/the problem.

- ALf

James Kuyper <jameskuyper@alumni.caltech.edu>: Dec 07 11:38AM -0500

On 12/7/21 6:37 AM, Juha Nieminen wrote:
> Visual Studio, where apparently it just takes the byte values from the
> source file within the parentheses as-is, and just assigns those values
> as-is to the wide chars that end up in the binary. (Or something like that.)> Does the standard specify what the compiler should do in this situation?

The standard says a great many things about it, but the most important
things it says are that the relevant character sets and encodings are
implementation-defined. If an implementation uses utf-8 for it's native
character encoding, your code should work fine. The most likely
explanation why it doesn't work is that your utf-8 encoded source code
file is being interpreted using some other encoding, probably ASCII or
one of its many variants.

I have relatively little experience programming for Windows, and
essentially none with internationalization. Therefore, the following
comments about Windows all convey second or third-hand information, and
should be treated accordingly. Many people posting on this newsgroup
know more than I do about such things - hopefully someone will correct
any errors I make:

* When Unicode first came out, Windows choose to use UCS-2 to support
it, and made that it's default character encoding.
* When Unicode expanded beyond the capacity of UCS-2, Windows decided to
transition over to using UTF-16. There was an annoyingly long transition
period during which some parts of Windows used UTF-16, while other parts
still used UCS-2. I cannot confirm whether or not that transition period
has completed yet.
* I remember hearing rumors that modern versions of Windows do provide
some support for UTF-8, but that support is neither complete, nor the
default. You have know what you need to do to enable such support - I don't.

> If not, then what is the proper way of specifying wide string literals
> that contain non-ascii characters?

The most portable way of doing it to use what the standard calls
Universal Character Names, or UCNs for short. "\u" followed by 4
hexadecimal digits represents the character whose code point is
identified by those digits. "\U" followed by eight hexadecimal digits
represents the character whose Unicode code point is identified by those
digits.
Here's some key things to keep in mind when using UCNs:

5.2p1: during translation phase 1, the implementation is required to
convert any source file character that is not in the basic source
character set into the corresponding UCN.
5.2p2: Interrupting a UCN with an escaped new-line has undefined behavior.
5.2p4: Creating something that looks like a UCN by using the ## operator
has undefined behavior.
5.2p5: During translation phase 5, UCN's are converted to the execution
character set.
5.3p2: A UCN whose hexadecimal digits don't represent a code point or
which represents a surrogate code point renders the program ill-formed.
A UCN that represents a control character or a member of the basic
character set renders the program ill-formed unless it occurs in a
character literal or string literal.
5.4p3: The conversion to UCNs is reverted in raw string literals.
5.10p1: UCNs are allowed in identifiers, but only if they fall into one
of the ranges listed in Table 2 of the standard.
5.13.3p8: Any UCN for which there is no corresponding member of the
execution character set is translated to an implementation-defined encoding.
5.13.5p13: A UCN occurring in a UTF-16 string literal may yield a
surrogate pair. A UCN occurring in a narrow string literal may map to
one or more char or char8_t elements.

Here's a more detailed explanation of what the standard says about this
situation:
The standard talks about three different implementation-defined
character sets:
* The physical source character set which is used in your source code file.
* The source character set which is used internally by the compiler
while processing your code.
* The execution character set used by your program when it is executed.

The standard talks about 5 different character encodings:
The implementation-defined narrow and wide native encodings used by
character constants and string literals with no prefix, or with the "L"
prefix, respectively. These are stored in arrays of char and wchar_t,
respectively
The UTF-8, UTF-16, and UTF-32 encodings used by character constants with
u8, u, and U prefixes, respectively. These are stored in arrays of
char8_t, char16_t, and char32_t, respectively.

Virtually every standard library template that handles characters is
required to support specializations for wchar_t, char8_t, char16_t, and
char32_t.

The standard mandates support for std::codecvt facets enabling
conversion between the narrow and wide native encodings, and facets for
converting between UTF-8 and either UTF-16 or UTF-32.
The standard specifies the <cuchar> header which incorporates routines
form the C standard library header <uchar.h> for converting between the
narrow native encoding and either UTF-16 or UTF-32.
Therefore, conversion between wchar_t and either char16_t or char32_t
requires three conversion steps.

James Kuyper <jameskuyper@alumni.caltech.edu>: Dec 07 11:48AM -0500

On 12/7/21 10:44 AM, Alf P. Steinbach wrote:
...
> aware: the main problem is that the C and C++ standards do not conform
> to reality in their requirement that a `wchar_t` value should suffice to
> encode all possible code points in the wide character set.

The purpose of the C and C++ standards is prescriptive, not descriptive.
It's therefore missing the point to criticize them for not conforming to
reality. Rather, you should say that some popular implementations fail
to conform to the standards.

> some emojis etc. that appear as a single character and constitute one
> 21-bit code point, can become a pair of two `wchar_t` values, an UTF-16
> "surrogate pair".

The C++ standard explicitly addresses that point, though the C standard
does not.

Keith Thompson <Keith.S.Thompson+u@gmail.com>: Dec 07 09:41AM -0800

> * you use an encoding option such as `/utf8`, or
> * the source is UTF-8 with BOM, or
> * the source is UTF-16.

What exactly do you mean by "Windows ANSI"? Windows-1252 or something
else? (Microsoft doesn't call it "ANSI", because it isn't.)

[...]

--
Keith Thompson (The_Other_Keith) Keith.S.Thompson+u@gmail.com
Working, but not speaking, for Philips
void Void(void) { Void(); } /* The recursive call of the void */

"Alf P. Steinbach" <alf.p.steinbach@gmail.com>: Dec 07 06:59PM +0100

On 7 Dec 2021 17:48, James Kuyper wrote:
> It's therefore missing the point to criticize them for not conforming to
> reality. Rather, you should say that some popular implementations fail
> to conform to the standards.

No, in this case it's the standard's fault. They failed to standardize
existing practice and instead standardized a completely unreasonable
requirement, given that 16-bit `wchar_t` was established as the API
foundation in the most widely used OS on the platform, something that
could not easily be changed. In particular this was the C standard
committee: their choice here was as reasonable and practical as their
choice of not supporting pointers outside of original (sub-) array.

It was idiotic. It was simple blunders. But inn both cases, as I recall,
they tried to cover up the blunder by writing a rationale; they took the
blunders to heart and made them into great obstacles, to not lose face.

>> "surrogate pair".

> The C++ standard explicitly addresses that point, though the C standard
> does not.

Happy to hear that but some more specific information would be welcome.

- Alf

"Alf P. Steinbach" <alf.p.steinbach@gmail.com>: Dec 07 07:07PM +0100

On 7 Dec 2021 18:41, Keith Thompson wrote:

> What exactly do you mean by "Windows ANSI"? Windows-1252 or something
> else? (Microsoft doesn't call it "ANSI", because it isn't.)

> [...]

"Windows ANSI" is the encoding specified by the `GetACP` API function,
which, but as I recall that's more or less undocumented, just serves up
the codepage number specified by registry value

Computer\HKEY_LOCAL_MACHINE\SYSTEM\CurrentControlSet\Control\Nls\CodePage@ACP

This means that "Windows ANSI" is a pretty dynamic thing. Not just
system-dependent, but at-the-moment-configuration dependent. Though in
English-speaking countries it's Windows 1252 by default.

And that in turn means that using the defaults with Visual C++, you can
end up with pretty much any encoding whatsoever of narrow literals.

Which means that it's a good idea to take charge.

Option `/utf8` is one way to take charge.

- Alf

Manfred <noname@add.invalid>: Dec 07 07:07PM +0100

On 12/7/2021 5:38 PM, James Kuyper wrote:
> * I remember hearing rumors that modern versions of Windows do provide
> some support for UTF-8, but that support is neither complete, nor the
> default. You have know what you need to do to enable such support - I don't.

One relevant addition that is relatively recent is support to/from UTF-8
in the APIs WideCharToMultiByte and MultiByteToWideChar.
These allow to handle UTF-8 programmatically in code.
Windows itself still uses UTF-16 internally.
I don't know how filenames are stored on disk.

Paavo Helde <eesnimi@osa.pri.ee>: Dec 07 08:32PM +0200

07.12.2021 19:41 Keith Thompson kirjutas:

> What exactly do you mean by "Windows ANSI"? Windows-1252 or something
> else? (Microsoft doesn't call it "ANSI", because it isn't.)

It does. From
https://docs.microsoft.com/en-us/windows/win32/api/winnls/nf-winnls-getacp

"Retrieves the current Windows ANSI code page identifier for the
operating system."

This is in contrast to the GetOEMCP() function which is said to return
"OEM code page", not "ANSI code page". Both terms are misnomers from the
previous century.

Both these codepage settings traditionally refer to some narrow char
codepage identifiers, which will vary depending on the user regional
settings and are thus unpredictable and unusable for basically anything
related to internationalization.

The only meaningful strategy is to set these both to UTF-8 which now
finally has some (beta stage?) support in Windows 10, and to upgrade all
affected software to properly support this setting.

James Kuyper <jameskuyper@alumni.caltech.edu>: Dec 07 01:55PM -0500

On 12/7/21 12:59 PM, Alf P. Steinbach wrote:
> On 7 Dec 2021 17:48, James Kuyper wrote:
>> On 12/7/21 10:44 AM, Alf P. Steinbach wrote:
...
> could not easily be changed. In particular this was the C standard
> committee: their choice here was as reasonable and practical as their
> choice of not supporting pointers outside of original (sub-) array.

It was existing practice. From the very beginning, wchar_t was supposed
to be "an integral type whose range of values can represent distinct
codes for all members of the largest extended character set specified
among the supported locales". When char32_t was added to the language,
moving that specification to char32_t might have been a reasonable thing
to do, but continuing to apply that specification to wchar_t was NOT an
innovation. The same version of the standard that added char32_t also
added char16_t, which is what should now be used for UTF-16 encoding,
not wchar_t.

It's an abuse of what wchar_t was intended for, to use it for a
variable-length encoding. None of the functions in the C or C++ standard
library for dealing with wchar_t values has ever had the right kind of
interface to allow it to be used as a variable-length encoding. To see
what I'm talking about, look at the mbrto*() and *tomb() functions from
the C standard library, that have been incorporated by reference into
the C++ standard library. Those functions do have interfaces designed to
handle a variable-length encoding.

...

>> The C++ standard explicitly addresses that point, though the C standard
>> does not.

> Happy to hear that but some more specific information would be welcome.

5.3p2:
"A universal-character-name designates the character in ISO/IEC 10646
(if any) whose code point is the hexadecimal number represented by the
sequence of hexadecimal-digits in the universal-character-name. The
program is ill-formed if that number ... is a surrogate code point. ...
A surrogate code point is a value in the range [D800, DFFF] (hexadecimal)."

5.13.5p8: "[Note: A single c-char may produce more than one char16_t
character in the form of surrogate pairs. A surrogate pair is a
representation for a single code point as a sequence of two 16-bit code
units. — end note]"

5.13.5p13: "a universal-character-name in a UTF-16 string literal may
yield a surrogate pair. ... The size of a UTF-16 string literal is the
total number of escape sequences, universal-character-names, and other
characters, plus one for each character requiring a surrogate pair, plus
one for the terminating u'\0'."

Note that it's UTF-16, which should be encoded using char16_t, for which
this issue is acknowledged. wchar_t is not, and never was, supposed to
be a variable-length encoding like UTF-8 and UTF-16.

James Kuyper <jameskuyper@alumni.caltech.edu>: Dec 07 01:55PM -0500

On 12/7/21 1:32 PM, Paavo Helde wrote:

> The only meaningful strategy is to set these both to UTF-8 which now
> finally has some (beta stage?) support in Windows 10, and to upgrade all
> affected software to properly support this setting.

Note that it was referred to as "ANSI" because Microsoft proposed it for
ANSI standardization, but that proposal was never approved. Continuing
to refer to it as "ANSI" decades later is a rather sad failure to
acknowledge that rejection.

Keith Thompson <Keith.S.Thompson+u@gmail.com>: Dec 07 12:21PM -0800

> literals.

> Which means that it's a good idea to take charge.

> Option `/utf8` is one way to take charge.

It appears my previous statement was incorrect. At least some Microsoft
documentation does still (incorrectly) refer to "Windows ANSI".

https://docs.microsoft.com/en-us/windows/win32/api/winnls/nf-winnls-getacp

The history, as I recall, is that Microsoft proposed one or more 8-bit
extensions of the 7-bit ASCII character set as ANSI standards.
Windows-1252, which has various accented letters and other symbols in
the range 128-255, is the best known variant. But Microsoft's proposal
was never adopted by ANSI, leaving us with a bunch of incorrect
documentation. Instead, ISO created the 8859-* 8-bit character sets,
including 8859-1, or Latin-1. Latin-1 differs from Windows-1252 in that
Latin-1 it has control characters in the range 128-159, while
Windows-1252 has printable characters.

https://en.wikipedia.org/wiki/Windows-1252

--
Keith Thompson (The_Other_Keith) Keith.S.Thompson+u@gmail.com
Working, but not speaking, for Philips
void Void(void) { Void(); } /* The recursive call of the void */

Keith Thompson <Keith.S.Thompson+u@gmail.com>: Dec 07 12:26PM -0800

> https://docs.microsoft.com/en-us/windows/win32/api/winnls/nf-winnls-getacp

> "Retrieves the current Windows ANSI code page identifier for the
> operating system."

Yes, I has missed that.

But Microsoft has also said:

The term ANSI as used to signify Windows code pages is a historical
reference, but is nowadays a misnomer that continues to persist in
the Windows community.

https://en.wikipedia.org/wiki/Windows-1252
https://web.archive.org/web/20150204175931/http://download.microsoft.com/download/5/6/8/56803da0-e4a0-4796-a62c-ca920b73bb17/21-Unicode_WinXP.pdf

Microsoft's mistake was to start using the term "ANSI" before it
actually became an ANSI standard. Once that mistake was in place,
cleaning it up was very difficult.

> The only meaningful strategy is to set these both to UTF-8 which now
> finally has some (beta stage?) support in Windows 10, and to upgrade
> all affected software to properly support this setting.

Yes, I advocate using UTF-8 whenever practical.

--
Keith Thompson (The_Other_Keith) Keith.S.Thompson+u@gmail.com
Working, but not speaking, for Philips
void Void(void) { Void(); } /* The recursive call of the void */

Little program to test concurrency of .fetch_add and .compare_exchange_weak

Bonita Montero <Bonita.Montero@gmail.com>: Dec 07 06:27AM +0100

Am 06.12.2021 um 19:02 schrieb Scott Lurndal:

> cache to ensure exclusive access for the add, while
> others will pass the entire operation to the last-level cache
> where it is executed atomically.

There's for sure no architecture that does atomic operations in
the last level cache because this would be silly.

> In any case, a couple dozen line assembler program would be a
> far better test than your overly complicated C++.

No, it wouldn't give better results and the code would be
magnitudes longer if it would do the same.

scott@slp53.sl.home (Scott Lurndal): Dec 07 06:06PM

>> where it is executed atomically.

>There's for sure no architecture that does atomic operations in
>the last level cache because this would be silly.

Well, are you sure? Why do you think it would be silly?

https://genzconsortium.org/wp-content/uploads/2019/04/Gen-Z-Atomics-2019.pdf

Given that at least three high-end processor chips have taped out just
this year with the capability of executing "far" atomic operations in
the LLC (or to a PCI Express Root complex host bridge), I think you really don't have
a clue what you are talking about.

Bonita Montero <Bonita.Montero@gmail.com>: Dec 07 07:14PM +0100

Am 07.12.2021 um 19:06 schrieb Scott Lurndal:
> this year with the capability of executing "far" atomic operations in
> the LLC (or to a PCI Express Root complex host bridge), I think you
> really don't have a clue what you are talking about.

And which CPUs currently support this Gen-Z interconnect ?
And which CPUs currently use this far atomics for thread
-synchronitation - none.
Did you really read the paper and noted what Gen-Z is ?
No.

scott@slp53.sl.home (Scott Lurndal): Dec 07 06:49PM

>> the LLC (or to a PCI Express Root complex host bridge), I think you
>> really don't have a clue what you are talking about.

>And which CPUs currently support this Gen-Z interconnect ?

I'd tell you, but various NDA's forbid.

>And which CPUs currently use this far atomics for thread
>-synchronitation - none.

How do you know? I'm aware of three. Two sampling to
customers, with core counts from 8 to 64. A handful of others
are in development by several processor vendors as I
write this.

>Did you really read the paper and noted what Gen-Z is ?

I know exactly what it is, and I know what CXL is as well,
both being part of my day job. And if you don't think Intel
is designing all of their server CPUs to be CXL [*] compatible,
you're not thinking.

[*] "In November 2021 the CXL Consortium and the GenZ Consortium
signed a letter of intent for Gen-Z to transfer its specifications
and assets to CXL, leaving CXL as the sole industry standard moving
forward"

Bonita Montero <Bonita.Montero@gmail.com>: Dec 07 07:54PM +0100

Am 07.12.2021 um 19:49 schrieb Scott Lurndal:
>>> really don't have a clue what you are talking about.

>> And which CPUs currently support this Gen-Z interconnect ?

> I'd tell you, but various NDA's forbid.

LOOOOOOOL.

>> And which CPUs currently use this far atomics for thread
>> -synchronitation - none.

> How do you know?

Because this would be slower since the lock-modifications
woudln't be done in the L1-caches but in far memory. That's
just a silly idea.

> I'm aware of three. Two sampling to customers, with core
> counts from 8 to 64.

And you can't tell it because of NDAs. Hrhr.

scott@slp53.sl.home (Scott Lurndal): Dec 07 07:14PM

>Because this would be slower since the lock-modifications
>woudln't be done in the L1-caches but in far memory. That's
>just a silly idea.

Hello, it's a cache-coherent multiprocessor. You need to
fetch it exclusively into the L1 first, so instead of sending the fetch
(or invalidate if converting a shared line to owned),
you send the atomic op and it gets handled atomically at
the far end (e.g. LLC, PCI express device, SoC coprocessor)
saving the interconnect (mesh, ring, whatever) bandwidth and
the round-trip time between L1 and LLC and reducing contention
for the line.

If it's already in the L1 cache, then the processor will
automatically treat it as a near-atomic, this is expected
to be a rare case with correctly designed atomic usage.

scott@slp53.sl.home (Scott Lurndal): Dec 07 07:25PM

>If it's already in the L1 cache, then the processor will
>automatically treat it as a near-atomic, this is expected
>to be a rare case with correctly designed atomic usage.

In case you need a public reference for a shipping processor:

https://developer.arm.com/documentation/102099/0000/L1-data-memory-system/Instruction-implementation-in-the-L1-data-memory-system

You received this digest because you're subscribed to updates for this group. You can change your settings on the group membership page.
To unsubscribe from this group and stop receiving emails from it send an email to comp.lang.c+++unsubscribe@googlegroups.com.

soft and program

Tuesday, December 7, 2021

Digest for comp.lang.c++@googlegroups.com - 20 updates in 2 topics

No comments:

Blog Archive

About Me