| Juha Nieminen <nospam@thanks.invalid>: Dec 07 11:37AM Recently I stumbled across a problem where I had wide string literals with non-ascii characters UTF-8 encoded. In other words, I had code like this (I'm using non-ascii in the code below, I hope it doesn't get mangled up, but even if it does, it should nevertheless be clear what I'm trying to express): std::wstring str = L"non-ascii chars: ???"; The C++ source file itself uses UTF-8 encoding, meaning that that line of code is likewise UTF-8 encoded. If it were a narrow string literal (being assigned to a std::string) then it works just fine (primarily because the compiler doesn't need to do anything to it, it can simply take those bytes from the source file as is). However, since it's a wide string literal (being assigned to a std::wstring) it's not as clear-cut anymore. What does the standard say about this situation? The thing is that it works just fine in Linux using gcc. The compiler will re-encode the UTF-8 encoded characters in the source file inside the parentheses into whatever encoding wide char string use, so the correct content will end up in the executable binary (and thus in the wstring). Apparently it does not work correctly in (some recent version of) Visual Studio, where apparently it just takes the byte values from the source file within the parentheses as-is, and just assigns those values as-is to the wide chars that end up in the binary. (Or something like that.) Does the standard specify what the compiler should do in this situation? If not, then what is the proper way of specifying wide string literals that contain non-ascii characters? |
| "Alf P. Steinbach" <alf.p.steinbach@gmail.com>: Dec 07 04:44PM +0100 On 7 Dec 2021 12:37, Juha Nieminen wrote: > Visual Studio, where apparently it just takes the byte values from the > source file within the parentheses as-is, and just assigns those values > as-is to the wide chars that end up in the binary. (Or something like that.) The Visual C++ compiler assumes that source code is Windows ANSI encoded unless * you use an encoding option such as `/utf8`, or * the source is UTF-8 with BOM, or * the source is UTF-16. Independently of that Visual C++ assumes that the execution character set (the byte-based encoding that should be used for text data in the executable) is Windows ANSI, unless it's specified as something else. The `/utf8` option specifies also that. It's a combo option that specifies both source encoding and execution character set as UTF-8. Unfortunately as of VS 2022 `/utf8` is not set by default in a VS project, and unfortunately there's nothing you can just click to set it. You have to to type it in (right click project then) Properties -> C/C++ -> Command Line. I usually set "/utf-8 /Zc:__cplusplus". > Does the standard specify what the compiler should do in this situation? > If not, then what is the proper way of specifying wide string literals > that contain non-ascii characters? I'll let others discuss that, but (1) it does, and (2) just so you're aware: the main problem is that the C and C++ standards do not conform to reality in their requirement that a `wchar_t` value should suffice to encode all possible code points in the wide character set. In Windows wide text is UTF-16, with 16-bit `wchar_t`. Which means that some emojis etc. that appear as a single character and constitute one 21-bit code point, can become a pair of two `wchar_t` values, an UTF-16 "surrogate pair". That's probably not your problem though, but it is a/the problem. - ALf |
| James Kuyper <jameskuyper@alumni.caltech.edu>: Dec 07 11:38AM -0500 On 12/7/21 6:37 AM, Juha Nieminen wrote: > Visual Studio, where apparently it just takes the byte values from the > source file within the parentheses as-is, and just assigns those values > as-is to the wide chars that end up in the binary. (Or something like that.)> Does the standard specify what the compiler should do in this situation? The standard says a great many things about it, but the most important things it says are that the relevant character sets and encodings are implementation-defined. If an implementation uses utf-8 for it's native character encoding, your code should work fine. The most likely explanation why it doesn't work is that your utf-8 encoded source code file is being interpreted using some other encoding, probably ASCII or one of its many variants. I have relatively little experience programming for Windows, and essentially none with internationalization. Therefore, the following comments about Windows all convey second or third-hand information, and should be treated accordingly. Many people posting on this newsgroup know more than I do about such things - hopefully someone will correct any errors I make: * When Unicode first came out, Windows choose to use UCS-2 to support it, and made that it's default character encoding. * When Unicode expanded beyond the capacity of UCS-2, Windows decided to transition over to using UTF-16. There was an annoyingly long transition period during which some parts of Windows used UTF-16, while other parts still used UCS-2. I cannot confirm whether or not that transition period has completed yet. * I remember hearing rumors that modern versions of Windows do provide some support for UTF-8, but that support is neither complete, nor the default. You have know what you need to do to enable such support - I don't. > If not, then what is the proper way of specifying wide string literals > that contain non-ascii characters? The most portable way of doing it to use what the standard calls Universal Character Names, or UCNs for short. "\u" followed by 4 hexadecimal digits represents the character whose code point is identified by those digits. "\U" followed by eight hexadecimal digits represents the character whose Unicode code point is identified by those digits. Here's some key things to keep in mind when using UCNs: 5.2p1: during translation phase 1, the implementation is required to convert any source file character that is not in the basic source character set into the corresponding UCN. 5.2p2: Interrupting a UCN with an escaped new-line has undefined behavior. 5.2p4: Creating something that looks like a UCN by using the ## operator has undefined behavior. 5.2p5: During translation phase 5, UCN's are converted to the execution character set. 5.3p2: A UCN whose hexadecimal digits don't represent a code point or which represents a surrogate code point renders the program ill-formed. A UCN that represents a control character or a member of the basic character set renders the program ill-formed unless it occurs in a character literal or string literal. 5.4p3: The conversion to UCNs is reverted in raw string literals. 5.10p1: UCNs are allowed in identifiers, but only if they fall into one of the ranges listed in Table 2 of the standard. 5.13.3p8: Any UCN for which there is no corresponding member of the execution character set is translated to an implementation-defined encoding. 5.13.5p13: A UCN occurring in a UTF-16 string literal may yield a surrogate pair. A UCN occurring in a narrow string literal may map to one or more char or char8_t elements. Here's a more detailed explanation of what the standard says about this situation: The standard talks about three different implementation-defined character sets: * The physical source character set which is used in your source code file. * The source character set which is used internally by the compiler while processing your code. * The execution character set used by your program when it is executed. The standard talks about 5 different character encodings: The implementation-defined narrow and wide native encodings used by character constants and string literals with no prefix, or with the "L" prefix, respectively. These are stored in arrays of char and wchar_t, respectively The UTF-8, UTF-16, and UTF-32 encodings used by character constants with u8, u, and U prefixes, respectively. These are stored in arrays of char8_t, char16_t, and char32_t, respectively. Virtually every standard library template that handles characters is required to support specializations for wchar_t, char8_t, char16_t, and char32_t. The standard mandates support for std::codecvt facets enabling conversion between the narrow and wide native encodings, and facets for converting between UTF-8 and either UTF-16 or UTF-32. The standard specifies the <cuchar> header which incorporates routines form the C standard library header <uchar.h> for converting between the narrow native encoding and either UTF-16 or UTF-32. Therefore, conversion between wchar_t and either char16_t or char32_t requires three conversion steps. |
| James Kuyper <jameskuyper@alumni.caltech.edu>: Dec 07 11:48AM -0500 On 12/7/21 10:44 AM, Alf P. Steinbach wrote: ... > aware: the main problem is that the C and C++ standards do not conform > to reality in their requirement that a `wchar_t` value should suffice to > encode all possible code points in the wide character set. The purpose of the C and C++ standards is prescriptive, not descriptive. It's therefore missing the point to criticize them for not conforming to reality. Rather, you should say that some popular implementations fail to conform to the standards. > some emojis etc. that appear as a single character and constitute one > 21-bit code point, can become a pair of two `wchar_t` values, an UTF-16 > "surrogate pair". The C++ standard explicitly addresses that point, though the C standard does not. |
| Keith Thompson <Keith.S.Thompson+u@gmail.com>: Dec 07 09:41AM -0800 > * you use an encoding option such as `/utf8`, or > * the source is UTF-8 with BOM, or > * the source is UTF-16. What exactly do you mean by "Windows ANSI"? Windows-1252 or something else? (Microsoft doesn't call it "ANSI", because it isn't.) [...] -- Keith Thompson (The_Other_Keith) Keith.S.Thompson+u@gmail.com Working, but not speaking, for Philips void Void(void) { Void(); } /* The recursive call of the void */ |
| "Alf P. Steinbach" <alf.p.steinbach@gmail.com>: Dec 07 06:59PM +0100 On 7 Dec 2021 17:48, James Kuyper wrote: > It's therefore missing the point to criticize them for not conforming to > reality. Rather, you should say that some popular implementations fail > to conform to the standards. No, in this case it's the standard's fault. They failed to standardize existing practice and instead standardized a completely unreasonable requirement, given that 16-bit `wchar_t` was established as the API foundation in the most widely used OS on the platform, something that could not easily be changed. In particular this was the C standard committee: their choice here was as reasonable and practical as their choice of not supporting pointers outside of original (sub-) array. It was idiotic. It was simple blunders. But inn both cases, as I recall, they tried to cover up the blunder by writing a rationale; they took the blunders to heart and made them into great obstacles, to not lose face. >> "surrogate pair". > The C++ standard explicitly addresses that point, though the C standard > does not. Happy to hear that but some more specific information would be welcome. - Alf |
| "Alf P. Steinbach" <alf.p.steinbach@gmail.com>: Dec 07 07:07PM +0100 On 7 Dec 2021 18:41, Keith Thompson wrote: > What exactly do you mean by "Windows ANSI"? Windows-1252 or something > else? (Microsoft doesn't call it "ANSI", because it isn't.) > [...] "Windows ANSI" is the encoding specified by the `GetACP` API function, which, but as I recall that's more or less undocumented, just serves up the codepage number specified by registry value Computer\HKEY_LOCAL_MACHINE\SYSTEM\CurrentControlSet\Control\Nls\CodePage@ACP This means that "Windows ANSI" is a pretty dynamic thing. Not just system-dependent, but at-the-moment-configuration dependent. Though in English-speaking countries it's Windows 1252 by default. And that in turn means that using the defaults with Visual C++, you can end up with pretty much any encoding whatsoever of narrow literals. Which means that it's a good idea to take charge. Option `/utf8` is one way to take charge. - Alf |
| Manfred <noname@add.invalid>: Dec 07 07:07PM +0100 On 12/7/2021 5:38 PM, James Kuyper wrote: > * I remember hearing rumors that modern versions of Windows do provide > some support for UTF-8, but that support is neither complete, nor the > default. You have know what you need to do to enable such support - I don't. One relevant addition that is relatively recent is support to/from UTF-8 in the APIs WideCharToMultiByte and MultiByteToWideChar. These allow to handle UTF-8 programmatically in code. Windows itself still uses UTF-16 internally. I don't know how filenames are stored on disk. |
| Paavo Helde <eesnimi@osa.pri.ee>: Dec 07 08:32PM +0200 07.12.2021 19:41 Keith Thompson kirjutas: > What exactly do you mean by "Windows ANSI"? Windows-1252 or something > else? (Microsoft doesn't call it "ANSI", because it isn't.) It does. From https://docs.microsoft.com/en-us/windows/win32/api/winnls/nf-winnls-getacp "Retrieves the current Windows ANSI code page identifier for the operating system." This is in contrast to the GetOEMCP() function which is said to return "OEM code page", not "ANSI code page". Both terms are misnomers from the previous century. Both these codepage settings traditionally refer to some narrow char codepage identifiers, which will vary depending on the user regional settings and are thus unpredictable and unusable for basically anything related to internationalization. The only meaningful strategy is to set these both to UTF-8 which now finally has some (beta stage?) support in Windows 10, and to upgrade all affected software to properly support this setting. |
| James Kuyper <jameskuyper@alumni.caltech.edu>: Dec 07 01:55PM -0500 On 12/7/21 12:59 PM, Alf P. Steinbach wrote: > On 7 Dec 2021 17:48, James Kuyper wrote: >> On 12/7/21 10:44 AM, Alf P. Steinbach wrote: ... > could not easily be changed. In particular this was the C standard > committee: their choice here was as reasonable and practical as their > choice of not supporting pointers outside of original (sub-) array. It was existing practice. From the very beginning, wchar_t was supposed to be "an integral type whose range of values can represent distinct codes for all members of the largest extended character set specified among the supported locales". When char32_t was added to the language, moving that specification to char32_t might have been a reasonable thing to do, but continuing to apply that specification to wchar_t was NOT an innovation. The same version of the standard that added char32_t also added char16_t, which is what should now be used for UTF-16 encoding, not wchar_t. It's an abuse of what wchar_t was intended for, to use it for a variable-length encoding. None of the functions in the C or C++ standard library for dealing with wchar_t values has ever had the right kind of interface to allow it to be used as a variable-length encoding. To see what I'm talking about, look at the mbrto*() and *tomb() functions from the C standard library, that have been incorporated by reference into the C++ standard library. Those functions do have interfaces designed to handle a variable-length encoding. ... >> The C++ standard explicitly addresses that point, though the C standard >> does not. > Happy to hear that but some more specific information would be welcome. 5.3p2: "A universal-character-name designates the character in ISO/IEC 10646 (if any) whose code point is the hexadecimal number represented by the sequence of hexadecimal-digits in the universal-character-name. The program is ill-formed if that number ... is a surrogate code point. ... A surrogate code point is a value in the range [D800, DFFF] (hexadecimal)." 5.13.5p8: "[Note: A single c-char may produce more than one char16_t character in the form of surrogate pairs. A surrogate pair is a representation for a single code point as a sequence of two 16-bit code units. — end note]" 5.13.5p13: "a universal-character-name in a UTF-16 string literal may yield a surrogate pair. ... The size of a UTF-16 string literal is the total number of escape sequences, universal-character-names, and other characters, plus one for each character requiring a surrogate pair, plus one for the terminating u'\0'." Note that it's UTF-16, which should be encoded using char16_t, for which this issue is acknowledged. wchar_t is not, and never was, supposed to be a variable-length encoding like UTF-8 and UTF-16. |
| James Kuyper <jameskuyper@alumni.caltech.edu>: Dec 07 01:55PM -0500 On 12/7/21 1:32 PM, Paavo Helde wrote: > The only meaningful strategy is to set these both to UTF-8 which now > finally has some (beta stage?) support in Windows 10, and to upgrade all > affected software to properly support this setting. Note that it was referred to as "ANSI" because Microsoft proposed it for ANSI standardization, but that proposal was never approved. Continuing to refer to it as "ANSI" decades later is a rather sad failure to acknowledge that rejection. |
| Keith Thompson <Keith.S.Thompson+u@gmail.com>: Dec 07 12:21PM -0800 > literals. > Which means that it's a good idea to take charge. > Option `/utf8` is one way to take charge. It appears my previous statement was incorrect. At least some Microsoft documentation does still (incorrectly) refer to "Windows ANSI". https://docs.microsoft.com/en-us/windows/win32/api/winnls/nf-winnls-getacp The history, as I recall, is that Microsoft proposed one or more 8-bit extensions of the 7-bit ASCII character set as ANSI standards. Windows-1252, which has various accented letters and other symbols in the range 128-255, is the best known variant. But Microsoft's proposal was never adopted by ANSI, leaving us with a bunch of incorrect documentation. Instead, ISO created the 8859-* 8-bit character sets, including 8859-1, or Latin-1. Latin-1 differs from Windows-1252 in that Latin-1 it has control characters in the range 128-159, while Windows-1252 has printable characters. https://en.wikipedia.org/wiki/Windows-1252 -- Keith Thompson (The_Other_Keith) Keith.S.Thompson+u@gmail.com Working, but not speaking, for Philips void Void(void) { Void(); } /* The recursive call of the void */ |
| Keith Thompson <Keith.S.Thompson+u@gmail.com>: Dec 07 12:26PM -0800 > https://docs.microsoft.com/en-us/windows/win32/api/winnls/nf-winnls-getacp > "Retrieves the current Windows ANSI code page identifier for the > operating system." Yes, I has missed that. But Microsoft has also said: The term ANSI as used to signify Windows code pages is a historical reference, but is nowadays a misnomer that continues to persist in the Windows community. https://en.wikipedia.org/wiki/Windows-1252 https://web.archive.org/web/20150204175931/http://download.microsoft.com/download/5/6/8/56803da0-e4a0-4796-a62c-ca920b73bb17/21-Unicode_WinXP.pdf Microsoft's mistake was to start using the term "ANSI" before it actually became an ANSI standard. Once that mistake was in place, cleaning it up was very difficult. > The only meaningful strategy is to set these both to UTF-8 which now > finally has some (beta stage?) support in Windows 10, and to upgrade > all affected software to properly support this setting. Yes, I advocate using UTF-8 whenever practical. -- Keith Thompson (The_Other_Keith) Keith.S.Thompson+u@gmail.com Working, but not speaking, for Philips void Void(void) { Void(); } /* The recursive call of the void */ |
| Bonita Montero <Bonita.Montero@gmail.com>: Dec 07 06:27AM +0100 Am 06.12.2021 um 19:02 schrieb Scott Lurndal: > cache to ensure exclusive access for the add, while > others will pass the entire operation to the last-level cache > where it is executed atomically. There's for sure no architecture that does atomic operations in the last level cache because this would be silly. > In any case, a couple dozen line assembler program would be a > far better test than your overly complicated C++. No, it wouldn't give better results and the code would be magnitudes longer if it would do the same. |
| scott@slp53.sl.home (Scott Lurndal): Dec 07 06:06PM >> where it is executed atomically. >There's for sure no architecture that does atomic operations in >the last level cache because this would be silly. Well, are you sure? Why do you think it would be silly? https://genzconsortium.org/wp-content/uploads/2019/04/Gen-Z-Atomics-2019.pdf Given that at least three high-end processor chips have taped out just this year with the capability of executing "far" atomic operations in the LLC (or to a PCI Express Root complex host bridge), I think you really don't have a clue what you are talking about. |
| Bonita Montero <Bonita.Montero@gmail.com>: Dec 07 07:14PM +0100 Am 07.12.2021 um 19:06 schrieb Scott Lurndal: > this year with the capability of executing "far" atomic operations in > the LLC (or to a PCI Express Root complex host bridge), I think you > really don't have a clue what you are talking about. And which CPUs currently support this Gen-Z interconnect ? And which CPUs currently use this far atomics for thread -synchronitation - none. Did you really read the paper and noted what Gen-Z is ? No. |
| scott@slp53.sl.home (Scott Lurndal): Dec 07 06:49PM >> the LLC (or to a PCI Express Root complex host bridge), I think you >> really don't have a clue what you are talking about. >And which CPUs currently support this Gen-Z interconnect ? I'd tell you, but various NDA's forbid. >And which CPUs currently use this far atomics for thread >-synchronitation - none. How do you know? I'm aware of three. Two sampling to customers, with core counts from 8 to 64. A handful of others are in development by several processor vendors as I write this. >Did you really read the paper and noted what Gen-Z is ? I know exactly what it is, and I know what CXL is as well, both being part of my day job. And if you don't think Intel is designing all of their server CPUs to be CXL [*] compatible, you're not thinking. [*] "In November 2021 the CXL Consortium and the GenZ Consortium signed a letter of intent for Gen-Z to transfer its specifications and assets to CXL, leaving CXL as the sole industry standard moving forward" |
| Bonita Montero <Bonita.Montero@gmail.com>: Dec 07 07:54PM +0100 Am 07.12.2021 um 19:49 schrieb Scott Lurndal: >>> really don't have a clue what you are talking about. >> And which CPUs currently support this Gen-Z interconnect ? > I'd tell you, but various NDA's forbid. LOOOOOOOL. >> And which CPUs currently use this far atomics for thread >> -synchronitation - none. > How do you know? Because this would be slower since the lock-modifications woudln't be done in the L1-caches but in far memory. That's just a silly idea. > I'm aware of three. Two sampling to customers, with core > counts from 8 to 64. And you can't tell it because of NDAs. Hrhr. |
| scott@slp53.sl.home (Scott Lurndal): Dec 07 07:14PM >Because this would be slower since the lock-modifications >woudln't be done in the L1-caches but in far memory. That's >just a silly idea. Hello, it's a cache-coherent multiprocessor. You need to fetch it exclusively into the L1 first, so instead of sending the fetch (or invalidate if converting a shared line to owned), you send the atomic op and it gets handled atomically at the far end (e.g. LLC, PCI express device, SoC coprocessor) saving the interconnect (mesh, ring, whatever) bandwidth and the round-trip time between L1 and LLC and reducing contention for the line. If it's already in the L1 cache, then the processor will automatically treat it as a near-atomic, this is expected to be a rare case with correctly designed atomic usage. |
| scott@slp53.sl.home (Scott Lurndal): Dec 07 07:25PM >If it's already in the L1 cache, then the processor will >automatically treat it as a near-atomic, this is expected >to be a rare case with correctly designed atomic usage. In case you need a public reference for a shipping processor: https://developer.arm.com/documentation/102099/0000/L1-data-memory-system/Instruction-implementation-in-the-L1-data-memory-system |
| You received this digest because you're subscribed to updates for this group. You can change your settings on the group membership page. To unsubscribe from this group and stop receiving emails from it send an email to comp.lang.c+++unsubscribe@googlegroups.com. |
No comments:
Post a Comment