Friday, July 2, 2021

Digest for comp.lang.c++@googlegroups.com - 16 updates in 3 topics

Juha Nieminen <nospam@thanks.invalid>: Jul 02 05:26AM

> the strings as arrays of char, char16_t, or char32_t, respectively. Such
> literals are guaranteed to be in UTF-8, UTF-16, or UTF-32 encoding,
> respectively.
 
No my choice in this case.
James Kuyper <jameskuyper@alumni.caltech.edu>: Jul 02 01:44AM -0400

On 7/2/21 1:26 AM, Juha Nieminen wrote:
>> literals are guaranteed to be in UTF-8, UTF-16, or UTF-32 encoding,
>> respectively.
 
> No my choice in this case.
 
Recent messages reminded me of Window's strong incentives to use
wchar_t, something I've thankfully had no experience with. As a result,
what I'm about to say may be incorrect - but it seems to me that a
conforming implementation of C++ targeting Windows should have wchar_t
be the same as char16_t, so you should be able to freely use u prefixed
string literals and char16_t with code that needs to be portable to
Windows, but can also be used on other platforms. If your code didn't
need to be portable to other platforms, you could, by definition, rely
upon Window's own guarantees about wchar_t.
Bo Persson <bo@bo-persson.se>: Jul 02 08:33AM +0200

On 2021-07-02 at 07:44, James Kuyper wrote:
> Windows, but can also be used on other platforms. If your code didn't
> need to be portable to other platforms, you could, by definition, rely
> upon Window's own guarantees about wchar_t.
 
The problem is that the language explicitly requires wchar_t to be a
distinct type.
 
Once upon a time it was implemented as a typedef, but that option was
removed already in C++98. Overload resolution and things...
Kli-Kla-Klawitter <kliklaklawitter69@gmail.com>: Jul 02 10:03AM +0200

Am 01.07.2021 um 20:21 schrieb Keith Thompson:
 
>> v1.01 does honor the BOM.
 
> At the risk of giving the impression I'm taking you seriously, the
> oldest version of gcc available from gnu.org is 1.42, released in 1992.
 
No, the oldest gcc-version is v0.9 from March 22, 1987.
Keith Thompson <Keith.S.Thompson+u@gmail.com>: Jul 02 02:31AM -0700

Ralf Goertz <me@myprovider.invalid> writes:
[...]
> b.cc:1:5: warning: null character(s) ignored
 
> etc.
 
> How does that qualify as "gcc honoring the BOM"?
 
On my system, gcc doesn't handle UTF-16 at all, with or without a BOM.
(I don't know whether there's a way to configure it to do so.)
 
It does handle UTF-8 with or without a BOM.
 
$ file b.cpp
b.cpp: C source, UTF-8 Unicode (with BOM) text
$ cat b.cpp
int main() { }
$ hd b.cpp
00000000 ef bb bf 69 6e 74 20 6d 61 69 6e 28 29 20 7b 20 |...int main() { |
00000010 7d 0a |}.|
00000012
$ gcc -c b.cpp
$
 
gcc 9.3.0 on Ubuntu 20.04. (There is, of course, no point in going back
to ancient versions of gcc.)
 
--
Keith Thompson (The_Other_Keith) Keith.S.Thompson+u@gmail.com
Working, but not speaking, for Philips
void Void(void) { Void(); } /* The recursive call of the void */
MrSpud_oyCn@92wvlb1hltq4dhc.gov.uk: Jul 02 09:38AM

On Fri, 02 Jul 2021 02:31:24 -0700
>[...]
 
>On my system, gcc doesn't handle UTF-16 at all, with or without a BOM.
>(I don't know whether there's a way to configure it to do so.)
 
Just out of interest, what byte order is the BOM in? Catch 22?
Paavo Helde <myfirstname@osa.pri.ee>: Jul 02 01:52PM +0300


>> On my system, gcc doesn't handle UTF-16 at all, with or without a BOM.
>> (I don't know whether there's a way to configure it to do so.)
 
> Just out of interest, what byte order is the BOM in? Catch 22?
 
This is probably a troll question, but answering anyway: the BOM marker
U+FEFF is in the correct byte order, in both little-endian and
big-endian UTF-16. That's how you tell them apart.
 
The trick is in that the reverse value U+FFFE is not valid Unicode
character, so there is no possibility of mixup. Neither sequence is also
valid UTF-8, to avoid mixup with that.
Juha Nieminen <nospam@thanks.invalid>: Jul 02 11:52AM

> so you should be able to freely use u prefixed
> string literals and char16_t with code that needs to be portable to
> Windows, but can also be used on other platforms.
 
But doesn't that have the exact same problem as in my original post?
In other words, in code like this:
 
const char16_t *str = u"something";
 
the stuff between the quotation marks in the source code will use
whatever encoding (ostensibly but not assuredly UTF-8), which the
compiler needs to convert to UTF-16 at compile time for writing
it to the output binary.
 
Or is it guaranteed that the characters between u" and " will
always be interpreted as UTF-8?
MrSpud_zg8yg@nx8574z6ey2rc0y563k5i2.net: Jul 02 12:53PM

On Fri, 2 Jul 2021 13:52:53 +0300
>>> (I don't know whether there's a way to configure it to do so.)
 
>> Just out of interest, what byte order is the BOM in? Catch 22?
 
>This is probably a troll question, but answering anyway: the BOM marker
 
No, not a troll.
 
>U+FEFF is in the correct byte order, in both little-endian and
>big-endian UTF-16. That's how you tell them apart.
 
So its just 2 bytes in sequence, not a 16 bit value?
"Alf P. Steinbach" <alf.p.steinbach@gmail.com>: Jul 02 03:22PM +0200


>> U+FEFF is in the correct byte order, in both little-endian and
>> big-endian UTF-16. That's how you tell them apart.
 
> So its just 2 bytes in sequence, not a 16 bit value?
 
The BOM is a Unicode code point, U+FEFF as Paavo mentioned, originally
standing for an invisible zero-width hard space. It's encoded with
either little endian UTF-16 (then as two bytes), or as big endian UTF-16
(then as two bytes), or as endianness agnostic UTF-8 (then as three
bytes). The encoded BOM yields a reliable encoding indicator, though
pedantic people might argue that it's just statistical -- after all, one
just might happen to have a Windows 1252 encoded file with three
characters at the start with the same byte values as the UTF-8 BOM.
 
In the same vein, one just might happen to have a `.txt` file in Windows
with the letters "MZ" at the very start, like
 
MZ, Mishtara Zva'it, is the Military Police Corps of Israel. blah
 
and if you then try to open the file in your default text editor by just
typing the file name in old Cmd, those letters will be misinterpreted as
the initials of Mark Zbikowski, marking the file as an executable...
 
Since the chance of that happening isn't absolutely 0 one should never
use text file names as commands, or the UTF-8 BOM as an encoding marker.
 
 
- Alf
"Alf P. Steinbach" <alf.p.steinbach@gmail.com>: Jul 02 03:37PM +0200

On 2 Jul 2021 07:44, James Kuyper wrote:
> Windows, but can also be used on other platforms. If your code didn't
> need to be portable to other platforms, you could, by definition, rely
> upon Window's own guarantees about wchar_t.
 
I agree, but unfortunately both the C and C++ standards require
`wchar_t` to be able represent all code points in the largest supported
extended character set, and while that requirement worked nicely with
original Unicode, already in 1992 or thereabouts (not quite sure) it was
in conflict with extremely firmly established practice in Windows.
 
Bringing the standards in agreement with actual practice should be a
goal, but for C++ it's seldom done.
 
It /was/ done, in C++11, for the hopeless idealistic C++98 goal of clean
C++-ish `<c...>` headers that didn't pollute the global namespace. But
C++17 added stuff to only some `<c...>` headers and not to the
corresponding `<... .h>` headers, and that fact was then used quite
recently in a proposal to un-deprecate the .h headers but add all new
stuff only to `<c...>` headers. Which satisfies the politics but not the
in-practice of using code that uses C libraries designed as
C++-compatible, without running afoul of qualification issues.
 
And it /was/ done, in C++11, for both `throw` specifications and for the
`export` keyword, not to mention the C++11 optional conversion between
function pointers and `void*`, in order to support the reality of Posix.
 
But it seems to me, it's very very hard to get such changes through the
committee, just as individual people generally don't like to admit that
they've been wrong. Instead, C++20 started on a path of /introducing/
more conflicts with reality. In particular for `std::filesystem::path`,
where they threw the baby out with the bathwater for purely academic
idealism reasons.
 
- Alf
Paavo Helde <myfirstname@osa.pri.ee>: Jul 02 05:32PM +0300


>> U+FEFF is in the correct byte order, in both little-endian and
>> big-endian UTF-16. That's how you tell them apart.
 
> So its just 2 bytes in sequence, not a 16 bit value?
 
The file contains bytes, it's up to the reading code how to interpret
the bytes. It can interpret the bytes as uint16_t, i.e. cast the file
buffer as 'const uint16_t*' and read the first 2-byte value. If it is
0xFEFF, then it knows this is an UTF-16 file in a matching byte order.
If it is 0xFFFE, then it knows it's an UTF-16 file in an opposite byte
order, and the rest of the file needs to be byte-swapped.
 
It can also interpret the buffer as containing uint8_t bytes, but then
the logic is a bit more complex, it must then know if it itself is
running on a big-endian or little-endian machine, and behave accordingly.
James Kuyper <jameskuyper@alumni.caltech.edu>: Jul 02 04:15PM -0400

On 7/2/21 7:52 AM, Juha Nieminen wrote:
>> string literals and char16_t with code that needs to be portable to
>> Windows, but can also be used on other platforms.
 
> But doesn't that have the exact same problem as in my original post?
No, because the original post used wchar_t and the L prefix, for which
the relevant encoding is implementation-defined. That's not the case for
char and the u8 prefix, char16_t and the u prefix, or for char32_t and
the U prefix.
 
I have a feeling that there's a misunderstanding somewhere in this
conversation, but I'm not sure yet what it is.
 
 
> const char16_t *str = u"something";
 
> the stuff between the quotation marks in the source code will use
> whatever encoding (ostensibly but not assuredly UTF-8), which the
 
I don't understand why you think that the source code encoding matters.
The only thing that matters is what characters are encoded. As long as
those encodings are for the characters {'u', '""', '\\', 'x', 'C', '2',
'\\', 'x', 'A', '9', '"'}, any fully conforming implementation must give
you the standard defined behavior for u"\xC2\xA9".
 
> compiler needs to convert to UTF-16 at compile time for writing
> it to the output binary.
 
 
When using the u8 prefix, UTF-8 encoding is guaranteed, for which every
codepoint from U+0000 to U+007F is represented by a single character
with a numerical value matching the code point.
 
When using the u prefix, UTF-16 encoding is guaranteed, for which every
codepoint from U+0000 to U+D7FF, and from U+E000 to U+FFFF, is
represented by a single character with a numerical value matching the
codepoint.
 
When using the U prefix, UTF-32 encoding is guaranteed, for which every
codepoint from U+0000 to U+D7FF, and from U+E000 to U+10FFFF, is
represented by a single character with a numerical value matching the
codepoint.
 
Since the meanings of the octal and hexadecimal escape sequences are
defined in terms of the numerical values of the corresponding
characters, if you use any of those prefixes, specifying the value of a
character within the specified range by using an octal or hexadecimal
escape sequence is precisely as portable as using the UCN with the same
numerical value. Using UCNs would be better because they are less
restricted, working just as well with the L prefix and with no prefix.
However, within those ranges, octal and hexadecimal escapes will work
just as well.
 
Do you know of any implementation of C++ that claims to be fully
conforming, for which that is not the case? If so, how do they justify
that claim?
 
'\xC2' and '\xA9' are in the ranges for the u and U prefixes. They would
not be in the range for the u8 prefix, but since the context was wide
characters, the u8 prefix is not relevant.
 
 
> Or is it guaranteed that the characters between u" and " will
> always be interpreted as UTF-8?
 
Source code characters between the u" and the " will be interpreted
according to an implementation-defined character encoding. But so long
as they encode {'\\', 'x', 'C', '2', '\\', 'x', 'A', '9'}, you should
get the standard-defined behavior for u"\xC2\xA9".
Lynn McGuire <lynnmcguire5@gmail.com>: Jul 01 10:27PM -0500

On 6/10/2021 10:08 PM, Chris M. Thomasson wrote:
>> Has anyone read this and found it to be a good education on threads ?
 
>> https://www.amazon.com/C-Concurrency-Action-Anthony-Williams/dp/1617294691/
 
> I used to converse with Anthony way back. He knows his threads! :^)
 
I bought the book. It looks intense. Not a very big font either.
 
Lynn
"Chris M. Thomasson" <chris.m.thomasson.1@gmail.com>: Jul 01 09:28PM -0700

On 7/1/2021 8:27 PM, Lynn McGuire wrote:
 
>>> https://www.amazon.com/C-Concurrency-Action-Anthony-Williams/dp/1617294691/
 
>> I used to converse with Anthony way back. He knows his threads! :^)
 
> I bought the book.  It looks intense.  Not a very big font either.
 
I have not seen his new book, but I am wondering if he mentions how to
use DWCAS in it with pure C++. I know he knows about it:
 
actually, read the whole thread where this message resides:
 
https://groups.google.com/g/lock-free/c/X3fuuXknQF0/m/Ho0H1iJgmrQJ
 
He calls it DCAS in that thread, buts DCAS and DWCAS are different
things. So, I am wondering if he mentions it in his 2nd edition.
red floyd <no.spam.here@its.invalid>: Jul 01 05:48PM -0700

On 7/1/2021 1:14 PM, Vir Campestris wrote:
> with is: This pointer from my database points to an interface. It might
> be a type 1 object, in which case do ONE(). It might be a type 2, in
> which case do TWO().
 
Why can't you use a virtual member function for this? Or is it to cast
the pointer from the database to one of two different hierarchies?
You received this digest because you're subscribed to updates for this group. You can change your settings on the group membership page.
To unsubscribe from this group and stop receiving emails from it send an email to comp.lang.c+++unsubscribe@googlegroups.com.

No comments: