Sunday, August 18, 2019

Digest for comp.lang.c++@googlegroups.com - 25 updates in 5 topics

Keith Thompson <kst-u@mib.org>: Aug 17 04:33PM -0700


> Does your ignorance know no bounds?
 
> Aren't you even capable of the simplest of web searches or references
> before talking rubbish in public?
 
I think you have incorrectly assumed that Bonita is asserting that there
are no more than 65536 Unicode codepoints.
 
UTF-16 can represent all Unicode codepoints. It cannot represent each
Unicode codepoint in 16 bits; some of them require two 16-bit values.
 
I presume Bonita meant that UTF-16 can represent all of Unicode (which
is true). But a 16-bit wchar_t cannot, because the standard requires
wchar_t to be "able to represent all members of the execution
wide-character set" (that's the point Bonita missed or ignored).
 
To use wchar_t to represent all Unicode code points, you either have to
make wchar_t at least 21 bits (more likely 32 bits) *or* you have to use
it in a way that doesn't satisfy the requirements of the standard, such
as using UTF-16 to encode some characters in more than one wchar_t.
 
[...]
 
--
Keith Thompson (The_Other_Keith) kst-u@mib.org <http://www.ghoti.net/~kst>
Will write code for food.
void Void(void) { Void(); } /* The recursive call of the void */
Keith Thompson <kst-u@mib.org>: Aug 17 04:46PM -0700

> wchar_t, or a 121-bit wchar_t if it liked, but not a 16-bit one. It is
> just like saying the size of "int" is implementation dependent, but the
> standards require a minimum of 16-bit int.
 
**OR** it could have a 16-bit wchar_t and not claim to support Unicode
as its wide character set. An implementation with 16-bit wchar_t that
supports only the BMP as its wide character set could be conforming
(though not as usesful as an implementation that supports full Unicode).
 
But Microsoft's own documentation says:
The wchar_t type is an implementation-defined wide character
type. In the Microsoft compiler, it represents a 16-bit wide
character used to store Unicode encoded as UTF-16LE, the native
character type on Windows operating systems. The wide character
versions of the Universal C Runtime (UCRT) library functions
use wchar_t and its pointer and array types as parameters and
return values, as do the wide character versions of the native
Windows API.
https://docs.microsoft.com/en-us/cpp/cpp/char-wchar-t-char16-t-char32-t?view=vs-2019
 
which I believe is non-conforming.
 
(But I'm not sure how Microsoft could have fixed this without breaking
too much existing code.)
 
[...]
 
--
Keith Thompson (The_Other_Keith) kst-u@mib.org <http://www.ghoti.net/~kst>
Will write code for food.
void Void(void) { Void(); } /* The recursive call of the void */
Robert Wessel <robertwessel2@yahoo.com>: Aug 17 10:57PM -0500

On Sat, 17 Aug 2019 20:41:06 +0200, David Brown
>handling some billion+ code points, just like UTF-8 and UTF-32 can,
>while not understanding that you can't fit all Unicode characters into a
>single 16-bit wchar_t.
 
 
Actually UTF-16 can't encode billions of characters, as it
specifically uses a pair of extension/surrogate characters, which
encode about 10 bits each, leading to the extra 16 planes.
 
The format for UTF-8 was designed (and could trivially be extended) to
support 31 bit code points (with a six byte sequence). It was limited
to the current 17-plane scheme by the adoption of UTF-16, which could
only address 17 planes (the BMP plus the 16 extra planes implied by
the 20-bit number encoded in the surrogate characters).
Tim Rentsch <tr.17687@z991.linuxsc.com>: Aug 17 10:25PM -0700


> So if you only support the locales where 16 bits is enough, you
> haven't broken any rules. And you cannot help is some people use
> parts of "unsupported" locales as well...
 
Perhaps not as much of a loophole as you think. The setlocale()
function is defined, even for C++, by the C standard. A call to
setlocale() must not accept a locale that has more characters
than wchar_t can accommodate, because of how 'wide character' is
defined. So the "unsupported" locales you are talking about
cannot be used in a conforming implmentation. Of course, if the
implementation is not conforming, it can do whatever it wants.
Bonita Montero <Bonita.Montero@gmail.com>: Aug 18 08:48AM +0200

> Does your ignorance know no bounds?
 
Unicode is defined to have a maximum of 21 bit codepoints.
Bonita Montero <Bonita.Montero@gmail.com>: Aug 18 08:49AM +0200

> an encoding that combines all the disadvantages of UTF-8 with all the
> disadvantages of UTF-32 with none of their advantages.  But you cannot
> store a arbitrary Unicode /character/ in a single 16-bit object.
 
Therefore you have UTF-16.
Bonita Montero <Bonita.Montero@gmail.com>: Aug 18 08:54AM +0200

> return the stored characters, nor that insert/erase/substring operations
> will cut the string properly - which completely defies the purpose of
> wchar_t.
 
That's how UTF-16 works and the Unicode-standard recommends UTF-16 for
certain circumstances.
Bonita Montero <Bonita.Montero@gmail.com>: Aug 18 08:57AM +0200

> wchar_t, or a 121-bit wchar_t if it liked, but not a 16-bit one.  It is
> just like saying the size of "int" is implementation dependent, but the
> standards require a minimum of 16-bit int.
 
There's no mandantory relationship between Unicode and wchar_t
 
>> range of UTF-16 can cover Windows isn't broken here.
 
> Look it up.  UTF-16 can cover everything, though it is a silly choice. A
> 16-bit wchar_t cannot store all UTF-16 characters.
 
But UTF-16 can. And Windows works with UTF-16.
Bonita Montero <Bonita.Montero@gmail.com>: Aug 18 08:58AM +0200

> Get back to use when you figure out how to fit 21 bits of Unicode code
> points into a 16-bit wchar_t.
 
wchar_t has no mandantory relationship to Unicode.
And Windows uses UTF-16. And UTF-16 works with 16 bit characters.
Bart <bc@freeuk.com>: Aug 18 10:16AM +0100

On 18/08/2019 07:58, Bonita Montero wrote:
>> points into a 16-bit wchar_t.
 
> wchar_t has no mandantory relationship to Unicode.
> And Windows uses UTF-16. And UTF-16 works with 16 bit characters.
 
This is what you might often do with ASCII:
 
unsigned char ascii[128];
for (i=0; i<128; ++i) ascii[i]=i;
 
So that all ascii code points are represented in sequence by each
element of ascii;
 
What people are saying is that you can't do the equivalent for Unicode
using Windows' wchar_t when the latter is 16 bits:
 
unsigned wchar_t unicode[1114112];
for (i=0; i<1114112; ++i) unicode[i]=i;
 
because the stored values will wrap back to 0 as soon as i gets to 65536
(and again at 131072 and so on).
 
If this unicode[] array were to represent a UTF16 /string/, then its
size would be greater than 1114112 elements, as many values require
escape sequences with multiple elements, you couldn't write the loop in
such a simple way, and you wouldn't have the Nth code point stored as
the single value in unicode[N].
 
This is why a 32-bit wchar_t would have been a far better choice.
Obviously, Windows can display and work with any Unicode characters, but
it makes it more complicated than it would have been.
 
(There's nothing to stop a Windows program choosing to write the program
like this:
 
unsigned uint32_t unicode[1114112];
for (i=0; i<1114112; ++i) unicode[i]=i;
 
but then such 32-bit characters, or strings using such arrays, are not
directly supported by the OS.)
Bonita Montero <Bonita.Montero@gmail.com>: Aug 18 11:27AM +0200

> escape sequences with multiple elements, you couldn't write the loop
> in such a simple way, and you wouldn't have the Nth code point stored
> as the single value in unicode[N].
 
Of course a wchar_t with 32 bits would be more convenient.
But with 16 bits its fits with Win32.
And the Uncicode-standard recommends UTF-16 fort certain circumstances.
 
>    unsigned uint32_t unicode[1114112];
>    for (i=0; i<1114112; ++i) unicode[i]=i;
 
No one uses char-arrays of that length in C++ but u16string
or u32string.
David Brown <david.brown@hesbynett.no>: Aug 18 11:34AM +0200

On 18/08/2019 01:33, Keith Thompson wrote:
>> before talking rubbish in public?
 
> I think you have incorrectly assumed that Bonita is asserting that there
> are no more than 65536 Unicode codepoints.
 
Possibly. If I have misinterpreted her, then I will be glad to be
corrected.
 
 
> UTF-16 can represent all Unicode codepoints. It cannot represent each
> Unicode codepoint in 16 bits; some of them require two 16-bit values.
 
Correct - and that is something I have said several times.
 
> is true). But a 16-bit wchar_t cannot, because the standard requires
> wchar_t to be "able to represent all members of the execution
> wide-character set" (that's the point Bonita missed or ignored).
 
Agreed - and again, it is something I have said several times.
 
It is fine (both in the sense of working practically and in being
compliant with the standards) to use char16_t and u16string to handle
all Unicode strings and characters. You can't store all Unicode
characters in a /single/ char16_t object - but a char16_t is for storing
Unicode code /units/, not code /points/, so it is fine for the job.
 
But a whcar_t has to be able to store any /character/ - for Unicode,
that means 21 bits of code point.
 
> make wchar_t at least 21 bits (more likely 32 bits) *or* you have to use
> it in a way that doesn't satisfy the requirements of the standard, such
> as using UTF-16 to encode some characters in more than one wchar_t.
 
In practice, I think people on Windows use wchar_t strings (or arrays)
for holding UTF-16 encoding strings. That will work fine. But it
encourages mistaken assumptions - such as that ws[9] holds the tenth
Unicode character in the string, or that wcslen returns the number of
characters in the string. These assumptions hold for a proper wchar_t,
such as the 32-bit wchar_t on Unix systems (or a 16-bit wchar_t on
Windows while it used UCS-2 rather than UTF-16).
 
I think the sensible practice would be to deprecate the use of wchar_t
as much as possible, using instead char8_t for UTF-8 strings when
dealing with string data (and especially for data interchange), and
char32_t for UTF-32 encoding internally if you need
character-by-character access. These are unambiguous and function
identically across platforms (except perhaps for the endianness of
char32_t). For interaction with legacy code and API's on Windows,
char16_t is a better choice than wchar_t.
 
Going forward, C++ could drop wchar_t and support for wide character
execution sets other than Unicode, just as it is dropping support for
signed integer representations other than two's complement. This kind
of thing limits flexibility in theory, but not in practice, and it would
simplify things a bit.
David Brown <david.brown@hesbynett.no>: Aug 18 11:34AM +0200

On 18/08/2019 08:48, Bonita Montero wrote:
>> Does your ignorance know no bounds?
 
> Unicode is defined to have a maximum of 21 bit codepoints.
 
Yes. And how to you fit those 21 bits into a 16-bit wchar_t ?
David Brown <david.brown@hesbynett.no>: Aug 18 11:46AM +0200

On 18/08/2019 08:49, Bonita Montero wrote:
>> disadvantages of UTF-32 with none of their advantages.  But you cannot
>> store a arbitrary Unicode /character/ in a single 16-bit object.
 
> Therefore you have UTF-16.
 
wchar_t is not for UTF-16 - each single wchar_t object should store a
complete character. That is what the type means - look it up.
 
Strings or arrays of wchar_t can hold UTF-16 encoded data, which will
work in practice but is an abuse of wchar_t that goes against the
standards. ("char16_t" is the type you want here, which is distinct
from wchar_t despite being the same size on Windows.)
 
Until you can find a way to store the /single/ character "𓃀" (the
hieroglyph for "B") in a /single/ 16-bit wchar_t, not a string or array,
then 16-bit wchar_t is too small.
David Brown <david.brown@hesbynett.no>: Aug 18 12:07PM +0200

On 18/08/2019 08:54, Bonita Montero wrote:
>> completely defies the purpose of wchar_t.
 
> That's how UTF-16 works and the Unicode-standard recommends UTF-16 for
> certain circumstances.
 
I know how UTF-16 works. And UTF-16 is only recommended for documents
written primarily in the BMP code plane characters between 0x0800 and
0xffff (i.e., non-European scripts, not including CJK) as it is more
efficient than UTF-8 or UTF-32. This recommendation is roundly ignored,
for good reasons. While UTF-16 is still used internally on Windows, and
with some languages (like Java) and libraries (like QT), it is almost
completely negligible as an encoding in documents or data interchange.
If you see any reference recommending its usage, check the data - the
page is probably from last century.
David Brown <david.brown@hesbynett.no>: Aug 18 12:22PM +0200

On 18/08/2019 01:46, Keith Thompson wrote:
> as its wide character set. An implementation with 16-bit wchar_t that
> supports only the BMP as its wide character set could be conforming
> (though not as usesful as an implementation that supports full Unicode).
 
Absolutely true. And that was the case originally with Windows NT,
which used UCS-2 (essentially the subset of UTF-16 that can be encoded
in a single unit). But Windows has moved steadily more of its APIs,
libraries, gui widgets, and software to full UTF-16. It has done so in
uncoordinated jumps, with versions of Windows that let you use
multi-unit characters in some programs and APIs but not others, but I
believe it is fairly complete now. (And with Windows 10, I have heard
that UTF-8 support is officially in place.)
 
So 16-bit wchar_t was appropriate when Windows NT started with UCS-2.
But it should have been changed to 32-bit for later Windows.
 
 
> which I believe is non-conforming.
 
> (But I'm not sure how Microsoft could have fixed this without breaking
> too much existing code.)
 
Aye, there's the rub.
 
People programming for Windows have a long history of making unwarranted
assumptions about types and sizes. They have been used to programming
on a single platform, and thinking it will remain the same forever.
They have not been helped by Microsoft's unforgivable reticence over C99
and headers like <stdint.h>. Windows programmers regularly assume that
wchar_t is 16-bit - changing it would break code written by these
programmers. The same thing happened with the move to 64-bit Windows -
because Windows programmers had assumed that "long" is exactly 32-bit
(having had no "int32_t" available), changing the size of "long" to
match every other 64-bit platform would have broken lots of Windows code.
 
It is easy to say /now/ that Microsoft should have changed to 32-bit
wchar_t and UTF-32 and/or UTF-8 as soon as it was clear that Unicode
would not fit in 16 bits. But it would have been hard to do at the time.
 
The resulting situation today, however, is clear. Windows is
non-compliant and nearly unique with its 16-bit wchar_t, and it is the
only major OS to make heavy use of UTF-16 instead of UTF-8.
David Brown <david.brown@hesbynett.no>: Aug 18 12:24PM +0200

On 18/08/2019 08:57, Bonita Montero wrote:
>> one.  It is just like saying the size of "int" is implementation
>> dependent, but the standards require a minimum of 16-bit int.
 
> There's no mandantory relationship between Unicode and wchar_t
 
Correct.
 
But there is a mandatory relationship between the character set
supported by the target system, and wchar_t. Windows supports Unicode
(with UTF-16 encoding, and possibly UTF-8 with Windows 10). Therefore,
wchar_t must support Unicode characters.
 
Bonita Montero <Bonita.Montero@gmail.com>: Aug 18 12:57PM +0200

>> Unicode is defined to have a maximum of 21 bit codepoints.
 
> Yes.  And how to you fit those 21 bits into a 16-bit wchar_t ?
 
That's not necessary if you use UTF-16.
David Brown <david.brown@hesbynett.no>: Aug 18 12:32PM +0200

On 17/08/2019 21:43, Geoff wrote:
 
> He does mention endianess and he declined to discuss it, saying it
> needs a separate discussion. You are the second poster here to
> overlook that.
 
I said he glossed over it, which he did. I did not say he ignored it
completely - he said it was an issue but did not discuss it further.
 
>> define a format for your particular requirements.
 
> The UNIX epoch-based 64-bit time is good only if you believe the sun
> will only last another 2.147 billion years.
 
292 billion years, by my counting. By that time, any issues with
rollover will be an SEP¹.
 
> The epoch time rolls over
> in the year 2,147,485,547 on MacOS. The sun is calculated to have
> another 5.4 billion years of hydrogen fuel left.
 
That sounds like a signed 32-bit integer storing the number of years
(not an unreasonable choice for a split time/date format).
 
 
> Proton decay, so far, is only theoretical. It has never been observed.
 
True. And perhaps humankind will have moved to a different, newer
universe by that time. We /could/ standardise on 128-bit second
timestamps, but I think it is important to leave a few challenges to
keep future generations motivated :-)
 
 
> [snip]
 
¹ <https://en.wikipedia.org/wiki/Somebody_else%27s_problem>
Tim Rentsch <tr.17687@z991.linuxsc.com>: Aug 17 10:07PM -0700


>> Does it make sense to code:
 
>> MyClass *myObj = MyClass();
 
> No. It is usually syntax error,
 
In most cases a diagnostic is required, but assuming MyClass
is a class (or struct), and there is no funny preprocessor
stuff going on, it is never a syntax error.
"Öö Tiib" <ootiib@hot.ee>: Aug 17 11:18PM -0700

On Sunday, 18 August 2019 08:07:31 UTC+3, Tim Rentsch wrote:
 
> In most cases a diagnostic is required, but assuming MyClass
> is a class (or struct), and there is no funny preprocessor
> stuff going on, it is never a syntax error.
 
I can demo that it is ill-formed on most trivial case:
http://coliru.stacked-crooked.com/a/aa95b9b7035de3c3
"Öö Tiib" <ootiib@hot.ee>: Aug 18 12:05AM -0700

On Sunday, 18 August 2019 09:18:33 UTC+3, Öö Tiib wrote:
> > stuff going on, it is never a syntax error.
 
> I can demo that it is ill-formed on most trivial case:
> http://coliru.stacked-crooked.com/a/aa95b9b7035de3c3
 
Ok. I think I got it ... you meant it is not syntax error but
semantic error.
Tim Rentsch <tr.17687@z991.linuxsc.com>: Aug 17 10:02PM -0700

> of all minds. So the only progress from there possible seems to
> be towards (often vague, often incorrect,) understandings of some
> aspects of few things. Did you mean some other "progress"?
 
I'm sorry about how my statement was phrased. It made sense in
my head when I wrote it but now I think it might be not so good.
If you don't mind maybe we can just drop it and leave it at that.
 
> to provide some. I meant that your lack of attempt to express
> any thoughts started to feel strange. Have I offended you
> somehow?
 
I have the sense that your communication style is more
argumentative than constructive. I don't like arguing, and I
don't like trying to have a productive conversation with people
who are more interested in argument than communication. So in
exchanges with you I've learned not to say very much, because
trying to say more would for me be counter-productive.
"Öö Tiib" <ootiib@hot.ee>: Aug 17 11:53PM -0700

On Sunday, 18 August 2019 08:02:30 UTC+3, Tim Rentsch wrote:
 
> I'm sorry about how my statement was phrased. It made sense in
> my head when I wrote it but now I think it might be not so good.
> If you don't mind maybe we can just drop it and leave it at that.
 
Fine, I have also not much to add to it.
 
> who are more interested in argument than communication. So in
> exchanges with you I've learned not to say very much, because
> trying to say more would for me be counter-productive.
 
I have dropped viewing discussions as some kind of combat long
ago. Some things remain controversial and subjects of viewpoint,
style or taste. Participating in discussions has still point for
me when I have some arguments to add, to question or to
clarify. How else can a discussion be productive?
"Chris M. Thomasson" <chris.m.thomasson.1@gmail.com>: Aug 17 10:13PM -0700

On 8/7/2019 11:06 PM, Bonita Montero wrote:
> I don't think FPWB is suitable for lock-free-programming because
> it a kernel-call and thereby very slow.
 
Not true. Think of RCU, how about fast SMR? Btw its not only slow wrt
calling into the Kernel. It does something to other processors running
threads within the calling processes affinity mask:
 
https://docs.microsoft.com/en-us/windows/win32/api/processthreadsapi/nf-processthreadsapi-flushprocesswritebuffers
 
Its used on the slow side of an asymmetric sync algorihtm. Think of
optimizing hot paths, at the expense of hitting slow paths with extra work.
You received this digest because you're subscribed to updates for this group. You can change your settings on the group membership page.
To unsubscribe from this group and stop receiving emails from it send an email to comp.lang.c+++unsubscribe@googlegroups.com.

No comments: