Saturday, August 17, 2019

Digest for comp.lang.c++@googlegroups.com - 18 updates in 2 topics

Geoff <geoff@invalid.invalid>: Aug 17 12:43PM -0700

On Sat, 17 Aug 2019 15:08:52 +0200, David Brown
<david.brown@hesbynett.no> wrote:
 
[snip]
 
>need to define and document the format, making it independent of the
>platform. The author here glosses over the important bits - endianness
>and alignment (and consequently padding), which he doesn't mention at all.
 
He does mention endianess and he declined to discuss it, saying it
needs a separate discussion. You are the second poster here to
overlook that.
 
[snip]
 
>seconds will last longer than protons, and give you accurate fractions
>of a second for more normal timeframes. If these are not suitable,
>define a format for your particular requirements.
 
The UNIX epoch-based 64-bit time is good only if you believe the sun
will only last another 2.147 billion years. The epoch time rolls over
in the year 2,147,485,547 on MacOS. The sun is calculated to have
another 5.4 billion years of hydrogen fuel left.
 
Proton decay, so far, is only theoretical. It has never been observed.
 
[snip]
Bonita Montero <Bonita.Montero@gmail.com>: Aug 17 06:05PM +0200

> You can store any character encoding supported by the compiler in a
> wchar_t, unless you are using Windows, which is broken and can't store
> Unicode characters in a wchar_t because it is too small.
 
The size of wchar_t is implementation-dependent, so Windows isn't broken
here. And you have a limited number of codepoints in UTF-16 over UTF-32,
but even UTF-16 has such a huge number of codepoints that this will
never be exhausted.
And as there will never be more codepoints in Unicode as the "limited"
range of UTF-16 can cover Windows isn't broken here.
 
> On most modern systems, wchar_t matches char32_t.
> But it is not a requirement.
 
That was where David is wrong.
 
> designs, not ancient ones.  They often don't have a full C++ library,
> but they are freestanding systems and don't need the full library to
> be compliant.  (They might be non-compliant in other ways.)
 
I wasn't talking about the library but basic datatypes.
 
> Windows compilers are invariably non-compliant regarding wchar_t because
> it is only 16-bit on that platform, when it is required to be 32-bit.
 
UTF-16 is suitable for all codepoints that will ever be populated.
Bonita Montero <Bonita.Montero@gmail.com>: Aug 17 06:06PM +0200

> execution set.  Since Windows supports Unicode (regardless of the
> encodings used), a Windows compiler must be able to hold any Unicode
> code point in a wchar_t - i.e., wchar_t must be a minimum of 32 bit.
 
Win32 uses UTF-16 and there will never be more populated codepoints
than UTF-16 can cover; so where's the problem?
Bonita Montero <Bonita.Montero@gmail.com>: Aug 17 06:26PM +0200

> Exactly, yes.  16-bit wchar_t can't do that on a system that
> supports Unicode (regardless of the encoding).
 
There will be no more Unicode codepoints populated than could be
adressed by UTF-16. At least until we gonna support alien languages.
Paavo Helde <myfirstname@osa.pri.ee>: Aug 17 07:55PM +0300

On 17.08.2019 18:54, Bonita Montero wrote:
 
> The APIs accepting UTF-16 strings aren't for persistence.
 
You mean, like CreateFileW() or RegSetKeyValueW()?
 
> And there aren't any function for string-manipulation on
> the Win32-API.
 
Like ExpandEnvironmentStringsW() or PathUnExpandEnvStringsW()?
 
Not that it would matter the slightest. Microsoft's use of wchar_t
still does not match the C++ standard and UTF-16 still remains the most
useless Unicode representation.
Bonita Montero <Bonita.Montero@gmail.com>: Aug 17 07:15PM +0200

>> The APIs accepting UTF-16 strings aren't for persistence.
 
> You mean, like CreateFileW() or RegSetKeyValueW()?
 
More precise: for persistent content.
And for both functions UTF-16 is sufficient.
 
> Not that it would matter the slightest. Microsoft's use of wchar_t
> still does not match the C++ standard and UTF-16 still remains the
> most useless Unicode representation.
 
wchar_t is not required to be 32 bit wide.
Paavo Helde <myfirstname@osa.pri.ee>: Aug 17 08:39PM +0300

On 17.08.2019 20:15, Bonita Montero wrote:
 
>> You mean, like CreateFileW() or RegSetKeyValueW()?
 
> More precise: for persistent content.
> And for both functions UTF-16 is sufficient.
 
Sure, and UTF-8 would be as well.
 
>> still does not match the C++ standard and UTF-16 still remains the
>> most useless Unicode representation.
 
> wchar_t is not required to be 32 bit wide.
 
Sure, wchar_t can be 64 bit or whatever as long as it can hold any
character supported by the implementation [1]. Windows happens to
support Unicode, so a 16-bit type does not cut it.
 
[1] 3.9.1/5: "Type wchar_t is a distinct type whose values can represent
distinct codes for all members of the largest extended character set
specified among the supported locales."
Bonita Montero <Bonita.Montero@gmail.com>: Aug 17 07:44PM +0200

>> More precise: for persistent content.
>> And for both functions UTF-16 is sufficient.
 
> Sure, and UTF-8 would be as well.
 
UTF-8 and UTF-16 can adress 21 bits.
All codepoints that Unicode adresses.
 
> [1] 3.9.1/5: "Type wchar_t is a distinct type whose values can represent
> distinct codes for all members of the largest extended character set
> specified among the supported locales."
^^^^^^^^^^^^^^^^^^^^^^^^^^^
Hergen Lehmann <hlehmann.expires.5-11@snafu.de>: Aug 17 07:47PM +0200

Am 17.08.19 um 18:05 schrieb Bonita Montero:
>> Unicode characters in a wchar_t because it is too small.
 
> The size of wchar_t is implementation-dependent, so Windows isn't broken
> here.
 
Yes, the size of wchar_t is implementation-dependent, so it's not
formally broken.
 
But the Windows implementation is technically broken, as the programmer
can neither rely on the assumption, that any given character can be
stored into a wchar_t, nor that iterating over a string will correctly
return the stored characters, nor that insert/erase/substring operations
will cut the string properly - which completely defies the purpose of
wchar_t.
 
If i have to code multi-byte awareness into each and every string
operation anyways, there is no point of using wstring instead of string
and the much more common UTF8 encoding.
 
And it is getting even worse, as many Windows libraries and Windows
applications are not aware of the fact, that a Windows wstring is
supposed to be UTF16. The code is assuming one codepoint per string
position and fails in an unpredictable way, if the input data acutally
contains codepoints from the upper pages.
 
>> On most modern systems, wchar_t matches char32_t.
>> But it is not a  requirement.
 
> That was where David is wrong.
 
He's not.
David Brown <david.brown@hesbynett.no>: Aug 17 08:10PM +0200

On 17/08/2019 19:44, Bonita Montero wrote:
 
>> Sure, and UTF-8 would be as well.
 
> UTF-8 and UTF-16 can adress 21 bits.
> All codepoints that Unicode adresses.
 
Get back to use when you figure out how to fit 21 bits of Unicode code
points into a 16-bit wchar_t.
 
You can happily store any Unicode characters in a UTF-16 string, but not
in a single 16-bit character object.
 
David Brown <david.brown@hesbynett.no>: Aug 17 08:20PM +0200

On 17/08/2019 18:26, Bonita Montero wrote:
>> supports  Unicode (regardless of the encoding).
 
> There will be no more Unicode codepoints populated than could be
> adressed by UTF-16. At least until we gonna support alien languages.
 
Does your ignorance know no bounds?
 
Aren't you even capable of the simplest of web searches or references
before talking rubbish in public?
 
<https://en.wikipedia.org/wiki/Unicode>
 
<https://home.unicode.org/>
 
Currently, Unicode has 137,929 characters. These are organised in
different 32 code planes of 64K characters, of which only 3 code planes
are significantly used. But 3 planes is a great deal more than 1 plane
- 16-bit has been insufficient for Unicode since 1996.
Tim Rentsch <tr.17687@z991.linuxsc.com>: Aug 17 11:22AM -0700

> whcar_t as 16 bit since it makes no sense to implement it differently.
> But it's easier to conform to any hypothetical implementation by using
> char16_t.
 
(and in another posting quotes the C++ standard)
 
> "Type wchar_t is a distinct type whose values can represent distinct
> codes for all members of the largest extended character set specified
> among the supported locales (22.3.1)."
 
Like you say, the C and C++ standards require wchar_t to be large
enough so one wchar_t object can hold distinct values for every
character in the largest supported character set.
 
Most widely used operating systems today (eg, Microsoft Windows,
Linux) support character sets with (at least) hundreds of
thousands of characters.
 
Can you explain how hundreds of thousands distinct values can be
represented in a single 16-bit wide wchar_t object?
David Brown <david.brown@hesbynett.no>: Aug 17 08:24PM +0200

On 17/08/2019 18:06, Bonita Montero wrote:
>> minimum of 32 bit.
 
> Win32 uses UTF-16 and there will never be more populated codepoints
> than UTF-16 can cover; so where's the problem?
 
There is no problem with using UTF-16 for strings - except that it is an
encoding that combines all the disadvantages of UTF-8 with all the
disadvantages of UTF-32 with none of their advantages. But you cannot
store a arbitrary Unicode /character/ in a single 16-bit object.
Critically, you cannot store a Unicode CJK (Chinese, Japanese, Korean)
ideograph in a single 16-bit wchar_t, despite these characters being
supported by Windows. You /can/ store them in a char32_t, or a 32-bit
wchar_t.
David Brown <david.brown@hesbynett.no>: Aug 17 08:41PM +0200

On 17/08/2019 18:05, Bonita Montero wrote:
>> Unicode characters in a wchar_t because it is too small.
 
> The size of wchar_t is implementation-dependent, so Windows isn't broken
> here.
 
Yes, Windows is broken here. The standards give minimum requirements
for many implementation-dependent features. Windows could have a 21-bit
wchar_t, or a 121-bit wchar_t if it liked, but not a 16-bit one. It is
just like saying the size of "int" is implementation dependent, but the
standards require a minimum of 16-bit int.
 
> And you have a limited number of codepoints in UTF-16 over UTF-32,
> but even UTF-16 has such a huge number of codepoints that this will
> never be exhausted.
 
Unicode started with 16-bit code points, but within five years or so
they figured out that 16-bit was not enough, and it was extended.
 
Perhaps you don't understand how multi-unit encoding works, and are
totally confused by the concept that UTF-16 encoding is fine for
handling some billion+ code points, just like UTF-8 and UTF-32 can,
while not understanding that you can't fit all Unicode characters into a
single 16-bit wchar_t.
 
 
> And as there will never be more codepoints in Unicode as the "limited"
> range of UTF-16 can cover Windows isn't broken here.
 
Look it up. UTF-16 can cover everything, though it is a silly choice.
A 16-bit wchar_t cannot store all UTF-16 characters.
 
 
>> On most modern systems, wchar_t matches char32_t.
>> But it is not a  requirement.
 
> That was where David is wrong.
 
The minimum requirement would be 21-bit for a system supporting Unicode.
But wchar_t must also match an existing integral type. It would be
possible for a compiler to have an extended integer type that is 24-bit,
but assuming a "normal" compiler, 32-bit is the only sane minimum
requirement for wchar_t. And while it the standards allow bigger sizes
- a 64-bit wchar_t would be perfectly compliant - sanity dictates 32-bit.
 
>> but they are freestanding systems and don't need the full library to
>> be compliant.  (They might be non-compliant in other ways.)
 
> I wasn't talking about the library but basic datatypes.
 
I don't think /you/ know what you are talking about - it is hard for
other people to guess.
 
>> because it is only 16-bit on that platform, when it is required to be
>> 32-bit.
 
> UTF-16 is suitable for all codepoints that will ever be populated.
 
When you are in a hole, stop digging.
David Brown <david.brown@hesbynett.no>: Aug 17 08:46PM +0200

On 17/08/2019 19:47, Hergen Lehmann wrote:
>> here.
 
> Yes, the size of wchar_t is implementation-dependent, so it's not
> formally broken.
 
It is for Windows, because Windows supports Unicode as an execution
character set (the encoding is irrelevant), and 16-bit wchar_t is not
big enough. Anything 21 bits or higher, that fulfils the other
requirements in the standard, would do.
 
> return the stored characters, nor that insert/erase/substring operations
> will cut the string properly - which completely defies the purpose of
> wchar_t.
 
Yes.
 
16-bit wchar_t made sense when Windows supported UCS-2, which is a
16-bit subset of Unicode, and is where Unicode started. But Unicode
moved on in 1996, and Windows did not.
 
> If i have to code multi-byte awareness into each and every string
> operation anyways, there is no point of using wstring instead of string
> and the much more common UTF8 encoding.
 
Yes.
 
Bo Persson <bo@bo-persson.se>: Aug 17 09:07PM +0200

On 2019-08-17 at 19:39, Paavo Helde wrote:
 
> [1] 3.9.1/5: "Type wchar_t is a distinct type whose values can represent
> distinct codes for all members of the largest extended character set
> specified among the supported locales."
 
And, of course, MS follows the letter of the law here by limiting the
number of supported locales. The language standard says nothing about that.
 
 
Bo Persson
Bo Persson <bo@bo-persson.se>: Aug 17 09:21PM +0200

On 2019-08-17 at 20:22, Tim Rentsch wrote:
> thousands of characters.
 
> Can you explain how hundreds of thousands distinct values can be
> represented in a single 16-bit wide wchar_t object?
 
No, they would naturally not fit.
 
But there is a loop-hole here, as the language standard says "the
supported locales" and not "every existing locale".
 
So if you only support the locales where 16 bits is enough, you haven't
broken any rules. And you cannot help is some people use parts of
"unsupported" locales as well...
David Brown <david.brown@hesbynett.no>: Aug 17 09:41PM +0200

On 17/08/2019 21:07, Bo Persson wrote:
>> character set specified among the supported locales."
 
> And, of  course, MS follows the letter of the law here by limiting the
> number of supported locales. The language standard says nothing about that.
 
I can understand if MS Windows does not support Cuneiform or Old Persian
locales. But are you telling me they don't support Chinese, Japanese or
Korean using plane 2 characters?
 
My understanding is that the standard Windows API's for things like file
names now support full UTF-16 encodings, and that includes characters
that won't fit in a 16-bit wchar_t. When the decision to have 16-bit
wchar_t was made, these API's were UCS-2, so 16-bit wchar_t was a
suitable and complaint choice at that time. But it is not suitable any
more (and hasn't been for a long time).
You received this digest because you're subscribed to updates for this group. You can change your settings on the group membership page.
To unsubscribe from this group and stop receiving emails from it send an email to comp.lang.c+++unsubscribe@googlegroups.com.

No comments: