- "C++: Size Matters in Platform Compatibility" - 1 Update
- How to get standard paths?!? - 17 Updates
Geoff <geoff@invalid.invalid>: Aug 17 12:43PM -0700 On Sat, 17 Aug 2019 15:08:52 +0200, David Brown <david.brown@hesbynett.no> wrote: [snip] >need to define and document the format, making it independent of the >platform. The author here glosses over the important bits - endianness >and alignment (and consequently padding), which he doesn't mention at all. He does mention endianess and he declined to discuss it, saying it needs a separate discussion. You are the second poster here to overlook that. [snip] >seconds will last longer than protons, and give you accurate fractions >of a second for more normal timeframes. If these are not suitable, >define a format for your particular requirements. The UNIX epoch-based 64-bit time is good only if you believe the sun will only last another 2.147 billion years. The epoch time rolls over in the year 2,147,485,547 on MacOS. The sun is calculated to have another 5.4 billion years of hydrogen fuel left. Proton decay, so far, is only theoretical. It has never been observed. [snip] |
Bonita Montero <Bonita.Montero@gmail.com>: Aug 17 06:05PM +0200 > You can store any character encoding supported by the compiler in a > wchar_t, unless you are using Windows, which is broken and can't store > Unicode characters in a wchar_t because it is too small. The size of wchar_t is implementation-dependent, so Windows isn't broken here. And you have a limited number of codepoints in UTF-16 over UTF-32, but even UTF-16 has such a huge number of codepoints that this will never be exhausted. And as there will never be more codepoints in Unicode as the "limited" range of UTF-16 can cover Windows isn't broken here. > On most modern systems, wchar_t matches char32_t. > But it is not a requirement. That was where David is wrong. > designs, not ancient ones. They often don't have a full C++ library, > but they are freestanding systems and don't need the full library to > be compliant. (They might be non-compliant in other ways.) I wasn't talking about the library but basic datatypes. > Windows compilers are invariably non-compliant regarding wchar_t because > it is only 16-bit on that platform, when it is required to be 32-bit. UTF-16 is suitable for all codepoints that will ever be populated. |
Bonita Montero <Bonita.Montero@gmail.com>: Aug 17 06:06PM +0200 > execution set. Since Windows supports Unicode (regardless of the > encodings used), a Windows compiler must be able to hold any Unicode > code point in a wchar_t - i.e., wchar_t must be a minimum of 32 bit. Win32 uses UTF-16 and there will never be more populated codepoints than UTF-16 can cover; so where's the problem? |
Bonita Montero <Bonita.Montero@gmail.com>: Aug 17 06:26PM +0200 > Exactly, yes. 16-bit wchar_t can't do that on a system that > supports Unicode (regardless of the encoding). There will be no more Unicode codepoints populated than could be adressed by UTF-16. At least until we gonna support alien languages. |
Paavo Helde <myfirstname@osa.pri.ee>: Aug 17 07:55PM +0300 On 17.08.2019 18:54, Bonita Montero wrote: > The APIs accepting UTF-16 strings aren't for persistence. You mean, like CreateFileW() or RegSetKeyValueW()? > And there aren't any function for string-manipulation on > the Win32-API. Like ExpandEnvironmentStringsW() or PathUnExpandEnvStringsW()? Not that it would matter the slightest. Microsoft's use of wchar_t still does not match the C++ standard and UTF-16 still remains the most useless Unicode representation. |
Bonita Montero <Bonita.Montero@gmail.com>: Aug 17 07:15PM +0200 >> The APIs accepting UTF-16 strings aren't for persistence. > You mean, like CreateFileW() or RegSetKeyValueW()? More precise: for persistent content. And for both functions UTF-16 is sufficient. > Not that it would matter the slightest. Microsoft's use of wchar_t > still does not match the C++ standard and UTF-16 still remains the > most useless Unicode representation. wchar_t is not required to be 32 bit wide. |
Paavo Helde <myfirstname@osa.pri.ee>: Aug 17 08:39PM +0300 On 17.08.2019 20:15, Bonita Montero wrote: >> You mean, like CreateFileW() or RegSetKeyValueW()? > More precise: for persistent content. > And for both functions UTF-16 is sufficient. Sure, and UTF-8 would be as well. >> still does not match the C++ standard and UTF-16 still remains the >> most useless Unicode representation. > wchar_t is not required to be 32 bit wide. Sure, wchar_t can be 64 bit or whatever as long as it can hold any character supported by the implementation [1]. Windows happens to support Unicode, so a 16-bit type does not cut it. [1] 3.9.1/5: "Type wchar_t is a distinct type whose values can represent distinct codes for all members of the largest extended character set specified among the supported locales." |
Bonita Montero <Bonita.Montero@gmail.com>: Aug 17 07:44PM +0200 >> More precise: for persistent content. >> And for both functions UTF-16 is sufficient. > Sure, and UTF-8 would be as well. UTF-8 and UTF-16 can adress 21 bits. All codepoints that Unicode adresses. > [1] 3.9.1/5: "Type wchar_t is a distinct type whose values can represent > distinct codes for all members of the largest extended character set > specified among the supported locales." ^^^^^^^^^^^^^^^^^^^^^^^^^^^ |
Hergen Lehmann <hlehmann.expires.5-11@snafu.de>: Aug 17 07:47PM +0200 Am 17.08.19 um 18:05 schrieb Bonita Montero: >> Unicode characters in a wchar_t because it is too small. > The size of wchar_t is implementation-dependent, so Windows isn't broken > here. Yes, the size of wchar_t is implementation-dependent, so it's not formally broken. But the Windows implementation is technically broken, as the programmer can neither rely on the assumption, that any given character can be stored into a wchar_t, nor that iterating over a string will correctly return the stored characters, nor that insert/erase/substring operations will cut the string properly - which completely defies the purpose of wchar_t. If i have to code multi-byte awareness into each and every string operation anyways, there is no point of using wstring instead of string and the much more common UTF8 encoding. And it is getting even worse, as many Windows libraries and Windows applications are not aware of the fact, that a Windows wstring is supposed to be UTF16. The code is assuming one codepoint per string position and fails in an unpredictable way, if the input data acutally contains codepoints from the upper pages. >> On most modern systems, wchar_t matches char32_t. >> But it is not a requirement. > That was where David is wrong. He's not. |
David Brown <david.brown@hesbynett.no>: Aug 17 08:10PM +0200 On 17/08/2019 19:44, Bonita Montero wrote: >> Sure, and UTF-8 would be as well. > UTF-8 and UTF-16 can adress 21 bits. > All codepoints that Unicode adresses. Get back to use when you figure out how to fit 21 bits of Unicode code points into a 16-bit wchar_t. You can happily store any Unicode characters in a UTF-16 string, but not in a single 16-bit character object. |
David Brown <david.brown@hesbynett.no>: Aug 17 08:20PM +0200 On 17/08/2019 18:26, Bonita Montero wrote: >> supports Unicode (regardless of the encoding). > There will be no more Unicode codepoints populated than could be > adressed by UTF-16. At least until we gonna support alien languages. Does your ignorance know no bounds? Aren't you even capable of the simplest of web searches or references before talking rubbish in public? <https://en.wikipedia.org/wiki/Unicode> <https://home.unicode.org/> Currently, Unicode has 137,929 characters. These are organised in different 32 code planes of 64K characters, of which only 3 code planes are significantly used. But 3 planes is a great deal more than 1 plane - 16-bit has been insufficient for Unicode since 1996. |
Tim Rentsch <tr.17687@z991.linuxsc.com>: Aug 17 11:22AM -0700 > whcar_t as 16 bit since it makes no sense to implement it differently. > But it's easier to conform to any hypothetical implementation by using > char16_t. (and in another posting quotes the C++ standard) > "Type wchar_t is a distinct type whose values can represent distinct > codes for all members of the largest extended character set specified > among the supported locales (22.3.1)." Like you say, the C and C++ standards require wchar_t to be large enough so one wchar_t object can hold distinct values for every character in the largest supported character set. Most widely used operating systems today (eg, Microsoft Windows, Linux) support character sets with (at least) hundreds of thousands of characters. Can you explain how hundreds of thousands distinct values can be represented in a single 16-bit wide wchar_t object? |
David Brown <david.brown@hesbynett.no>: Aug 17 08:24PM +0200 On 17/08/2019 18:06, Bonita Montero wrote: >> minimum of 32 bit. > Win32 uses UTF-16 and there will never be more populated codepoints > than UTF-16 can cover; so where's the problem? There is no problem with using UTF-16 for strings - except that it is an encoding that combines all the disadvantages of UTF-8 with all the disadvantages of UTF-32 with none of their advantages. But you cannot store a arbitrary Unicode /character/ in a single 16-bit object. Critically, you cannot store a Unicode CJK (Chinese, Japanese, Korean) ideograph in a single 16-bit wchar_t, despite these characters being supported by Windows. You /can/ store them in a char32_t, or a 32-bit wchar_t. |
David Brown <david.brown@hesbynett.no>: Aug 17 08:41PM +0200 On 17/08/2019 18:05, Bonita Montero wrote: >> Unicode characters in a wchar_t because it is too small. > The size of wchar_t is implementation-dependent, so Windows isn't broken > here. Yes, Windows is broken here. The standards give minimum requirements for many implementation-dependent features. Windows could have a 21-bit wchar_t, or a 121-bit wchar_t if it liked, but not a 16-bit one. It is just like saying the size of "int" is implementation dependent, but the standards require a minimum of 16-bit int. > And you have a limited number of codepoints in UTF-16 over UTF-32, > but even UTF-16 has such a huge number of codepoints that this will > never be exhausted. Unicode started with 16-bit code points, but within five years or so they figured out that 16-bit was not enough, and it was extended. Perhaps you don't understand how multi-unit encoding works, and are totally confused by the concept that UTF-16 encoding is fine for handling some billion+ code points, just like UTF-8 and UTF-32 can, while not understanding that you can't fit all Unicode characters into a single 16-bit wchar_t. > And as there will never be more codepoints in Unicode as the "limited" > range of UTF-16 can cover Windows isn't broken here. Look it up. UTF-16 can cover everything, though it is a silly choice. A 16-bit wchar_t cannot store all UTF-16 characters. >> On most modern systems, wchar_t matches char32_t. >> But it is not a requirement. > That was where David is wrong. The minimum requirement would be 21-bit for a system supporting Unicode. But wchar_t must also match an existing integral type. It would be possible for a compiler to have an extended integer type that is 24-bit, but assuming a "normal" compiler, 32-bit is the only sane minimum requirement for wchar_t. And while it the standards allow bigger sizes - a 64-bit wchar_t would be perfectly compliant - sanity dictates 32-bit. >> but they are freestanding systems and don't need the full library to >> be compliant. (They might be non-compliant in other ways.) > I wasn't talking about the library but basic datatypes. I don't think /you/ know what you are talking about - it is hard for other people to guess. >> because it is only 16-bit on that platform, when it is required to be >> 32-bit. > UTF-16 is suitable for all codepoints that will ever be populated. When you are in a hole, stop digging. |
David Brown <david.brown@hesbynett.no>: Aug 17 08:46PM +0200 On 17/08/2019 19:47, Hergen Lehmann wrote: >> here. > Yes, the size of wchar_t is implementation-dependent, so it's not > formally broken. It is for Windows, because Windows supports Unicode as an execution character set (the encoding is irrelevant), and 16-bit wchar_t is not big enough. Anything 21 bits or higher, that fulfils the other requirements in the standard, would do. > return the stored characters, nor that insert/erase/substring operations > will cut the string properly - which completely defies the purpose of > wchar_t. Yes. 16-bit wchar_t made sense when Windows supported UCS-2, which is a 16-bit subset of Unicode, and is where Unicode started. But Unicode moved on in 1996, and Windows did not. > If i have to code multi-byte awareness into each and every string > operation anyways, there is no point of using wstring instead of string > and the much more common UTF8 encoding. Yes. |
Bo Persson <bo@bo-persson.se>: Aug 17 09:07PM +0200 On 2019-08-17 at 19:39, Paavo Helde wrote: > [1] 3.9.1/5: "Type wchar_t is a distinct type whose values can represent > distinct codes for all members of the largest extended character set > specified among the supported locales." And, of course, MS follows the letter of the law here by limiting the number of supported locales. The language standard says nothing about that. Bo Persson |
Bo Persson <bo@bo-persson.se>: Aug 17 09:21PM +0200 On 2019-08-17 at 20:22, Tim Rentsch wrote: > thousands of characters. > Can you explain how hundreds of thousands distinct values can be > represented in a single 16-bit wide wchar_t object? No, they would naturally not fit. But there is a loop-hole here, as the language standard says "the supported locales" and not "every existing locale". So if you only support the locales where 16 bits is enough, you haven't broken any rules. And you cannot help is some people use parts of "unsupported" locales as well... |
David Brown <david.brown@hesbynett.no>: Aug 17 09:41PM +0200 On 17/08/2019 21:07, Bo Persson wrote: >> character set specified among the supported locales." > And, of course, MS follows the letter of the law here by limiting the > number of supported locales. The language standard says nothing about that. I can understand if MS Windows does not support Cuneiform or Old Persian locales. But are you telling me they don't support Chinese, Japanese or Korean using plane 2 characters? My understanding is that the standard Windows API's for things like file names now support full UTF-16 encodings, and that includes characters that won't fit in a 16-bit wchar_t. When the decision to have 16-bit wchar_t was made, these API's were UCS-2, so 16-bit wchar_t was a suitable and complaint choice at that time. But it is not suitable any more (and hasn't been for a long time). |
You received this digest because you're subscribed to updates for this group. You can change your settings on the group membership page. To unsubscribe from this group and stop receiving emails from it send an email to comp.lang.c+++unsubscribe@googlegroups.com. |
No comments:
Post a Comment