- How to get standard paths?!? - 18 Updates
- "C++: Size Matters in Platform Compatibility" - 1 Update
- newbie question: exceptions - 3 Updates
- Tricky! - 2 Updates
- Performance of unaligned memory-accesses - 1 Update
Keith Thompson <kst-u@mib.org>: Aug 17 04:33PM -0700 > Does your ignorance know no bounds? > Aren't you even capable of the simplest of web searches or references > before talking rubbish in public? I think you have incorrectly assumed that Bonita is asserting that there are no more than 65536 Unicode codepoints. UTF-16 can represent all Unicode codepoints. It cannot represent each Unicode codepoint in 16 bits; some of them require two 16-bit values. I presume Bonita meant that UTF-16 can represent all of Unicode (which is true). But a 16-bit wchar_t cannot, because the standard requires wchar_t to be "able to represent all members of the execution wide-character set" (that's the point Bonita missed or ignored). To use wchar_t to represent all Unicode code points, you either have to make wchar_t at least 21 bits (more likely 32 bits) *or* you have to use it in a way that doesn't satisfy the requirements of the standard, such as using UTF-16 to encode some characters in more than one wchar_t. [...] -- Keith Thompson (The_Other_Keith) kst-u@mib.org <http://www.ghoti.net/~kst> Will write code for food. void Void(void) { Void(); } /* The recursive call of the void */ |
Keith Thompson <kst-u@mib.org>: Aug 17 04:46PM -0700 > wchar_t, or a 121-bit wchar_t if it liked, but not a 16-bit one. It is > just like saying the size of "int" is implementation dependent, but the > standards require a minimum of 16-bit int. **OR** it could have a 16-bit wchar_t and not claim to support Unicode as its wide character set. An implementation with 16-bit wchar_t that supports only the BMP as its wide character set could be conforming (though not as usesful as an implementation that supports full Unicode). But Microsoft's own documentation says: The wchar_t type is an implementation-defined wide character type. In the Microsoft compiler, it represents a 16-bit wide character used to store Unicode encoded as UTF-16LE, the native character type on Windows operating systems. The wide character versions of the Universal C Runtime (UCRT) library functions use wchar_t and its pointer and array types as parameters and return values, as do the wide character versions of the native Windows API. https://docs.microsoft.com/en-us/cpp/cpp/char-wchar-t-char16-t-char32-t?view=vs-2019 which I believe is non-conforming. (But I'm not sure how Microsoft could have fixed this without breaking too much existing code.) [...] -- Keith Thompson (The_Other_Keith) kst-u@mib.org <http://www.ghoti.net/~kst> Will write code for food. void Void(void) { Void(); } /* The recursive call of the void */ |
Robert Wessel <robertwessel2@yahoo.com>: Aug 17 10:57PM -0500 On Sat, 17 Aug 2019 20:41:06 +0200, David Brown >handling some billion+ code points, just like UTF-8 and UTF-32 can, >while not understanding that you can't fit all Unicode characters into a >single 16-bit wchar_t. Actually UTF-16 can't encode billions of characters, as it specifically uses a pair of extension/surrogate characters, which encode about 10 bits each, leading to the extra 16 planes. The format for UTF-8 was designed (and could trivially be extended) to support 31 bit code points (with a six byte sequence). It was limited to the current 17-plane scheme by the adoption of UTF-16, which could only address 17 planes (the BMP plus the 16 extra planes implied by the 20-bit number encoded in the surrogate characters). |
Tim Rentsch <tr.17687@z991.linuxsc.com>: Aug 17 10:25PM -0700 > So if you only support the locales where 16 bits is enough, you > haven't broken any rules. And you cannot help is some people use > parts of "unsupported" locales as well... Perhaps not as much of a loophole as you think. The setlocale() function is defined, even for C++, by the C standard. A call to setlocale() must not accept a locale that has more characters than wchar_t can accommodate, because of how 'wide character' is defined. So the "unsupported" locales you are talking about cannot be used in a conforming implmentation. Of course, if the implementation is not conforming, it can do whatever it wants. |
Bonita Montero <Bonita.Montero@gmail.com>: Aug 18 08:48AM +0200 > Does your ignorance know no bounds? Unicode is defined to have a maximum of 21 bit codepoints. |
Bonita Montero <Bonita.Montero@gmail.com>: Aug 18 08:49AM +0200 > an encoding that combines all the disadvantages of UTF-8 with all the > disadvantages of UTF-32 with none of their advantages. But you cannot > store a arbitrary Unicode /character/ in a single 16-bit object. Therefore you have UTF-16. |
Bonita Montero <Bonita.Montero@gmail.com>: Aug 18 08:54AM +0200 > return the stored characters, nor that insert/erase/substring operations > will cut the string properly - which completely defies the purpose of > wchar_t. That's how UTF-16 works and the Unicode-standard recommends UTF-16 for certain circumstances. |
Bonita Montero <Bonita.Montero@gmail.com>: Aug 18 08:57AM +0200 > wchar_t, or a 121-bit wchar_t if it liked, but not a 16-bit one. It is > just like saying the size of "int" is implementation dependent, but the > standards require a minimum of 16-bit int. There's no mandantory relationship between Unicode and wchar_t >> range of UTF-16 can cover Windows isn't broken here. > Look it up. UTF-16 can cover everything, though it is a silly choice. A > 16-bit wchar_t cannot store all UTF-16 characters. But UTF-16 can. And Windows works with UTF-16. |
Bonita Montero <Bonita.Montero@gmail.com>: Aug 18 08:58AM +0200 > Get back to use when you figure out how to fit 21 bits of Unicode code > points into a 16-bit wchar_t. wchar_t has no mandantory relationship to Unicode. And Windows uses UTF-16. And UTF-16 works with 16 bit characters. |
Bart <bc@freeuk.com>: Aug 18 10:16AM +0100 On 18/08/2019 07:58, Bonita Montero wrote: >> points into a 16-bit wchar_t. > wchar_t has no mandantory relationship to Unicode. > And Windows uses UTF-16. And UTF-16 works with 16 bit characters. This is what you might often do with ASCII: unsigned char ascii[128]; for (i=0; i<128; ++i) ascii[i]=i; So that all ascii code points are represented in sequence by each element of ascii; What people are saying is that you can't do the equivalent for Unicode using Windows' wchar_t when the latter is 16 bits: unsigned wchar_t unicode[1114112]; for (i=0; i<1114112; ++i) unicode[i]=i; because the stored values will wrap back to 0 as soon as i gets to 65536 (and again at 131072 and so on). If this unicode[] array were to represent a UTF16 /string/, then its size would be greater than 1114112 elements, as many values require escape sequences with multiple elements, you couldn't write the loop in such a simple way, and you wouldn't have the Nth code point stored as the single value in unicode[N]. This is why a 32-bit wchar_t would have been a far better choice. Obviously, Windows can display and work with any Unicode characters, but it makes it more complicated than it would have been. (There's nothing to stop a Windows program choosing to write the program like this: unsigned uint32_t unicode[1114112]; for (i=0; i<1114112; ++i) unicode[i]=i; but then such 32-bit characters, or strings using such arrays, are not directly supported by the OS.) |
Bonita Montero <Bonita.Montero@gmail.com>: Aug 18 11:27AM +0200 > escape sequences with multiple elements, you couldn't write the loop > in such a simple way, and you wouldn't have the Nth code point stored > as the single value in unicode[N]. Of course a wchar_t with 32 bits would be more convenient. But with 16 bits its fits with Win32. And the Uncicode-standard recommends UTF-16 fort certain circumstances. > unsigned uint32_t unicode[1114112]; > for (i=0; i<1114112; ++i) unicode[i]=i; No one uses char-arrays of that length in C++ but u16string or u32string. |
David Brown <david.brown@hesbynett.no>: Aug 18 11:34AM +0200 On 18/08/2019 01:33, Keith Thompson wrote: >> before talking rubbish in public? > I think you have incorrectly assumed that Bonita is asserting that there > are no more than 65536 Unicode codepoints. Possibly. If I have misinterpreted her, then I will be glad to be corrected. > UTF-16 can represent all Unicode codepoints. It cannot represent each > Unicode codepoint in 16 bits; some of them require two 16-bit values. Correct - and that is something I have said several times. > is true). But a 16-bit wchar_t cannot, because the standard requires > wchar_t to be "able to represent all members of the execution > wide-character set" (that's the point Bonita missed or ignored). Agreed - and again, it is something I have said several times. It is fine (both in the sense of working practically and in being compliant with the standards) to use char16_t and u16string to handle all Unicode strings and characters. You can't store all Unicode characters in a /single/ char16_t object - but a char16_t is for storing Unicode code /units/, not code /points/, so it is fine for the job. But a whcar_t has to be able to store any /character/ - for Unicode, that means 21 bits of code point. > make wchar_t at least 21 bits (more likely 32 bits) *or* you have to use > it in a way that doesn't satisfy the requirements of the standard, such > as using UTF-16 to encode some characters in more than one wchar_t. In practice, I think people on Windows use wchar_t strings (or arrays) for holding UTF-16 encoding strings. That will work fine. But it encourages mistaken assumptions - such as that ws[9] holds the tenth Unicode character in the string, or that wcslen returns the number of characters in the string. These assumptions hold for a proper wchar_t, such as the 32-bit wchar_t on Unix systems (or a 16-bit wchar_t on Windows while it used UCS-2 rather than UTF-16). I think the sensible practice would be to deprecate the use of wchar_t as much as possible, using instead char8_t for UTF-8 strings when dealing with string data (and especially for data interchange), and char32_t for UTF-32 encoding internally if you need character-by-character access. These are unambiguous and function identically across platforms (except perhaps for the endianness of char32_t). For interaction with legacy code and API's on Windows, char16_t is a better choice than wchar_t. Going forward, C++ could drop wchar_t and support for wide character execution sets other than Unicode, just as it is dropping support for signed integer representations other than two's complement. This kind of thing limits flexibility in theory, but not in practice, and it would simplify things a bit. |
David Brown <david.brown@hesbynett.no>: Aug 18 11:34AM +0200 On 18/08/2019 08:48, Bonita Montero wrote: >> Does your ignorance know no bounds? > Unicode is defined to have a maximum of 21 bit codepoints. Yes. And how to you fit those 21 bits into a 16-bit wchar_t ? |
David Brown <david.brown@hesbynett.no>: Aug 18 11:46AM +0200 On 18/08/2019 08:49, Bonita Montero wrote: >> disadvantages of UTF-32 with none of their advantages. But you cannot >> store a arbitrary Unicode /character/ in a single 16-bit object. > Therefore you have UTF-16. wchar_t is not for UTF-16 - each single wchar_t object should store a complete character. That is what the type means - look it up. Strings or arrays of wchar_t can hold UTF-16 encoded data, which will work in practice but is an abuse of wchar_t that goes against the standards. ("char16_t" is the type you want here, which is distinct from wchar_t despite being the same size on Windows.) Until you can find a way to store the /single/ character "𓃀" (the hieroglyph for "B") in a /single/ 16-bit wchar_t, not a string or array, then 16-bit wchar_t is too small. |
David Brown <david.brown@hesbynett.no>: Aug 18 12:07PM +0200 On 18/08/2019 08:54, Bonita Montero wrote: >> completely defies the purpose of wchar_t. > That's how UTF-16 works and the Unicode-standard recommends UTF-16 for > certain circumstances. I know how UTF-16 works. And UTF-16 is only recommended for documents written primarily in the BMP code plane characters between 0x0800 and 0xffff (i.e., non-European scripts, not including CJK) as it is more efficient than UTF-8 or UTF-32. This recommendation is roundly ignored, for good reasons. While UTF-16 is still used internally on Windows, and with some languages (like Java) and libraries (like QT), it is almost completely negligible as an encoding in documents or data interchange. If you see any reference recommending its usage, check the data - the page is probably from last century. |
David Brown <david.brown@hesbynett.no>: Aug 18 12:22PM +0200 On 18/08/2019 01:46, Keith Thompson wrote: > as its wide character set. An implementation with 16-bit wchar_t that > supports only the BMP as its wide character set could be conforming > (though not as usesful as an implementation that supports full Unicode). Absolutely true. And that was the case originally with Windows NT, which used UCS-2 (essentially the subset of UTF-16 that can be encoded in a single unit). But Windows has moved steadily more of its APIs, libraries, gui widgets, and software to full UTF-16. It has done so in uncoordinated jumps, with versions of Windows that let you use multi-unit characters in some programs and APIs but not others, but I believe it is fairly complete now. (And with Windows 10, I have heard that UTF-8 support is officially in place.) So 16-bit wchar_t was appropriate when Windows NT started with UCS-2. But it should have been changed to 32-bit for later Windows. > which I believe is non-conforming. > (But I'm not sure how Microsoft could have fixed this without breaking > too much existing code.) Aye, there's the rub. People programming for Windows have a long history of making unwarranted assumptions about types and sizes. They have been used to programming on a single platform, and thinking it will remain the same forever. They have not been helped by Microsoft's unforgivable reticence over C99 and headers like <stdint.h>. Windows programmers regularly assume that wchar_t is 16-bit - changing it would break code written by these programmers. The same thing happened with the move to 64-bit Windows - because Windows programmers had assumed that "long" is exactly 32-bit (having had no "int32_t" available), changing the size of "long" to match every other 64-bit platform would have broken lots of Windows code. It is easy to say /now/ that Microsoft should have changed to 32-bit wchar_t and UTF-32 and/or UTF-8 as soon as it was clear that Unicode would not fit in 16 bits. But it would have been hard to do at the time. The resulting situation today, however, is clear. Windows is non-compliant and nearly unique with its 16-bit wchar_t, and it is the only major OS to make heavy use of UTF-16 instead of UTF-8. |
David Brown <david.brown@hesbynett.no>: Aug 18 12:24PM +0200 On 18/08/2019 08:57, Bonita Montero wrote: >> one. It is just like saying the size of "int" is implementation >> dependent, but the standards require a minimum of 16-bit int. > There's no mandantory relationship between Unicode and wchar_t Correct. But there is a mandatory relationship between the character set supported by the target system, and wchar_t. Windows supports Unicode (with UTF-16 encoding, and possibly UTF-8 with Windows 10). Therefore, wchar_t must support Unicode characters. |
Bonita Montero <Bonita.Montero@gmail.com>: Aug 18 12:57PM +0200 >> Unicode is defined to have a maximum of 21 bit codepoints. > Yes. And how to you fit those 21 bits into a 16-bit wchar_t ? That's not necessary if you use UTF-16. |
David Brown <david.brown@hesbynett.no>: Aug 18 12:32PM +0200 On 17/08/2019 21:43, Geoff wrote: > He does mention endianess and he declined to discuss it, saying it > needs a separate discussion. You are the second poster here to > overlook that. I said he glossed over it, which he did. I did not say he ignored it completely - he said it was an issue but did not discuss it further. >> define a format for your particular requirements. > The UNIX epoch-based 64-bit time is good only if you believe the sun > will only last another 2.147 billion years. 292 billion years, by my counting. By that time, any issues with rollover will be an SEP¹. > The epoch time rolls over > in the year 2,147,485,547 on MacOS. The sun is calculated to have > another 5.4 billion years of hydrogen fuel left. That sounds like a signed 32-bit integer storing the number of years (not an unreasonable choice for a split time/date format). > Proton decay, so far, is only theoretical. It has never been observed. True. And perhaps humankind will have moved to a different, newer universe by that time. We /could/ standardise on 128-bit second timestamps, but I think it is important to leave a few challenges to keep future generations motivated :-) > [snip] ¹ <https://en.wikipedia.org/wiki/Somebody_else%27s_problem> |
Tim Rentsch <tr.17687@z991.linuxsc.com>: Aug 17 10:07PM -0700 >> Does it make sense to code: >> MyClass *myObj = MyClass(); > No. It is usually syntax error, In most cases a diagnostic is required, but assuming MyClass is a class (or struct), and there is no funny preprocessor stuff going on, it is never a syntax error. |
"Öö Tiib" <ootiib@hot.ee>: Aug 17 11:18PM -0700 On Sunday, 18 August 2019 08:07:31 UTC+3, Tim Rentsch wrote: > In most cases a diagnostic is required, but assuming MyClass > is a class (or struct), and there is no funny preprocessor > stuff going on, it is never a syntax error. I can demo that it is ill-formed on most trivial case: http://coliru.stacked-crooked.com/a/aa95b9b7035de3c3 |
"Öö Tiib" <ootiib@hot.ee>: Aug 18 12:05AM -0700 On Sunday, 18 August 2019 09:18:33 UTC+3, Öö Tiib wrote: > > stuff going on, it is never a syntax error. > I can demo that it is ill-formed on most trivial case: > http://coliru.stacked-crooked.com/a/aa95b9b7035de3c3 Ok. I think I got it ... you meant it is not syntax error but semantic error. |
Tim Rentsch <tr.17687@z991.linuxsc.com>: Aug 17 10:02PM -0700 > of all minds. So the only progress from there possible seems to > be towards (often vague, often incorrect,) understandings of some > aspects of few things. Did you mean some other "progress"? I'm sorry about how my statement was phrased. It made sense in my head when I wrote it but now I think it might be not so good. If you don't mind maybe we can just drop it and leave it at that. > to provide some. I meant that your lack of attempt to express > any thoughts started to feel strange. Have I offended you > somehow? I have the sense that your communication style is more argumentative than constructive. I don't like arguing, and I don't like trying to have a productive conversation with people who are more interested in argument than communication. So in exchanges with you I've learned not to say very much, because trying to say more would for me be counter-productive. |
"Öö Tiib" <ootiib@hot.ee>: Aug 17 11:53PM -0700 On Sunday, 18 August 2019 08:02:30 UTC+3, Tim Rentsch wrote: > I'm sorry about how my statement was phrased. It made sense in > my head when I wrote it but now I think it might be not so good. > If you don't mind maybe we can just drop it and leave it at that. Fine, I have also not much to add to it. > who are more interested in argument than communication. So in > exchanges with you I've learned not to say very much, because > trying to say more would for me be counter-productive. I have dropped viewing discussions as some kind of combat long ago. Some things remain controversial and subjects of viewpoint, style or taste. Participating in discussions has still point for me when I have some arguments to add, to question or to clarify. How else can a discussion be productive? |
"Chris M. Thomasson" <chris.m.thomasson.1@gmail.com>: Aug 17 10:13PM -0700 On 8/7/2019 11:06 PM, Bonita Montero wrote: > I don't think FPWB is suitable for lock-free-programming because > it a kernel-call and thereby very slow. Not true. Think of RCU, how about fast SMR? Btw its not only slow wrt calling into the Kernel. It does something to other processors running threads within the calling processes affinity mask: https://docs.microsoft.com/en-us/windows/win32/api/processthreadsapi/nf-processthreadsapi-flushprocesswritebuffers Its used on the slow side of an asymmetric sync algorihtm. Think of optimizing hot paths, at the expense of hitting slow paths with extra work. |
You received this digest because you're subscribed to updates for this group. You can change your settings on the group membership page. To unsubscribe from this group and stop receiving emails from it send an email to comp.lang.c+++unsubscribe@googlegroups.com. |
No comments:
Post a Comment