Tim Rentsch <txr@alumni.caltech.edu>: Sep 10 09:31AM -0700 > [...] For conformance a `wchar_t` must be able to represent > all code points of Unicode, i.e., in practice it must be 32 bits, > since it can't be 21. Can you give citations to passages that establish this statement? AFAICT the various standards do not require wchar_t to support all of Unicode, even in later versions that mandate <uchar.h>. Of course I may have missed something, especially in the C++ documents. |
Chris Vine <chris@cvine--nospam--.freeserve.co.uk>: Sep 10 07:44PM +0100 On Sat, 10 Sep 2016 09:31:02 -0700 > all of Unicode, even in later versions that mandate <uchar.h>. > Of course I may have missed something, especially in the C++ > documents. I was wondering about that. §3.9.1/5 says "Type wchar_t is a distinct type whose values can represent distinct codes for all members of the largest extended character set specified among the supported locales (22.3.1). Type wchar_t shall have the same size, signedness, and alignment requirements (3.11) as one of the other integral types, called its underlying type. Types char16_t and char32_t denote distinct types with the same size, signedness, and alignment as uint_least16_t and uint_least32_t, respectively, in <stdint.h>, called the underlying types." §3.9.1/7 says: "Types bool, char, char16_t, char32_t, wchar_t, and the signed and unsigned integer types are collectively called integral types." References are to the C++11 standard. This doesn't, in my view, preclude the use of 16-bit code units for wchar_t with surrogate pairs for unicode coverage (nor for that matter 8-bit wchar_t with UTF-8), but there may be something else in the standard about that, either in C++11 or C++14. Chris |
Vir Campestris <vir.campestris@invalid.invalid>: Sep 10 10:22PM +0100 On 09/09/2016 16:13, David Brown wrote: <snip> > source code with a BOM. It is less understandable, and at least as bad > for MSVC to require a BOM. gcc fixed their code in version 4.4 - has > MSVC fixed their compiler yet? You've obviously come from a Linux world. My experience from years back is that a file won't have a BOM, because we know what character set it is. It's US Ascii, or some other 7 bit national variant - or it might even be EBCDIC. Only the truly obscure have a 6-bit byte, and are limited to UPPER CASE ONLY. Linux assumes you are going to run UTF-8, and that's just as invalid as assuming Windows 1251 - which used to be a perfectly sane assumption in some parts of the world. The BOM gets you around a few of these problems. If you want to compile your code on the world's most popular operating system then you have to follow its rules. Inserting a BOM is far less painful than swapping all your slashes around - or even turning them into Yen symbols. <snip> > Maybe MS and Windows simply suffer from being the brave pioneer here, > and the *nix world watched them then saw how to do it better. I think you're right. The 16-bit Unicode charset made perfect sense at the time. <snip> > The advantages of UTF-8 over UTF-16 are clear and technical, not just > "feelings" or propaganda. The advantages are a lot less clear in countries that habitually use more than 128 characters. Little unimportant countries like Japan, China, India, Russia, Korea... Andy |
You received this digest because you're subscribed to updates for this group. You can change your settings on the group membership page. To unsubscribe from this group and stop receiving emails from it send an email to comp.lang.c+++unsubscribe@googlegroups.com. |
No comments:
Post a Comment