Saturday, September 10, 2016

Digest for comp.lang.c++@googlegroups.com - 3 updates in 1 topic

Tim Rentsch <txr@alumni.caltech.edu>: Sep 10 09:31AM -0700


> [...] For conformance a `wchar_t` must be able to represent
> all code points of Unicode, i.e., in practice it must be 32 bits,
> since it can't be 21.
 
Can you give citations to passages that establish this statement?
AFAICT the various standards do not require wchar_t to support
all of Unicode, even in later versions that mandate <uchar.h>.
Of course I may have missed something, especially in the C++
documents.
Chris Vine <chris@cvine--nospam--.freeserve.co.uk>: Sep 10 07:44PM +0100

On Sat, 10 Sep 2016 09:31:02 -0700
> all of Unicode, even in later versions that mandate <uchar.h>.
> Of course I may have missed something, especially in the C++
> documents.
 
I was wondering about that.
 
§3.9.1/5 says "Type wchar_t is a distinct type whose values can
represent distinct codes for all members of the largest extended
character set specified among the supported locales (22.3.1). Type
wchar_t shall have the same size, signedness, and alignment
requirements (3.11) as one of the other integral types, called its
underlying type. Types char16_t and char32_t denote distinct types with
the same size, signedness, and alignment as uint_least16_t and
uint_least32_t, respectively, in <stdint.h>, called the underlying
types."
 
§3.9.1/7 says: "Types bool, char, char16_t, char32_t, wchar_t, and the
signed and unsigned integer types are collectively called integral
types."
 
References are to the C++11 standard.
 
This doesn't, in my view, preclude the use of 16-bit code units for
wchar_t with surrogate pairs for unicode coverage (nor for that matter
8-bit wchar_t with UTF-8), but there may be something else in the
standard about that, either in C++11 or C++14.
 
Chris
Vir Campestris <vir.campestris@invalid.invalid>: Sep 10 10:22PM +0100

On 09/09/2016 16:13, David Brown wrote:
<snip>
> source code with a BOM. It is less understandable, and at least as bad
> for MSVC to require a BOM. gcc fixed their code in version 4.4 - has
> MSVC fixed their compiler yet?
 
You've obviously come from a Linux world.
My experience from years back is that a file won't have a BOM, because
we know what character set it is. It's US Ascii, or some other 7 bit
national variant - or it might even be EBCDIC. Only the truly obscure
have a 6-bit byte, and are limited to UPPER CASE ONLY.
 
Linux assumes you are going to run UTF-8, and that's just as invalid as
assuming Windows 1251 - which used to be a perfectly sane assumption in
some parts of the world.
 
The BOM gets you around a few of these problems.
 
If you want to compile your code on the world's most popular operating
system then you have to follow its rules. Inserting a BOM is far less
painful than swapping all your slashes around - or even turning them
into Yen symbols.
 
<snip>
 
> Maybe MS and Windows simply suffer from being the brave pioneer here,
> and the *nix world watched them then saw how to do it better.
 
I think you're right. The 16-bit Unicode charset made perfect sense at
the time.
 
<snip>
 
> The advantages of UTF-8 over UTF-16 are clear and technical, not just
> "feelings" or propaganda.
 
The advantages are a lot less clear in countries that habitually use
more than 128 characters. Little unimportant countries like Japan,
China, India, Russia, Korea...
 
Andy
You received this digest because you're subscribed to updates for this group. You can change your settings on the group membership page.
To unsubscribe from this group and stop receiving emails from it send an email to comp.lang.c+++unsubscribe@googlegroups.com.

No comments: