soft and program: Digest for comp.lang.c++@googlegroups.com

comp.lang.c++@googlegroups.com

Google Groups

C++ needs some help - 1 Update
"C++ - Unicode Encoding Conversions with STL Strings and Win32 APIs" - 4 Updates

woodbrian77@gmail.com: Sep 11 01:24PM -0700

On Monday, October 12, 2015 at 12:12:07 PM UTC-5, Scott Lurndal wrote:
> or SOAP?

> What guarantee is there that your solution will be available a decade
> from now?

There's no guarantee, but the service has been on line since
2002. The company is stronger than ever and what we've been
anticipating is happening:

http://www.reseller.co.nz/article/590705/middleware-as-a-service-turns-enterprise-integration-its-head/

Brian
Ebenezer Enterprises - In G-d we trust.
http://webEbenezer.net

"C++ - Unicode Encoding Conversions with STL Strings and Win32 APIs"

David Brown <david.brown@hesbynett.no>: Sep 11 01:51PM +0200

On 10/09/16 23:22, Vir Campestris wrote:
>> for MSVC to require a BOM. gcc fixed their code in version 4.4 - has
>> MSVC fixed their compiler yet?

> You've obviously come from a Linux world.

Not really - my background is very mixed. But I haven't had much use
for unicode on Windows. On Windows, Latin-1 has always been sufficient
for the non-ASCII characters I needed (mainly the Norwegian letters ÅØÆ).

> system then you have to follow its rules. Inserting a BOM is far less
> painful than swapping all your slashes around - or even turning them
> into Yen symbols.

Just using / seems to work fine for slashes. Windows APIs, AFAIK,
accept them without question.

But I don't do much PC programming - most of my work is small embedded
systems, where neither unicode nor filenames are relevant.

> The advantages are a lot less clear in countries that habitually use
> more than 128 characters. Little unimportant countries like Japan,
> China, India, Russia, Korea...

Actually, it is still clear in those countries. CJK countries often use
characters outside the BMP, and those are three or four bytes in UTF-8
and four bytes in UTF-16. So UTF-8 is at least as efficient. And since
most text documents of any size will be full of markup (html, xml, word
processor stuff, etc.) the percentage of 16-bit code points is not
nearly as high as you might guess.

There are good reasons for UTF-8 being the most common encoding on the
net, even in those countries.

"Alf P. Steinbach" <alf.p.steinbach+usenet@gmail.com>: Sep 11 04:52PM +0200

On 10.09.2016 20:44, Chris Vine wrote:
> wchar_t with surrogate pairs for unicode coverage (nor for that matter
> 8-bit wchar_t with UTF-8), but there may be something else in the
> standard about that, either in C++11 or C++14.

"Distinct codes" is clear (otherwise it would be meaningless, and the
standards committees are not into that).

But this requirement comes from earlier in C, as I recall.

Cheers & hth.,

- Alf

Tim Rentsch <txr@alumni.caltech.edu>: Sep 11 10:55AM -0700

> wchar_t, and the signed and unsigned integer types are collectively
> called integral types."

> References are to the C++11 standard.

In the C standard the type wchar_t is defined in 7.19p2, and
char{16,32}_t in 7.28p2, but other than that I think the two
are pretty much the same.

> wchar_t with surrogate pairs for unicode coverage (nor for that matter
> 8-bit wchar_t with UTF-8), but there may be something else in the
> standard about that, either in C++11 or C++14.

I think this is sort of right and sort of wrong. The question is
what characters must be in "the largest extended character set
specified among the supported locales". AFAICT that set does not
have to be any larger than what is in the "C" locale, which is
"the minimal environment for C translation" (which for the sake
of discussion let's say is the same as 7-bit ASCII). If that set
is small enough, then wchar_t could be 16 bits or 8 bits, as you
say. If however "the largest extended character set specified
among the supported locales" has 100,000 characters, then I don't
see how wchar_t can be 16 bits or smaller, because there is no
way for each of those 100,000 characters to have a distinct code.

Now, if the largest extended character set has only the 127 ASCII
characters (plus 0 for null), then wchar_t can be 8 bits and use
a UTF-8 encoding. Or, if the largest extended character set has
only the 16-bit Unicode characters that do not need surrogate
pairs for their encoding, then wchar_t can be 16 bits and use a
UTF-16 encoding. AFAICT both of these scenarios are allowed by
the C and C++ standards. But it depends on what characters are
in the the largest extended character set specified among the
supported locales.

Let me emphasize that the above represents my best understanding,
and only that, and which may in fact be wrong if there is something
I have missed.

Tim Rentsch <txr@alumni.caltech.edu>: Sep 11 11:24AM -0700

> "Distinct codes" is clear (otherwise it would be meaningless, and
> the standards committees are not into that).

> But this requirement comes from earlier in C, as I recall.

More or less this same text (for wchar_t) appears in pre-C99 drafts.
The type wchar_t was introduced in Amendment 1, aka "C95", but I
don't have a copy of that handy so I don't know what wording it
uses.

I agree that the "distinct codes" provision means there must be a
distinct numeric value in wchar_t for each character in the
largest extended character set, but how big does that set have
to be? AFAICT it can be pretty small and still be part of a
conforming implementation. Let me draw your attention to two
passages in the C++14 standard (which I believe are the same
as the corresponding passages in the relevant C standards).

2.13.3 paragraph 1:

A character literal that begins with the letter L, such as
L'z', is a wide-character literal. A wide-character literal
has type wchar_t. The value of a wide-character literal
containing a single c-char has value equal to the numerical
value of the encoding of the c-char in the execution
wide-character set, unless the c-char has no representation
in the execution wide-character set, in which case the value
is implementation-defined.

16.8 paragraph 2, entry for the symbol __STDC_ISO_10646__

An integer literal of the form yyyymmL (for example,
199712L). If this symbol is defined, then every character
in the Unicode required set, when stored in an object of
type wchar_t, has the same value as the short identifier of
that character. The Unicode required set consists of all
the characters that are defined by ISO/IEC 10646, along with
all amendments and technical corrigenda as of the specified
year and month.

Both of these passages appear to acknowledge the possibility that
some "characters" might not be representable in the execution
wide-character set. Even if __STDC_ISO_10646__ is defined, it
might be defined with a date value that necessitates only 16 bit
characters for wchar_t. (I don't have the old Unicode documents
available, but I'm pretty sure there was such a time in the last
20 or 25 years.) Is there something in the newer standards that
requires __STDC_ISO_10646__ be defined at all, or that constrains
the range of dates that may be used? AFAICT there isn't, but of
course I may have missed something.

So, am I on track here, or is there something you think I may
have missed?

You received this digest because you're subscribed to updates for this group. You can change your settings on the group membership page.
To unsubscribe from this group and stop receiving emails from it send an email to comp.lang.c+++unsubscribe@googlegroups.com.

soft and program

Sunday, September 11, 2016

Digest for comp.lang.c++@googlegroups.com - 5 updates in 2 topics

No comments:

Blog Archive

About Me