Sunday, March 29, 2015

Digest for comp.lang.c++@googlegroups.com - 14 updates in 4 topics

Chris Vine <chris@cvine--nospam--.freeserve.co.uk>: Mar 28 11:49PM

On Sat, 28 Mar 2015 22:07:05 +0000
> On 28/03/2015 08:57, Jorgen Grahn wrote:
> > Surely utf8/is/ unicode?
 
> Oh no, not at all.
 
Oh yes.
 
> Unicode is a wide character set
 
That is wrong. Unicode is a defined set of code points. There are a
number of encodings which unicode recognizes for the purposes of
encoding those code points, namely UTF-7, UTF-8, UTF-16 and UTF-32.
Because the latter in fact is a one to one encoding with code points,
it is identical to UCS-4.
 
> that is efficient for us westerners. It isn't bad for China either, as
> a lot of the characters can be compressed down to 2 bytes - which
> means UTF-8 is no bigger than 16-bit Unicode (UCS-2?).
 
UTF-16 is the 16 bit encoding for unicode. UCS-2 only supports the
basic multilingual plane, whereas UTF-16 can represent all unicode code
points (and accordingly is a variable length encoding as it uses
surrogate pairs of 16 bit code units). UTF-16 occupies more space than
UTF-8 for the average european script. It is reputed to occupy slightly
less for the average Japanese script.
 
> but takes up to 4 bytes per character. and ++ on a char string
> doesn't work any more - you have to know you're using UTF-8 and take
> special measures to get whole characters.
 
Many "characters" (given the meaning most people think it has) require
more than two unicode code points in normalized non-precomposed form.
Some "characters" are not representable in precomposed form. Such
representations require more than one UTF-32 code unit, more than two
UTF-16 code units and can require more than four UTF-8 code units.
 
Because UTF-16 is a variable length encoding, your '++' does not work
(for your meaning of "work") with UTF-16 either. Because of combining
characters, nor does UTF-32 if by "character" you mean a grapheme,
which is what most people think it means (namely, what they see as a
"character" in their terminal).
 
> AIUI Unix has used UTF-8 since the year dot, and hence Linux since
> birth. DOS had national variants :( - which is why the Japanese (used
> to?) use the yen currency character for backslash in path seperators.
 
No. For narrow encodings, unix used to be as incoherent as microsoft
code pages for its narrow codesets. ISO-8859 was common for
non-cyrillic european scripts, KOI8-R for cyrillic, and EUC ("Extended
Unix Code") for JKC scripts. JIS and Shift-JIS was also in use for
Japanese scripts and GB 2312 for Chinese scripts.
 
Chris
Richard Damon <Richard@Damon-Family.org>: Mar 28 08:55PM -0400

On 3/28/15 7:07 PM, Nobody wrote:
> (which are less concerned about legacy compatibility) and b) the
> internet (which typically wants some degree of internationalisation
> at minimal cost).
 
One of the keys to Unicode (via UTF-8 encoding) working in *nix
environments is that the designers of Unicode made special effort to
make it fairly transparent to most programs. The first 128 characters
exactly match the ASCII definitions, so an ASCII file and a Unicode
UTF-8 file of the same content are identical. They were also careful
that no code-point looked like a piece of another code-point which makes
string searching generally "work". This means that in general, if a
program just manipulates strings at points found by searching for
characters/strings, will tend to "just work" with UTF-8 text. This
describes most of the operations in the *nix kernel.
 
Programs that need to break down a string into "characters" (like an
editor) need to be much more aware of things like Unicode.
Nobody <nobody@nowhere.invalid>: Mar 29 01:14PM +0100

On Sat, 28 Mar 2015 23:49:10 +0000, Chris Vine wrote:
 
>> Unicode is a wide character set
 
> That is wrong. Unicode is a defined set of code points.
 
Which is roughly the correct meaning of "character set" (as opposed to
"encoding", which is what some people mean when they say "character set").
 
It's "wide" insofar as their are more than 256 code points.
 
> No. For narrow encodings, unix used to be as incoherent as microsoft
> code pages for its narrow codesets.
 
I wouldn't go quite that far.
 
And to the extent that it's true, it hasn't really changed. UTF-8 is
something a lot of people *wish* was standard, but isn't. UTF-8 itself is
*a* standard, but from POSIX' perspective, it's just one of the many
encodings which may or may not be supported by a given platform.
 
In short, for all its advantages, UTF-8 isn't magically immune to
 
http://xkcd.com/927/
"Öö Tiib" <ootiib@hot.ee>: Mar 29 09:29AM -0700

On Sunday, 29 March 2015 15:14:06 UTC+3, Nobody wrote:
> encodings which may or may not be supported by a given platform.
 
> In short, for all its advantages, UTF-8 isn't magically immune to
 
> http://xkcd.com/927/
 
Yes, but that is not the case with C, C++ or POSIX standards. Those
keep string contents totally implementation-defined because
of legacy that may have 9 bit bytes or EBCDIC encoding or the like.
However since even such (more imaginary than real) systems have to
deal with data formats consisting of 8-bit octets and UTF-8 texts
(wast majority of data we have on our planet is that) it is
clearly futile trend.
Chris Vine <chris@cvine--nospam--.freeserve.co.uk>: Mar 29 05:50PM +0100

On Sun, 29 Mar 2015 13:14:04 +0100
> "encoding", which is what some people mean when they say "character
> set").
 
> It's "wide" insofar as their are more than 256 code points.
 
First, to say that unicode comprises UTF-32 and that other encodings,
including UTF-8, are not "unicode" (which was the suggestion to which I
was responding) is out-and-out wrong. You do not help anyone reading
this newsgroup to suggest otherwise.
 
Secondly, unicode is universal (as the name suggests). It is a type
error to say that unicode is "wide", and the fact that there are
0x10FFFF usable code points within the range unicode employs is
irrelevant to this. There are two narrow and two wide encodings for
unicode, if by "wide" you mean greater than 8 bits and by "narrow" you
mean 8 bits or less.
 
(As an aside, if by "wide" you are trying to bring in some association
with wchar_t, then you would be wrong to do so: on unix-like systems
wchar_t is normally a 32 bit type, therefore leaving the 16-bit unicode
encoding on this measure as "narrow" or unclassifiable on unix and
"wide" on windows.)
 
There are enough misconceptions about unicode floating around without
creating more.
 
Chris
Vir Campestris <vir.campestris@invalid.invalid>: Mar 29 09:09PM +0100

On 29/03/2015 17:50, Chris Vine wrote:
> wchar_t is normally a 32 bit type, therefore leaving the 16-bit unicode
> encoding on this measure as "narrow" or unclassifiable on unix and
> "wide" on windows.)
 
Correct me if I'm wrong, but I think Windows just represents the first
64k code points in its 16-bit characters. With no way of representing
the rest.
 
> There are enough misconceptions about unicode floating around without
> creating more.
 
The point I pricked up my ears was "utf-8 == unicode". Which it isn't -
it's a representation.
 
And back to the question I asked a few days ago - if you want to open a
file whose name is not US-ASCII is there a way to do it without using a
compression system of some sort in current STL?
 
Andy
Paavo Helde <myfirstname@osa.pri.ee>: Mar 29 04:03PM -0500

Vir Campestris <vir.campestris@invalid.invalid> wrote in
>> (As an aside, if by "wide" you are trying to bring in some association
>> with wchar_t, then you would be wrong to do so: on unix-like systems
>> wchar_t is normally a 32 bit type, therefore leaving the 16-bit
unicode
 
> Correct me if I'm wrong, but I think Windows just represents the first
> 64k code points in its 16-bit characters. With no way of representing
> the rest.
 
No, at some time point (more than 15 years ago I believe) the Windows
encoding was redefined to be UTF-16 instead of UCS-2.
 
>> creating more.
 
> The point I pricked up my ears was "utf-8 == unicode". Which it isn't -
> it's a representation.
 
Yes, UTF-8 is a representation of the Unicode. Which means it is Unicode
(i.e. has a property of being able to represent all Unicode codepoints).
 
 
> And back to the question I asked a few days ago - if you want to open a
> file whose name is not US-ASCII is there a way to do it without using a
> compression system of some sort in current STL?
 
Not sure what is a "compression system", but of course on every platform
the C++ implementations generally take care that it is possible to open
files with all valid filenames for this platform. Unfortunately this is
still very platform-specific. On Windows for example you might need to
use functions like _wfopen_s() or CreateFileW().
 
Cheers
Paavo
Richard Damon <Richard@Damon-Family.org>: Mar 29 05:10PM -0400

On 3/29/15 4:09 PM, Vir Campestris wrote:
 
> Correct me if I'm wrong, but I think Windows just represents the first
> 64k code points in its 16-bit characters. With no way of representing
> the rest.
 
The current Microsoft documentation describes using UTF-16, so I think
they mean to allow surrogate pairs to get to the full range of Unicode
codepoints. This doesn't say how much of the system will have trouble
with them.
 
>> creating more.
 
> The point I pricked up my ears was "utf-8 == unicode". Which it isn't -
> it's a representation.
 
As are utf-16, ucs-2, utf-32, and ucs-4. You have to use some form of
"representation" to store ANY data.
 
> file whose name is not US-ASCII is there a way to do it without using a
> compression system of some sort in current STL?
 
> Andy
 
It will inherently be implementation (or other standard) dependent. For
*nix, it will depend on the system's language setting. If it is using
utf-8, then just send the file name as UTF-8. If it is configured to use
a national code page, you send a string encoded in the national code page.
 
Windows, I believe, will always store the filename with UTF-16 (so you
don't have language interpretation issue of filenames), but the "narrow"
open function will interpret the character string according to the
defined locale, as the function will widen them.
"Chris M. Thomasson" <no@spam.invalid>: Mar 29 01:31PM -0700

> triggering cache reads.
 
 
> Can someone tell me more about how C++ 11's new memory order is
> implemented at a low level?
 
Basically, they are implemented using whatever it takes to achieve the goal
of the memory ordering constraints you specify. Just beware of
memory_order_consume... It can be tricky to get it right. AFAICT, one of
the main reasons it exists is to support Read-Copy, Update...
 
 
> And can someone tell me how to achieve that goal of cache control?
 
The cache control that you want is beyond your control at this level;
actually,
forget about cache, and think about visibility. However, you can design your
data-structures to be "cache friendly", so to speak. Basically, try really
hard
to get around false-sharing.
 
Something along the lines of:
 
https://groups.google.com/d/msg/comp.programming.threads/kR2OxyF5IAg/rRgHOIrcNoAJ
 
 
I can go into more detail if you want. But I am a bit time constrained right
now.
 
Sorry!
 
;^o
"Öö Tiib" <ootiib@hot.ee>: Mar 28 07:51PM -0700

On Saturday, 28 March 2015 20:37:35 UTC+2, David Brown wrote:
> nothing to put in it. That way you can make use of the compiler's
> static error checking to warn you if you accidentally use it later
> without having set it.
 
Compilers do not typically warn if you pass uninitialied variable to
function by non-const reference or pointer.
 
If you really want to get advantage from not initializing variables
then you need something that instruments stuff for runtime checking.
For example clang's memory sanitizer. Such tools seem to only warn
if the program actually reads from unintialized variable when running.
That may be some rare corner case.
 
Lot of types have some "unavailable" state when you don't have anything
to put into those. Floating point has NaN, pointers have nullptr and
so on. I prefer to initialize with such. Rest I initialize with some
likely bad value (minimum for signed value, maximum for unsigned value,
"#FEIL#" for string etc. Works fine.
JiiPee <no@notvalid.com>: Mar 29 12:21PM +0100

On 28/03/2015 17:25, Marcel Mueller wrote:
 
> There are use cases where it is no problem to declare uninitialized PODs.
> E.g. if a value is initialized by switch/case/default. Assigning an
> initial value is useless in this case.
 
althouth even then I guess its not 100% useless, as initializing might
help in a situation where human forgets for example insert default case.
So for debugging purposes and protecting human errors. But how usefull
this is am not sure...
JiiPee <no@notvalid.com>: Mar 29 12:29PM +0100

On 29/03/2015 03:51, Öö Tiib wrote:
> That may be some rare corner case.
 
> Lot of types have some "unavailable" state when you don't have anything
> to put into those. Floating point has NaN,
 
"Floating point has NaN". oh thats a good idea, never thought of that
:). I think am gonna use that.
Did you invent it yourself? I normally put something like -1.0, but NaN
is better? Sometimes I also think putting MAX_FLOAT, but that is
actually a valid value...so maybe not so good.
 
Vir Campestris <vir.campestris@invalid.invalid>: Mar 29 09:11PM +0100

On 29/03/2015 12:29, JiiPee wrote:
> Did you invent it yourself? I normally put something like -1.0, but NaN
> is better? Sometimes I also think putting MAX_FLOAT, but that is
> actually a valid value...so maybe not so good.
 
The nice thing about NaN as opposed to -1 is that -1 + 1 is 0; a
perfectly valid number. But NaN + 1 is still NaN.
 
Andy
Udo Steinbach <trashcan@udoline.de>: Mar 29 11:14AM +0200

> What would be the reason for NOT initalizing the variable?
 
To clarify that there isn't a default nor a not-set-value.
--
Fahrradverkehr in Deutschland: http://radwege.udoline.de/
GPG: A245 F153 0636 6E34 E2F3 E1EB 817A B14D 3E7E 482E
You received this digest because you're subscribed to updates for this group. You can change your settings on the group membership page.
To unsubscribe from this group and stop receiving emails from it send an email to comp.lang.c+++unsubscribe@googlegroups.com.

No comments: