Sunday, August 18, 2019

Digest for comp.lang.c++@googlegroups.com - 25 updates in 5 topics

Anton Shepelev <anton.txt@gmail.com>: Aug 18 03:20PM +0300

Lynn McGuire:
 
> "C++: Size Matters in Platform Compatibility"
> https://www.codeproject.com/Tips/5164768/Cplusplus-Size-Matters-in-Platform-Compatibility
> Interesting. Especially on storing Unicode as UTF-8.
 
My reaction to this article is in fir's manner: highly
skeptical. The author says obvious things that are much
better discussed in network data transfer and socket
programming in particular. He adds nothing new and original
except wrong conclusions, e.g.:
 
time_t is not guaranteed to be interoperable between
platforms, so it is best to store time as text and
convert to time_t accordingly.
 
whereas the natural solution is any fixed-width binary
datetime format. Loth to do calendar arithmetics, I should
pass the fields separately, e.g.:
 
Date: 24 Time: 24
+-----------------+--------------------+
|BC? Year Mon Day | Hours Min Sec 1/128|
| 1 14 4 5 | 5 6 6 7 |
+-----------------+--------------------+
 
The date and time are separable three-byte blocks. Each
block is a packed structure with the fields interpreted as
big-endian, according to the network byte order.
 
--
() ascii ribbon campaign -- against html e-mail
/\ http://preview.tinyurl.com/qcy6mjc [archived]
Anton Shepelev <anton.txt@gmail.com>: Aug 18 03:59PM +0300

David Brown:
 
> However, it doesn't make sense to have pointer types at
> all in a stored file or data interchange format.
 
He is talking about the use of pointer types as integer
indices, as (he says) WinAPI does with DWORD_PTR.
 
--
() ascii ribbon campaign -- against html e-mail
/\ http://preview.tinyurl.com/qcy6mjc [archived]
Ben Bacarisse <ben.usenet@bsb.me.uk>: Aug 18 02:50PM +0100


> The date and time are separable three-byte blocks. Each
> block is a packed structure with the fields interpreted as
> big-endian, according to the network byte order.
 
For the price of two bytes more you can make the ordering explicit and
and avoid any shifting and masking. Make everything an octet:
 
century, year in century, month, day, hour, minute, second, 1/128ths
 
(Dates BC require the first byte to be signed: 6BC is year 94 in century
-1.)
 
You could, of course, use units of 512 years instead at the expense of
having a slightly less human-debuggable format. You gain just having to
do a shift rather than a multiply by 100.
 
--
Ben.
David Brown <david.brown@hesbynett.no>: Aug 18 04:32PM +0200

On 18/08/2019 14:59, Anton Shepelev wrote:
>> all in a stored file or data interchange format.
 
> He is talking about the use of pointer types as integer
> indices, as (he says) WinAPI does with DWORD_PTR.
 
It makes sense to use integer indices in a file format - use a suitable
integer size for the job (using fixed size types, like uint32_t or
int64_t). It doesn't make sense to use pointers, because they depend on
the memory address your data happens to lie at, which will not be the
same when the file or data is read again. So using some sort of
"integer pointer type" (like a pre-C99 uintptr_t alternative) makes
little sense.
Geoff <geoff@invalid.invalid>: Aug 18 01:10PM -0700

On Sun, 18 Aug 2019 12:32:18 +0200, David Brown
>> will only last another 2.147 billion years.
 
>292 billion years, by my counting. By that time, any issues with
>rollover will be an SEPน.
 
Where are you getting that? time_t is the number of seconds since
January 1, 1970, 0:00 UTC and as a 64-bit integer that's not 292
billion years.
Bart <bc@freeuk.com>: Aug 18 09:15PM +0100

On 18/08/2019 21:10, Geoff wrote:
 
> Where are you getting that? time_t is the number of seconds since
> January 1, 1970, 0:00 UTC and as a 64-bit integer that's not 292
> billion years.
 
No, it's 584 billion years. Presumably 292 billion was due to using it
as a signed value (2**63 rather than 2**64 seconds).
Bonita Montero <Bonita.Montero@gmail.com>: Aug 18 12:59PM +0200

> work in practice but is an abuse of wchar_t that goes against the
> standards.  ("char16_t" is the type you want here, which is distinct
> from wchar_t despite being the same size on Windows.)
 
There's no violation of the standard. wchar_t is implementation-defined.
Bonita Montero <Bonita.Montero@gmail.com>: Aug 18 01:05PM +0200

> written primarily in the BMP code plane characters between 0x0800 and
> 0xffff (i.e., non-European scripts, not including CJK) as it is more
> efficient than UTF-8 or UTF-32.
 
There's nothing about that in den Unicode-standard. The Uncicode-stan-
dard simply says that UTF-16 can cover every Unicode codepoint via
surrogate-pairs aund that UTF-16 is _optimized_ for BMP, but not only
suitable for this. It says that is preferred if you ned a balance of
storage-size and access-speed.
 
> While UTF-16 is still used internally on Windows, and
> with some languages (like Java) and libraries (like QT), it is almost
> completely negligible as an encoding in documents or data interchange.
 
That might be practical reasons about which one can argue excellently
but that doesn't affect that UTF-16 is simply Unicode-compliant.
Bonita Montero <Bonita.Montero@gmail.com>: Aug 18 01:07PM +0200

> But there is a mandatory relationship between the character set
> supported by the target system, and wchar_t.
 
No, the implementor might implement whatever he likes since it
hasn't conform to the target-system but all locales supported
by the implementation.
Bonita Montero <Bonita.Montero@gmail.com>: Aug 18 01:13PM +0200

Am 18.08.2019 um 13:07 schrieb Bonita Montero:
 
> No, the implementor might implement whatever he likes since it
> hasn't conform to the target-system but all locales supported
> by the implementation.
 
The standard says:
"Type wchar_t is a distinct type whose values can represent
distinct codes for all members of the largest extended character
set specified among the supported locales."
^^^^^^^^^^*********^^^^^^^^
So where's the inconformance?
Bart <bc@freeuk.com>: Aug 18 01:07PM +0100

On 18/08/2019 11:57, Bonita Montero wrote:
>>> Unicode is defined to have a maximum of 21 bit codepoints.
 
>> Yes.  And how to you fit those 21 bits into a 16-bit wchar_t ?
 
> That's not necessary if you use UTF-16.
 
Show me how to make use of UTF-16 here, because it doesn't seem to work
on Windows:
 
#include <wchar_t>
 
wchar_t c;
c = 1000000; // character code 1000000
 
printf("c = %d 0x%X", c, c);
 
Displays:
 
c = 16960 0x4240
 
Not this:
 
c = 1000000 0xF4240
David Brown <david.brown@hesbynett.no>: Aug 18 02:48PM +0200

On 18/08/2019 13:05, Bonita Montero wrote:
> surrogate-pairs aund that UTF-16 is _optimized_ for BMP, but not only
> suitable for this. It says that is preferred if you ned a balance of
> storage-size and access-speed.
 
UTF-16 works fine, but is a poor solution in almost every case. As I
said previously, it combines the disadvantages of UTF-8 with the
disadvantages of UTF-32, giving the benefits of neither. People know
this - that is why it is not used in any significance except in places
where it is hard to avoid (Windows programming), or for dealing with
legacy code.
 
There was a time, long ago, when Unicode was young, when 16-bit was
enough for almost everything, when saving a few bytes was important,
when HTML and XML were not common, and when transparent compression was
rare. In those days, UTF-16 made sense for some uses.
 
Given a free choice without worrying about compatibility with old code
or interfaces, no one would recommend or choose UTF-16 now.
 
>> negligible as an encoding in documents or data interchange.
 
> That might be practical reasons about which one can argue excellently
> but that doesn't affect that UTF-16 is simply Unicode-compliant.
 
No one has doubted that UTF-16 is a valid Unicode encoding. That has
never been in question.
 
As seems to happen so often, you have totally misunderstood the issue
when you have been shown wrong. I wonder if it is intentional.
Bonita Montero <Bonita.Montero@gmail.com>: Aug 18 03:49PM +0200


>     c = 16960 0x4240
 
> Not this:
 
>     c = 1000000 0xF4240
 
wchar_t needs not to be Unicode-compliant.
But you can store UTF16-strings in std::wstring.
Bonita Montero <Bonita.Montero@gmail.com>: Aug 18 03:51PM +0200

> UTF-16 works fine, but is a poor solution in almost every case.
> As I said previously, it combines the disadvantages of UTF-8
> with the disadvantages of UTF-32, giving the benefits of neither.
 
That's just your taste. But someone else may argue like the Unicode
-standard, that UTF-16 is a good balance between size and performance.
David Brown <david.brown@hesbynett.no>: Aug 18 04:43PM +0200

On 18/08/2019 15:51, Bonita Montero wrote:
>> with the  disadvantages of UTF-32, giving the benefits of neither.
 
> That's just your taste. But someone else may argue like the Unicode
> -standard, that UTF-16 is a good balance between size and performance.
 
Sorry, you /do/ know that UTF-8 and UTF-32 are also encodings for
Unicode, just like UTF-16? It is not clear, but it looks a little like
you think UTF-16 /is/ Unicode and that I am suggesting something other
than Unicode. I hope you can clarify if you are misunderstanding this
or not.
 
As for taste, something like 94% of documents on the internet are
encoded with UTF-8. The rest is mostly 8-bit encodings like ISO-8859-1.
UTF-16 documents are almost invariably bigger than UTF-8 documents
(since most text-based formats involve keys, tags, etc., that are
single-byte characters in UTF-8). They have the same multi-unit
encoding issues as you get with UTF-8. If you want "one unit is one
code", you need UTF-32, which is fine for internal program usage.
UTF-16 has endianess issues. There really is no benefit in UTF-16 over
UTF-8. The world at large knows this - that is why UTF-8 is
overwhelmingly dominant at UTF-16 is used mostly for legacy reasons.
 
But you can all it "taste" if you like.
Bonita Montero <Bonita.Montero@gmail.com>: Aug 18 05:02PM +0200

> you think UTF-16 /is/ Unicode and that I am suggesting something other
> than Unicode.  I hope you can clarify if you are misunderstanding this
> or not.
 
You can't derive that from what I said above.
And it isn't even related to that.
You seem a bit confuse.
scott@slp53.sl.home (Scott Lurndal): Aug 18 03:38PM

>decision. For internal processing the representation doesn't matter
>but for compatiblity-reasons it should match the most commun Unicode
>representation.
 
wchar_t in Unix (198x) predates Windows 3.1 (1992) by several years.
 
Windows is the outlier here, not unix.
scott@slp53.sl.home (Scott Lurndal): Aug 18 03:40PM

>> system might support ASCII and Big5, which is a 16-bit encoding - it
>> could therefore have a 16-bit wchar_t.
 
>Unrealistic that it will be used for anything different than Unicode.
 
It (wchar_t) predates unicode by a few years.
Bonita Montero <Bonita.Montero@gmail.com>: Aug 18 05:44PM +0200

>> Unrealistic that it will be used for anything different than Unicode.
 
> It (wchar_t) predates unicode by a few years.
 
That doesn't change the above.
Bonita Montero <Bonita.Montero@gmail.com>: Aug 18 05:46PM +0200

>> but for compatiblity-reasons it should match the most commun Unicode
>> representation.
 
> wchar_t in Unix (198x) predates Windows 3.1 (1992) by several years.
 
That doesn't affect the above because in the beginning, UCS-2 was the
standard.
Bo Persson <bo@bo-persson.se>: Aug 18 08:33PM +0200

On 2019-08-18 at 14:07, Bart wrote:
 
>     c = 16960 0x4240
 
> Not this:
 
>     c = 1000000 0xF4240
 
So we can only guess that this character code is not part of the
"supported locale" used in your program.
 
Nowhere does the language standard require C++ to support all available
OS locales. It could be a very small subset.
 
 
 
Bo Persson
Bart <bc@freeuk.com>: Aug 18 08:51PM +0100

On 18/08/2019 19:33, Bo Persson wrote:
> "supported locale" used in your program.
 
> Nowhere does the language standard require C++ to support all available
> OS locales. It could be a very small subset.
 
A small subset where none of the locale's alphabets has character codes
above 65535.
 
Anyway, the locale need have nothing to do with the ability to be able
to read, write or process arbitrary Unicode characters and strings.
 
Otherwise countries such as the UK and US could get by with just ASCII.
 
But if they are going beyond a 7-bit character, then why not beyond
16-bit too? Actually, the first few alphabets beyond codepoint 65535
appear to be ancient scripts with no meaningful locale anyway.
ram@zedat.fu-berlin.de (Stefan Ram): Aug 18 06:03PM

>std::wstring gLocalDataDir();
 
Some years ago I thought that it would be a helpful first
step to first define what types of standard directories
could be there. So I wrote
 
http://www.purl.org/stefan_ram/pub/the_portadir_specification
 
. It can be used as a vocabulary for documentation to refer to.
Tim Rentsch <tr.17687@z991.linuxsc.com>: Aug 18 05:22AM -0700


> On Sunday, 18 August 2019 08:02:30 UTC+3, Tim Rentsch wrote:
 
>> Tiib <ootiib@hot.ee> writes:
[...]
>> trying to say more would for me be counter-productive.
 
> I have dropped viewing discussions as some kind of combat long
> ago. [...]
 
I didn't say combative, I said argumentative. They aren't the
same thing.
Tim Rentsch <tr.17687@z991.linuxsc.com>: Aug 18 05:18AM -0700

>> http://coliru.stacked-crooked.com/a/aa95b9b7035de3c3
 
> Ok. I think I got it ... you meant it is not syntax error but
> semantic error.
 
Right.
You received this digest because you're subscribed to updates for this group. You can change your settings on the group membership page.
To unsubscribe from this group and stop receiving emails from it send an email to comp.lang.c+++unsubscribe@googlegroups.com.

No comments: