- "C++: Size Matters in Platform Compatibility" - 6 Updates
- How to get standard paths?!? - 16 Updates
- How to get standard paths?!? - 1 Update
- Tricky! - 1 Update
- newbie question: exceptions - 1 Update
Anton Shepelev <anton.txt@gmail.com>: Aug 18 03:20PM +0300 Lynn McGuire: > "C++: Size Matters in Platform Compatibility" > https://www.codeproject.com/Tips/5164768/Cplusplus-Size-Matters-in-Platform-Compatibility > Interesting. Especially on storing Unicode as UTF-8. My reaction to this article is in fir's manner: highly skeptical. The author says obvious things that are much better discussed in network data transfer and socket programming in particular. He adds nothing new and original except wrong conclusions, e.g.: time_t is not guaranteed to be interoperable between platforms, so it is best to store time as text and convert to time_t accordingly. whereas the natural solution is any fixed-width binary datetime format. Loth to do calendar arithmetics, I should pass the fields separately, e.g.: Date: 24 Time: 24 +-----------------+--------------------+ |BC? Year Mon Day | Hours Min Sec 1/128| | 1 14 4 5 | 5 6 6 7 | +-----------------+--------------------+ The date and time are separable three-byte blocks. Each block is a packed structure with the fields interpreted as big-endian, according to the network byte order. -- () ascii ribbon campaign -- against html e-mail /\ http://preview.tinyurl.com/qcy6mjc [archived] |
Anton Shepelev <anton.txt@gmail.com>: Aug 18 03:59PM +0300 David Brown: > However, it doesn't make sense to have pointer types at > all in a stored file or data interchange format. He is talking about the use of pointer types as integer indices, as (he says) WinAPI does with DWORD_PTR. -- () ascii ribbon campaign -- against html e-mail /\ http://preview.tinyurl.com/qcy6mjc [archived] |
Ben Bacarisse <ben.usenet@bsb.me.uk>: Aug 18 02:50PM +0100 > The date and time are separable three-byte blocks. Each > block is a packed structure with the fields interpreted as > big-endian, according to the network byte order. For the price of two bytes more you can make the ordering explicit and and avoid any shifting and masking. Make everything an octet: century, year in century, month, day, hour, minute, second, 1/128ths (Dates BC require the first byte to be signed: 6BC is year 94 in century -1.) You could, of course, use units of 512 years instead at the expense of having a slightly less human-debuggable format. You gain just having to do a shift rather than a multiply by 100. -- Ben. |
David Brown <david.brown@hesbynett.no>: Aug 18 04:32PM +0200 On 18/08/2019 14:59, Anton Shepelev wrote: >> all in a stored file or data interchange format. > He is talking about the use of pointer types as integer > indices, as (he says) WinAPI does with DWORD_PTR. It makes sense to use integer indices in a file format - use a suitable integer size for the job (using fixed size types, like uint32_t or int64_t). It doesn't make sense to use pointers, because they depend on the memory address your data happens to lie at, which will not be the same when the file or data is read again. So using some sort of "integer pointer type" (like a pre-C99 uintptr_t alternative) makes little sense. |
Geoff <geoff@invalid.invalid>: Aug 18 01:10PM -0700 On Sun, 18 Aug 2019 12:32:18 +0200, David Brown >> will only last another 2.147 billion years. >292 billion years, by my counting. By that time, any issues with >rollover will be an SEPน. Where are you getting that? time_t is the number of seconds since January 1, 1970, 0:00 UTC and as a 64-bit integer that's not 292 billion years. |
Bart <bc@freeuk.com>: Aug 18 09:15PM +0100 On 18/08/2019 21:10, Geoff wrote: > Where are you getting that? time_t is the number of seconds since > January 1, 1970, 0:00 UTC and as a 64-bit integer that's not 292 > billion years. No, it's 584 billion years. Presumably 292 billion was due to using it as a signed value (2**63 rather than 2**64 seconds). |
Bonita Montero <Bonita.Montero@gmail.com>: Aug 18 12:59PM +0200 > work in practice but is an abuse of wchar_t that goes against the > standards. ("char16_t" is the type you want here, which is distinct > from wchar_t despite being the same size on Windows.) There's no violation of the standard. wchar_t is implementation-defined. |
Bonita Montero <Bonita.Montero@gmail.com>: Aug 18 01:05PM +0200 > written primarily in the BMP code plane characters between 0x0800 and > 0xffff (i.e., non-European scripts, not including CJK) as it is more > efficient than UTF-8 or UTF-32. There's nothing about that in den Unicode-standard. The Uncicode-stan- dard simply says that UTF-16 can cover every Unicode codepoint via surrogate-pairs aund that UTF-16 is _optimized_ for BMP, but not only suitable for this. It says that is preferred if you ned a balance of storage-size and access-speed. > While UTF-16 is still used internally on Windows, and > with some languages (like Java) and libraries (like QT), it is almost > completely negligible as an encoding in documents or data interchange. That might be practical reasons about which one can argue excellently but that doesn't affect that UTF-16 is simply Unicode-compliant. |
Bonita Montero <Bonita.Montero@gmail.com>: Aug 18 01:07PM +0200 > But there is a mandatory relationship between the character set > supported by the target system, and wchar_t. No, the implementor might implement whatever he likes since it hasn't conform to the target-system but all locales supported by the implementation. |
Bonita Montero <Bonita.Montero@gmail.com>: Aug 18 01:13PM +0200 Am 18.08.2019 um 13:07 schrieb Bonita Montero: > No, the implementor might implement whatever he likes since it > hasn't conform to the target-system but all locales supported > by the implementation. The standard says: "Type wchar_t is a distinct type whose values can represent distinct codes for all members of the largest extended character set specified among the supported locales." ^^^^^^^^^^*********^^^^^^^^ So where's the inconformance? |
Bart <bc@freeuk.com>: Aug 18 01:07PM +0100 On 18/08/2019 11:57, Bonita Montero wrote: >>> Unicode is defined to have a maximum of 21 bit codepoints. >> Yes. And how to you fit those 21 bits into a 16-bit wchar_t ? > That's not necessary if you use UTF-16. Show me how to make use of UTF-16 here, because it doesn't seem to work on Windows: #include <wchar_t> wchar_t c; c = 1000000; // character code 1000000 printf("c = %d 0x%X", c, c); Displays: c = 16960 0x4240 Not this: c = 1000000 0xF4240 |
David Brown <david.brown@hesbynett.no>: Aug 18 02:48PM +0200 On 18/08/2019 13:05, Bonita Montero wrote: > surrogate-pairs aund that UTF-16 is _optimized_ for BMP, but not only > suitable for this. It says that is preferred if you ned a balance of > storage-size and access-speed. UTF-16 works fine, but is a poor solution in almost every case. As I said previously, it combines the disadvantages of UTF-8 with the disadvantages of UTF-32, giving the benefits of neither. People know this - that is why it is not used in any significance except in places where it is hard to avoid (Windows programming), or for dealing with legacy code. There was a time, long ago, when Unicode was young, when 16-bit was enough for almost everything, when saving a few bytes was important, when HTML and XML were not common, and when transparent compression was rare. In those days, UTF-16 made sense for some uses. Given a free choice without worrying about compatibility with old code or interfaces, no one would recommend or choose UTF-16 now. >> negligible as an encoding in documents or data interchange. > That might be practical reasons about which one can argue excellently > but that doesn't affect that UTF-16 is simply Unicode-compliant. No one has doubted that UTF-16 is a valid Unicode encoding. That has never been in question. As seems to happen so often, you have totally misunderstood the issue when you have been shown wrong. I wonder if it is intentional. |
Bonita Montero <Bonita.Montero@gmail.com>: Aug 18 03:49PM +0200 > c = 16960 0x4240 > Not this: > c = 1000000 0xF4240 wchar_t needs not to be Unicode-compliant. But you can store UTF16-strings in std::wstring. |
Bonita Montero <Bonita.Montero@gmail.com>: Aug 18 03:51PM +0200 > UTF-16 works fine, but is a poor solution in almost every case. > As I said previously, it combines the disadvantages of UTF-8 > with the disadvantages of UTF-32, giving the benefits of neither. That's just your taste. But someone else may argue like the Unicode -standard, that UTF-16 is a good balance between size and performance. |
David Brown <david.brown@hesbynett.no>: Aug 18 04:43PM +0200 On 18/08/2019 15:51, Bonita Montero wrote: >> with the disadvantages of UTF-32, giving the benefits of neither. > That's just your taste. But someone else may argue like the Unicode > -standard, that UTF-16 is a good balance between size and performance. Sorry, you /do/ know that UTF-8 and UTF-32 are also encodings for Unicode, just like UTF-16? It is not clear, but it looks a little like you think UTF-16 /is/ Unicode and that I am suggesting something other than Unicode. I hope you can clarify if you are misunderstanding this or not. As for taste, something like 94% of documents on the internet are encoded with UTF-8. The rest is mostly 8-bit encodings like ISO-8859-1. UTF-16 documents are almost invariably bigger than UTF-8 documents (since most text-based formats involve keys, tags, etc., that are single-byte characters in UTF-8). They have the same multi-unit encoding issues as you get with UTF-8. If you want "one unit is one code", you need UTF-32, which is fine for internal program usage. UTF-16 has endianess issues. There really is no benefit in UTF-16 over UTF-8. The world at large knows this - that is why UTF-8 is overwhelmingly dominant at UTF-16 is used mostly for legacy reasons. But you can all it "taste" if you like. |
Bonita Montero <Bonita.Montero@gmail.com>: Aug 18 05:02PM +0200 > you think UTF-16 /is/ Unicode and that I am suggesting something other > than Unicode. I hope you can clarify if you are misunderstanding this > or not. You can't derive that from what I said above. And it isn't even related to that. You seem a bit confuse. |
scott@slp53.sl.home (Scott Lurndal): Aug 18 03:38PM >decision. For internal processing the representation doesn't matter >but for compatiblity-reasons it should match the most commun Unicode >representation. wchar_t in Unix (198x) predates Windows 3.1 (1992) by several years. Windows is the outlier here, not unix. |
scott@slp53.sl.home (Scott Lurndal): Aug 18 03:40PM >> system might support ASCII and Big5, which is a 16-bit encoding - it >> could therefore have a 16-bit wchar_t. >Unrealistic that it will be used for anything different than Unicode. It (wchar_t) predates unicode by a few years. |
Bonita Montero <Bonita.Montero@gmail.com>: Aug 18 05:44PM +0200 >> Unrealistic that it will be used for anything different than Unicode. > It (wchar_t) predates unicode by a few years. That doesn't change the above. |
Bonita Montero <Bonita.Montero@gmail.com>: Aug 18 05:46PM +0200 >> but for compatiblity-reasons it should match the most commun Unicode >> representation. > wchar_t in Unix (198x) predates Windows 3.1 (1992) by several years. That doesn't affect the above because in the beginning, UCS-2 was the standard. |
Bo Persson <bo@bo-persson.se>: Aug 18 08:33PM +0200 On 2019-08-18 at 14:07, Bart wrote: > c = 16960 0x4240 > Not this: > c = 1000000 0xF4240 So we can only guess that this character code is not part of the "supported locale" used in your program. Nowhere does the language standard require C++ to support all available OS locales. It could be a very small subset. Bo Persson |
Bart <bc@freeuk.com>: Aug 18 08:51PM +0100 On 18/08/2019 19:33, Bo Persson wrote: > "supported locale" used in your program. > Nowhere does the language standard require C++ to support all available > OS locales. It could be a very small subset. A small subset where none of the locale's alphabets has character codes above 65535. Anyway, the locale need have nothing to do with the ability to be able to read, write or process arbitrary Unicode characters and strings. Otherwise countries such as the UK and US could get by with just ASCII. But if they are going beyond a 7-bit character, then why not beyond 16-bit too? Actually, the first few alphabets beyond codepoint 65535 appear to be ancient scripts with no meaningful locale anyway. |
ram@zedat.fu-berlin.de (Stefan Ram): Aug 18 06:03PM >std::wstring gLocalDataDir(); Some years ago I thought that it would be a helpful first step to first define what types of standard directories could be there. So I wrote http://www.purl.org/stefan_ram/pub/the_portadir_specification . It can be used as a vocabulary for documentation to refer to. |
Tim Rentsch <tr.17687@z991.linuxsc.com>: Aug 18 05:22AM -0700 > On Sunday, 18 August 2019 08:02:30 UTC+3, Tim Rentsch wrote: >> Tiib <ootiib@hot.ee> writes: [...] >> trying to say more would for me be counter-productive. > I have dropped viewing discussions as some kind of combat long > ago. [...] I didn't say combative, I said argumentative. They aren't the same thing. |
Tim Rentsch <tr.17687@z991.linuxsc.com>: Aug 18 05:18AM -0700 >> http://coliru.stacked-crooked.com/a/aa95b9b7035de3c3 > Ok. I think I got it ... you meant it is not syntax error but > semantic error. Right. |
You received this digest because you're subscribed to updates for this group. You can change your settings on the group membership page. To unsubscribe from this group and stop receiving emails from it send an email to comp.lang.c+++unsubscribe@googlegroups.com. |
No comments:
Post a Comment