- How to write wide char string literals? - 13 Updates
- good reference on threads - 2 Updates
- dynamic_cast - 1 Update
| Juha Nieminen <nospam@thanks.invalid>: Jul 02 05:26AM > the strings as arrays of char, char16_t, or char32_t, respectively. Such > literals are guaranteed to be in UTF-8, UTF-16, or UTF-32 encoding, > respectively. No my choice in this case. |
| James Kuyper <jameskuyper@alumni.caltech.edu>: Jul 02 01:44AM -0400 On 7/2/21 1:26 AM, Juha Nieminen wrote: >> literals are guaranteed to be in UTF-8, UTF-16, or UTF-32 encoding, >> respectively. > No my choice in this case. Recent messages reminded me of Window's strong incentives to use wchar_t, something I've thankfully had no experience with. As a result, what I'm about to say may be incorrect - but it seems to me that a conforming implementation of C++ targeting Windows should have wchar_t be the same as char16_t, so you should be able to freely use u prefixed string literals and char16_t with code that needs to be portable to Windows, but can also be used on other platforms. If your code didn't need to be portable to other platforms, you could, by definition, rely upon Window's own guarantees about wchar_t. |
| Bo Persson <bo@bo-persson.se>: Jul 02 08:33AM +0200 On 2021-07-02 at 07:44, James Kuyper wrote: > Windows, but can also be used on other platforms. If your code didn't > need to be portable to other platforms, you could, by definition, rely > upon Window's own guarantees about wchar_t. The problem is that the language explicitly requires wchar_t to be a distinct type. Once upon a time it was implemented as a typedef, but that option was removed already in C++98. Overload resolution and things... |
| Kli-Kla-Klawitter <kliklaklawitter69@gmail.com>: Jul 02 10:03AM +0200 Am 01.07.2021 um 20:21 schrieb Keith Thompson: >> v1.01 does honor the BOM. > At the risk of giving the impression I'm taking you seriously, the > oldest version of gcc available from gnu.org is 1.42, released in 1992. No, the oldest gcc-version is v0.9 from March 22, 1987. |
| Keith Thompson <Keith.S.Thompson+u@gmail.com>: Jul 02 02:31AM -0700 Ralf Goertz <me@myprovider.invalid> writes: [...] > b.cc:1:5: warning: null character(s) ignored > etc. > How does that qualify as "gcc honoring the BOM"? On my system, gcc doesn't handle UTF-16 at all, with or without a BOM. (I don't know whether there's a way to configure it to do so.) It does handle UTF-8 with or without a BOM. $ file b.cpp b.cpp: C source, UTF-8 Unicode (with BOM) text $ cat b.cpp int main() { } $ hd b.cpp 00000000 ef bb bf 69 6e 74 20 6d 61 69 6e 28 29 20 7b 20 |...int main() { | 00000010 7d 0a |}.| 00000012 $ gcc -c b.cpp $ gcc 9.3.0 on Ubuntu 20.04. (There is, of course, no point in going back to ancient versions of gcc.) -- Keith Thompson (The_Other_Keith) Keith.S.Thompson+u@gmail.com Working, but not speaking, for Philips void Void(void) { Void(); } /* The recursive call of the void */ |
| MrSpud_oyCn@92wvlb1hltq4dhc.gov.uk: Jul 02 09:38AM On Fri, 02 Jul 2021 02:31:24 -0700 >[...] >On my system, gcc doesn't handle UTF-16 at all, with or without a BOM. >(I don't know whether there's a way to configure it to do so.) Just out of interest, what byte order is the BOM in? Catch 22? |
| Paavo Helde <myfirstname@osa.pri.ee>: Jul 02 01:52PM +0300 >> On my system, gcc doesn't handle UTF-16 at all, with or without a BOM. >> (I don't know whether there's a way to configure it to do so.) > Just out of interest, what byte order is the BOM in? Catch 22? This is probably a troll question, but answering anyway: the BOM marker U+FEFF is in the correct byte order, in both little-endian and big-endian UTF-16. That's how you tell them apart. The trick is in that the reverse value U+FFFE is not valid Unicode character, so there is no possibility of mixup. Neither sequence is also valid UTF-8, to avoid mixup with that. |
| Juha Nieminen <nospam@thanks.invalid>: Jul 02 11:52AM > so you should be able to freely use u prefixed > string literals and char16_t with code that needs to be portable to > Windows, but can also be used on other platforms. But doesn't that have the exact same problem as in my original post? In other words, in code like this: const char16_t *str = u"something"; the stuff between the quotation marks in the source code will use whatever encoding (ostensibly but not assuredly UTF-8), which the compiler needs to convert to UTF-16 at compile time for writing it to the output binary. Or is it guaranteed that the characters between u" and " will always be interpreted as UTF-8? |
| MrSpud_zg8yg@nx8574z6ey2rc0y563k5i2.net: Jul 02 12:53PM On Fri, 2 Jul 2021 13:52:53 +0300 >>> (I don't know whether there's a way to configure it to do so.) >> Just out of interest, what byte order is the BOM in? Catch 22? >This is probably a troll question, but answering anyway: the BOM marker No, not a troll. >U+FEFF is in the correct byte order, in both little-endian and >big-endian UTF-16. That's how you tell them apart. So its just 2 bytes in sequence, not a 16 bit value? |
| "Alf P. Steinbach" <alf.p.steinbach@gmail.com>: Jul 02 03:22PM +0200 >> U+FEFF is in the correct byte order, in both little-endian and >> big-endian UTF-16. That's how you tell them apart. > So its just 2 bytes in sequence, not a 16 bit value? The BOM is a Unicode code point, U+FEFF as Paavo mentioned, originally standing for an invisible zero-width hard space. It's encoded with either little endian UTF-16 (then as two bytes), or as big endian UTF-16 (then as two bytes), or as endianness agnostic UTF-8 (then as three bytes). The encoded BOM yields a reliable encoding indicator, though pedantic people might argue that it's just statistical -- after all, one just might happen to have a Windows 1252 encoded file with three characters at the start with the same byte values as the UTF-8 BOM. In the same vein, one just might happen to have a `.txt` file in Windows with the letters "MZ" at the very start, like MZ, Mishtara Zva'it, is the Military Police Corps of Israel. blah and if you then try to open the file in your default text editor by just typing the file name in old Cmd, those letters will be misinterpreted as the initials of Mark Zbikowski, marking the file as an executable... Since the chance of that happening isn't absolutely 0 one should never use text file names as commands, or the UTF-8 BOM as an encoding marker. - Alf |
| "Alf P. Steinbach" <alf.p.steinbach@gmail.com>: Jul 02 03:37PM +0200 On 2 Jul 2021 07:44, James Kuyper wrote: > Windows, but can also be used on other platforms. If your code didn't > need to be portable to other platforms, you could, by definition, rely > upon Window's own guarantees about wchar_t. I agree, but unfortunately both the C and C++ standards require `wchar_t` to be able represent all code points in the largest supported extended character set, and while that requirement worked nicely with original Unicode, already in 1992 or thereabouts (not quite sure) it was in conflict with extremely firmly established practice in Windows. Bringing the standards in agreement with actual practice should be a goal, but for C++ it's seldom done. It /was/ done, in C++11, for the hopeless idealistic C++98 goal of clean C++-ish `<c...>` headers that didn't pollute the global namespace. But C++17 added stuff to only some `<c...>` headers and not to the corresponding `<... .h>` headers, and that fact was then used quite recently in a proposal to un-deprecate the .h headers but add all new stuff only to `<c...>` headers. Which satisfies the politics but not the in-practice of using code that uses C libraries designed as C++-compatible, without running afoul of qualification issues. And it /was/ done, in C++11, for both `throw` specifications and for the `export` keyword, not to mention the C++11 optional conversion between function pointers and `void*`, in order to support the reality of Posix. But it seems to me, it's very very hard to get such changes through the committee, just as individual people generally don't like to admit that they've been wrong. Instead, C++20 started on a path of /introducing/ more conflicts with reality. In particular for `std::filesystem::path`, where they threw the baby out with the bathwater for purely academic idealism reasons. - Alf |
| Paavo Helde <myfirstname@osa.pri.ee>: Jul 02 05:32PM +0300 >> U+FEFF is in the correct byte order, in both little-endian and >> big-endian UTF-16. That's how you tell them apart. > So its just 2 bytes in sequence, not a 16 bit value? The file contains bytes, it's up to the reading code how to interpret the bytes. It can interpret the bytes as uint16_t, i.e. cast the file buffer as 'const uint16_t*' and read the first 2-byte value. If it is 0xFEFF, then it knows this is an UTF-16 file in a matching byte order. If it is 0xFFFE, then it knows it's an UTF-16 file in an opposite byte order, and the rest of the file needs to be byte-swapped. It can also interpret the buffer as containing uint8_t bytes, but then the logic is a bit more complex, it must then know if it itself is running on a big-endian or little-endian machine, and behave accordingly. |
| James Kuyper <jameskuyper@alumni.caltech.edu>: Jul 02 04:15PM -0400 On 7/2/21 7:52 AM, Juha Nieminen wrote: >> string literals and char16_t with code that needs to be portable to >> Windows, but can also be used on other platforms. > But doesn't that have the exact same problem as in my original post? No, because the original post used wchar_t and the L prefix, for which the relevant encoding is implementation-defined. That's not the case for char and the u8 prefix, char16_t and the u prefix, or for char32_t and the U prefix. I have a feeling that there's a misunderstanding somewhere in this conversation, but I'm not sure yet what it is. > const char16_t *str = u"something"; > the stuff between the quotation marks in the source code will use > whatever encoding (ostensibly but not assuredly UTF-8), which the I don't understand why you think that the source code encoding matters. The only thing that matters is what characters are encoded. As long as those encodings are for the characters {'u', '""', '\\', 'x', 'C', '2', '\\', 'x', 'A', '9', '"'}, any fully conforming implementation must give you the standard defined behavior for u"\xC2\xA9". > compiler needs to convert to UTF-16 at compile time for writing > it to the output binary. When using the u8 prefix, UTF-8 encoding is guaranteed, for which every codepoint from U+0000 to U+007F is represented by a single character with a numerical value matching the code point. When using the u prefix, UTF-16 encoding is guaranteed, for which every codepoint from U+0000 to U+D7FF, and from U+E000 to U+FFFF, is represented by a single character with a numerical value matching the codepoint. When using the U prefix, UTF-32 encoding is guaranteed, for which every codepoint from U+0000 to U+D7FF, and from U+E000 to U+10FFFF, is represented by a single character with a numerical value matching the codepoint. Since the meanings of the octal and hexadecimal escape sequences are defined in terms of the numerical values of the corresponding characters, if you use any of those prefixes, specifying the value of a character within the specified range by using an octal or hexadecimal escape sequence is precisely as portable as using the UCN with the same numerical value. Using UCNs would be better because they are less restricted, working just as well with the L prefix and with no prefix. However, within those ranges, octal and hexadecimal escapes will work just as well. Do you know of any implementation of C++ that claims to be fully conforming, for which that is not the case? If so, how do they justify that claim? '\xC2' and '\xA9' are in the ranges for the u and U prefixes. They would not be in the range for the u8 prefix, but since the context was wide characters, the u8 prefix is not relevant. > Or is it guaranteed that the characters between u" and " will > always be interpreted as UTF-8? Source code characters between the u" and the " will be interpreted according to an implementation-defined character encoding. But so long as they encode {'\\', 'x', 'C', '2', '\\', 'x', 'A', '9'}, you should get the standard-defined behavior for u"\xC2\xA9". |
| Lynn McGuire <lynnmcguire5@gmail.com>: Jul 01 10:27PM -0500 On 6/10/2021 10:08 PM, Chris M. Thomasson wrote: >> Has anyone read this and found it to be a good education on threads ? >> https://www.amazon.com/C-Concurrency-Action-Anthony-Williams/dp/1617294691/ > I used to converse with Anthony way back. He knows his threads! :^) I bought the book. It looks intense. Not a very big font either. Lynn |
| "Chris M. Thomasson" <chris.m.thomasson.1@gmail.com>: Jul 01 09:28PM -0700 On 7/1/2021 8:27 PM, Lynn McGuire wrote: >>> https://www.amazon.com/C-Concurrency-Action-Anthony-Williams/dp/1617294691/ >> I used to converse with Anthony way back. He knows his threads! :^) > I bought the book. It looks intense. Not a very big font either. I have not seen his new book, but I am wondering if he mentions how to use DWCAS in it with pure C++. I know he knows about it: actually, read the whole thread where this message resides: https://groups.google.com/g/lock-free/c/X3fuuXknQF0/m/Ho0H1iJgmrQJ He calls it DCAS in that thread, buts DCAS and DWCAS are different things. So, I am wondering if he mentions it in his 2nd edition. |
| red floyd <no.spam.here@its.invalid>: Jul 01 05:48PM -0700 On 7/1/2021 1:14 PM, Vir Campestris wrote: > with is: This pointer from my database points to an interface. It might > be a type 1 object, in which case do ONE(). It might be a type 2, in > which case do TWO(). Why can't you use a virtual member function for this? Or is it to cast the pointer from the database to one of two different hierarchies? |
| You received this digest because you're subscribed to updates for this group. You can change your settings on the group membership page. To unsubscribe from this group and stop receiving emails from it send an email to comp.lang.c+++unsubscribe@googlegroups.com. |
No comments:
Post a Comment