- Config of a library with modules? - 10 Updates
- OT: Github - 1 Update
- Config of a library with modules? - 1 Update
- [Announcement] neos -- Coming Soon! - 1 Update
- Compile time virtual machine/stack machine - 1 Update
Ralf Goertz <me@myprovider.invalid>: Feb 08 05:28PM +0100 Am Fri, 8 Feb 2019 13:45:39 +0100 > It's not unnecessary if the intent is to further edit the file, > because then it says what encoding should better be used with this > file. But there is still no need for that BOM if you save it as UTF-8. That's the whole point. > Otherwise I'd just save as pure ASCII. > Done. Which is a UTF-8 file is that doesn't contain non-ASCII characters. > > too. > Evidently vim doesn't have to relate to many Windows ANSI encoded > files, where all byte sequences are valid. And somehow it still manages to correctly detect both types of encoding in most cases. If I save a file containing the line "My name is not spelled Görtz but Goertz" after having ":set fileencoding=latin1" and reopen it vim tells me '"file" [converted]' because it detected the encoding and converted it to it's native encoding (but it saves the file using the original encoding if I don't interfere, emitting a warning if that is impossible). > for tools such as compilers. > For an editor that loads the whole file anyway, and also has an > interactive user in front that can guide it, maybe it doesn't matter. As I said I usually don't have to specify anything, vim does it automagically. > I guess the argument, that you've picked up from somebody else, is > that it's plain impossible to make a corresponding text concatenation > tool. I think I made that argument before in a discussion with you. But still I never said it was originally mine. So even if I picked it up from somebody else that doesn't make it invalid. And by the way "binary" `cat` /is/ `cat`. One of the other nice things you don't have to care about under *nix. > > above mentioned case of an ASCII only UTF-8 file). > Not having the BOMs for files intended to be used with Windows tools, > causes problems of correctness. Yeah but that's the fault of (the) Windows (tools) IMHO. |
Robert Wessel <robertwessel2@yahoo.com>: Feb 08 10:31AM -0600 >the huge legacy they have, both with existing code base all around the >world, and existing executables that are tied to their existing >standard, so they're stuck with UTF-16 (or UCS-2, whatever it is). In MS's defense, they decided on UCS-2 for Windows (shipped 1993) at a time when Unicode was explicitly* defined as a 16-bit code. At the time the variable length encodings weren't really a thing yet (although Plan 9's development of UTF-8 would have overlapped at least the end of the initial NT development). And arguably a "simple" 16-bit code was a reasonable choice, sure it carried a bit of a size penalty, but just for text. Unicode was extended in 1996 to be a (sort-of) 32-bit code, but that didn't get all that much traction until about 2006, when the Chinese government started requiring some levels of character set support for software sold there. At that point MS switched from UCS-2 to UTF-16, as the least disruptive change. MS was already supporting two versions of (mostly) all APIs (8-bit "ANSI" and 16-bit UCS-2), adding a third (for UTF-8 - the ANSI encoding did not have room for the surrogate encodings) would have been a considerable undertaking, although arguably one they should have taken. And this was at a time when it's likely that the majority of machines in the world supporting Unicode were ones running Windows. *The standard made that statement explicitly |
"Alf P. Steinbach" <alf.p.steinbach+usenet@gmail.com>: Feb 08 05:41PM +0100 On 08.02.2019 16:06, Daniel wrote: > On the contrary, it's utterly redundant. Given the first four bytes of > UTF-8, UTF-16(LE), UTF-16(BE), UTF-32(LE), or UTF-32(BE) encoded text, the > encoding can be detected with equal reliability. Apparently you remember that with a binary inversion, the exact opposite of what someone wrote. That happens to me sometimes, but mostly about things I can't reason about, things that are arbitrary facts. I get them mixed up & inverted. Anyway, with a BOM the first four bytes give a good, generally reliable indication. But without a BOM... Well, consider the aforementioned (in my code) bullet point, "•". A consultant accustomed to Powerpoint presentations, might well start a text file with a bullet point. Or two. Or more. It's Unicode code point u+2022. And as UTF-16(BE) it's 0x20 followed by 0x22. Interpreted as ASCII that's a space followed by a double quote. Now you're looking at the first four bytes of the file. They're 0x20, 0x22, 0x20, 0x22. Is it a space, quote, space, quote, in ASCII or some ASCII extension such as UTF-8, Latin-1 or the original IBM PC encoding (Windows cp 437)? Or is it perhaps two bullet points in UTF-16(BE)? Or, just maybe it's twice the mathematical left angle symbol "∠" expressed in UTF-16(LE)? As you can see these byte values leave the specter of possible encodings wide open, except that UTF-32 would have guaranteed a nullbyte. > whether the json data interchange specification should provide a statement > of that algorithm (there are some subtleties), but it was dropped when it > was decided to restrict data interchange to UTF8 only. Good idea. ;-) Cheers! - Alf |
"Alf P. Steinbach" <alf.p.steinbach+usenet@gmail.com>: Feb 08 05:51PM +0100 On 08.02.2019 17:28, Ralf Goertz wrote: > somebody else that doesn't make it invalid. And by the way "binary" > `cat` /is/ `cat`. One of the other nice things you don't have to care > about under *nix. My point was not that it was invalid because you picked it up from somewhere, but that it's an invalid argument that you didn't originate, i.e. no fault of yours except not thinking deeply about it. Text concatenation and binary concatenation was never the same even for ASCII. Consider a C++ source code text, and C++11 §2.2/1 requoted from an SO posting I hastily googled up: "A source file that is not empty and that does not end in a new-line character, or that ends in a new-line character immediately preceded by a backslash character before any such splicing takes place, shall be processed as if an additional new-line character were appended to the file." As far as the C++11 or later compiler is concerned that file that ends in in a line without a final newline, acts as if there was a newline. Now you use binary `cat` to concatenate this text and some more source code. Oh dang, that placed a preprocessor directive in the middle of a line (namely the last line from the first file). A nice textual concatenation tool would ensure by default that every non-empty file's text was terminated with a newline, or the appropriate system specific end of line specification, so that you could pass the result to a C++ compiler... And by default it would also strip out those pesky zero width spaces. Even in the middle of text. >> Not having the BOMs for files intended to be used with Windows tools, >> causes problems of correctness. > Yeah but that's the fault of (the) Windows (tools) IMHO. :) Cheers! - Alf |
Manfred <noname@add.invalid>: Feb 08 05:51PM +0100 On 2/8/2019 5:31 PM, Robert Wessel wrote: >> standard, so they're stuck with UTF-16 (or UCS-2, whatever it is). > In MS's defense, they decided on UCS-2 for Windows (shipped 1993) at a > time when Unicode was explicitly* defined as a 16-bit code. [snip further valid points] > And this was at a time when it's likely that the majority of machines > in the world supporting Unicode were ones running Windows. True, and in fact UTF-8 /is/ at least supported by recent versions of MultiByteToWideChar() and WideCharToMultiByte(), so it is usable in Windows. The inconvenience is having to code the conversion when interfacing with nowadays' world out there. |
"Alf P. Steinbach" <alf.p.steinbach+usenet@gmail.com>: Feb 08 01:45PM +0100 On 08.02.2019 12:27, Ralf Goertz wrote: > to erase those fancy quotation marks I just used and replace them with > ordinary ones like in "UTF-8". Suddenly, the file is pure ASCII but has > an unnecessary BOM. It's not unnecessary if the intent is to further edit the file, because then it says what encoding should better be used with this file. Otherwise I'd just save as pure ASCII. Done. > If the file contains non-ASCII characters you'll > notice that soon enough. My favourite editor (vim) is very good at > detecting that without the aid of BOMs and I guess others are, too. Evidently vim doesn't have to relate to many Windows ANSI encoded files, where all byte sequences are valid. It's possible to apply statistical measures over large stretches of text, but these are necessarily grossly inefficient compared to just checking three bytes, and that efficiency versus inefficiency counts for tools such as compilers. For an editor that loads the whole file anyway, and also has an interactive user in front that can guide it, maybe it doesn't matter. > concatenate two files with "cat file1 file2 >ouftile". Then you end up > with a BOM in the middle of a file which doesn't conform to the > standard AFAIK. Binary `cat` is a nice tool when it's not misapplied. I guess the argument, that you've picked up from somebody else, is that it's plain impossible to make a corresponding text concatenation tool. >> intent. > But they increase the file size which can cause problems (in the above > mentioned case of an ASCII only UTF-8 file). Not having the BOMs for files intended to be used with Windows tools, causes problems of correctness. In the above mentioned case the "problem" of /not forgetting the encoding/ sounds to me like turning black to white and vice versa. I'd rather /not/ throw away the encoding information, and would see the throwing-away, if that were enforced, as a serious problem. > I really don't understand > why UTF-8 has not become standard on Windows even after so many years of > it's existence. As I see it, a war between Microsoft and other platforms, where they try their best to subtly and not-so-subtly sabotage each other. Microsoft does things like not supporting UTF-8 in Windows consoles (input doesn't work at all for non-ASCII characters), and not supporting UTF-8 locales in Windows, hiding the UTF-8 sans BOM encoding far down in a very long list of useless encodings in the VS editor's GUI for encoding choice, letting it save with system-dependent Windows ANSI encoding by default, and even (Odin save us!) using that as the default basic execution character set in Visual C++ -- a /system dependent/ encoding as basic execution character set. *nix-world folks do things such as restricting the JSON format, in newer version of its RFC, to UTF without BOM, permitting a BOM to be treated as an error. Very political, as I see it. Not engineering. Cheers!, - Alf |
jameskuyper@alumni.caltech.edu: Feb 08 09:16AM -0800 On Friday, February 8, 2019 at 11:51:42 AM UTC-5, Alf P. Steinbach wrote: ... > character, or that ends in a new-line character immediately preceded by > a backslash character before any such splicing takes place, shall be > processed as if an additional new-line character were appended to the file." That's 5.2p2. It comes as rather a shock to me, since I'm used to The C standard's specification: "A source file that is not empty shall end in a new-line character, which shall not be immediately preceded by a backslash character before any such splicing takes place." (5.1.1.2p2). Do you know in what version of C++ it first specified this processing? |
Robert Wessel <robertwessel2@yahoo.com>: Feb 08 11:29AM -0600 On Fri, 8 Feb 2019 09:16:17 -0800 (PST), >a new-line character, which shall not be immediately preceded by a >backslash character before any such splicing takes place." (5.1.1.2p2). >Do you know in what version of C++ it first specified this processing? It was "If a source file that is not empty does not end in a newline character, or ends in a newline character immediately preceded by a backslash character, the behavior is undefined." in the original version of C++98. So either C++11, or (unlikely, IMO) in one of the TC to C++98. |
Daniel <danielaparker@gmail.com>: Feb 08 09:59AM -0800 On Friday, February 8, 2019 at 11:41:33 AM UTC-5, Alf P. Steinbach wrote: > of what someone wrote. > That happens to me sometimes, but mostly about things I can't reason > about, things that are arbitrary facts. I get them mixed up & inverted. Alf, you're remarkably restrained :-) I was thinking only of json, where a detection mechanism is possible because the first character is always US ASCII, and then it's just a matter of detecting zeros in the first octets to disambiguate. But you are of course correct in the general case. Daniel |
"Alf P. Steinbach" <alf.p.steinbach+usenet@gmail.com>: Feb 08 09:38PM +0100 On 08.02.2019 08:01, Alf P. Steinbach wrote: > function to client code (or provide a weakly linked default), but could > indicate a direction for a language supported standard solution, maybe? > Anyway it's a Better Way™. :) I landed on a kind of compromise. Compared to the original simple conditional code inclusion this feels like very much overkill. But now it's prepared for modules, like at one time receivers were prepared for stereo broadcasts: only a teeny tiny little adjustment would be necessary when the time came to do it for real. Yes, yes. --------------------------------------------------------------------- #pragma once // Source encoding: UTF-8 with BOM (π is a lowercase Greek "pi"). #include <cppx-core/config.hpp> // cppx::use_ascii_substitutes #include <cppx-core/meta-type/Type_choice_.hpp> // cppx::Type_choice_ namespace cppx { struct Symbol_strings_utf8 { static constexpr auto& left_quote_str = """; static constexpr auto& right_quote_str = """; static constexpr auto& bullet_str = "•"; static constexpr auto& left_arrow_str = "←"; static constexpr auto& right_arrow_str = "→"; }; struct Symbol_strings_ascii { static constexpr auto& left_quote_str = "\""; static constexpr auto& right_quote_str = "\""; static constexpr auto& bullet_str = "*"; static constexpr auto& left_arrow_str = "<-"; static constexpr auto& right_arrow_str = "->"; }; namespace best_effort { using Symbol_strings = Type_choice_< use_ascii_substitutes, Symbol_strings_ascii, Symbol_strings_utf8 >; constexpr auto& left_quote_str = Symbol_strings::left_quote_str; constexpr auto& right_quote_str = Symbol_strings::right_quote_str; constexpr auto& bullet_str = Symbol_strings::bullet_str; constexpr auto& left_arrow_str = Symbol_strings::left_arrow_str; constexpr auto& right_arrow_str = Symbol_strings::right_arrow_str; } // namespace best_effort } // namespace cppx --------------------------------------------------------------------- Cheers!, - Alf |
woodbrian77@gmail.com: Feb 08 11:52AM -0800 On Friday, February 8, 2019 at 3:50:06 AM UTC-6, Öö Tiib wrote: > > going to keep both. > Viable products that compile with less than half an hour > on single core are getting rare. 1. I'm working on a service and try to minimize the amount of code that has to be downloaded/built/maintained. 2. The closed source part of my work is much larger than the open source part. 3. If you aren't working on some closed source code, now would be a good time to start. > stuff is such and one won't use ninja for those. The unit > tests typically take somewhat longer to run than compiling. > Do you have unit tests? Yes. > only platforms that will suck belong to Apple and that is > as expected ... who targets those should have double > bigger budgets too. I've only used Cmake and Meson a little, but so far I like Meson more. I don't know yet if it can produce project files on Windows like Cmake does. Brian Ebenezer Enterprises http://webEbenezer.net |
ram@zedat.fu-berlin.de (Stefan Ram): Feb 08 05:48PM >Do you know in what version of C++ it first specified this processing? It seems to be C++11. N3035 of 2010-02-16 still has: |If a source file that is not empty does not end in a new-line |character, or ends in a new-line character immediately |preceded by a backslash character before any such splicing |takes place, the behavior is undefined. N3092 of 2010-03-26 then has: |A source file that is not empty and that does not end in a |new-line character, or that ends in a new-line character |immediately preceded by a backslash character before any such |splicing takes place, shall be processed as if an additional |new-line character were appended to the file. . |
Mr Flibble <flibbleREMOVETHISBIT@i42.co.uk>: Feb 08 05:40PM Hi! Announcing "neos" a cross-platform language agnostic scripting engine with a WASM JIT! * Language agnostic: a schema describes the scripting language to use (theoretically allowing any language to be used). * RJSON (Relaxed JSON) language schema file format. * Invent your own scripting language to use with "neos" by writing a new language schema! Available on GitHub: https://github.com/i42output/neos Coming soon! /Flibble -- "You won't burn in hell. But be nice anyway." – Ricky Gervais "I see Atheists are fighting and killing each other again, over who doesn't believe in any God the most. Oh, no..wait.. that never happens." – Ricky Gervais "Suppose it's all true, and you walk up to the pearly gates, and are confronted by God," Bryne asked on his show The Meaning of Life. "What will Stephen Fry say to him, her, or it?" "I'd say, bone cancer in children? What's that about?" Fry replied. "How dare you? How dare you create a world to which there is such misery that is not our fault. It's not right, it's utterly, utterly evil." "Why should I respect a capricious, mean-minded, stupid God who creates a world that is so full of injustice and pain. That's what I would say." |
bitrex <user@example.net>: Feb 08 11:26AM -0500 On 02/08/2019 04:19 AM, Öö Tiib wrote: > https://wg21.cmeerw.net/cwg/msg5556 > With gcc and clang it doesn't work anymore and with msvc it never did. > Research something else. ;) I see. Yeah you would need metaprogramming state to at least construct the equivalent of a program counter. |
You received this digest because you're subscribed to updates for this group. You can change your settings on the group membership page. To unsubscribe from this group and stop receiving emails from it send an email to comp.lang.c+++unsubscribe@googlegroups.com. |
No comments:
Post a Comment