Friday, February 8, 2019

Digest for comp.lang.c++@googlegroups.com - 14 updates in 5 topics

Ralf Goertz <me@myprovider.invalid>: Feb 08 05:28PM +0100

Am Fri, 8 Feb 2019 13:45:39 +0100
 
> It's not unnecessary if the intent is to further edit the file,
> because then it says what encoding should better be used with this
> file.
 
But there is still no need for that BOM if you save it as UTF-8. That's
the whole point.
 
> Otherwise I'd just save as pure ASCII.
 
> Done.
 
Which is a UTF-8 file is that doesn't contain non-ASCII characters.
 
> > too.
 
> Evidently vim doesn't have to relate to many Windows ANSI encoded
> files, where all byte sequences are valid.
 
And somehow it still manages to correctly detect both types of encoding
in most cases. If I save a file containing the line "My name is not
spelled Görtz but Goertz" after having ":set fileencoding=latin1" and
reopen it vim tells me '"file" [converted]' because it detected the
encoding and converted it to it's native encoding (but it saves the file
using the original encoding if I don't interfere, emitting a warning if
that is impossible).
 
> for tools such as compilers.
 
> For an editor that loads the whole file anyway, and also has an
> interactive user in front that can guide it, maybe it doesn't matter.
 
As I said I usually don't have to specify anything, vim does it
automagically.
 
 
> I guess the argument, that you've picked up from somebody else, is
> that it's plain impossible to make a corresponding text concatenation
> tool.
 
I think I made that argument before in a discussion with you. But still
I never said it was originally mine. So even if I picked it up from
somebody else that doesn't make it invalid. And by the way "binary"
`cat` /is/ `cat`. One of the other nice things you don't have to care
about under *nix.

> > above mentioned case of an ASCII only UTF-8 file).
 
> Not having the BOMs for files intended to be used with Windows tools,
> causes problems of correctness.
 
Yeah but that's the fault of (the) Windows (tools) IMHO.
Robert Wessel <robertwessel2@yahoo.com>: Feb 08 10:31AM -0600

>the huge legacy they have, both with existing code base all around the
>world, and existing executables that are tied to their existing
>standard, so they're stuck with UTF-16 (or UCS-2, whatever it is).
 
 
In MS's defense, they decided on UCS-2 for Windows (shipped 1993) at a
time when Unicode was explicitly* defined as a 16-bit code.
 
At the time the variable length encodings weren't really a thing yet
(although Plan 9's development of UTF-8 would have overlapped at least
the end of the initial NT development). And arguably a "simple"
16-bit code was a reasonable choice, sure it carried a bit of a size
penalty, but just for text.
 
Unicode was extended in 1996 to be a (sort-of) 32-bit code, but that
didn't get all that much traction until about 2006, when the Chinese
government started requiring some levels of character set support for
software sold there.
 
At that point MS switched from UCS-2 to UTF-16, as the least
disruptive change.
 
MS was already supporting two versions of (mostly) all APIs (8-bit
"ANSI" and 16-bit UCS-2), adding a third (for UTF-8 - the ANSI
encoding did not have room for the surrogate encodings) would have
been a considerable undertaking, although arguably one they should
have taken.
 
And this was at a time when it's likely that the majority of machines
in the world supporting Unicode were ones running Windows.
 
 
*The standard made that statement explicitly
"Alf P. Steinbach" <alf.p.steinbach+usenet@gmail.com>: Feb 08 05:41PM +0100

On 08.02.2019 16:06, Daniel wrote:
 
> On the contrary, it's utterly redundant. Given the first four bytes of
> UTF-8, UTF-16(LE), UTF-16(BE), UTF-32(LE), or UTF-32(BE) encoded text, the
> encoding can be detected with equal reliability.
 
Apparently you remember that with a binary inversion, the exact opposite
of what someone wrote.
 
That happens to me sometimes, but mostly about things I can't reason
about, things that are arbitrary facts. I get them mixed up & inverted.
 
Anyway, with a BOM the first four bytes give a good, generally reliable
indication. But without a BOM... Well, consider the aforementioned (in
my code) bullet point, "•".
 
A consultant accustomed to Powerpoint presentations, might well start a
text file with a bullet point. Or two. Or more.
 
It's Unicode code point u+2022. And as UTF-16(BE) it's 0x20 followed by
0x22. Interpreted as ASCII that's a space followed by a double quote.
 
Now you're looking at the first four bytes of the file. They're 0x20,
0x22, 0x20, 0x22.
 
Is it a space, quote, space, quote, in ASCII or some ASCII extension
such as UTF-8, Latin-1 or the original IBM PC encoding (Windows cp 437)?
Or is it perhaps two bullet points in UTF-16(BE)? Or, just maybe it's
twice the mathematical left angle symbol "∠" expressed in UTF-16(LE)?
 
As you can see these byte values leave the specter of possible encodings
wide open, except that UTF-32 would have guaranteed a nullbyte.
 
 
> whether the json data interchange specification should provide a statement
> of that algorithm (there are some subtleties), but it was dropped when it
> was decided to restrict data interchange to UTF8 only.
 
Good idea. ;-)
 
 
Cheers!
 
- Alf
"Alf P. Steinbach" <alf.p.steinbach+usenet@gmail.com>: Feb 08 05:51PM +0100

On 08.02.2019 17:28, Ralf Goertz wrote:
> somebody else that doesn't make it invalid. And by the way "binary"
> `cat` /is/ `cat`. One of the other nice things you don't have to care
> about under *nix.
 
My point was not that it was invalid because you picked it up from
somewhere, but that it's an invalid argument that you didn't originate,
i.e. no fault of yours except not thinking deeply about it.
 
Text concatenation and binary concatenation was never the same even for
ASCII.
 
Consider a C++ source code text, and C++11 §2.2/1 requoted from an SO
posting I hastily googled up:
 
"A source file that is not empty and that does not end in a new-line
character, or that ends in a new-line character immediately preceded by
a backslash character before any such splicing takes place, shall be
processed as if an additional new-line character were appended to the file."
 
As far as the C++11 or later compiler is concerned that file that ends
in in a line without a final newline, acts as if there was a newline.
Now you use binary `cat` to concatenate this text and some more source
code. Oh dang, that placed a preprocessor directive in the middle of a
line (namely the last line from the first file).
 
A nice textual concatenation tool would ensure by default that every
non-empty file's text was terminated with a newline, or the appropriate
system specific end of line specification, so that you could pass the
result to a C++ compiler... And by default it would also strip out those
pesky zero width spaces. Even in the middle of text.
 
 
 
>> Not having the BOMs for files intended to be used with Windows tools,
>> causes problems of correctness.
 
> Yeah but that's the fault of (the) Windows (tools) IMHO.
 
:)
 
 
Cheers!
 
- Alf
Manfred <noname@add.invalid>: Feb 08 05:51PM +0100

On 2/8/2019 5:31 PM, Robert Wessel wrote:
>> standard, so they're stuck with UTF-16 (or UCS-2, whatever it is).
 
> In MS's defense, they decided on UCS-2 for Windows (shipped 1993) at a
> time when Unicode was explicitly* defined as a 16-bit code.
 
[snip further valid points]
 
> And this was at a time when it's likely that the majority of machines
> in the world supporting Unicode were ones running Windows.
 
True, and in fact UTF-8 /is/ at least supported by recent versions of
MultiByteToWideChar() and WideCharToMultiByte(), so it is usable in
Windows. The inconvenience is having to code the conversion when
interfacing with nowadays' world out there.
"Alf P. Steinbach" <alf.p.steinbach+usenet@gmail.com>: Feb 08 01:45PM +0100

On 08.02.2019 12:27, Ralf Goertz wrote:
> to erase those fancy quotation marks I just used and replace them with
> ordinary ones like in "UTF-8". Suddenly, the file is pure ASCII but has
> an unnecessary BOM.
 
It's not unnecessary if the intent is to further edit the file, because
then it says what encoding should better be used with this file.
 
Otherwise I'd just save as pure ASCII.
 
Done.
 
 
> If the file contains non-ASCII characters you'll
> notice that soon enough. My favourite editor (vim) is very good at
> detecting that without the aid of BOMs and I guess others are, too.
 
Evidently vim doesn't have to relate to many Windows ANSI encoded files,
where all byte sequences are valid.
 
It's possible to apply statistical measures over large stretches of
text, but these are necessarily grossly inefficient compared to just
checking three bytes, and that efficiency versus inefficiency counts for
tools such as compilers.
 
For an editor that loads the whole file anyway, and also has an
interactive user in front that can guide it, maybe it doesn't matter.
 
 
> concatenate two files with "cat file1 file2 >ouftile". Then you end up
> with a BOM in the middle of a file which doesn't conform to the
> standard AFAIK.
 
Binary `cat` is a nice tool when it's not misapplied.
 
I guess the argument, that you've picked up from somebody else, is that
it's plain impossible to make a corresponding text concatenation tool.
 
 
>> intent.
 
> But they increase the file size which can cause problems (in the above
> mentioned case of an ASCII only UTF-8 file).
 
Not having the BOMs for files intended to be used with Windows tools,
causes problems of correctness.
 
In the above mentioned case the "problem" of /not forgetting the
encoding/ sounds to me like turning black to white and vice versa.
 
I'd rather /not/ throw away the encoding information, and would see the
throwing-away, if that were enforced, as a serious problem.
 
 
> I really don't understand
> why UTF-8 has not become standard on Windows even after so many years of
> it's existence.
 
As I see it, a war between Microsoft and other platforms, where they try
their best to subtly and not-so-subtly sabotage each other.
 
Microsoft does things like not supporting UTF-8 in Windows consoles
(input doesn't work at all for non-ASCII characters), and not supporting
UTF-8 locales in Windows, hiding the UTF-8 sans BOM encoding far down in
a very long list of useless encodings in the VS editor's GUI for
encoding choice, letting it save with system-dependent Windows ANSI
encoding by default, and even (Odin save us!) using that as the default
basic execution character set in Visual C++ -- a /system dependent/
encoding as basic execution character set.
 
*nix-world folks do things such as restricting the JSON format, in newer
version of its RFC, to UTF without BOM, permitting a BOM to be treated
as an error.
 
Very political, as I see it.
 
Not engineering.
 
 
Cheers!,
 
- Alf
jameskuyper@alumni.caltech.edu: Feb 08 09:16AM -0800

On Friday, February 8, 2019 at 11:51:42 AM UTC-5, Alf P. Steinbach wrote:
...
> character, or that ends in a new-line character immediately preceded by
> a backslash character before any such splicing takes place, shall be
> processed as if an additional new-line character were appended to the file."
 
That's 5.2p2. It comes as rather a shock to me, since I'm used to The C
standard's specification: "A source file that is not empty shall end in
a new-line character, which shall not be immediately preceded by a
backslash character before any such splicing takes place." (5.1.1.2p2).
Do you know in what version of C++ it first specified this processing?
Robert Wessel <robertwessel2@yahoo.com>: Feb 08 11:29AM -0600

On Fri, 8 Feb 2019 09:16:17 -0800 (PST),
>a new-line character, which shall not be immediately preceded by a
>backslash character before any such splicing takes place." (5.1.1.2p2).
>Do you know in what version of C++ it first specified this processing?
 
 
It was "If a source file that is not empty does not end in a newline
character, or ends in a newline character immediately preceded by a
backslash character, the behavior is undefined." in the original
version of C++98. So either C++11, or (unlikely, IMO) in one of the
TC to C++98.
Daniel <danielaparker@gmail.com>: Feb 08 09:59AM -0800

On Friday, February 8, 2019 at 11:41:33 AM UTC-5, Alf P. Steinbach wrote:
> of what someone wrote.
 
> That happens to me sometimes, but mostly about things I can't reason
> about, things that are arbitrary facts. I get them mixed up & inverted.
 
Alf, you're remarkably restrained :-) I was thinking only of json,
where a detection mechanism is possible because the first character is
always US ASCII, and then it's just a matter of detecting zeros in the
first octets to disambiguate. But you are of course correct in the general
case.
 
Daniel
"Alf P. Steinbach" <alf.p.steinbach+usenet@gmail.com>: Feb 08 09:38PM +0100

On 08.02.2019 08:01, Alf P. Steinbach wrote:
> function to client code (or provide a weakly linked default), but could
> indicate a direction for a language supported standard solution, maybe?
 
> Anyway it's a Better Way™. :)
 
I landed on a kind of compromise.
 
Compared to the original simple conditional code inclusion this feels
like very much overkill.
 
But now it's prepared for modules, like at one time receivers were
prepared for stereo broadcasts: only a teeny tiny little adjustment
would be necessary when the time came to do it for real. Yes, yes.
 
 
---------------------------------------------------------------------
#pragma once // Source encoding: UTF-8 with BOM (π is a lowercase
Greek "pi").
 
#include <cppx-core/config.hpp> //
cppx::use_ascii_substitutes
#include <cppx-core/meta-type/Type_choice_.hpp> // cppx::Type_choice_
 
namespace cppx
{
struct Symbol_strings_utf8
{
static constexpr auto& left_quote_str = """;
static constexpr auto& right_quote_str = """;
static constexpr auto& bullet_str = "•";
static constexpr auto& left_arrow_str = "←";
static constexpr auto& right_arrow_str = "→";
};
 
struct Symbol_strings_ascii
{
static constexpr auto& left_quote_str = "\"";
static constexpr auto& right_quote_str = "\"";
static constexpr auto& bullet_str = "*";
static constexpr auto& left_arrow_str = "<-";
static constexpr auto& right_arrow_str = "->";
};
 
namespace best_effort
{
using Symbol_strings = Type_choice_<
use_ascii_substitutes, Symbol_strings_ascii,
Symbol_strings_utf8
>;
 
constexpr auto& left_quote_str =
Symbol_strings::left_quote_str;
constexpr auto& right_quote_str =
Symbol_strings::right_quote_str;
constexpr auto& bullet_str = Symbol_strings::bullet_str;
constexpr auto& left_arrow_str =
Symbol_strings::left_arrow_str;
constexpr auto& right_arrow_str =
Symbol_strings::right_arrow_str;
} // namespace best_effort
 
} // namespace cppx
 
---------------------------------------------------------------------
 
 
Cheers!,
 
- Alf
woodbrian77@gmail.com: Feb 08 11:52AM -0800

On Friday, February 8, 2019 at 3:50:06 AM UTC-6, Öö Tiib wrote:
> > going to keep both.
 
> Viable products that compile with less than half an hour
> on single core are getting rare.
 
1. I'm working on a service and try to minimize the amount
of code that has to be downloaded/built/maintained.
 
2. The closed source part of my work is much larger than the
open source part.
 
3. If you aren't working on some closed source code, now
would be a good time to start.
 
> stuff is such and one won't use ninja for those. The unit
> tests typically take somewhat longer to run than compiling.
> Do you have unit tests?
 
Yes.
 
> only platforms that will suck belong to Apple and that is
> as expected ... who targets those should have double
> bigger budgets too.
 
I've only used Cmake and Meson a little, but so far I like
Meson more. I don't know yet if it can produce project files
on Windows like Cmake does.
 
 
Brian
Ebenezer Enterprises
http://webEbenezer.net
ram@zedat.fu-berlin.de (Stefan Ram): Feb 08 05:48PM

>Do you know in what version of C++ it first specified this processing?
 
It seems to be C++11.
 
N3035 of 2010-02-16 still has:
 
|If a source file that is not empty does not end in a new-line
|character, or ends in a new-line character immediately
|preceded by a backslash character before any such splicing
|takes place, the behavior is undefined.
 
N3092 of 2010-03-26 then has:
 
|A source file that is not empty and that does not end in a
|new-line character, or that ends in a new-line character
|immediately preceded by a backslash character before any such
|splicing takes place, shall be processed as if an additional
|new-line character were appended to the file.
 
.
Mr Flibble <flibbleREMOVETHISBIT@i42.co.uk>: Feb 08 05:40PM

Hi!
 
Announcing "neos" a cross-platform language agnostic scripting engine with
a WASM JIT!
 
* Language agnostic: a schema describes the scripting language to use
(theoretically allowing any language to be used).
* RJSON (Relaxed JSON) language schema file format.
* Invent your own scripting language to use with "neos" by writing a new
language schema!
 
Available on GitHub: https://github.com/i42output/neos
 
Coming soon!
 
/Flibble
 
--
"You won't burn in hell. But be nice anyway." – Ricky Gervais
 
"I see Atheists are fighting and killing each other again, over who
doesn't believe in any God the most. Oh, no..wait.. that never happens." –
Ricky Gervais
 
"Suppose it's all true, and you walk up to the pearly gates, and are
confronted by God," Bryne asked on his show The Meaning of Life. "What
will Stephen Fry say to him, her, or it?"
"I'd say, bone cancer in children? What's that about?" Fry replied.
"How dare you? How dare you create a world to which there is such misery
that is not our fault. It's not right, it's utterly, utterly evil."
"Why should I respect a capricious, mean-minded, stupid God who creates a
world that is so full of injustice and pain. That's what I would say."
bitrex <user@example.net>: Feb 08 11:26AM -0500

On 02/08/2019 04:19 AM, Öö Tiib wrote:
> https://wg21.cmeerw.net/cwg/msg5556
> With gcc and clang it doesn't work anymore and with msvc it never did.
> Research something else. ;)
 
I see. Yeah you would need metaprogramming state to at least construct
the equivalent of a program counter.
You received this digest because you're subscribed to updates for this group. You can change your settings on the group membership page.
To unsubscribe from this group and stop receiving emails from it send an email to comp.lang.c+++unsubscribe@googlegroups.com.

No comments: