Tuesday, December 22, 2020

Digest for comp.lang.c++@googlegroups.com - 24 updates in 5 topics

"Öö Tiib" <ootiib@hot.ee>: Dec 22 05:19AM -0800

On Monday, 21 December 2020 at 21:14:55 UTC+2, Bonita Montero wrote:
> What are the means to convert UTF-8-strings to u16string-s
> with C++20. wstring_convert is deprecated.
 
There are only platform- or library-specific means. Unicode standard is
differently (in strict sense incorrectly in various ways) supported by
platforms and libraries. In such world it is impossible to support it
"correctly" so the C++ standard does not want to lie that C++ somehow
does that. There is unicode.org for information about Unicode.
Richard Damon <Richard@Damon-Family.org>: Dec 22 09:05AM -0500

On 12/21/20 2:14 PM, Bonita Montero wrote:
> What are the means to convert UTF-8-strings to u16string-s
> with C++20. wstring_convert is deprecated.
 
Part of the issue is that wstring is not necessarily UTF-16 encoded, so
it might not be the right choice. wstring might not even be 16 bits wide.
 
Actually converting UTF-8 into UTF-16 isn't that hard to do, as it is a
simple matter to extract the next UCS-4 code-point out of a UTF-8 string
(a bit more complicated if you want to do all the suggested error checks
for malformed UTF-8, but still not that hard), and converting the UCS-4
code-point into UTF-16 is even simpler (just check if it is BMP or not
and write the value(s) out).
 
 
Note that technically, you may want to use char16_t based string instead
of wstring, as technically wstring should be based on char32_t, but for
historical reasons it may still be 16 bits on Windows.
Bonita Montero <Bonita.Montero@gmail.com>: Dec 22 03:12PM +0100

> platforms and libraries. In such world it is impossible to support it
> "correctly" so the C++ standard does not want to lie that C++ somehow
> does that. There is unicode.org for information about Unicode.
 
I'm not talking about Unicode but UTF-8.
UTF-8 isn't Unicode but just a encoding.
Bonita Montero <Bonita.Montero@gmail.com>: Dec 22 03:14PM +0100

> Part of the issue is that wstring is not necessarily UTF-16 encoded, so
> it might not be the right choice. wstring might not even be 16 bits wide.
 
There's u16string and u32string which have UTF-16 or UTF-32 encoding.
And I don't talk about a charset but en encoding, which is independent
of a charset.
 
> for malformed UTF-8, but still not that hard), and converting the UCS-4
> code-point into UTF-16 is even simpler (just check if it is BMP or not
> and write the value(s) out).
 
Nevertheless it would be nice to have this in the standard-library.
Bonita Montero <Bonita.Montero@gmail.com>: Dec 22 03:16PM +0100

Am 21.12.2020 um 20:14 schrieb Bonita Montero:
Richard Damon <Richard@Damon-Family.org>: Dec 22 09:28AM -0500

On 12/22/20 9:14 AM, Bonita Montero wrote:
>> code-point into UTF-16 is even simpler (just check if it is BMP or not
>> and write the value(s) out).
 
> Nevertheless it would be nice to have this in the standard-library.
 
I think part of the issue is that despite the names, u16string is NOT
required to be UTF-16 encoded, as it is just basic_string<char16_t>, and
that is not required to use UTF-16. (In C there is a define
__STDC_UTF_16__ to indicate that it is, which I don't see in my C++
standard, but C++ still uses words like if the native encoding is UTF-16)
 
It looks like codecvt can be used to make the conversion IF the
implementation uses UTF-16/UTF-32 for char16_t and char32_t.
James Kuyper <jameskuyper@alumni.caltech.edu>: Dec 22 02:19PM -0500

On 12/22/20 9:28 AM, Richard Damon wrote:
...
> I think part of the issue is that despite the names, u16string is NOT
> required to be UTF-16 encoded, as it is just basic_string<char16_t>, and
> that is not required to use UTF-16.
 
Citation, please? A search of every ocurrance of "UTF-16" in the C++
standard leaves me with the impression that every function in the C++
standard library that has a specialization for char16_t that interprets
objects of that type is required to interpret them as parts of a UTF-16
string. What did I miss?
 
(In C there is a define
> __STDC_UTF_16__ to indicate that it is, which I don't see in my C++
> standard, but C++ still uses words like if the native encoding is UTF-16)
 
A search for "native encoding is" doesn't get any hits. The phrase
"native encoding" occurs in only three places in the standard:
 
29.11.7.2.2p1: refers to the native encoding of "ordinary character
strings" and "wide character strings", or char and wchar_t respectively.
It says nothing about char16_t.
D.23p4 talks about u8path(), which converts from utf8 encodings to the
native encoding for filenames.
 
I'm using n4860.pdf, 2020-03-31 as my reference.
Richard Damon <Richard@Damon-Family.org>: Dec 22 02:41PM -0500

On 12/22/20 2:19 PM, James Kuyper wrote:
> D.23p4 talks about u8path(), which converts from utf8 encodings to the
> native encoding for filenames.
 
> I'm using n4860.pdf, 2020-03-31 as my reference.
 
Looking at the change log for C++20, one of the changes is:
 
> guarantee that char16_t and char32_t literals are encoded as UTF-16 and UTF-32 respectively
 
so this is a new requirement of the Standard.
"daniel...@gmail.com" <danielaparker@gmail.com>: Dec 22 01:04PM -0800

On Tuesday, December 22, 2020 at 9:28:27 AM UTC-5, Richard Damon wrote:
 
> It looks like codecvt can be used to make the conversion IF the
> implementation uses UTF-16/UTF-32 for char16_t and char32_t.
 
The entire header <codecvt> has been deprecated as of C++17. The
std::codecvt template from <locale> hasn't been deprecated, but
all the standard conversion facets have been.

Daniel
"Öö Tiib" <ootiib@hot.ee>: Dec 22 01:56PM -0800

On Tuesday, 22 December 2020 at 16:06:08 UTC+2, Richard Damon wrote:
> for malformed UTF-8, but still not that hard), and converting the UCS-4
> code-point into UTF-16 is even simpler (just check if it is BMP or not
> and write the value(s) out).
 
Converting UTF-8 into UTF-16 is simple only if it is correct (in some
manner of "correct") UTF-8. What to do when it is incorrect (in some sense
of "incorrect")? Close the application? But it was "only" text, shame on you.
Bonita Montero <Bonita.Montero@gmail.com>: Dec 22 10:58PM +0100

> Converting UTF-8 into UTF-16 is simple only if it is correct (in some
> manner of "correct") UTF-8. What to do when it is incorrect (in some sense
> of "incorrect")? Close the application? But it was "only" text, shame on you.
 
What do you mean with "incorrect" ?
Richard Damon <Richard@Damon-Family.org>: Dec 22 05:30PM -0500

On 12/22/20 4:58 PM, Bonita Montero wrote:
>> of "incorrect")? Close the application? But it was "only" text, shame
>> on you.
 
> What do you mean with "incorrect" ?
 
A couple of quick things that can make a byte sequence not valid UTF-8:
 
1) The number of bytes the first bytes says the code-point will have
doesn't match the number of bytes that it does have.
 
2) The first byte of the string isn't a valid first byte of a UTF-8
sequence. (like have a value between 0x80 and 0xBF, the values for
subsequent bytes)
 
3) The byte sequences is NOT the minimal length for that value. Some
variants of UTF-8 allow NUL to be encoded as 0xC0 0x00, but allowing
others can allow for some possible exploits, and the standard says they
should not be allowed.
 
4) A UTF-8 sequence that decodes to a value greater than 0x0010FFFF
should be marked as invalid (and can't be converted to UTF-16)
 
 
There is a code point U+FFFD (Replacement Character) reserved for this
sort of error.
"Öö Tiib" <ootiib@hot.ee>: Dec 22 02:33PM -0800

On Tuesday, 22 December 2020 at 23:58:30 UTC+2, Bonita Montero wrote:
> > manner of "correct") UTF-8. What to do when it is incorrect (in some sense
> > of "incorrect")? Close the application? But it was "only" text, shame on you.
> What do you mean with "incorrect" ?
 
Only subset of sequences of bytes is valid UTF-8 or valid UTF-16. Rest
are invalid. With "incorrect" I meant invalid Unicode that is treated as valid
by one or other library or platform. The details are not hard to find even
Wikipedia mentions couple of such cases.
"Öö Tiib" <ootiib@hot.ee>: Dec 22 02:43PM -0800

On Wednesday, 23 December 2020 at 00:30:50 UTC+2, Richard Damon wrote:
 
> There is a code point U+FFFD (Replacement Character) reserved for this
> sort of error.
 
That character is used to make "incorrect unicode" produced in your
product to look ugly in competitor's product that technically validates
it correctly.
"Christian Hanné" <the.hanne@gmail.com>: Dec 22 04:45PM +0100

You can get Volume 1 here: https://easyupload.io/s8tvow
The archive-password is "fuck u".
Brian Wood <woodbrian77@gmail.com>: Dec 22 11:30AM -0800

On Tuesday, December 22, 2020 at 9:46:16 AM UTC-6, Christian Hanné wrote:
 
The author gave a talk about the book that is a legal
and ethical way to get more info:
 
https://duckduckgo.com/?q=cppcon+2020+lakos&page=1&adx=artexpa&sexp=%7B%22cdrexp%22%3A%22b%22%2C%22artexp%22%3A%22a%22%2C%22prodexp%22%3A%22b%22%2C%22prdsdexp%22%3A%22c%22%2C%22biaexp%22%3A%22b%22%2C%22msvrtexp%22%3A%22b%22%2C%22bltexp%22%3A%22b%22%7D&iax=videos&ia=videos&iai=https%3A%2F%2Fwww.youtube.com%2Fwatch%3Fv%3Dd3zMfMC8l5U
Keith Thompson <Keith.S.Thompson+u@gmail.com>: Dec 22 12:22PM -0800

> on writing very large C++ programs would be gladly accepted.
> If the project I am trying to sell goes ahead, it could easily
> become very large.
 
On amazon.com, I see:
 
Large-Scale C++ Volume I: Process and Architecture Dec 17, 2019
Large-Scale C++ Volume II: Design and Implementation Mar 14, 2021
 
--
Keith Thompson (The_Other_Keith) Keith.S.Thompson+u@gmail.com
Working, but not speaking, for Philips Healthcare
void Void(void) { Void(); } /* The recursive call of the void */
"Chris M. Thomasson" <chris.m.thomasson.1@gmail.com>: Dec 21 08:57PM -0800

On 12/21/2020 9:04 AM, Jorgen Grahn wrote:
>> [Jesus Loves You] ?
 
> He did. But Rick doesn't decide policy here. When you're posting in
> those threads, you're just as rude and disruptive as he is.
 
;^(
"Chris M. Thomasson" <chris.m.thomasson.1@gmail.com>: Dec 21 08:58PM -0800

On 12/19/2020 3:37 PM, Mr Flibble wrote:
>> consider what real loss is:
 
> [snip - tl;dr]
 
> And Satan invented fossils, yes?
 
certain fossils, lol.
 
Finding fossils under the once topical like climate in the poles, as
once was?
"Chris M. Thomasson" <chris.m.thomasson.1@gmail.com>: Dec 21 10:34PM -0800

On 12/18/2020 4:43 PM, Rick C. Hodgin wrote:
> is running out for all of us -- saved or otherwise.
 
> Peace, my friend.  Tell your family and friends about Jesus.  Spread the
> word.  Let His call be heard from your mouth unto others.
 
https://youtu.be/SIxi2uqFVmc
seeplus <boardmounrt@gmail.com>: Dec 22 03:20AM -0800

On Saturday, December 19, 2020 at 11:43:25 AM UTC+11, Rick C. Hodgin wrote:

> My friends, the time of the rapture is upon us. This week, next week,
> this month, next month, this year, next year ... it's here.

Damn. This looks like you will be gone by March, maybe sooner.
 
Could you please leave me your stuff ASAP!
 
If you really do not think that this is going to happen, then just don't bother handing it over.
Will take it that you are not a true believer.
 
Just quote some nonsense bible message to give you an out.
Mr Flibble <flibble@i42.REMOVETHISBIT.co.uk>: Dec 22 07:54PM

On 22/12/2020 11:20, seeplus wrote:
 
> If you really do not think that this is going to happen, then just don't bother handing it over.
> Will take it that you are not a true believer.
 
> Just quote some nonsense bible message to give you an out.
 
Yes, Hodgin, donate all your money to charity now as you won't be needing it because the rapture is definitely going to happen!
 
A good charity to donate all your money to would be your local natural history museum: I hear they have a good collection of fossils that definitely aren't an invention of Satan!
 
Dweeb.
 
/Flibble
 
--
😎
"Öö Tiib" <ootiib@hot.ee>: Dec 22 05:06AM -0800

> OK, but I don't think I claimed some big benefit. I'm not
> enthralled with 2020 C++. I mention this in the hopes
> it could be included in the next standard.
 
My point is that there are no reasons to. The cases where likes of
string_view are more efficient than char* are there but this is
not one. Same about cases where it is safer. The char* version
can't be removed even if string_view is better. So we would
have two virtual functions side-by-side that do same thing.
 
> > not allow compilers to remove families of functions from virtual
> > tables as rest of the calls remain virtual in meaningful program.
> Are you saying LTO is not relevant to meaningful programs?
 
No. Your trying to mix static exceptions and LTO into discussion is
just red herring. LTO and static exceptions can find local optimisation
opportunity here or there. In what percentage of meaningful programs
an LTO can prove that what() of *whole* std::exception family is *never*
called virtually? About half of code of every meaningful program is
error handling so it is close to zero.
"Chris M. Thomasson" <chris.m.thomasson.1@gmail.com>: Dec 21 10:08PM -0800

For no kernel guys:
 
https://youtu.be/XrW5yerbAog
You received this digest because you're subscribed to updates for this group. You can change your settings on the group membership page.
To unsubscribe from this group and stop receiving emails from it send an email to comp.lang.c+++unsubscribe@googlegroups.com.

No comments: