Thursday, February 23, 2023

Digest for comp.lang.c++@googlegroups.com - 25 updates in 2 topics

"james...@alumni.caltech.edu" <jameskuyper@alumni.caltech.edu>: Feb 22 09:21PM -0800

> "james...@alumni.caltech.edu" <james...@alumni.caltech.edu> wrote:
> >On Tuesday, February 21, 2023 at 9:53:23 AM UTC-5, Mut...@dastardlyhq.com
> >wrote:
...
> Absolute rubbish. Why would UTF8 have anything to do with "safety"? Safety
> means will the program crash or have hidden bugs, not whether a string gets
> translated into uppercase properly or not which would be immediately obvious.
 
Not getting the expected results can make other parts of the program malfunction.
There are certainly more dangerous problems a program can have, but there's more
reasons to worry about that issue than about the one that was your actual concern.
"james...@alumni.caltech.edu" <jameskuyper@alumni.caltech.edu>: Feb 22 09:30PM -0800

> On Tue, 21 Feb 2023 10:58:31 -0800 (PST)
> "james...@alumni.caltech.edu" <james...@alumni.caltech.edu> wrote:
...
> >parsed as a structured binding declaration, and in that context & would be
> No it wouldn't. Whitespace is not significant in C++ (ok, apart from the > >
> vs >>) template syntax hack up until 2011.
 
If white-space between tokens were significant after translation phase 5, the way in
which people used it wouldn't qualify as a "convention", but as a necessity for
correct code. Such conventions are chosen to make code easier for humans to
understand, not because they are needed to ensure that compilers handle the code
correctly.
 
Note: white-space within tokens is always significant, and white-space between
tokens can be significant in translation phase 4 and earlier.
Paavo Helde <eesnimi@osa.pri.ee>: Feb 23 09:39AM +0200


> Absolute rubbish. Why would UTF8 have anything to do with "safety"? Safety
> means will the program crash or have hidden bugs, not whether a string gets
> translated into uppercase properly or not which would be immediately obvious.
 
 
Because you asked: I have seen isdigit() crashing hard on negative
values, would not be surprised if toupper() would behave the same in
some implementation.
 
In case of a multi-byte UTF-8 character encoded in a std::string all its
bytes would have a negative value if 'char' is signed on that platform.
So there.
Muttley@dastardlyhq.com: Feb 23 09:25AM

On Wed, 22 Feb 2023 22:33:09 +0100
>> means will the program crash or have hidden bugs, not whether a string gets
>> translated into uppercase properly or not which would be immediately obvious.
 
>The UB includes that the program can crash.
 
Feel free to explain how toupper could crash.
Muttley@dastardlyhq.com: Feb 23 09:29AM

On Wed, 22 Feb 2023 22:34:37 +0100
>> vs >>) template syntax hack up until 2011.
 
>Either you missed the point, or you understood and deliberately snipped
>what you quoted to create a misleading impression.
 
I didn't miss the point at all. You implied that the positioning of the
whitespace in a declaration makes a difference. It doesn't.
Muttley@dastardlyhq.com: Feb 23 09:31AM

On Wed, 22 Feb 2023 21:30:04 -0800 (PST)
>correctly.
 
>Note: white-space within tokens is always significant, and white-space between
>tokens can be significant in translation phase 4 and earlier.
 
Quite obviously a program with no whitespace won't compile. That doesn't mean
the whitespace is significant in the programming sense.
Muttley@dastardlyhq.com: Feb 23 09:35AM

On Thu, 23 Feb 2023 09:39:04 +0200
>> means will the program crash or have hidden bugs, not whether a string gets
>> translated into uppercase properly or not which would be immediately obvious.
 
>Because you asked: I have seen isdigit() crashing hard on negative
 
Thats clearly a library bug. All bets are off when they exist.
 
>In case of a multi-byte UTF-8 character encoded in a std::string all its
>bytes would have a negative value if 'char' is signed on that platform.
>So there.
 
One would assume unsigned would be used internally.
Paavo Helde <eesnimi@osa.pri.ee>: Feb 23 12:30PM +0200

>>> translated into uppercase properly or not which would be immediately obvious.
 
>> Because you asked: I have seen isdigit() crashing hard on negative
 
> Thats clearly a library bug. All bets are off when they exist.
 
What makes you think so? The C standard clearly says in 7.4 (Character
handling <ctype.h>):
 
"In all cases the argument is an int, the value of which shall be
representable as an unsigned char or shall equal the value of the macro
EOF. If the argument has any other value, the behavior is undefined."
 
Undefined behavior may or may not involve a program crash.
 
>> bytes would have a negative value if 'char' is signed on that platform.
>> So there.
 
> One would assume unsigned would be used internally.
 
Alas, std::string is standardized to use plain char, whose signedness is
implementation dependent.
Muttley@dastardlyhq.com: Feb 23 10:34AM

On Thu, 23 Feb 2023 12:30:39 +0200
>representable as an unsigned char or shall equal the value of the macro
>EOF. If the argument has any other value, the behavior is undefined."
 
>Undefined behavior may or may not involve a program crash.
 
I would still consider a crash to be a bug. Undefined would just be returning
rubbish. I can't even figure out HOW it would crash since all its doing is
 
return (c >= '0' && c <= '9')
 
unless there's some obscure way of doing that test even faster.
"Öö Tiib" <ootiib@hot.ee>: Feb 23 02:47AM -0800

> >> translated into uppercase properly or not which would be immediately obvious.
 
> >Because you asked: I have seen isdigit() crashing hard on negative
 
> Thats clearly a library bug. All bets are off when they exist.
 
Nope, standard does matter only as specification. Read the licence
agreements of compilers you use or something. There are all the
warranties that you actually get and there the interesting part of our
work only starts. No bets are off ... in practice we may need to use
clearly and provably defective implementations for developing
financial applications that people use daily and trust blindly without
thinking. We get paid well for that.
 
About 15 years ago one of my teams helped programming particular
point-of-sale credit card terminal using gcc that produced a binary
that rebooted that terminal on case of situation that Paavo described.
POS could talk native language of card owner (that might contain none
of Latin characters) got certified by EMV (eurocard-mastercard-visa)
and I saw it in use only few years ago.
Paavo Helde <eesnimi@osa.pri.ee>: Feb 23 01:20PM +0200

>> EOF. If the argument has any other value, the behavior is undefined."
 
>> Undefined behavior may or may not involve a program crash.
 
> I would still consider a crash to be a bug.
 
Sure, but the bug would be in your code.
 
> Undefined would just be returning
> rubbish. I can't even figure out HOW it would crash since all its doing is
 
> return (c >= '0' && c <= '9')
 
Nope, because it's not known at the compile time which characters should
be considered digits. It might do something like
 
if (c==(EOF)) {
return (EOF);
} else {
lock_current_locale();
int result = get_current_locale()->isdigit_map[c];
unlock_current_locale();
return result;
}
 
where isdigit_map is a 256-element array provided by the locale.
Richard Damon <Richard@Damon-Family.org>: Feb 23 07:18AM -0500

> rubbish. I can't even figure out HOW it would crash since all its doing is
 
> return (c >= '0' && c <= '9')
 
> unless there's some obscure way of doing that test even faster.
 
The issue is that to support locales, things like isdigit might be
implemented as
 
int isdigit(int c) {
return _prop_table[c+1] & DIGIT_PROPERTY;
}
 
where _prop_table gets set to a table based on the current locale, which
might define additional characters that are digits.
David Brown <david.brown@hesbynett.no>: Feb 23 03:02PM +0100


>> Undefined behavior may or may not involve a program crash.
 
> I would still consider a crash to be a bug. Undefined would just be returning
> rubbish.
 
Undefined behaviour means there is no define behaviour - crashing is
entirely plausible. It doesn't matter what /you/ think about it. The C
standard is quite clear about this - if you pass a valid argument to
isdigit(), as specified in the standard, you'll get a valid result. If
you pass something invalid, all bets are off and whatever happens is
/your/ problem.
 
This is so fundamental to the whole concept of programming that I am
regularly surprised by people who call themselves programmers, yet fail
to comprehend it. A function has a specified input domain, and a
specified result or behaviour for inputs in that domain. Move outside
that input domain, and you are in the realm of nonsense. You don't
expect particular behaviour from 1/0 - maybe you'll get a random value,
maybe you'll get a crash. Why you think calling isdigit() with an
invalid input should have some guarantees is beyond me. "Garbage in,
garbage out" applies to behaviour, not just values, and has been
understood since Babbage designed the first programmable mechanical
computer.
 
So yes, there is a bug - it's in /your/ code if you pass an invalid
value to the function.
 
 
> I can't even figure out HOW it would crash since all its doing is
 
> return (c >= '0' && c <= '9')
 
> unless there's some obscure way of doing that test even faster.
 
There are other ways that can be faster (depending on details of
processor, cache uses, and other aspects). The traditional
implementation of the <ctype.h> classification functions involves lookup
tables, and it is quite reasonable for a negative value to lead to
things going horribly wrong.
Ben Bacarisse <ben.usenet@bsb.me.uk>: Feb 23 02:49PM

> to comprehend it. A function has a specified input domain, and a specified
> result or behaviour for inputs in that domain. Move outside that input
> domain, and you are in the realm of nonsense.
 
This is formally true, but I think we can also legitimately ask to what
extent a function's domain (and the corresponding returned values) are
reasonable and helpful.
 
> some guarantees is beyond me. "Garbage in, garbage out" applies to
> behaviour, not just values, and has been understood since Babbage designed
> the first programmable mechanical computer.
 
Those of us with a background in languages like C are not going to be
confused, but I bet almost everyone who comes to C from a more modern
language will be astonished by what you have to do to get isdigit to
work safely. To have a character testing function that does not work
for all the values of the language's char type is, well, bonkers.
 
In Haskell, a program won't even compile unless isDigit is called with
an argument of type Char, and the result is defined to be exactly one
of True or False for all values of that type.
 
And that brings up another trap for the unwary: C's isdigit returns
something that is only vaguely Boolean. For example, you can't test if
char c1, c2; are both digits or neither are digits with
 
isdigit(c1) == isdigit(c2)
 
because the value indicating "yes" is not guaranteed to be anything
other than "not zero". Instead you'd write
 
!isdigit((unsigned char)c1) == !isdigit((unsigned char)c2)
 
I think an occasional nod to how we have got used to such nonsense is
merited!
 
--
Ben.
Andrey Tarasevich <andreytarasevich@hotmail.com>: Feb 23 07:05AM -0800


> Feel free to explain how toupper could crash.
 
There's no such concept as "explaining" undefined behavior.
 
Yet, it could be very simple. The implementation treats `toupper` as an
intrinsic, and the compiler explicitly generates a "CRASH NOW!!11"
instruction for invalid arguments. Let's say that in their
implementation of `toupper` it would result in a negligible performance
penalty or no penalty at all.
 
GCC is well-known to do such things, for one example.
 
--
Best regards,
Andrey
Andrey Tarasevich <andreytarasevich@hotmail.com>: Feb 23 07:16AM -0800


>> Undefined behavior may or may not involve a program crash.
 
> I would still consider a crash to be a bug. Undefined would just be returning
> rubbish.
 
Nope. "Returning rubbish" would be an example _unspecified behavior_.
Undefined is a wholly different thing.
 
--
Best regards,
Andrey
David Brown <david.brown@hesbynett.no>: Feb 23 04:25PM +0100

On 23/02/2023 15:49, Ben Bacarisse wrote:
 
> This is formally true, but I think we can also legitimately ask to what
> extent a function's domain (and the corresponding returned values) are
> reasonable and helpful.
 
Sure. You could, for example, argue that "isdigit" would be better
designed if it were to return "false" on any int value outside of the
current valid range. But it might be less efficient if it had such an
extended domain - do you optimise for maximum efficiency for programmers
who are able to read and follow specifications and write correct code,
or do you optimise for minimal surprise for programmers who can't or
won't follow the specifications? I'd say that for C, it's the former -
let those who don't understand the importance of following
specifications use a different language more suited to their needs,
wants and skills. There is a time and a place for making functions with
maximal input domains and controlled handling of nonsensical inputs -
low-level functions like "isdigit" are not such cases.
 
> language will be astonished by what you have to do to get isdigit to
> work safely. To have a character testing function that does not work
> for all the values of the language's char type is, well, bonkers.
 
IMHO the concept of "character" in C and C++ is a mess these days. It
was perhaps inevitable, given the history of the languages, the
development of characters, and the overriding requirement for backwards
compatibility. The notion of "signed characters" and "unsigned
characters" is insane. The jumble of "wide characters" and various
varieties of UTF formats is confusing at best. There are various
character sets - source, execution, basic, extended, whatever (the terms
seem to change regularly, especially in C++). Sometimes these are the
same, sometimes different. Some of the different character types are
the same size as others, but have different interpretations. Sometimes
they have the same interpretations, but are still distinct. Some of the
standard C library functions work only on 7-bit ASCII, some will work
with UTF-8 as well. Some of them work with "int" parameters instead of
more logical "char" types, and support non-character values (like EOF)
in functions that appear to take character parameters. And some
functions treat EOF as a normal character.
 
And if you are coming to C from pretty much any other language, you'll
be shocked at the rudimentary "string" support.
 
So while I agree that it might be surprising to find that "isdigit()" is
not defined for all possible values of all character types, I think it
would be /way/ down the list.
 
This is not a criticism of C - different languages are better and worse
for different things. But if you are working in C, you have to learn C
- you can't just assume it is like whatever other languages you have
used. And if C and C++ were to try to be like other languages, such as
by specifying values for every int value in "isdigit" calls, or having
"isdigit" throw a C++ exception on bad values, you'd lose some of the
aspects that make C and C++ important and useful languages.
 
 
> !isdigit((unsigned char)c1) == !isdigit((unsigned char)c2)
 
> I think an occasional nod to how we have got used to such nonsense is
> merited!
 
Yes, absolutely.
"Alf P. Steinbach" <alf.p.steinbach@gmail.com>: Feb 23 04:34PM +0100

> rubbish. I can't even figure out HOW it would crash since all its doing is
 
> return (c >= '0' && c <= '9')
 
> unless there's some obscure way of doing that test even faster.
 
An MS runtime library example, but apparently this is old code, not for
the current version:
 
<url:
https://github.com/ojdkbuild/tools_toolchain_sdk10_1607/blob/master/Source/10.0.14393.0/ucrt/convert/isctype.cpp#L34-L39>
 
// The _chvalidator function is called by the character
classification functions
// in the debug CRT. This function tests the character argument to
ensure that
// it is not out of range. For performance reasons, this function
is not used
// in the retail CRT.
#if defined _DEBUG
 
extern "C" int __cdecl _chvalidator(int const c, int const mask)
{
_ASSERTE(c >= -1 && c <= 255);
return _chvalidator_l(nullptr, c, mask);
}
 
extern "C" int __cdecl _chvalidator_l(_locale_t const locale, int
const c, int const mask)
{
_ASSERTE(c >= -1 && c <= 255);
 
_LocaleUpdate locale_update(locale);
 
int const index = (c >= -1 && c <= 255) ? c : -1;
 
return
locale_update.GetLocaleT()->locinfo->_public._locale_pctype[index] & mask;
}
 

No comments: