- Is this safe? - 22 Updates
- Order of addresses of string-literals inside a translation unit - 3 Updates
"james...@alumni.caltech.edu" <jameskuyper@alumni.caltech.edu>: Feb 22 09:21PM -0800 > "james...@alumni.caltech.edu" <james...@alumni.caltech.edu> wrote: > >On Tuesday, February 21, 2023 at 9:53:23 AM UTC-5, Mut...@dastardlyhq.com > >wrote: ... > Absolute rubbish. Why would UTF8 have anything to do with "safety"? Safety > means will the program crash or have hidden bugs, not whether a string gets > translated into uppercase properly or not which would be immediately obvious. Not getting the expected results can make other parts of the program malfunction. There are certainly more dangerous problems a program can have, but there's more reasons to worry about that issue than about the one that was your actual concern. |
"james...@alumni.caltech.edu" <jameskuyper@alumni.caltech.edu>: Feb 22 09:30PM -0800 > On Tue, 21 Feb 2023 10:58:31 -0800 (PST) > "james...@alumni.caltech.edu" <james...@alumni.caltech.edu> wrote: ... > >parsed as a structured binding declaration, and in that context & would be > No it wouldn't. Whitespace is not significant in C++ (ok, apart from the > > > vs >>) template syntax hack up until 2011. If white-space between tokens were significant after translation phase 5, the way in which people used it wouldn't qualify as a "convention", but as a necessity for correct code. Such conventions are chosen to make code easier for humans to understand, not because they are needed to ensure that compilers handle the code correctly. Note: white-space within tokens is always significant, and white-space between tokens can be significant in translation phase 4 and earlier. |
Paavo Helde <eesnimi@osa.pri.ee>: Feb 23 09:39AM +0200 > Absolute rubbish. Why would UTF8 have anything to do with "safety"? Safety > means will the program crash or have hidden bugs, not whether a string gets > translated into uppercase properly or not which would be immediately obvious. Because you asked: I have seen isdigit() crashing hard on negative values, would not be surprised if toupper() would behave the same in some implementation. In case of a multi-byte UTF-8 character encoded in a std::string all its bytes would have a negative value if 'char' is signed on that platform. So there. |
Muttley@dastardlyhq.com: Feb 23 09:25AM On Wed, 22 Feb 2023 22:33:09 +0100 >> means will the program crash or have hidden bugs, not whether a string gets >> translated into uppercase properly or not which would be immediately obvious. >The UB includes that the program can crash. Feel free to explain how toupper could crash. |
Muttley@dastardlyhq.com: Feb 23 09:29AM On Wed, 22 Feb 2023 22:34:37 +0100 >> vs >>) template syntax hack up until 2011. >Either you missed the point, or you understood and deliberately snipped >what you quoted to create a misleading impression. I didn't miss the point at all. You implied that the positioning of the whitespace in a declaration makes a difference. It doesn't. |
Muttley@dastardlyhq.com: Feb 23 09:31AM On Wed, 22 Feb 2023 21:30:04 -0800 (PST) >correctly. >Note: white-space within tokens is always significant, and white-space between >tokens can be significant in translation phase 4 and earlier. Quite obviously a program with no whitespace won't compile. That doesn't mean the whitespace is significant in the programming sense. |
Muttley@dastardlyhq.com: Feb 23 09:35AM On Thu, 23 Feb 2023 09:39:04 +0200 >> means will the program crash or have hidden bugs, not whether a string gets >> translated into uppercase properly or not which would be immediately obvious. >Because you asked: I have seen isdigit() crashing hard on negative Thats clearly a library bug. All bets are off when they exist. >In case of a multi-byte UTF-8 character encoded in a std::string all its >bytes would have a negative value if 'char' is signed on that platform. >So there. One would assume unsigned would be used internally. |
Paavo Helde <eesnimi@osa.pri.ee>: Feb 23 12:30PM +0200 >>> translated into uppercase properly or not which would be immediately obvious. >> Because you asked: I have seen isdigit() crashing hard on negative > Thats clearly a library bug. All bets are off when they exist. What makes you think so? The C standard clearly says in 7.4 (Character handling <ctype.h>): "In all cases the argument is an int, the value of which shall be representable as an unsigned char or shall equal the value of the macro EOF. If the argument has any other value, the behavior is undefined." Undefined behavior may or may not involve a program crash. >> bytes would have a negative value if 'char' is signed on that platform. >> So there. > One would assume unsigned would be used internally. Alas, std::string is standardized to use plain char, whose signedness is implementation dependent. |
Muttley@dastardlyhq.com: Feb 23 10:34AM On Thu, 23 Feb 2023 12:30:39 +0200 >representable as an unsigned char or shall equal the value of the macro >EOF. If the argument has any other value, the behavior is undefined." >Undefined behavior may or may not involve a program crash. I would still consider a crash to be a bug. Undefined would just be returning rubbish. I can't even figure out HOW it would crash since all its doing is return (c >= '0' && c <= '9') unless there's some obscure way of doing that test even faster. |
"Öö Tiib" <ootiib@hot.ee>: Feb 23 02:47AM -0800 > >> translated into uppercase properly or not which would be immediately obvious. > >Because you asked: I have seen isdigit() crashing hard on negative > Thats clearly a library bug. All bets are off when they exist. Nope, standard does matter only as specification. Read the licence agreements of compilers you use or something. There are all the warranties that you actually get and there the interesting part of our work only starts. No bets are off ... in practice we may need to use clearly and provably defective implementations for developing financial applications that people use daily and trust blindly without thinking. We get paid well for that. About 15 years ago one of my teams helped programming particular point-of-sale credit card terminal using gcc that produced a binary that rebooted that terminal on case of situation that Paavo described. POS could talk native language of card owner (that might contain none of Latin characters) got certified by EMV (eurocard-mastercard-visa) and I saw it in use only few years ago. |
Paavo Helde <eesnimi@osa.pri.ee>: Feb 23 01:20PM +0200 >> EOF. If the argument has any other value, the behavior is undefined." >> Undefined behavior may or may not involve a program crash. > I would still consider a crash to be a bug. Sure, but the bug would be in your code. > Undefined would just be returning > rubbish. I can't even figure out HOW it would crash since all its doing is > return (c >= '0' && c <= '9') Nope, because it's not known at the compile time which characters should be considered digits. It might do something like if (c==(EOF)) { return (EOF); } else { lock_current_locale(); int result = get_current_locale()->isdigit_map[c]; unlock_current_locale(); return result; } where isdigit_map is a 256-element array provided by the locale. |
Richard Damon <Richard@Damon-Family.org>: Feb 23 07:18AM -0500 > rubbish. I can't even figure out HOW it would crash since all its doing is > return (c >= '0' && c <= '9') > unless there's some obscure way of doing that test even faster. The issue is that to support locales, things like isdigit might be implemented as int isdigit(int c) { return _prop_table[c+1] & DIGIT_PROPERTY; } where _prop_table gets set to a table based on the current locale, which might define additional characters that are digits. |
David Brown <david.brown@hesbynett.no>: Feb 23 03:02PM +0100 >> Undefined behavior may or may not involve a program crash. > I would still consider a crash to be a bug. Undefined would just be returning > rubbish. Undefined behaviour means there is no define behaviour - crashing is entirely plausible. It doesn't matter what /you/ think about it. The C standard is quite clear about this - if you pass a valid argument to isdigit(), as specified in the standard, you'll get a valid result. If you pass something invalid, all bets are off and whatever happens is /your/ problem. This is so fundamental to the whole concept of programming that I am regularly surprised by people who call themselves programmers, yet fail to comprehend it. A function has a specified input domain, and a specified result or behaviour for inputs in that domain. Move outside that input domain, and you are in the realm of nonsense. You don't expect particular behaviour from 1/0 - maybe you'll get a random value, maybe you'll get a crash. Why you think calling isdigit() with an invalid input should have some guarantees is beyond me. "Garbage in, garbage out" applies to behaviour, not just values, and has been understood since Babbage designed the first programmable mechanical computer. So yes, there is a bug - it's in /your/ code if you pass an invalid value to the function. > I can't even figure out HOW it would crash since all its doing is > return (c >= '0' && c <= '9') > unless there's some obscure way of doing that test even faster. There are other ways that can be faster (depending on details of processor, cache uses, and other aspects). The traditional implementation of the <ctype.h> classification functions involves lookup tables, and it is quite reasonable for a negative value to lead to things going horribly wrong. |
Ben Bacarisse <ben.usenet@bsb.me.uk>: Feb 23 02:49PM > to comprehend it. A function has a specified input domain, and a specified > result or behaviour for inputs in that domain. Move outside that input > domain, and you are in the realm of nonsense. This is formally true, but I think we can also legitimately ask to what extent a function's domain (and the corresponding returned values) are reasonable and helpful. > some guarantees is beyond me. "Garbage in, garbage out" applies to > behaviour, not just values, and has been understood since Babbage designed > the first programmable mechanical computer. Those of us with a background in languages like C are not going to be confused, but I bet almost everyone who comes to C from a more modern language will be astonished by what you have to do to get isdigit to work safely. To have a character testing function that does not work for all the values of the language's char type is, well, bonkers. In Haskell, a program won't even compile unless isDigit is called with an argument of type Char, and the result is defined to be exactly one of True or False for all values of that type. And that brings up another trap for the unwary: C's isdigit returns something that is only vaguely Boolean. For example, you can't test if char c1, c2; are both digits or neither are digits with isdigit(c1) == isdigit(c2) because the value indicating "yes" is not guaranteed to be anything other than "not zero". Instead you'd write !isdigit((unsigned char)c1) == !isdigit((unsigned char)c2) I think an occasional nod to how we have got used to such nonsense is merited! -- Ben. |
Andrey Tarasevich <andreytarasevich@hotmail.com>: Feb 23 07:05AM -0800 > Feel free to explain how toupper could crash. There's no such concept as "explaining" undefined behavior. Yet, it could be very simple. The implementation treats `toupper` as an intrinsic, and the compiler explicitly generates a "CRASH NOW!!11" instruction for invalid arguments. Let's say that in their implementation of `toupper` it would result in a negligible performance penalty or no penalty at all. GCC is well-known to do such things, for one example. -- Best regards, Andrey |
Andrey Tarasevich <andreytarasevich@hotmail.com>: Feb 23 07:16AM -0800 >> Undefined behavior may or may not involve a program crash. > I would still consider a crash to be a bug. Undefined would just be returning > rubbish. Nope. "Returning rubbish" would be an example _unspecified behavior_. Undefined is a wholly different thing. -- Best regards, Andrey |
David Brown <david.brown@hesbynett.no>: Feb 23 04:25PM +0100 On 23/02/2023 15:49, Ben Bacarisse wrote: > This is formally true, but I think we can also legitimately ask to what > extent a function's domain (and the corresponding returned values) are > reasonable and helpful. Sure. You could, for example, argue that "isdigit" would be better designed if it were to return "false" on any int value outside of the current valid range. But it might be less efficient if it had such an extended domain - do you optimise for maximum efficiency for programmers who are able to read and follow specifications and write correct code, or do you optimise for minimal surprise for programmers who can't or won't follow the specifications? I'd say that for C, it's the former - let those who don't understand the importance of following specifications use a different language more suited to their needs, wants and skills. There is a time and a place for making functions with maximal input domains and controlled handling of nonsensical inputs - low-level functions like "isdigit" are not such cases. > language will be astonished by what you have to do to get isdigit to > work safely. To have a character testing function that does not work > for all the values of the language's char type is, well, bonkers. IMHO the concept of "character" in C and C++ is a mess these days. It was perhaps inevitable, given the history of the languages, the development of characters, and the overriding requirement for backwards compatibility. The notion of "signed characters" and "unsigned characters" is insane. The jumble of "wide characters" and various varieties of UTF formats is confusing at best. There are various character sets - source, execution, basic, extended, whatever (the terms seem to change regularly, especially in C++). Sometimes these are the same, sometimes different. Some of the different character types are the same size as others, but have different interpretations. Sometimes they have the same interpretations, but are still distinct. Some of the standard C library functions work only on 7-bit ASCII, some will work with UTF-8 as well. Some of them work with "int" parameters instead of more logical "char" types, and support non-character values (like EOF) in functions that appear to take character parameters. And some functions treat EOF as a normal character. And if you are coming to C from pretty much any other language, you'll be shocked at the rudimentary "string" support. So while I agree that it might be surprising to find that "isdigit()" is not defined for all possible values of all character types, I think it would be /way/ down the list. This is not a criticism of C - different languages are better and worse for different things. But if you are working in C, you have to learn C - you can't just assume it is like whatever other languages you have used. And if C and C++ were to try to be like other languages, such as by specifying values for every int value in "isdigit" calls, or having "isdigit" throw a C++ exception on bad values, you'd lose some of the aspects that make C and C++ important and useful languages. > !isdigit((unsigned char)c1) == !isdigit((unsigned char)c2) > I think an occasional nod to how we have got used to such nonsense is > merited! Yes, absolutely. |
"Alf P. Steinbach" <alf.p.steinbach@gmail.com>: Feb 23 04:34PM +0100 > rubbish. I can't even figure out HOW it would crash since all its doing is > return (c >= '0' && c <= '9') > unless there's some obscure way of doing that test even faster. An MS runtime library example, but apparently this is old code, not for the current version: <url: https://github.com/ojdkbuild/tools_toolchain_sdk10_1607/blob/master/Source/10.0.14393.0/ucrt/convert/isctype.cpp#L34-L39> // The _chvalidator function is called by the character classification functions // in the debug CRT. This function tests the character argument to ensure that // it is not out of range. For performance reasons, this function is not used // in the retail CRT. #if defined _DEBUG extern "C" int __cdecl _chvalidator(int const c, int const mask) { _ASSERTE(c >= -1 && c <= 255); return _chvalidator_l(nullptr, c, mask); } extern "C" int __cdecl _chvalidator_l(_locale_t const locale, int const c, int const mask) { _ASSERTE(c >= -1 && c <= 255); _LocaleUpdate locale_update(locale); int const index = (c >= -1 && c <= 255) ? c : -1; return locale_update.GetLocaleT()->locinfo->_public._locale_pctype[index] & mask; }
Subscribe to:
Post Comments (Atom)
|
No comments:
Post a Comment