soft and program: Digest for comp.lang.c++@googlegroups.com

comp.lang.c++@googlegroups.com

Google Groups

two's complement idea - 10 Updates
alloca-alternative - 1 Update

Bonita Montero <Bonita.Montero@gmail.com>: Nov 06 08:15PM +0100

> All standard library writers are rather good programmers.

Another aspect is: if millions of developers reuse code instead of
writing it on their own, errors will be noticed for sure; but when
you do everything yourself, a lot of errors won't be noticed.

David Brown <david.brown@hesbynett.no>: Nov 06 08:36PM +0100

On 06/11/2019 15:38, Manfred wrote:
>>> return 0;

>> That is just silly, in all sorts of ways.

> You realize that this comes from the glibc maintainers, don't you?

No. But that doesn't change my opinion - at least not much. It makes a
bit more sense if it has to be portable to DOS, for example, with 16-bit
size_t.

> Moreover, I took this as an example of detection of integer overflow.
> The fact that it happens to be about pathname strings is irrelevant to
> this discussion.

It is, IMHO, a fine example of when overflow should not be issue - and
that defined overflow behaviour is a lot less useful than many people think.

(I still agree that occasionally it is useful.)

> Obviously this applies to string /contents/; the pointer themselves can
> only be internal to the program (can't they?), so no need to check for
> null pointer.

I don't know the circumstances for this function, so I can't answer
that. But if it is an externally accessible function, I think checks
for null pointers would make sense.

> On the other hand, contents of the string is checked by ensuring that
> the result of strlen and their combination is valid. This is ensured
> /exactly/ by making use of unsigned wrapping behavior.

The code is going out of its way to check something that is almost
impossible to occur, even with malicious intentions - while ignoring
problems that could easily occur even by accident.

>> in a reasonable limit.

> This code handles C strings, so there is no way to check for their
> length other than running strlen.

size_t limited_strlen(const char* s, size_t n) {
const char* p = memchr(s, 0, n);
if (!p) return (size_t) -1; // Indicate bad string
return p - s;
}

> The fact that you seem to miss is that it is exactly thanks to the check
> that you call "silly" that it is ensured that they "actually end with a
> 0 in a reasonable limit".

The checks don't do that. If the three parameters pointed to strings
adding up to 4 GB - 1 characters in total, by what stretch of the
imagination is that "in a reasonable limit" ?

> 0 terminated, since the function is going to be called by some other
> part of the program that can take care that there is a 0 at the end of
> the buffer.

Failing to 0 terminate a character sequence that is intended to be a
string is not an uncommon bug.

> What is not guaranteed is that the strings actually contain
> pathnames, and don't contain very long malicious text instead (e.g. they
> could come from stdin).

You need text sequences totalling more than 4 GB to cause a problem (and
that's on older 32-bit size_t systems). Since the user program can't
have as much as 4 GB of data space, you'd need not only huge malicious
texts, but you'd also need at least two of the three parameters to overlap.

> In fact cryptography is another example where unsigned wrap is useful,
> but it would be much more complex (and off topic) to draw an example of
> this (not that I claim to be an expert in this area)

Yes, I agree. As I said, sometimes unsigned wrap is useful. But not in
this example.

> And no, just assuming that "adding up realistic numbers will not
> overflow a 64-bit type" is not what safe code is about.

Safe coding is never about assuming anything. But safe coding involves
looking at what values can occur (intentionally and unintentionally),
and making checks /only/ when you need checks. It involves /thinking/,
rather than making checks in the blind "just in case". And when you
have 64-bit types, there are few non-specialist occasions when your
values can overflow with simple addition. If there is a risk of their
being an overflow, then you probably have not sanity checked the values
coming in.

>> if (big_size_t > MAX_SANE_PATHLENGTH) return 0;

> You realize that this code is less efficient than the original one,
> don't you?

You realise that it is /more/ efficient on any 64-bit system? And that
on 32-bit systems it is at most marginally less efficient, but might be
more efficient (by avoiding branches)? And that this code does
something /useful/ by checking for a sensible maximum pathlength? And
that any few instruction cycles lost here are negligible compared to the
hundreds that are needed for copying even a small path string?

> And what would be the correct value for MAX_SANE_PATHLENGTH? Are you
> aware of the trouble that has been caused by Windows MAX_PATH?

You'd pick a big value for it - but a lot less than 4G. Most
filesystems limit file names to 255 bytes. What would you say is a
maximum depth of directories before it is clear that you have thoroughly
screwed up somewhere - maybe 1000? So pick MAX_SANE_PATHLENGTH at 256K
on 32-bit systems, and 16M on 64-bit systems. People with more
knowledge of the target system in question can pick better values.

POSIX used to have a limit of 4096, which was okay in the old days but
got a bit limiting. Windows is limited to 32K now, which seems to be
fine - the old limit was 260 characters which is clearly too small.

> resources, and allowing for the maximum string length that the system
> can /safely/ handle (don't miss the check after malloc). What more do
> you want?

See what I wrote above.

>> There are times when unsigned wrapping overflow is useful. This is not
>> one of them.

> I suggest you read the code again (and its source - it is instructive)

I did.

David Brown <david.brown@hesbynett.no>: Nov 06 08:44PM +0100

On 06/11/2019 17:47, Manfred wrote:
> even if this is true for the physical memory address space, if I
> remember correctly the 386 has way larger virtual memory addressing
> space: it does have segmenting capability, even if most OSs never used it.

The 32-bit x86 could have a larger virtual memory space than 4 GB, and
processors with the PAE could have more physical memory too (I think 64
GB, but I am not sure). However, individual processes on *nix or
Windows could only access 4 GB (split either 2G/2G or 1G/3G between
kernel space and user space). The pointers here would all be within the
same 4 GB limit.

> More practically, the example was about code safety, and so the
> possibility of malicious usage has to be assumed, hence the need for the
> check (at least for the +3 part).

Checking parameters for sanity is vital for externally-facing functions.
This one fails to check for null pointers, fails to check for sane
sizes of the strings (strlen is not a good idea on unknown strings, if
you consider excessive run time to be a problem), and fails to check for
any kind of appropriate size on the data. It checks that the curtains
are closed while leaving the front door open.

Manfred <noname@add.invalid>: Nov 06 08:53PM +0100

On 11/6/2019 7:55 PM, Öö Tiib wrote:

> if (max_size() - size() < size_to_add)
> throw std::length_error(text_to_throw);

> Why don't you look into any of implementations in your computer?

My reply was about the comment "No need to check for any overflows." -
obviously if you do C++ programming you need to know how to check for
integer overflow (and we all know how to use '+' with std::string, but
we may also need to calculate e.g. a length of a std::vector to resize,
and check for overflow in the process)

The fragment you posted obviously works well with no need for unsigned
wrapping (I didn't say you can't do without, I said it's more efficient
and less verbose if you have it), but it also has different requirements
than the example I posted: two strings only, with the precondition that
the 'this' string is less than max_size() long.

Just to stick with this example (although my point was about integer
overflow rather than string concatenation), you see that if you lift
these preconditions and need to concat more than 2 strings (as it was in
the example) then either you precalculate the total length of the final
string with a series of additions (and you see that the example is more
efficient, more clear and less verbose than a series of if/else) or you
do string+string+string and you end up with multiple allocations and
copies (and yes, as often as this is totally irrelevant, sometimes this
does matter).

This is to say that if you want to handle integer overflow efficiently,
then unsigned wrapping /is/ useful, it is not a "major design mistake" -
which is what triggered all this (for me at least).

Paavo Helde <myfirstname@osa.pri.ee>: Nov 06 10:12PM +0200

On 6.11.2019 19:15, Manfred wrote:
>> }

>> No need to check for any overflows.

> How do you think that overflow check is done inside std::string?

I'm pretty confident it's done properly. Note that std::string would
even be allowed to use implementation-specific features which are UB by
the standard, should it be necessary.

> {
> return 0;
> }

Yes, that's what I meant by "more explicit". And there is
std::numeric_limits<T>::max(), no need for 0xFFFFFFFFF magic numbers.

> Or you can cast to a wider type, but then you are not solving the
> problem, you are only moving it forward, and still I wouldn't see the
> benefit.

To be honest, the check I would wrote would be something along to:

if (strlen(filename)>10000) {
throw std::runtime_error("You must be kidding");
}

and it would happen way before any make_pathname() function.

> Besides, this is about /integer/ overflow check, so it could apply to
> more computations other than memory size.

No, the variables in your example were size_t.

Paavo Helde <myfirstname@osa.pri.ee>: Nov 06 10:25PM +0200

On 6.11.2019 21:53, Manfred wrote:
> do string+string+string and you end up with multiple allocations and
> copies (and yes, as often as this is totally irrelevant, sometimes this
> does matter).

If the profiler shows the multiple allocations in this function are
creating a bottleneck, you can avoid this easily:

std::string make_pathname(const std::string& dir,
const std::string& fname, const std::string& ext)
{
std::string result;
result.reserve(dir.length()+fname.length()+ext.length()+2);
result += dir;
result += '/';
result += fname;
result += '.';
result += ext;
return result;
}

There is still no need to check for overflow.

Note that if there is an overflow, reserve() will reserve less memory
than needed, but that's OK because the later += operations will check
for it anyway.

And if a future C++ implementation makes unsigned wrapping undefined and
the above addition overflows, triggers an assertion and terminates the
program, then this would be a better result than trying to compose a
filename which won't fit in memory.

David Brown <david.brown@hesbynett.no>: Nov 06 09:50PM +0100

On 06/11/2019 14:26, Öö Tiib wrote:
> precisely. The major purpose is to get errors when program
> is actually storing value into type where it does not fit (IOW
> really overflows).

It is possible to consider intermediary results as "infinite precision",
which would give you that optimisation. But how would you treat:

x = x + 1;
x = x - 1;

Would that cause a trap if x started at 0x3ffffff (assuming 32-bit int)?
What if "x" is held in a register rather than stored in memory? There
are lots of possible inconsistencies here. (If it is any guide, gcc
with "-ftrapv" optimises the two statements into "x". With
"-fsanitize=signed-integer-overflow", the single expression "x + 1 - 1"
is simplified to "x", but the two statements lead to checks.)

Actually, it is worth looking at some gcc (or clang) generated code with
-ftrapv and -fsanitize=signed-integer-overflow, perhaps with
<https://gotbolt.org>, to see just host costly trapping semantics can
be. They are very far from negligible.

> these have different potential overflows and/or losses of
> accuracy. Until something helps to reduce that issue it is
> all about exactly that formula and period.

Yes. It is up to the programmer to ensure the expression is written in
a way that does not overflow or lose accuracy (at least, not more than
you are willing to lose). It is up to the compiler to pick the most
efficient way to get the results requested from that expression,
according to the rules of the language (and compiler flags) that
determine the semantics of the expression.

> nonsensically unimportant size. Somehow in C++ I have
> learned to separate different concerns and to abstract details
> away but not in Python.

That's fine. Different languages have their advantages and
disadvantages - and programmer preference is very subjective.

> to define but it is business as usual and not some show-stopper
> issue. In some languages (like Swift) it is done and it seems to
> work fine.

Ah, okay. I think there is a big risk of confusion here, and for people
to forget which operators do what. (There is also the risk of
accidentally writing smilies in your code!).

I am happier with the type-based solution, and simply refusing to
compile combinations of types that don't make sense (like adding a
wrapping type and a saturation type). Undefined or unspecified
behaviour types can be silently promoted to other more defined behaviour
as needed.

That also has the advantage that it can be done today, with current C++.

David Brown <david.brown@hesbynett.no>: Nov 06 10:20PM +0100

On 06/11/2019 20:53, Manfred wrote:

> This is to say that if you want to handle integer overflow efficiently,
> then unsigned wrapping /is/ useful, it is not a "major design mistake" -
> which is what triggered all this (for me at least).

I partly agree on this (despite disagreeing about the example you gave).
Unsigned wrapping is one way to handle integer overflow efficiently.
It is not the only way, and not necessarily the best way, but it can
sometimes be useful.

One thing to note is that when the checks are handled within the
standard library (such as for C++ string concatenation), the
implementation does not need to stick to portable, standard C++. A gcc
standard library could use "#pragma GCC optimize(("fwrapv"))" and
wrapping signed integer arithmetic, if it preferred. Or it could use
__int128 types, or the overflow __builtins - whatever gives the best
results.

But if you are writing portable, standard C or C++ and need to check for
overflows, unsigned wrapping is sometimes the best solution.

However, I would say that the main reason for having unsigned integers
wrap is simply that sometimes you want wrapping behaviour - overflow
detection is only one of a number of possible applications for wrapping.

Manfred <noname@add.invalid>: Nov 06 10:54PM +0100

On 11/6/2019 8:36 PM, David Brown wrote:

>> You can say they wrote silly code for this example (I don't), but I
>> doubt there are many more knowledgeable people about this kind of
>> matter than them.

[...]

> I don't know the circumstances for this function, so I can't answer
> that. But if it is an externally accessible function, I think checks
> for null pointers would make sense.

I think there is some misunderstanding about what we mean for "external"
here, and below.
This is an example of application code, in the context of code safety,
that may be written as part of the final program or as part of a
library. Either way, the pointers are in control of the programmer who
writes the program, and he does not want to pass null pointers here.

Yes, this may happen if there is a mistake somewhere in the program, but
this is not what code safety is about, it is about code correctness,
which is a somewhat different thing (although it is true that incorrect
code is almost certainly unsafe code as well).
It is for this reason that standard library functions do not check for
null on all pointers passed in, and I am sure you don't check for null
within every function you write.
Code safety as I understand it is about ensuring the program works as
expected with every kind of data input that is supplied to the /program/
- like through an I/O stream (kernel safety is a different matter of
course).
In other words, when you read from std::cin, a file or a socket, you
allocate a buffer and you know that its pointer is valid. When you reach
EOF you also know that the buffer is 0-terminated - you place a 0
yourself if the runtime does not do this for you.

What you don't know (you can't control it) is if the /contents/ between
the pointer address and the terminating 0 has a valid format, whether it
is too large or whatever, and that's what you need to ensure that your
program is robust against in order to be safe.

> The code is going out of its way to check something that is almost
> impossible to occur, even with malicious intentions - while ignoring
> problems that could easily occur even by accident.
See above.

> if (!p) return (size_t) -1; // Indicate bad string
> return p - s;
> }

Incidentally, this takes a (pointer, length) pair, which is not a C string.
But that's irrelevant, I explained above that the presence of the
terminating 0 can and needs to be ensured at the calling level.

> The checks don't do that. If the three parameters pointed to strings
> adding up to 4 GB - 1 characters in total, by what stretch of the
> imagination is that "in a reasonable limit" ?

Define reasonable.
The checks ensure that the code will concat correctly whatever length
can be /safely/ handled by the system, and handle correctly excessive
lengths as well, thus covering the entire input domain. That's enough
for safety, with no need to introduce any dependency on an additional
and arbitrary parameter.

>> end of the buffer.

> Failing to 0 terminate a character sequence that is intended to be a
> string is not an uncommon bug.
See above.

>>> You only need to check for overflow if it is possible for the
>>> calculations to overflow. If the operands are too small to cause an
>>> overflow, there will not be an overflow.

Yes, keep "if the operands are too small" in mind, and see below.

[...]

>> You realize that this code is less efficient than the original one,
>> don't you?

> You realise that it is /more/ efficient on any 64-bit system?

Watch out: you wrote this example for 32-bit, but on 64 bit it can fail
for large enough inputs:

// the following /can/ now silently overflow
>>> uint_least64_t pathlen = dirlen + filelen + extlen;

>>> if (big_size_t > MAX_SANE_PATHLENGTH) return 0;

That is, on 64-bit it becomes unsafe.

And that
> something /useful/ by checking for a sensible maximum pathlength? And
> that any few instruction cycles lost here are negligible compared to the
> hundreds that are needed for copying even a small path string?
As I wrote elsethread, if you are on 32-bit, and cast to 64-bit you
don't really solve the problem, you move it ahead - in other words, you
can't think of handling integer overflow by assuming you always have a
wider type available.
Remember, the topic here is about unsigned wrapping, i.e. integer overflow.

However, without avoiding your point, your alternative works under some
conditions, but it is more wasteful of resources (and as a programmer
involved with microcontrollers and embedded software you are sensible to
this matter), I don't see the benefit of an arbitratry
MAX_SANE_PATHLENGTH - in fact I'd rather not bother about any, and I
don't see how it would be any better than the original.

Manfred <noname@add.invalid>: Nov 06 11:05PM +0100

On 11/6/2019 9:25 PM, Paavo Helde wrote:

> Note that if there is an overflow, reserve() will reserve less memory
> than needed, but that's OK because the later += operations will check
> for it anyway.
Note that here you are still assuming wrapping.

> And if a future C++ implementation makes unsigned wrapping undefined
Undefined? Then I believe it wouldn't work.

and
> the above addition overflows, triggers an assertion and terminates the
> program, then this would be a better result than trying to compose a
> filename which won't fit in memory.

And here you are still assuming defined behavior (implying that every
single integer operation is checked at runtime).
Moreover, program termination is /not/ safe coding - you know what DoS
attack is.

alloca-alternative

Marcel Mueller <news.5.maazl@spamgourmet.org>: Nov 06 09:00PM +0100

Am 06.11.19 um 11:43 schrieb Bonita Montero:
> I just found that alloca() with MSVC isn't that fast that it could be.
> alloca() calls a function called __chkstk which touches the pages down
> the stack to tigger Windows' overcommitting of stacks,

Guard pages are no Windows-only feature.

> Windows is only able to allocate new pages to a stack when the pages
> are touched down the stack, i.e. you'll get a exception if you skip
> a page.

Exactly. This prevents from accidental out of bounds access.

But there should be functions that commit more than one stack page at
once. (I don't remember exactly on WinXX)

On the other side it is bad practice to allocate large data structures
on the stack. The maximum stack size is always limited and cannot be
enlarged once a thread has been started, because the virtual address
space of the stack must be continuous.

> The allocation will have more overhead than an alloca(), but my idea
> is that if there are a larger number of entries the processing time
> on the entires will outweigh the allocation.

So basically you wrote a collection with a limited inline storage.

> I'm asking myself if there's a class in boost or another well-known
> classlib that implements the same pattern.

Not that I know. But some string implementations use this pattern.
Keyword "short string optimization".

Marcel

You received this digest because you're subscribed to updates for this group. You can change your settings on the group membership page.
To unsubscribe from this group and stop receiving emails from it send an email to comp.lang.c+++unsubscribe@googlegroups.com.

soft and program

Wednesday, November 6, 2019

Digest for comp.lang.c++@googlegroups.com - 11 updates in 2 topics

No comments:

Blog Archive

About Me