soft and program: Digest for comp.lang.c++@googlegroups.com

comp.lang.c++@googlegroups.com

Google Groups

What's the correct fmod()-algorithm - 25 Updates

James Kuyper <jameskuyper@alumni.caltech.edu>: Apr 05 10:44PM -0400

On 4/5/21 12:06 AM, Chris M. Thomasson wrote:
...
> I do not think that standard C/C++ gives any hardcore guarantees on the
> level of precision its going to give a fmod. You should reap existing
> code in compilers.

C:
"The accuracy of the floating-point operations (+, -, *, /) and of the
library functions in <math.h> and <complex.h> that return floating-point
results is implementation-defined, as is the accuracy of the conversion
between floating-point internal representations and string
representations performed by the library functions in <stdio.h>,
<stdlib.h>, and <wchar.h>. The implementation may state that the
accuracy is unknown." (5.2.4.2.2p6)

This means that, should an implementation's documentation of the
accuracy of those operations so specify, the subtractions in the
expression LDBL_MAX - LDBL_MIN < LDBL_MIN - LDBL_MAX could be evaluated
with sufficient inaccuracy to give a result of "true". Despite that
fact, C is not in fact useless for portable floating point math.

"The following macro names are conditionally defined by the implementation:
...
_ _STDC_IEC_559_ _ The integer constant 1, intended to indicate
conformance to the specifications in annex F (IEC 60559 floating-point
arithmetic)." (6.10.8,3p1)

While annex F never directly addresses the accuracy requirements, it
does say:

"An implementation that defines _ _STDC_IEC_559_ _ shall conform to the
specifications in this annex. 356) Where a binding between the C
language and IEC 60559 is indicated, the IEC 60559-specified behavior is
adopted by reference, unless stated otherwise." (F.1p[)

IEC 60559 (== IEC/IEEE 754) specifies the behavior of those operations
far more tightly than the C standard itself does. In fact, it's accuracy
requirements are pretty much the tightest requirements that's it's
feasible to mandate. If you're pushing the limits of floating point
accuracy, make sure to check whether __STDC_IEC_559__ is defined.

C++:
"Note: This document imposes no requirements on the accuracy of
floating-point operations; see also 17.3. — end note" (6.8.1p12).

Re: std::numeric_limits
"static constexpr bool is_iec559;
true if and only if the type adheres to ISO/IEC/IEEE 60559." (17.3.5.1p57)

So the C++ standard says pretty much the same thing as the C standard,
though more succinctly and in less detail, and it doesn't require
documentation of the accuracy. If you're pushing the limits of floating
point accuracy in C++, you should always check the value of is_iec559.

James Kuyper <jameskuyper@alumni.caltech.edu>: Apr 05 10:44PM -0400

On 4/5/21 7:34 AM, Alf P. Steinbach wrote:
>>> if (std::signbit(result)) result += y;
>>> return std::copysign(result, x);
>>> }

That is essentially the same as the code from the C standard,
F.10.7.1p3, except for the addition of "std::". As such, it only applies
when __STDC_IEC_559__ is pre#defined by the implementation.

>> Sorry, that doesn't represent the difficulty in implenting fmod().

> Maybe not, but as I understand it that's the spec.

Note that (as indicated by my response on another thread), unless
__STDC_IEC_559__ is pre#defined by the implementation, the accuracy of
fmod() can be arbitrarily bad, so long as the implementation documents
that it is that bad. However, since remainder()'s behavior is defined by
a cross-reference to IEC 60559, it's accuracy is required to match the
much tighter specifications of that standard, whether or not
__STDC_IEC_559__ is pre#defined.

The C++ says nothing of it's own about this issue, merely
cross-referencing the C standard.

Juha Nieminen <nospam@thanks.invalid>: Apr 06 05:21AM

> Sorry, you're naive;

I don't understand why you need the feel to do this.
Are you really such a petty person that you can't just respond normally
to people, and instead you just have to belittle people?

Juha Nieminen <nospam@thanks.invalid>: Apr 06 05:30AM

> This won't work for:
> x = 5290912749423342.0 and y = 10.0

A double-precision floating point has a 53-bit mantissa, which is enough
to represent accurately 15 significant decimal digits. If you try a value
with 16 significant decimal digits, the last digit will not be accurate
(because there aren't enough bits to represent completely).

This means that a double cannot store 5290912749423342.0 accurately.
The last digit will be wrong.

Bonita Montero <Bonita.Montero@gmail.com>: Apr 06 08:50AM +0200

> to represent accurately 15 significant decimal digits. If you try a value
> with 16 significant decimal digits, the last digit will not be accurate
> (because there aren't enough bits to represent completely).

That was clear from the start and there is no need
to emphasize it again. I just asked for a solution.

MrSpook_jzAsla44@u3_eza.net: Apr 06 08:05AM

On Tue, 6 Apr 2021 08:50:36 +0200
>> (because there aren't enough bits to represent completely).

>That was clear from the start and there is no need
>to emphasize it again. I just asked for a solution.

If you don't like fmod() then implement it yourself using some big number
library. The maths is trivial:

ie: printf("%f = %f\n",fmod(f1,f2),f1 - f2 * (int)(f1 / f2));

Bonita Montero <Bonita.Montero@gmail.com>: Apr 06 10:21AM +0200

> If you don't like fmod() then implement it yourself using some big number
> library. ...

That would be too slow.
The solution given from the glibc-source isn't fast as well,
but it's for sure a magnitude faster than using a bignum-lib.

MrSpook_k_2jtdp@59hx8krs2bo3s8a.com: Apr 06 08:59AM

On Tue, 6 Apr 2021 10:21:56 +0200

>That would be too slow.
>The solution given from the glibc-source isn't fast as well,
>but it's for sure a magnitude faster than using a bignum-lib.

You can have speed or you can have unlimited floating point accuracy, pick one.
Fast floating point is limited by the abilities of the processors FPU.

Ian Collins <ian-news@hotmail.com>: Apr 06 09:14PM +1200

On 05/04/2021 10:38, Ian Collins wrote:
>> the converts to int64 then I get the same result as you.

> https://code.woboq.org/userspace/glibc/sysdeps/ieee754/dbl-64/s_trunc.c.html

> It's usually implement with a builtin (__builtin_trunc) in gcc.

For

double f( double d )
{
return trunc(d);
}

gcc generates

f:
.LFB0:
.cfi_startproc
endbr64
movsd .LC1(%rip), %xmm2
movsd .LC0(%rip), %xmm4
movapd %xmm0, %xmm3
movapd %xmm0, %xmm1
andpd %xmm2, %xmm3
ucomisd %xmm3, %xmm4
jbe .L2
cvttsd2siq %xmm0, %rax
pxor %xmm0, %xmm0
andnpd %xmm1, %xmm2
cvtsi2sdq %rax, %xmm0
orpd %xmm2, %xmm0
.L2:
ret

--
Ian.

Bonita Montero <Bonita.Montero@gmail.com>: Apr 06 12:37PM +0200

>> The solution given from the glibc-source isn't fast as well,
>> but it's for sure a magnitude faster than using a bignum-lib.

> You can have speed or you can have unlimited floating point accuracy, pick one.

No one needs unlimtited floating-point accuracy.
Bignum-libaries are only suitable for cryptographic purposes,
and these don't need floating-point- but simple integer-arithmetics.

Bonita Montero <Bonita.Montero@gmail.com>: Apr 06 12:42PM +0200

> orpd %xmm2, %xmm0
> .L2:
> ret

If I may translate this: the code stores the value as a 64 bit integer
and checks the flags if the value was fitting in 64 bit. If not, the
value is larger than a 64 bit integer and those values have a integral
-only mantissa. If not and the value fitted in a 64 bit integer it
loads the chopped converted value back as a floating-point value.

MrSpook_v310_69@j1w2mu8nr6agvg.biz: Apr 06 10:45AM

On Tue, 6 Apr 2021 12:37:34 +0200

>> You can have speed or you can have unlimited floating point accuracy, pick
>one.

>No one needs unlimtited floating-point accuracy.

I meant in the sense you can dictate how much accuracy you need.

>Bignum-libaries are only suitable for cryptographic purposes,
>and these don't need floating-point- but simple integer-arithmetics.

Thats all well and good, but if the accuracy you want exceeds the maximum
mantissa length the CPU can provide then you're going to need some sort of
software algorithm to do your calculations even if its invisibly built in to
one of the standard libraries.

Juha Nieminen <nospam@thanks.invalid>: Apr 06 10:57AM

>> (because there aren't enough bits to represent completely).

> That was clear from the start and there is no need
> to emphasize it again. I just asked for a solution.

If the last digit if your number isn't being stored accurately in the
variable, then there is no solution. The original digit has been lost.
You can't perform an operation on something that doesn't exist.

If you want to use digits of that magnitude, try long double instead.

Bonita Montero <Bonita.Montero@gmail.com>: Apr 06 01:10PM +0200

> Thats all well and good, but if the accuracy you want ...

Where did I wrote that I have this precision concerns.
I simply talked about a solution for valid values.

Bonita Montero <Bonita.Montero@gmail.com>: Apr 06 01:12PM +0200

> If the last digit if your number isn't being stored accurately in the
> variable, then there is no solution. The original digit has been lost.
> You can't perform an operation on something that doesn't exist.

Look at the glibc-code I already postet. It is slow but it gives
sufficiently accurate results with double precision.

> If you want to use digits of that magnitude, try long double instead.

long double will have the same issue for values where x and y have
a larger distance in their exponents.

Richard Damon <Richard@Damon-Family.org>: Apr 06 07:45AM -0400

On 4/6/21 1:30 AM, Juha Nieminen wrote:
> (because there aren't enough bits to represent completely).

> This means that a double cannot store 5290912749423342.0 accurately.
> The last digit will be wrong.

Actually, if I am doing my math right, 53 bits can represent integers up
to 9,007.199.254.740.991 (2**53 -1) exactly with the units bit
expressed, so that number is expressible exactly.

50 bits is about 15 digits, so the 3 more bits and the rounding of 10
bits = 3 digits gives us the most of an extra digit.

The issue is that after the divide, the results can't be expressed
precisely enough and the round off errors cause the issues.

If you think of Floating point numbers as representing ranges then
5290912749423342.0 represents the interval 5290912749423341.5 to
5290912749423342.5 and thus the answer to fmod should be the interval
1.5 to 2.5 (but float can't express that level of imprecision itself)

If you think of them as precise numbers that are an approximation to
reality, then fmod would be precisely 2.000, but you might get wrong
values from the math due to round off.

In this case the formula x/y - trunc(x/y) gave 0.1875 instead of 0.2
because is 3/16 and x/y will be expressed to the nearest 1/16th due to
precision round off.

Bonita Montero <Bonita.Montero@gmail.com>: Apr 06 02:29PM +0200

> In this case the formula x/y - trunc(x/y) gave 0.1875 instead of 0.2
> because is 3/16 and x/y will be expressed to the nearest 1/16th due to
> precision round off.

I'm not concerned about imprecise results, but of invalid results
with the naive implementations
x - trunc( x / y ) * y
and
x / y * trunc( x / y )
If x is very high and y is very small, you might get results wich
are outside the modulo-range.

Bonita Montero <Bonita.Montero@gmail.com>: Apr 06 02:49PM +0200

Because this algoritm is so similar to the usual FP-Divide-Algorithms
in FPUs, I'm asking myself why the latter don't integrate a path for
fmod()-operations. That wouldn't be a big deal to integrate an option
for fmod() into these circuits for a partiular FP-Instruction.

Richard Damon <Richard@Damon-Family.org>: Apr 06 09:01AM -0400

On 4/6/21 8:29 AM, Bonita Montero wrote:
> x / y * trunc( x / y )
> If x is very high and y is very small, you might get results wich
> are outside the modulo-range.

Floating Point should ALWAYS be treated is 'imprecise', and sometimes
imprecise can create errors big enough to cause other issues.

You need to remember that when you compute x / y, that value will have a
precision, taking the trunc implies an assumption that the precision is
well better than countinng units, which it might not be (and isn't in
your cases).

The answer to this is to add a bit of complexity to the method to make
sure that you adjust the numbers to bring things into a better range
accurately.

In general, this is not simple, epsecially if both x and y have lots of
significant bits and are of very different magnitudes.

Note, that if the LSB of x is bigger than the MSB of y, then the
question actually doesn't really make sense, and you are getting close
to that point, and that is why the system is very sensitive to errors.

Bonita Montero <Bonita.Montero@gmail.com>: Apr 06 03:14PM +0200

> Floating Point should ALWAYS be treated is 'imprecise', ...

You're telling things which are common knowledge.

Richard Damon <Richard@Damon-Family.org>: Apr 06 10:06AM -0400

On 4/6/21 9:14 AM, Bonita Montero wrote:
>> Floating Point should ALWAYS be treated is 'imprecise', ...

> You're telling things which are common knowledge.

The problem is that the cases your complaining about are the cases where
that impression becomes large compared to the numbers you are dealing with.

trunc(z) when z is the order of 2**53 has no precision left to express.

x/y will have a rounding error in it. If it rounds up and that round up
causes a 'bump' in trunc, then trunc might have amplified the error to a
full integer count. Take a number just a count or two below 1000, and
divide it by 10, you will get 100, but if done to infinite precision, it
would be 99.99999.. and trunc would round down. Thus the trunc(/y)
always has a chance that if the division is close enough to an integer
value, to round up and cause that range error.

All the problems you are complaining about are ultimately rooted in the
fact that floating point math is imprecise, and it is hard, and slow, to
do the math so you get the absolute best answer you can.

James Kuyper <jameskuyper@alumni.caltech.edu>: Apr 06 10:11AM -0400

> If you don't like fmod() then implement it yourself using some big number
> library. The maths is trivial:

> ie: printf("%f = %f\n",fmod(f1,f2),f1 - f2 * (int)(f1 / f2));

You don't need bignums to do that. There may be more clever ways to do
it, but as proof of principle, an algorithm has already been sketched
out in this thread which depends upon the fact that multiplication by
powers of FLT_RADIX can be done exactly, as is also true of finding the
difference between two numbers of the same sign and the same exponent,
and fmod(difference, f2) is equal to either fmod(f1,f2), or
f2-fmod(f1,f2), depending upon what you did to give them the same sign.
fmod() is required to give the exact result #ifdef __STDC_IEC_559__

Bonita Montero <Bonita.Montero@gmail.com>: Apr 06 04:19PM +0200

> The problem is that the cases your complaining about are the cases where
> that impression becomes large compared to the numbers you are dealing with.

I'm not complaining about that, I'm just telling that the problems
resulting from inacurrarcy aren't as obvious as in many other cases.

Rest of your text not read.

Bart <bc@freeuk.com>: Apr 06 03:34PM +0100

On 06/04/2021 15:19, Bonita Montero wrote:

> I'm not complaining about that, I'm just telling that the problems
> resulting from inacurrarcy aren't as obvious as in many other cases.

> Rest of your text not read.

Your OP gave the impression that you were looking for the correct
formula for fmod, and gave two variations.

The correct one was the second (x-y*trunc(x/y)) (the first gives the
wrong results).

You suggested that neither were correct because the it gave the wrong
results on certain inputs.

That is due to the limitations of 64-bit floats.

The actual implementation of fmod may use a /better/ method which
preserves more, which is not say that the trunc approach is wrong.

fmod will start giving invalid values too if you add a few more
significant digits.

I've used the x-y*trunc(x/y) on a bigfloat library and it will always
give the correct results (subject to whatever precision cap you apply),
so it is not wrong. But just not tuned to ieee754 64-bit format.

Bonita Montero <Bonita.Montero@gmail.com>: Apr 06 04:42PM +0200

> fmod will start giving invalid values too if you add a few more
> significant digits.

That's wrong. The results become more inaccurate as higher as the
exponent-differences are, but the modulo-results are always in the
correct range.

You received this digest because you're subscribed to updates for this group. You can change your settings on the group membership page.
To unsubscribe from this group and stop receiving emails from it send an email to comp.lang.c+++unsubscribe@googlegroups.com.

soft and program

Tuesday, April 6, 2021

Digest for comp.lang.c++@googlegroups.com - 25 updates in 1 topic

No comments:

Blog Archive

About Me