soft and program: Digest for comp.lang.c++@googlegroups.com

comp.lang.c++@googlegroups.com

Google Groups

What's the correct fmod()-algorithm - 18 Updates

Keith Thompson <Keith.S.Thompson+u@gmail.com>: Apr 04 04:23PM -0700

Bart <bc@freeuk.com> writes:
[...]
> I don't have trunc() on Windows. If I emulate it with a simple version
> the converts to int64 then I get the same result as you.
[...]

How do you not have trunc()? It's been standard C since C99, and
standard C++ since, what, C++11?

#include <math.h>
double trunc(double x);
float truncf(float x);
long double truncl(long double x);

Description
The trunc functions round their argument to the integer value, in
floating format, nearest to but no larger in magnitude than the
argument.

--
Keith Thompson (The_Other_Keith) Keith.S.Thompson+u@gmail.com
Working, but not speaking, for Philips Healthcare
void Void(void) { Void(); } /* The recursive call of the void */

Richard Damon <Richard@Damon-Family.org>: Apr 04 07:32PM -0400

On 4/4/21 10:32 AM, Bonita Montero wrote:
> ... this doesn't always give correct values, f.e. if x is very large
> without a fraction and y is very samll. So what's the correct alogithm
> for fmod ?

The basic issue is that you need to realize that computing the value of
x / y when done in floating point is going to be, in general inexact,
and that to use this sort of formula, you are going to need to take this
into account.

In particular, if the rounding of x / y cause it to round up to an
integer value, the value of trunc(x/y) will be off, so you need to guard
for that.

One option would be to see if trunc(x / y)*y is bigger than x, and if so
then reduce this value by y to effecitvely decrease the trunc by one (or
even save the trunc value and decrease it by 1). (This assumes that x
and y are positive, if they might be negative you need to handle those
signs properly).

You also need to handle the case where the value of x/y doesn't even
have a 1's digit because the value is too big, in this case you have
really had a total loss of precision, I would suspect the defined answer
in this case is 0, and accept that in this case it just won't match at
all what you would get with x - trunc(x/y)*y since that might require
the value to be bigger than y.

Bart <bc@freeuk.com>: Apr 05 01:00AM +0100

On 05/04/2021 00:23, Keith Thompson wrote:
> The trunc functions round their argument to the integer value, in
> floating format, nearest to but no larger in magnitude than the
> argument.

I forget to check other compilers which do have it.

But it's not part of msvcrt.dll which my bcc relies on.

It doesn't seem to be defined on top of anything else either. But there
/is/ 'floor' which I'd forgotten about.

Keith Thompson <Keith.S.Thompson+u@gmail.com>: Apr 04 07:04PM -0700

>> argument.

> I forget to check other compilers which do have it.

> But it's not part of msvcrt.dll which my bcc relies on.

Wikipedia says that's the C standard library for the Visual C++ (MSVC)
compiler from version 4.2 to 6.0. I have that file on my Windows 10
laptop, and it appears that it doesn't provide the trunc() function --
which means that the compiler has to find it somewhere else.

Are you using an old version of Microsoft's C compiler? Or some other
compiler that doesn't know where to look for library functions that
aren't in msvcrt.dll? I'm able to use trunc() with Visual Studio 2019.

--
Keith Thompson (The_Other_Keith) Keith.S.Thompson+u@gmail.com
Working, but not speaking, for Philips Healthcare
void Void(void) { Void(); } /* The recursive call of the void */

Kaz Kylheku <563-365-8930@kylheku.com>: Apr 05 03:06AM

["Followup-To:" header set to comp.lang.c.]
> ... this doesn't always give correct values, f.e. if x is very large
> without a fraction and y is very samll. So what's the correct alogithm
> for fmod ?

While mathematically, fmod(x, y) is always defined for real x, y, y /=
0, it simply might not make sense in floating point for
fabs(x) >> fabs(y).

Think about it.

Suppose that x is so larger that the distance between x and x + epsilon
(x and the next consecutive floating-point value) is larger than y.

The intuitition behind fmod(x, y) is that if we repeat y enough times
(start with 0 and add y), we will get to a value that is just below x.
The remaining gap to x is then fmod(x, y). But that point "just below x"
is not necessarily representable in that range where consecutive
floats are farther apart than y!

Nevertheless, the mathematical value fabs(y) exists, and is some real
number in the [0, y) range, which may have an excellent floating-point
approximation, regardless of the above.

Take the next part with a grain of salt; I am no numerical analyst.

I'm thinking that it could be fruitful to scale the exponent (by powers
of two) of y, to get it into the "ballpark" as x.

So that is, suppose we multiply y by pow(2, k), just so that
y * pow(2, k) <= x, but y * pow(2, k + 1) > x. I.e. k is the highest
exponent such that y * pow(2, k) does not exceed x.

Multiplications by powers of two preserve the mantissa exactly: they are
precise.

Having procured this k we, could subtract these two terms to obtain z:

z = x - y*pow(2, k)

Note that all we have done is subtract a multiple of y from x. The
result has the same fmod against y, namely:

fmod(z, y) = fmod(x, y)

we can calculate fmod(z, y) instead. Even though x >> y, it is not
the case that z >> y.

If we can work with floats at the bit level, the calculation for determining k
should be fast. Basically we just take the exponent field from x, and plant it
into y, tweaking it by at most one to produce a value that does not exceed x,
by comparing mantissas.

--
TXR Programming Language: http://nongnu.org/txr
Cygnal: Cygwin Native Application Library: http://kylheku.com/cygnal

"Chris M. Thomasson" <chris.m.thomasson.1@gmail.com>: Apr 04 09:06PM -0700

On 4/4/2021 7:32 AM, Bonita Montero wrote:
> ... this doesn't always give correct values, f.e. if x is very large
> without a fraction and y is very samll. So what's the correct alogithm
> for fmod ?

I do not think that standard C/C++ gives any hardcore guarantees on the
level of precision its going to give a fmod. You should reap existing
code in compilers.

antispam@math.uni.wroc.pl: Apr 05 04:46AM

> x/y =5290912749423342.00000000000000000000
> trunc(x/y) =5290912749423342.00000000000000000000
> (x/y)*y =52909127494233424.00000000000000000000
Actually (x/y)*y does not equal 52909127494233424, and that is
core problem in this example. Namely, decimal 52909127494233424
is not representable exactly in floating point.

> x/y (y is 10.0) should be 5290912749423341.6, but it gives ...3342.0.

This does not matter much in this example: 5290912749423342 is the
closest integer to quitient. Look:

52909127494233416 - 5290912749423342*10

(33) - 4

The result is -4 and it is enough to add 10 to get exact result.

--
Waldek Hebisch

"Chris M. Thomasson" <chris.m.thomasson.1@gmail.com>: Apr 04 11:39PM -0700

On 4/4/2021 7:32 AM, Bonita Montero wrote:
> ... this doesn't always give correct values, f.e. if x is very large
> without a fraction and y is very samll. So what's the correct alogithm
> for fmod ?

The fight between a programmer and a floating point number:

https://youtu.be/TJRaLnLDMWg

Bonita Montero <Bonita.Montero@gmail.com>: Apr 05 08:45AM +0200

> I do not think that standard C/C++ gives any hardcore guarantees on the
> level of precision its going to give a fmod. You should reap existing
> code in compilers.

fmod works correctly in current implementations
and gives results >= 0 and < y;

Bonita Montero <Bonita.Montero@gmail.com>: Apr 05 08:46AM +0200

> if (std::signbit(result)) result += y;
> return std::copysign(result, x);
> }
Sorry, that doesn't represent the difficulty in implenting fmod().

Bonita Montero <Bonita.Montero@gmail.com>: Apr 05 08:49AM +0200

That's a correct trunc():

double trunc( double d )
{
static_assert(sizeof(double) == sizeof(uint64_t), "sizeof(double)
not equal to sizeof(uint64_t)");
static_assert(numeric_limits<double>::is_iec559, "double must be
IEEE-754");
// assume size_t is our CPU's native register-width
static_assert(sizeof(size_t) == sizeof(uint64_t) || sizeof(size_t)
== sizeof(uint32_t), "register-width must be 32 or 64 bit");
if constexpr( sizeof(size_t) == sizeof(uint64_t) )
// we have 64 bit registers
{
unsigned const MANTISSA_BITS = 52,
EXP_BIAS = 0x3FF,
INF_NAN_BASE = 0x7FF;
uint64_t const EXP_MASK = (uint64_t)0x7FF
<< MANTISSA_BITS,
SIGN_MASK = (uint64_t)0x800
<< MANTISSA_BITS ,
MANTISSA_MASK = 0x000FFFFFFFFFFFFFu,
NAN_MASK = 0x0008000000000000u,
MIN_INTEGRAL_DIGITS_EXP = (uint64_t) EXP_BIAS
<< MANTISSA_BITS,
MIN_INTEGRAL_ONLY_EXP = (uint64_t)(EXP_BIAS +
MANTISSA_BITS) << MANTISSA_BITS,
INF_NAN_EXP = (uint64_t)INF_NAN_BASE
<< MANTISSA_BITS;
int64_t const MANTISSA_SHIFT_MASK = 0xFFF0000000000000u;
union
{
double du;
uint64_t dx;
};
du = d;
uint64_t exp = dx & EXP_MASK;
if( exp >= MIN_INTEGRAL_DIGITS_EXP )
// value has integral digits
if( exp < MIN_INTEGRAL_ONLY_EXP )
{
// there are fraction-digits to mask out, mask them
unsigned shift = (unsigned)(exp >> MANTISSA_BITS) -
EXP_BIAS;
dx &= MANTISSA_SHIFT_MASK >> shift;
return du;
}
else
if( exp < INF_NAN_EXP )
// value is integral
return du;
else
{
uint64_t mantissa = dx & MANTISSA_MASK;
// infinite, NaN: return value
if( !mantissa || mantissa & NAN_MASK )
return du;
// SNaN: raise exception on SNaN if necessary
feraiseexcept( FE_INVALID );
return du;
}
else
{
// below +/-1.0
// return +/-0.0
dx &= SIGN_MASK;
return du;
}
}
else if constexpr( sizeof(size_t) == sizeof(uint32_t) )
// we have 32 bit registers
{
unsigned const MANTISSA_BITS = 52,
HI_MANTISSA_BITS = 20,
EXP_BIAS = 0x3FF,
INF_NAN_BASE = 0x7FF;
uint32_t const EXP_MASK = (uint32_t)0x7FFu
<< HI_MANTISSA_BITS,
SIGN_MASK = (uint32_t)0x800u
<< HI_MANTISSA_BITS,
HI_MANTISSA_MASK = 0x000FFFFFu,
NAN_MASK = 0x00080000u,
MIN_INTEGRAL_DIGITS_EXP = (uint32_t) EXP_BIAS
<< HI_MANTISSA_BITS,
MAX_INTEGRAL32_EXP = (uint32_t)(EXP_BIAS
+ HI_MANTISSA_BITS) << HI_MANTISSA_BITS,
MIN_INTEGRAL_ONLY_EXP = (uint32_t)(EXP_BIAS
+ MANTISSA_BITS) << HI_MANTISSA_BITS,
INF_NAN_EXP =
(uint32_t)INF_NAN_BASE << HI_MANTISSA_BITS,
NEG_LO_MANTISSA_SHIFT_MASK = 0xFFFFFFFFu;
int32_t const HI_MANTISSA_SHIFT_MASK = 0xFFF00000;
union
{
double du;
struct
{
uint32_t dxLo;
uint32_t dxHi;
};
};
du = d;
uint32_t exp = dxHi & EXP_MASK;
if( exp >= MIN_INTEGRAL_DIGITS_EXP )
// value has integral digits
if( exp < MIN_INTEGRAL_ONLY_EXP )
// there are fraction-digits to mask out
if( exp <= MAX_INTEGRAL32_EXP )
{
// the fraction digits are in the upper dword, mask
them and zero the lower dword
unsigned shift = (unsigned)(exp >>
HI_MANTISSA_BITS) - EXP_BIAS;
dxHi &= HI_MANTISSA_SHIFT_MASK >> shift;
dxLo = 0;
return du;
}
else
{
// the fraction digits are in the lower dword, mask
them
unsigned shift = (unsigned)(exp >>
HI_MANTISSA_BITS) - EXP_BIAS - HI_MANTISSA_BITS;
dxLo &= ~(NEG_LO_MANTISSA_SHIFT_MASK >>
shift);
return du;
}
else
if( exp < INF_NAN_EXP )
// value is integral
return du;
else
{
uint32_t hiMantissa = dxHi & HI_MANTISSA_MASK;
// infinite, NaN: return value
if( !(hiMantissa | dxLo) || hiMantissa & NAN_MASK )
return du;
// SNaN: raise exception on SNaN if necessary
feraiseexcept( FE_INVALID );
return du;
}
else
{
// below +/-1.0
// return +/-0.0
dxHi &= SIGN_MASK;
dxLo = 0;
return du;
}
}
}

Paavo Helde <myfirstname@osa.pri.ee>: Apr 05 10:17AM +0300

05.04.2021 03:00 Bart kirjutas:
>> argument.

> I forget to check other compilers which do have it.

> But it's not part of msvcrt.dll which my bcc relies on.

In my setup, trunc() resides in ucrtbase.dll
(C:\Windows\System32\ucrtbase.dll). UCRT stands for "universal C
run-time" which is an optional component when installing Visual Studio
(I gather "universal" means it does not depend on the MSVC version).

Bonita Montero <Bonita.Montero@gmail.com>: Apr 05 09:49AM +0200

That's the code from glibc:

/* @(#)e_fmod.c 5.1 93/09/24 */
/*
* ====================================================
* Copyright (C) 1993 by Sun Microsystems, Inc. All rights reserved.
*
* Developed at SunPro, a Sun Microsystems, Inc. business.
* Permission to use, copy, modify, and distribute this
* software is freely granted, provided that this notice
* is preserved.
* ====================================================
*/

#if defined(LIBM_SCCS) && !defined(lint)
static char rcsid[] = "$NetBSD: e_fmod.c,v 1.8 1995/05/10 20:45:07 jtc
Exp $";

soft and program

Monday, April 5, 2021

Digest for comp.lang.c++@googlegroups.com - 18 updates in 1 topic

No comments:

Blog Archive

About Me