- What's the correct fmod()-algorithm - 25 Updates
| James Kuyper <jameskuyper@alumni.caltech.edu>: Apr 05 10:44PM -0400 On 4/5/21 12:06 AM, Chris M. Thomasson wrote: ... > I do not think that standard C/C++ gives any hardcore guarantees on the > level of precision its going to give a fmod. You should reap existing > code in compilers. C: "The accuracy of the floating-point operations (+, -, *, /) and of the library functions in <math.h> and <complex.h> that return floating-point results is implementation-defined, as is the accuracy of the conversion between floating-point internal representations and string representations performed by the library functions in <stdio.h>, <stdlib.h>, and <wchar.h>. The implementation may state that the accuracy is unknown." (5.2.4.2.2p6) This means that, should an implementation's documentation of the accuracy of those operations so specify, the subtractions in the expression LDBL_MAX - LDBL_MIN < LDBL_MIN - LDBL_MAX could be evaluated with sufficient inaccuracy to give a result of "true". Despite that fact, C is not in fact useless for portable floating point math. "The following macro names are conditionally defined by the implementation: ... _ _STDC_IEC_559_ _ The integer constant 1, intended to indicate conformance to the specifications in annex F (IEC 60559 floating-point arithmetic)." (6.10.8,3p1) While annex F never directly addresses the accuracy requirements, it does say: "An implementation that defines _ _STDC_IEC_559_ _ shall conform to the specifications in this annex. 356) Where a binding between the C language and IEC 60559 is indicated, the IEC 60559-specified behavior is adopted by reference, unless stated otherwise." (F.1p[) IEC 60559 (== IEC/IEEE 754) specifies the behavior of those operations far more tightly than the C standard itself does. In fact, it's accuracy requirements are pretty much the tightest requirements that's it's feasible to mandate. If you're pushing the limits of floating point accuracy, make sure to check whether __STDC_IEC_559__ is defined. C++: "Note: This document imposes no requirements on the accuracy of floating-point operations; see also 17.3. — end note" (6.8.1p12). Re: std::numeric_limits "static constexpr bool is_iec559; true if and only if the type adheres to ISO/IEC/IEEE 60559." (17.3.5.1p57) So the C++ standard says pretty much the same thing as the C standard, though more succinctly and in less detail, and it doesn't require documentation of the accuracy. If you're pushing the limits of floating point accuracy in C++, you should always check the value of is_iec559. |
| James Kuyper <jameskuyper@alumni.caltech.edu>: Apr 05 10:44PM -0400 On 4/5/21 7:34 AM, Alf P. Steinbach wrote: >>> if (std::signbit(result)) result += y; >>> return std::copysign(result, x); >>> } That is essentially the same as the code from the C standard, F.10.7.1p3, except for the addition of "std::". As such, it only applies when __STDC_IEC_559__ is pre#defined by the implementation. >> Sorry, that doesn't represent the difficulty in implenting fmod(). > Maybe not, but as I understand it that's the spec. Note that (as indicated by my response on another thread), unless __STDC_IEC_559__ is pre#defined by the implementation, the accuracy of fmod() can be arbitrarily bad, so long as the implementation documents that it is that bad. However, since remainder()'s behavior is defined by a cross-reference to IEC 60559, it's accuracy is required to match the much tighter specifications of that standard, whether or not __STDC_IEC_559__ is pre#defined. The C++ says nothing of it's own about this issue, merely cross-referencing the C standard. |
| Juha Nieminen <nospam@thanks.invalid>: Apr 06 05:21AM > Sorry, you're naive; I don't understand why you need the feel to do this. Are you really such a petty person that you can't just respond normally to people, and instead you just have to belittle people? |
| Juha Nieminen <nospam@thanks.invalid>: Apr 06 05:30AM > This won't work for: > x = 5290912749423342.0 and y = 10.0 A double-precision floating point has a 53-bit mantissa, which is enough to represent accurately 15 significant decimal digits. If you try a value with 16 significant decimal digits, the last digit will not be accurate (because there aren't enough bits to represent completely). This means that a double cannot store 5290912749423342.0 accurately. The last digit will be wrong. |
| Bonita Montero <Bonita.Montero@gmail.com>: Apr 06 08:50AM +0200 > to represent accurately 15 significant decimal digits. If you try a value > with 16 significant decimal digits, the last digit will not be accurate > (because there aren't enough bits to represent completely). That was clear from the start and there is no need to emphasize it again. I just asked for a solution. |
| MrSpook_jzAsla44@u3_eza.net: Apr 06 08:05AM On Tue, 6 Apr 2021 08:50:36 +0200 >> (because there aren't enough bits to represent completely). >That was clear from the start and there is no need >to emphasize it again. I just asked for a solution. If you don't like fmod() then implement it yourself using some big number library. The maths is trivial: ie: printf("%f = %f\n",fmod(f1,f2),f1 - f2 * (int)(f1 / f2)); |
| Bonita Montero <Bonita.Montero@gmail.com>: Apr 06 10:21AM +0200 > If you don't like fmod() then implement it yourself using some big number > library. ... That would be too slow. The solution given from the glibc-source isn't fast as well, but it's for sure a magnitude faster than using a bignum-lib. |
| MrSpook_k_2jtdp@59hx8krs2bo3s8a.com: Apr 06 08:59AM On Tue, 6 Apr 2021 10:21:56 +0200 >That would be too slow. >The solution given from the glibc-source isn't fast as well, >but it's for sure a magnitude faster than using a bignum-lib. You can have speed or you can have unlimited floating point accuracy, pick one. Fast floating point is limited by the abilities of the processors FPU. |
| Ian Collins <ian-news@hotmail.com>: Apr 06 09:14PM +1200 On 05/04/2021 10:38, Ian Collins wrote: >> the converts to int64 then I get the same result as you. > https://code.woboq.org/userspace/glibc/sysdeps/ieee754/dbl-64/s_trunc.c.html > It's usually implement with a builtin (__builtin_trunc) in gcc. For double f( double d ) { return trunc(d); } gcc generates f: .LFB0: .cfi_startproc endbr64 movsd .LC1(%rip), %xmm2 movsd .LC0(%rip), %xmm4 movapd %xmm0, %xmm3 movapd %xmm0, %xmm1 andpd %xmm2, %xmm3 ucomisd %xmm3, %xmm4 jbe .L2 cvttsd2siq %xmm0, %rax pxor %xmm0, %xmm0 andnpd %xmm1, %xmm2 cvtsi2sdq %rax, %xmm0 orpd %xmm2, %xmm0 .L2: ret -- Ian. |
| Bonita Montero <Bonita.Montero@gmail.com>: Apr 06 12:37PM +0200 >> The solution given from the glibc-source isn't fast as well, >> but it's for sure a magnitude faster than using a bignum-lib. > You can have speed or you can have unlimited floating point accuracy, pick one. No one needs unlimtited floating-point accuracy. Bignum-libaries are only suitable for cryptographic purposes, and these don't need floating-point- but simple integer-arithmetics. |
| Bonita Montero <Bonita.Montero@gmail.com>: Apr 06 12:42PM +0200 > orpd %xmm2, %xmm0 > .L2: > ret If I may translate this: the code stores the value as a 64 bit integer and checks the flags if the value was fitting in 64 bit. If not, the value is larger than a 64 bit integer and those values have a integral -only mantissa. If not and the value fitted in a 64 bit integer it loads the chopped converted value back as a floating-point value. |
| MrSpook_v310_69@j1w2mu8nr6agvg.biz: Apr 06 10:45AM On Tue, 6 Apr 2021 12:37:34 +0200 >> You can have speed or you can have unlimited floating point accuracy, pick >one. >No one needs unlimtited floating-point accuracy. I meant in the sense you can dictate how much accuracy you need. >Bignum-libaries are only suitable for cryptographic purposes, >and these don't need floating-point- but simple integer-arithmetics. Thats all well and good, but if the accuracy you want exceeds the maximum mantissa length the CPU can provide then you're going to need some sort of software algorithm to do your calculations even if its invisibly built in to one of the standard libraries. |
| Juha Nieminen <nospam@thanks.invalid>: Apr 06 10:57AM >> (because there aren't enough bits to represent completely). > That was clear from the start and there is no need > to emphasize it again. I just asked for a solution. If the last digit if your number isn't being stored accurately in the variable, then there is no solution. The original digit has been lost. You can't perform an operation on something that doesn't exist. If you want to use digits of that magnitude, try long double instead. |
| Bonita Montero <Bonita.Montero@gmail.com>: Apr 06 01:10PM +0200 > Thats all well and good, but if the accuracy you want ... Where did I wrote that I have this precision concerns. I simply talked about a solution for valid values. |
| Bonita Montero <Bonita.Montero@gmail.com>: Apr 06 01:12PM +0200 > If the last digit if your number isn't being stored accurately in the > variable, then there is no solution. The original digit has been lost. > You can't perform an operation on something that doesn't exist. Look at the glibc-code I already postet. It is slow but it gives sufficiently accurate results with double precision. > If you want to use digits of that magnitude, try long double instead. long double will have the same issue for values where x and y have a larger distance in their exponents. |
| Richard Damon <Richard@Damon-Family.org>: Apr 06 07:45AM -0400 On 4/6/21 1:30 AM, Juha Nieminen wrote: > (because there aren't enough bits to represent completely). > This means that a double cannot store 5290912749423342.0 accurately. > The last digit will be wrong. Actually, if I am doing my math right, 53 bits can represent integers up to 9,007.199.254.740.991 (2**53 -1) exactly with the units bit expressed, so that number is expressible exactly. 50 bits is about 15 digits, so the 3 more bits and the rounding of 10 bits = 3 digits gives us the most of an extra digit. The issue is that after the divide, the results can't be expressed precisely enough and the round off errors cause the issues. If you think of Floating point numbers as representing ranges then 5290912749423342.0 represents the interval 5290912749423341.5 to 5290912749423342.5 and thus the answer to fmod should be the interval 1.5 to 2.5 (but float can't express that level of imprecision itself) If you think of them as precise numbers that are an approximation to reality, then fmod would be precisely 2.000, but you might get wrong values from the math due to round off. In this case the formula x/y - trunc(x/y) gave 0.1875 instead of 0.2 because is 3/16 and x/y will be expressed to the nearest 1/16th due to precision round off. |
| Bonita Montero <Bonita.Montero@gmail.com>: Apr 06 02:29PM +0200 > In this case the formula x/y - trunc(x/y) gave 0.1875 instead of 0.2 > because is 3/16 and x/y will be expressed to the nearest 1/16th due to > precision round off. I'm not concerned about imprecise results, but of invalid results with the naive implementations x - trunc( x / y ) * y and x / y * trunc( x / y ) If x is very high and y is very small, you might get results wich are outside the modulo-range. |
| Bonita Montero <Bonita.Montero@gmail.com>: Apr 06 02:49PM +0200 Because this algoritm is so similar to the usual FP-Divide-Algorithms in FPUs, I'm asking myself why the latter don't integrate a path for fmod()-operations. That wouldn't be a big deal to integrate an option for fmod() into these circuits for a partiular FP-Instruction. |
| Richard Damon <Richard@Damon-Family.org>: Apr 06 09:01AM -0400 On 4/6/21 8:29 AM, Bonita Montero wrote: > x / y * trunc( x / y ) > If x is very high and y is very small, you might get results wich > are outside the modulo-range. Floating Point should ALWAYS be treated is 'imprecise', and sometimes imprecise can create errors big enough to cause other issues. You need to remember that when you compute x / y, that value will have a precision, taking the trunc implies an assumption that the precision is well better than countinng units, which it might not be (and isn't in your cases). The answer to this is to add a bit of complexity to the method to make sure that you adjust the numbers to bring things into a better range accurately. In general, this is not simple, epsecially if both x and y have lots of significant bits and are of very different magnitudes. Note, that if the LSB of x is bigger than the MSB of y, then the question actually doesn't really make sense, and you are getting close to that point, and that is why the system is very sensitive to errors. |
| Bonita Montero <Bonita.Montero@gmail.com>: Apr 06 03:14PM +0200 > Floating Point should ALWAYS be treated is 'imprecise', ... You're telling things which are common knowledge. |
| Richard Damon <Richard@Damon-Family.org>: Apr 06 10:06AM -0400 On 4/6/21 9:14 AM, Bonita Montero wrote: >> Floating Point should ALWAYS be treated is 'imprecise', ... > You're telling things which are common knowledge. The problem is that the cases your complaining about are the cases where that impression becomes large compared to the numbers you are dealing with. trunc(z) when z is the order of 2**53 has no precision left to express. x/y will have a rounding error in it. If it rounds up and that round up causes a 'bump' in trunc, then trunc might have amplified the error to a full integer count. Take a number just a count or two below 1000, and divide it by 10, you will get 100, but if done to infinite precision, it would be 99.99999.. and trunc would round down. Thus the trunc(/y) always has a chance that if the division is close enough to an integer value, to round up and cause that range error. All the problems you are complaining about are ultimately rooted in the fact that floating point math is imprecise, and it is hard, and slow, to do the math so you get the absolute best answer you can. |
| James Kuyper <jameskuyper@alumni.caltech.edu>: Apr 06 10:11AM -0400 > If you don't like fmod() then implement it yourself using some big number > library. The maths is trivial: > ie: printf("%f = %f\n",fmod(f1,f2),f1 - f2 * (int)(f1 / f2)); You don't need bignums to do that. There may be more clever ways to do it, but as proof of principle, an algorithm has already been sketched out in this thread which depends upon the fact that multiplication by powers of FLT_RADIX can be done exactly, as is also true of finding the difference between two numbers of the same sign and the same exponent, and fmod(difference, f2) is equal to either fmod(f1,f2), or f2-fmod(f1,f2), depending upon what you did to give them the same sign. fmod() is required to give the exact result #ifdef __STDC_IEC_559__ |
| Bonita Montero <Bonita.Montero@gmail.com>: Apr 06 04:19PM +0200 > The problem is that the cases your complaining about are the cases where > that impression becomes large compared to the numbers you are dealing with. I'm not complaining about that, I'm just telling that the problems resulting from inacurrarcy aren't as obvious as in many other cases. Rest of your text not read. |
| Bart <bc@freeuk.com>: Apr 06 03:34PM +0100 On 06/04/2021 15:19, Bonita Montero wrote: > I'm not complaining about that, I'm just telling that the problems > resulting from inacurrarcy aren't as obvious as in many other cases. > Rest of your text not read. Your OP gave the impression that you were looking for the correct formula for fmod, and gave two variations. The correct one was the second (x-y*trunc(x/y)) (the first gives the wrong results). You suggested that neither were correct because the it gave the wrong results on certain inputs. That is due to the limitations of 64-bit floats. The actual implementation of fmod may use a /better/ method which preserves more, which is not say that the trunc approach is wrong. fmod will start giving invalid values too if you add a few more significant digits. I've used the x-y*trunc(x/y) on a bigfloat library and it will always give the correct results (subject to whatever precision cap you apply), so it is not wrong. But just not tuned to ieee754 64-bit format. |
| Bonita Montero <Bonita.Montero@gmail.com>: Apr 06 04:42PM +0200 > fmod will start giving invalid values too if you add a few more > significant digits. That's wrong. The results become more inaccurate as higher as the exponent-differences are, but the modulo-results are always in the correct range. |
| You received this digest because you're subscribed to updates for this group. You can change your settings on the group membership page. To unsubscribe from this group and stop receiving emails from it send an email to comp.lang.c+++unsubscribe@googlegroups.com. |
No comments:
Post a Comment