- A right alternative to IEEE-754's format - 18 Updates
- Simple Proxy Collector... - 1 Update
Bonita Montero <Bonita.Montero@gmail.com>: Mar 28 02:30PM +0200 > There are many problems with IEEE-754. Different architectures > aren't even required to produce the same result. You can tune each compiler and FPU (through setting the control word) so that you can have identical results on different platforms. On x87 this will result in slower code to chop results but with SSE and AVX the code won't run slower. |
"Rick C. Hodgin" <rick.c.hodgin@gmail.com>: Mar 28 05:45AM -0700 On Wednesday, March 28, 2018 at 8:30:37 AM UTC-4, Bonita Montero wrote: > so that you can have identical results on different platforms. On x87 > this will result in slower code to chop results but with SSE and AVX > the code won't run slower. The compiler enables you to overcome limitations and variations allowed for by the IEEE-754 standard so as to obtain reproducible results across platforms ... in that compiler, and possibly to a (or an implicit) C++ standard observed by various compiler authors. But you can write a Java-based compute engine, and a C++-base compute engine, and they may not produce the same results using IEEE-754 because they have different mechanisms to process data using the same IEEE-754 "compliant" FPU. Some compilers also have a fused_multiply_add operation, and many compilers will try to optimize the multiple, add into a single multiply_add operation, and the results can be different on each because of the way different portions of the computation are completed. IEEE-754 has a lot of issues. It is unsuitable for high-precision numerical work, but is good enough to get you in the ball-park. But all truly detailed (often arbitrary precision) work will use a different software-based engine that bypasses the shortcomings of IEEE-754 in favor of accuracy at the sacrifice of much speed. My goal in creating a new numeric processing unit is to overcome those issues by allowing for arbitrary precision computation in hardware, and to define a standard that does not allow for any ambiguity in implementation. Corner cases and results will be explicitly defined, for example, so that any compliant implemen- tation will produce identical results. It's part of my long-term goal and Arxoda CPU project. -- Rick C. Hodgin |
Bonita Montero <Bonita.Montero@gmail.com>: Mar 28 03:32PM +0200 > compute engine, and they may not produce the same results using > IEEE-754 because they have different mechanisms to process data > using the same IEEE-754 "compliant" FPU. That's irrelevant because Java is a different language with a different behaviour. And Java even doesn't claim to be IEEE-754 -compliant. > multiply_add operation, and the results can be different on each > because of the way different portions of the computation are > completed. FMA produces the same results as expliticit instructoins with some special behaviour when some operands are -0, Inf or NaN. > It is unsuitable for high-precision numerical work, ... That's not true. You can control how the FPU behaves through instructing the compiler or the FPU control word. |
David Brown <david.brown@hesbynett.no>: Mar 28 04:38PM +0200 On 28/03/18 14:45, Rick C. Hodgin wrote: > multiply_add operation, and the results can be different on each > because of the way different portions of the computation are > completed. For most floating point work, that is absolutely fine. Doubles give you about 16 significant digits. That is high enough resolution to measure the distance across the USA to the nearest /atom/. Who cares if an atom or two gets lost in the rounding when you do some arithmetic with it? Yes, there are some uses of floating point where people want replicable results across different machines (or different implementations - software, FPU, SIMD, etc. on the same machine). But these are rarities, and IEEE and suitable compilers support them. It is not uncommon to have to use slower modes and drop optimisation (x * y is no longer always y * x) - for the best guarantees of repeatability, use a software floating point library. In most cases, however, it is fine to think that floating point calculations give a close but imprecise result. Simply assume that you will lose a bit of precision for each calculation, and order your code appropriately. (Fused multiply-add, and other such optimisations, reduce the precision loss.) Sometimes you need to do a lot of calculations, or you need to mix wildly different sizes of numbers (such as for big numerical calculations, if you can't order them appropriately) - there are use-cases for 128-bit IEEE numbers. (256-bit IEEE is also defined, but rarely used.) And there are use-cases for arbitrary precision numbers too. These are more typically integers than floating point. > But all truly detailed (often arbitrary precision) work will use > a different software-based engine that bypasses the shortcomings > of IEEE-754 in favor of accuracy at the sacrifice of much speed. Do you have any references or statistics to back this up? For most uses, AFAIK, strict IEEE-754 is used when you favour repeatability (not accuracy) over speed - and something like "gcc -ffast-math" is for when you favour speed. The IEEE formats are well established as a practical and sensible format for floating point that cover a huge range of use-cases. The situations where they are not good enough are rare. |
"Rick C. Hodgin" <rick.c.hodgin@gmail.com>: Mar 28 08:16AM -0700 On Wednesday, March 28, 2018 at 9:32:27 AM UTC-4, Bonita Montero wrote: > That's irrelevant because Java is a different language with a > different behaviour. And Java even doesn't claim to be IEEE-754 > -compliant. Java running on the same computer using the same sequence of math ops can produce a different total than the same sequence coded in C++. It is not supposed to, but C++ will inject periodic stores and reloads to ensure proper rounding at times. This means the IEEE-754 "standard" allows for variability in computationn based on when certain steps are performed. It is not mathematically accurate, nor is it required to produce identical results across architectures. > > completed. > FMA produces the same results as expliticit instructoins with > some special behaviour when some operands are -0, Inf or NaN. See section 2.3. FMA performs one round. Separate multiply-add perform two. The results arwe different: 2.3. The Fused Multiply-Add (FMA) http://docs.nvidia.com/cuda/floating-point/index.html > > It is unsuitable for high-precision numerical work, ... > That's not true. You can control how the FPU behaves through > instructing the compiler or the FPU control word. But you cannot guarantee the same operation across platforms. IEEE-754 is not a true standard as it leaves some wriggle room in actual implementation. And compilers often introduce optimizations which change the result: https://www.nccs.nasa.gov/images/FloatingPoint_consistency.pdf John Gustafson also talks about it in his Stanford talk posted above. -- Rick C. Hodgin |
"Rick C. Hodgin" <rick.c.hodgin@gmail.com>: Mar 28 08:27AM -0700 On Wednesday, March 28, 2018 at 10:38:30 AM UTC-4, David Brown wrote: > uses, AFAIK, strict IEEE-754 is used when you favour repeatability (not > accuracy) over speed - and something like "gcc -ffast-math" is for when > you favour speed. Weather modeling uses the double-double and quad-double libraries, which leverage x87 hardware to chain operations to create 128-bit, and 256-bit computations. http://crd-legacy.lbl.gov/~dhbailey/mpdist/ MPFR maintains a list of citations using their work: http://www.mpfr.org/pub.html And John Gustafson discusses the shortcomings in his Stanford video you didn't watch I posted above. -- Rick C. Hodgin |
Bonita Montero <Bonita.Montero@gmail.com>: Mar 28 05:59PM +0200 > Java running on the same computer using the same sequence of math > ops can produce a different total than the same sequence coded in C++. Where's the problem? Java is a different language and may have a different arithmetic behaviour. > It is not supposed to, but C++ will inject periodic stores and reloads > to ensure proper rounding at times. Only with the x87-FPU. With SSE or AVX, there are explicit operations with less precsion. And other architectures behave also such instruc- tions. > This means the IEEE-754 "standard" allows for variability > in computationn based on when certain steps are performed. But a consistent behaviour as one would naively would expect can be configured. > It is not mathematically accurate, nor is it required to > produce identical results across architectures. You can easily configure different compilers of the same language on different architectures to give identical binary result.s > perform two. The results arwe different: > 2.3. The Fused Multiply-Add (FMA) > http://docs.nvidia.com/cuda/floating-point/index.html NVidia GPUs aren't general purpose CPUs. And for the purpose these GPUs are designed, this behaviour isn't a restriction. >> That's not true. You can control how the FPU behaves through >> instructing the compiler or the FPU control word. > But you cannot guarantee the same operation across platforms. It can be configured so. |
Bonita Montero <Bonita.Montero@gmail.com>: Mar 28 06:03PM +0200 > Weather modeling uses the double-double and quad-double libraries, > which leverage x87 hardware to chain operations to create 128-bit, > and 256-bit computations. I'll bet my right hand that weather modelling usually isn't done with 128- or 256-bit FPs as these operations are ultimatively slow when done in software. If there would be a noteworthy demand for 128- or 256-bit FP, there would be a hardware-support by many CPUs. |
scott@slp53.sl.home (Scott Lurndal): Mar 28 04:13PM >I'll bet my right hand that weather modelling usually isn't done >with 128- or 256-bit FPs as these operations are ultimatively slow >when done in software. A large fraction of it is either done using SIMD operations or GPU (OpenCL/Cuda). Machine learning is leaning towards highly-parallel (SIMD) 16-bit FP. |
"Rick C. Hodgin" <rick.c.hodgin@gmail.com>: Mar 28 09:27AM -0700 On Wednesday, March 28, 2018 at 11:59:25 AM UTC-4, Bonita Montero wrote: > > ops can produce a different total than the same sequence coded in C++. > Where's the problem? Java is a different language and may have a > different arithmetic behaviour. That is the problem. It's using the same IEEE-754 engine under the hood. The result of a calculation using the same source code copied from a C++ app into a Java app (something like "c = a + b", but more complex) can produce different results in the Java app than it can in the C++ app, or vice-versa. This is directly the result of the IEEE-754 standard being insufficient in and of itself to perform math properly. It requires coddling to get it to produce identical results across platforms (and even sometimes on the same machine across processes which are scheduled on different cores). > Only with the x87-FPU. With SSE or AVX, there are explicit operations > with less precsion. And other architectures behave also such instruc- > tions. Various SIMD extensions have improved upon the issue relative to the x87 FPU, but the results still remain. You cannot guarantee that an ARM-based CPU will produce the same result as an AMD64- based CPU if they both use the same sequence of operations. You have to manually validate it, and that's the issue being discussed by math-based hardware people and IEEE-754, and specifically at the present time, John Gustafson. > > in computationn based on when certain steps are performed. > But a consistent behaviour as one would naively would expect > can be configured. Not without manual effort. You can't take a CPU that sports a fully IEEE-754 compliant FPU and guarantee the results will be the same. They likely will be, or as David points out, will be close enough that no one would ever see the variance ... but I'm talking about the actual variance which is there. In the last few bits, rounding at various levels enters in and it does change the result. It may have no real impact on the final result rounded to four base-10 decimal places, but if you look at the result the hardware computed, it is (can be) different. > > produce identical results across architectures. > You can easily configure different compilers of the same language > on different architectures to give identical binary results. This has not been my experience. I have listened to many lectures where they state this is one of the biggest problems with IEEE-754. In fact, one of the authors of Java was absolutely stunned to learn that IEEE-754 on one machine may produce different results than the same series of calculations on another. It floored him that such a result was possible in a "standard" such as IEEE-754: https://people.eecs.berkeley.edu/~wkahan/JAVAhurt.pdf > > http://docs.nvidia.com/cuda/floating-point/index.html > NVidia GPUs aren't general purpose CPUs. And for the purpose these > GPUs are designed, this behaviour isn't a restriction. The document I posted outlines the issue. It is prevalent in any implementation of the IEEE-754-2008 FMA extension. The operation requires only one rounding, whereas separate multiply-add ops will have two. This invariably produces different results, be it on a GPU, CPU, or with trained mice mimicking FPU gates switching on and off in a laboratory. FMA is a different mathematical operation than two separate multi- ply add operations. That's the end of that discussion. The results from FMA are typically better than those of the two separate multiply add operations, but the results are different. One of the biggest arguments is whether or not a compiler should be able to replace a multiply followed by an add with a single fused_multiply_add operation. Most people think it shouldn't, but to the compiler author, and in a pure mathematical sense, it should not make any difference, so they optimize in that way. But it does make a difference, and they shouldn't. Some modern compilers are wising up to that realization. Intel's compiler, for example, will only use FMA under certain flags for optimization. > >> instructing the compiler or the FPU control word. > > But you cannot guarantee the same operation across platforms. > It can be configured so. It cannot be guaranteed. You can work around it. You can wriggle your way into it, but you cannot know with certainty if a particular IEEE-754 compliant FPU will produce the same result on architectures A, B, C, and D, without testing them. IIRC, even some Pentium III-era and older CPUs produce some different results than modern Intel-based hardware (and not due to the Pentium FDIV bug), but to the way Intel removed "legacy DOS support" in the Pentium 4 and later. You must now explicitly enable backward compati- bility flags in the Pentium 4's x87 FPU and later CPUs to get the same results in your debugger as you would've observed previously. Such things have been altered for performance, but they impact the fundamental operations of the x87 FPU. With all things floating point, you must rigorously test things to see if they are the same across platforms, and even arguably revisions within platforms. -- Rick C. Hodgin |
"Rick C. Hodgin" <rick.c.hodgin@gmail.com>: Mar 28 09:33AM -0700 On Wednesday, March 28, 2018 at 12:04:13 PM UTC-4, Bonita Montero wrote: > I'll bet my right hand that weather modelling usually isn't done > with 128- or 256-bit FPs as these operations are ultimatively slow > when done in software. The QD library I cited above was created to use x87-based hardware to allow for fast 128-bit and 256-bit computation. This information came to me first-hand during an interview with John Stone who leases time on the supercomputers in Urbana, IL. > If there would be a noteworthy demand for 128- or 256-bit FP, there > would be a hardware-support by many CPUs. A discussion today on this subject on comp.arch yielded Terje Mathisen writing this as an aid introduced into hardware to make software-based 128-bit and 256-bit compute go faster: https://groups.google.com/d/msg/comp.arch/igzzuO9cwwM/lzqZ5KROAAAJ You should download and benchmark that 128-bit and 256-bit QD library, and compare it to the same size operations in MPFR. You will be amazed how fast it is. In 2010 when I was doing extensive large FPU computation, I was using MPFR for many months. I interviewed John Stone at that time and mentioned my work (as I was using a widely parallel math system that I commented would work really well on the supercomputers there). He asked me what I was using and I told him. He said most people he knows use the QD library. I downloaded it and tried it and it was notably faster, something like 6x to 10x faster IIRC. He said it was expressly developed or enhanced to work with the weather modeling done on supercomputers because it uses the FPU hardware but is more precise than even the 80-bit formats. It uses the 64-bit double format and spreads the mantissa bits across two 64-bit quantities for 128-bit, and across four for 256-bit. Search for "QD": http://crd-legacy.lbl.gov/~dhbailey/mpdist/ -- Rick C. Hodgin |
asetofsymbols@gmail.com: Mar 28 10:23AM -0700 I like fixed point float but I have not too much experience if it is ok. It seems at first saw, better than all IEEE float implementation For implement it is enough one CPU can do operations on unsigned as the usual one in x86 |
Hergen Lehmann <hlehmann.expires.5-11@snafu.de>: Mar 28 07:53PM +0200 > I like fixed point float but I have not too much experience if it is ok. > It seems at first saw, better than all IEEE float implementation > For implement it is enough one CPU can do operations on unsigned as the usual one in x86 Fixed point arithmetic is fine, if your application has a predictable and rather limited value range. However, proper overflow handling will be a pain in the ass (many CPUs do not even generate an exception for integer overflows!). And more complex operations like logarithms or trigonometric functions will be slow without the help of the FPU. |
Gareth Owen <gwowen@gmail.com>: Mar 28 07:00PM +0100 > I'll bet my right hand that weather modelling usually isn't done > with 128- or 256-bit FPs as these operations are ultimatively slow > when done in software. There's very little benefit in keeping precisions that's that many orders of magnitudes more precise than the measurements driving the model -- doubly so when the process your modelling is the poster child for sensitive dependence on those initial conditions. Back when I did a bit of it professionally, we'd run the same models and multiple architectures, and the variation in FP behaviour was just another way to add the ensembles. |
asetofsymbols@gmail.com: Mar 28 11:08AM -0700 I not agree in 1 word you say The real problem for error are float point IEEE not the fixed point float Because in IEEE float the error depends from size of the number too |
Mr Flibble <flibbleREMOVETHISBIT@i42.co.uk>: Mar 28 09:12PM +0100 On 27/03/2018 19:22, Rick C. Hodgin wrote: >> The only thing wrong with IEEE-754 is allowing a representation of >> negative zero. There is no such thing as negative zero. > There are many problems with IEEE-754. IEEE-754 serves its purpose quite adequately evidenced by its successful pervasiveness. If there was a significantly better way of doing things we would be doing it that way by now, modulo Gustafson's promising unums (rather than any amateurish God bothering alternative). /Flibble -- "Suppose it's all true, and you walk up to the pearly gates, and are confronted by God," Bryne asked on his show The Meaning of Life. "What will Stephen Fry say to him, her, or it?" "I'd say, bone cancer in children? What's that about?" Fry replied. "How dare you? How dare you create a world to which there is such misery that is not our fault. It's not right, it's utterly, utterly evil." "Why should I respect a capricious, mean-minded, stupid God who creates a world that is so full of injustice and pain. That's what I would say." |
"Rick C. Hodgin" <rick.c.hodgin@gmail.com>: Mar 28 01:19PM -0700 On Wednesday, March 28, 2018 at 4:13:03 PM UTC-4, Mr Flibble wrote: > IEEE-754 serves its purpose quite adequately evidenced by its successful > pervasiveness. If there was a significantly better way of doing things > we would be doing it that way by now, modulo Gustafson's promising unums There is inertia at this point, and the transformation from IEEE-754 to another format is a lot different than moving from FPUs to SIMD. It will require a major overhaul of many apps to be able to handle the new format. It's not as easy a sell as you might think. -- Rick C. Hodgin |
Mr Flibble <flibbleREMOVETHISBIT@i42.co.uk>: Mar 28 09:33PM +0100 On 28/03/2018 21:19, Rick C. Hodgin wrote: > It will require a major overhaul of many apps to be able to handle > the new format. > It's not as easy a sell as you might think. Read my reply again: I used the word "pervasiveness" and I what I didn't say is moving away from the status quo would be "an easy sell". Changing from something that pervasively bedded in is never easy. /Flibble -- "Suppose it's all true, and you walk up to the pearly gates, and are confronted by God," Bryne asked on his show The Meaning of Life. "What will Stephen Fry say to him, her, or it?" "I'd say, bone cancer in children? What's that about?" Fry replied. "How dare you? How dare you create a world to which there is such misery that is not our fault. It's not right, it's utterly, utterly evil." "Why should I respect a capricious, mean-minded, stupid God who creates a world that is so full of injustice and pain. That's what I would say." |
"Chris M. Thomasson" <invalid_chris_thomasson@invalid.invalid>: Mar 27 08:55PM -0700 Fwiw, since I have been working on fractals so much, I was wondering if creating a C++11 implementation of a proxy collector of mine would be as easy as I did it in the past. For some more information please read here: https://groups.google.com/d/topic/lock-free/X3fuuXknQF0/discussion Think of user-space RCU https://en.wikipedia.org/wiki/Read-copy-update Here is my C++11 code, it should compile and run on 32-bit x86 with support for DWCAS. Well, it should run on any other arch with support for lock-free DWCAS. In other words, the following struct needs to be at least lock-free atomic wrt std::atomic<T>: _______________________ struct anchor { std::intptr_t count; node* head; }; _______________________ http://en.cppreference.com/w/cpp/atomic/atomic/is_lock_free Has to return true. The lock-free part is the DWCAS in the acquire function. Here is my C++11 code [1]: https://pastebin.com/raw/KAt4nhCj The core algorithm has been verified with Relacy Race Detector: https://groups.google.com/d/topic/lock-free/QuLBH87z6B4/discussion I am currently writing up how this can be useful, sorry about that. Basically, its like RCU where reader threads can read a data-structure concurrently along with writer threads mutating it. Writer threads can collect objects while reader threads are accessing them. Every thing is accounted for using differential counting. I can get the code to compile and run on 32-bit x86. Can anybody else get it to compile and/or run? Thanks. [1] code: ___________________________ /* Simple Proxy Collector (DWCAS) - C++11 Version http://www.1024cores.net/home/relacy-race-detector Copyright 3/27/2018 ___________________________________________________*/ #if defined (_MSC_VER) # define _ENABLE_ATOMIC_ALIGNMENT_FIX // for dwcas
Subscribe to:
Post Comments (Atom)
|
No comments:
Post a Comment