Wednesday, March 28, 2018

Digest for comp.lang.c++@googlegroups.com - 19 updates in 2 topics

Bonita Montero <Bonita.Montero@gmail.com>: Mar 28 02:30PM +0200

> There are many problems with IEEE-754. Different architectures
> aren't even required to produce the same result.
You can tune each compiler and FPU (through setting the control word)
so that you can have identical results on different platforms. On x87
this will result in slower code to chop results but with SSE and AVX
the code won't run slower.
"Rick C. Hodgin" <rick.c.hodgin@gmail.com>: Mar 28 05:45AM -0700

On Wednesday, March 28, 2018 at 8:30:37 AM UTC-4, Bonita Montero wrote:
> so that you can have identical results on different platforms. On x87
> this will result in slower code to chop results but with SSE and AVX
> the code won't run slower.
 
The compiler enables you to overcome limitations and variations
allowed for by the IEEE-754 standard so as to obtain reproducible
results across platforms ... in that compiler, and possibly to a
(or an implicit) C++ standard observed by various compiler authors.
 
But you can write a Java-based compute engine, and a C++-base
compute engine, and they may not produce the same results using
IEEE-754 because they have different mechanisms to process data
using the same IEEE-754 "compliant" FPU.
 
Some compilers also have a fused_multiply_add operation, and many
compilers will try to optimize the multiple, add into a single
multiply_add operation, and the results can be different on each
because of the way different portions of the computation are
completed.
 
IEEE-754 has a lot of issues. It is unsuitable for high-precision
numerical work, but is good enough to get you in the ball-park.
But all truly detailed (often arbitrary precision) work will use
a different software-based engine that bypasses the shortcomings
of IEEE-754 in favor of accuracy at the sacrifice of much speed.
 
My goal in creating a new numeric processing unit is to overcome
those issues by allowing for arbitrary precision computation in
hardware, and to define a standard that does not allow for any
ambiguity in implementation. Corner cases and results will be
explicitly defined, for example, so that any compliant implemen-
tation will produce identical results.
 
It's part of my long-term goal and Arxoda CPU project.
 
--
Rick C. Hodgin
Bonita Montero <Bonita.Montero@gmail.com>: Mar 28 03:32PM +0200

> compute engine, and they may not produce the same results using
> IEEE-754 because they have different mechanisms to process data
> using the same IEEE-754 "compliant" FPU.
 
That's irrelevant because Java is a different language with a
different behaviour. And Java even doesn't claim to be IEEE-754
-compliant.
 
> multiply_add operation, and the results can be different on each
> because of the way different portions of the computation are
> completed.
 
FMA produces the same results as expliticit instructoins with
some special behaviour when some operands are -0, Inf or NaN.
 
> It is unsuitable for high-precision numerical work, ...
 
That's not true. You can control how the FPU behaves through
instructing the compiler or the FPU control word.
David Brown <david.brown@hesbynett.no>: Mar 28 04:38PM +0200

On 28/03/18 14:45, Rick C. Hodgin wrote:
> multiply_add operation, and the results can be different on each
> because of the way different portions of the computation are
> completed.
 
For most floating point work, that is absolutely fine. Doubles give you
about 16 significant digits. That is high enough resolution to measure
the distance across the USA to the nearest /atom/. Who cares if an atom
or two gets lost in the rounding when you do some arithmetic with it?
 
Yes, there are some uses of floating point where people want replicable
results across different machines (or different implementations -
software, FPU, SIMD, etc. on the same machine). But these are rarities,
and IEEE and suitable compilers support them. It is not uncommon to
have to use slower modes and drop optimisation (x * y is no longer
always y * x) - for the best guarantees of repeatability, use a software
floating point library.
 
In most cases, however, it is fine to think that floating point
calculations give a close but imprecise result. Simply assume that you
will lose a bit of precision for each calculation, and order your code
appropriately. (Fused multiply-add, and other such optimisations,
reduce the precision loss.)
 
Sometimes you need to do a lot of calculations, or you need to mix
wildly different sizes of numbers (such as for big numerical
calculations, if you can't order them appropriately) - there are
use-cases for 128-bit IEEE numbers. (256-bit IEEE is also defined, but
rarely used.)
 
And there are use-cases for arbitrary precision numbers too. These are
more typically integers than floating point.
 
> But all truly detailed (often arbitrary precision) work will use
> a different software-based engine that bypasses the shortcomings
> of IEEE-754 in favor of accuracy at the sacrifice of much speed.
 
Do you have any references or statistics to back this up? For most
uses, AFAIK, strict IEEE-754 is used when you favour repeatability (not
accuracy) over speed - and something like "gcc -ffast-math" is for when
you favour speed. The IEEE formats are well established as a practical
and sensible format for floating point that cover a huge range of
use-cases. The situations where they are not good enough are rare.
 
"Rick C. Hodgin" <rick.c.hodgin@gmail.com>: Mar 28 08:16AM -0700

On Wednesday, March 28, 2018 at 9:32:27 AM UTC-4, Bonita Montero wrote:
 
> That's irrelevant because Java is a different language with a
> different behaviour. And Java even doesn't claim to be IEEE-754
> -compliant.
 
Java running on the same computer using the same sequence of math
ops can produce a different total than the same sequence coded in C++.
It is not supposed to, but C++ will inject periodic stores and reloads
to ensure proper rounding at times.
 
This means the IEEE-754 "standard" allows for variability in computationn
based on when certain steps are performed. It is not mathematically
accurate, nor is it required to produce identical results across
architectures.
 
> > completed.
 
> FMA produces the same results as expliticit instructoins with
> some special behaviour when some operands are -0, Inf or NaN.
 
See section 2.3. FMA performs one round. Separate multiply-add
perform two. The results arwe different:
 
2.3. The Fused Multiply-Add (FMA)
http://docs.nvidia.com/cuda/floating-point/index.html
 
> > It is unsuitable for high-precision numerical work, ...
 
> That's not true. You can control how the FPU behaves through
> instructing the compiler or the FPU control word.
 
But you cannot guarantee the same operation across platforms.
 
IEEE-754 is not a true standard as it leaves some wriggle room in
actual implementation. And compilers often introduce optimizations
which change the result:
 
https://www.nccs.nasa.gov/images/FloatingPoint_consistency.pdf
 
John Gustafson also talks about it in his Stanford talk posted above.
 
--
Rick C. Hodgin
"Rick C. Hodgin" <rick.c.hodgin@gmail.com>: Mar 28 08:27AM -0700

On Wednesday, March 28, 2018 at 10:38:30 AM UTC-4, David Brown wrote:
> uses, AFAIK, strict IEEE-754 is used when you favour repeatability (not
> accuracy) over speed - and something like "gcc -ffast-math" is for when
> you favour speed.
 
Weather modeling uses the double-double and quad-double libraries,
which leverage x87 hardware to chain operations to create 128-bit,
and 256-bit computations.
 
http://crd-legacy.lbl.gov/~dhbailey/mpdist/
 
MPFR maintains a list of citations using their work:
 
http://www.mpfr.org/pub.html
 
And John Gustafson discusses the shortcomings in his Stanford video
you didn't watch I posted above.
 
--
Rick C. Hodgin
Bonita Montero <Bonita.Montero@gmail.com>: Mar 28 05:59PM +0200

> Java running on the same computer using the same sequence of math
> ops can produce a different total than the same sequence coded in C++.
 
Where's the problem? Java is a different language and may have a
different arithmetic behaviour.
 
> It is not supposed to, but C++ will inject periodic stores and reloads
> to ensure proper rounding at times.
 
Only with the x87-FPU. With SSE or AVX, there are explicit operations
with less precsion. And other architectures behave also such instruc-
tions.
 
> This means the IEEE-754 "standard" allows for variability
> in computationn based on when certain steps are performed.
 
But a consistent behaviour as one would naively would expect
can be configured.
 
> It is not mathematically accurate, nor is it required to
> produce identical results across architectures.
 
You can easily configure different compilers of the same language
on different architectures to give identical binary result.s
 
> perform two. The results arwe different:
> 2.3. The Fused Multiply-Add (FMA)
> http://docs.nvidia.com/cuda/floating-point/index.html
 
NVidia GPUs aren't general purpose CPUs. And for the purpose these
GPUs are designed, this behaviour isn't a restriction.
 
>> That's not true. You can control how the FPU behaves through
>> instructing the compiler or the FPU control word.
 
> But you cannot guarantee the same operation across platforms.
 
It can be configured so.
Bonita Montero <Bonita.Montero@gmail.com>: Mar 28 06:03PM +0200

> Weather modeling uses the double-double and quad-double libraries,
> which leverage x87 hardware to chain operations to create 128-bit,
> and 256-bit computations.
 
I'll bet my right hand that weather modelling usually isn't done
with 128- or 256-bit FPs as these operations are ultimatively slow
when done in software.
If there would be a noteworthy demand for 128- or 256-bit FP, there
would be a hardware-support by many CPUs.
scott@slp53.sl.home (Scott Lurndal): Mar 28 04:13PM


>I'll bet my right hand that weather modelling usually isn't done
>with 128- or 256-bit FPs as these operations are ultimatively slow
>when done in software.
 
A large fraction of it is either done using SIMD operations or GPU (OpenCL/Cuda).
 
Machine learning is leaning towards highly-parallel (SIMD) 16-bit FP.
"Rick C. Hodgin" <rick.c.hodgin@gmail.com>: Mar 28 09:27AM -0700

On Wednesday, March 28, 2018 at 11:59:25 AM UTC-4, Bonita Montero wrote:
> > ops can produce a different total than the same sequence coded in C++.
 
> Where's the problem? Java is a different language and may have a
> different arithmetic behaviour.
 
That is the problem. It's using the same IEEE-754 engine under the
hood. The result of a calculation using the same source code copied
from a C++ app into a Java app (something like "c = a + b", but more
complex) can produce different results in the Java app than it can
in the C++ app, or vice-versa.
 
This is directly the result of the IEEE-754 standard being insufficient
in and of itself to perform math properly. It requires coddling to get
it to produce identical results across platforms (and even sometimes on
the same machine across processes which are scheduled on different
cores).
 
 
> Only with the x87-FPU. With SSE or AVX, there are explicit operations
> with less precsion. And other architectures behave also such instruc-
> tions.
 
Various SIMD extensions have improved upon the issue relative to
the x87 FPU, but the results still remain. You cannot guarantee
that an ARM-based CPU will produce the same result as an AMD64-
based CPU if they both use the same sequence of operations. You
have to manually validate it, and that's the issue being discussed
by math-based hardware people and IEEE-754, and specifically at
the present time, John Gustafson.
 
> > in computationn based on when certain steps are performed.
 
> But a consistent behaviour as one would naively would expect
> can be configured.
 
Not without manual effort. You can't take a CPU that sports a fully
IEEE-754 compliant FPU and guarantee the results will be the same.
They likely will be, or as David points out, will be close enough
that no one would ever see the variance ... but I'm talking about
the actual variance which is there.
 
In the last few bits, rounding at various levels enters in and it
does change the result. It may have no real impact on the final
result rounded to four base-10 decimal places, but if you look at
the result the hardware computed, it is (can be) different.
 
> > produce identical results across architectures.
 
> You can easily configure different compilers of the same language
> on different architectures to give identical binary results.
 
This has not been my experience. I have listened to many lectures
where they state this is one of the biggest problems with IEEE-754.
In fact, one of the authors of Java was absolutely stunned to learn
that IEEE-754 on one machine may produce different results than the
same series of calculations on another. It floored him that such a
result was possible in a "standard" such as IEEE-754:
 
https://people.eecs.berkeley.edu/~wkahan/JAVAhurt.pdf
 
> > http://docs.nvidia.com/cuda/floating-point/index.html
 
> NVidia GPUs aren't general purpose CPUs. And for the purpose these
> GPUs are designed, this behaviour isn't a restriction.
 
The document I posted outlines the issue. It is prevalent in any
implementation of the IEEE-754-2008 FMA extension. The operation
requires only one rounding, whereas separate multiply-add ops will
have two. This invariably produces different results, be it on a
GPU, CPU, or with trained mice mimicking FPU gates switching on
and off in a laboratory.
 
FMA is a different mathematical operation than two separate multi-
ply add operations. That's the end of that discussion.
 
The results from FMA are typically better than those of the two
separate multiply add operations, but the results are different.
One of the biggest arguments is whether or not a compiler should
be able to replace a multiply followed by an add with a single
fused_multiply_add operation. Most people think it shouldn't,
but to the compiler author, and in a pure mathematical sense,
it should not make any difference, so they optimize in that way.
But it does make a difference, and they shouldn't.
 
Some modern compilers are wising up to that realization. Intel's
compiler, for example, will only use FMA under certain flags for
optimization.
 
> >> instructing the compiler or the FPU control word.
 
> > But you cannot guarantee the same operation across platforms.
 
> It can be configured so.
 
It cannot be guaranteed. You can work around it. You can wriggle
your way into it, but you cannot know with certainty if a particular
IEEE-754 compliant FPU will produce the same result on architectures
A, B, C, and D, without testing them.
 
IIRC, even some Pentium III-era and older CPUs produce some different
results than modern Intel-based hardware (and not due to the Pentium
FDIV bug), but to the way Intel removed "legacy DOS support" in the
Pentium 4 and later. You must now explicitly enable backward compati-
bility flags in the Pentium 4's x87 FPU and later CPUs to get the same
results in your debugger as you would've observed previously.
 
Such things have been altered for performance, but they impact the
fundamental operations of the x87 FPU.
 
With all things floating point, you must rigorously test things to
see if they are the same across platforms, and even arguably revisions
within platforms.
 
--
Rick C. Hodgin
"Rick C. Hodgin" <rick.c.hodgin@gmail.com>: Mar 28 09:33AM -0700

On Wednesday, March 28, 2018 at 12:04:13 PM UTC-4, Bonita Montero wrote:
 
> I'll bet my right hand that weather modelling usually isn't done
> with 128- or 256-bit FPs as these operations are ultimatively slow
> when done in software.
 
The QD library I cited above was created to use x87-based hardware
to allow for fast 128-bit and 256-bit computation.
 
This information came to me first-hand during an interview with
John Stone who leases time on the supercomputers in Urbana, IL.
 
> If there would be a noteworthy demand for 128- or 256-bit FP, there
> would be a hardware-support by many CPUs.
 
A discussion today on this subject on comp.arch yielded Terje
Mathisen writing this as an aid introduced into hardware to make
software-based 128-bit and 256-bit compute go faster:
 
https://groups.google.com/d/msg/comp.arch/igzzuO9cwwM/lzqZ5KROAAAJ
 
You should download and benchmark that 128-bit and 256-bit QD
library, and compare it to the same size operations in MPFR. You
will be amazed how fast it is.
 
In 2010 when I was doing extensive large FPU computation, I was
using MPFR for many months. I interviewed John Stone at that time
and mentioned my work (as I was using a widely parallel math system
that I commented would work really well on the supercomputers there).
He asked me what I was using and I told him. He said most people he
knows use the QD library. I downloaded it and tried it and it was
notably faster, something like 6x to 10x faster IIRC.
 
He said it was expressly developed or enhanced to work with the
weather modeling done on supercomputers because it uses the FPU
hardware but is more precise than even the 80-bit formats. It
uses the 64-bit double format and spreads the mantissa bits across
two 64-bit quantities for 128-bit, and across four for 256-bit.
 
Search for "QD":
http://crd-legacy.lbl.gov/~dhbailey/mpdist/
 
--
Rick C. Hodgin
asetofsymbols@gmail.com: Mar 28 10:23AM -0700

I like fixed point float but I have not too much experience if it is ok.
It seems at first saw, better than all IEEE float implementation
For implement it is enough one CPU can do operations on unsigned as the usual one in x86
Hergen Lehmann <hlehmann.expires.5-11@snafu.de>: Mar 28 07:53PM +0200


> I like fixed point float but I have not too much experience if it is ok.
> It seems at first saw, better than all IEEE float implementation
> For implement it is enough one CPU can do operations on unsigned as the usual one in x86
 
Fixed point arithmetic is fine, if your application has a predictable
and rather limited value range.
 
However, proper overflow handling will be a pain in the ass (many CPUs
do not even generate an exception for integer overflows!). And more
complex operations like logarithms or trigonometric functions will be
slow without the help of the FPU.
Gareth Owen <gwowen@gmail.com>: Mar 28 07:00PM +0100


> I'll bet my right hand that weather modelling usually isn't done
> with 128- or 256-bit FPs as these operations are ultimatively slow
> when done in software.
 
There's very little benefit in keeping precisions that's that many
orders of magnitudes more precise than the measurements driving the
model -- doubly so when the process your modelling is the poster child
for sensitive dependence on those initial conditions.
 
Back when I did a bit of it professionally, we'd run the same models and
multiple architectures, and the variation in FP behaviour was just
another way to add the ensembles.
asetofsymbols@gmail.com: Mar 28 11:08AM -0700

I not agree in 1 word you say
The real problem for error are float point IEEE not the fixed point float
Because in IEEE float the error depends from size of the number too
Mr Flibble <flibbleREMOVETHISBIT@i42.co.uk>: Mar 28 09:12PM +0100

On 27/03/2018 19:22, Rick C. Hodgin wrote:
 
>> The only thing wrong with IEEE-754 is allowing a representation of
>> negative zero. There is no such thing as negative zero.
 
> There are many problems with IEEE-754.
 
IEEE-754 serves its purpose quite adequately evidenced by its successful
pervasiveness. If there was a significantly better way of doing things
we would be doing it that way by now, modulo Gustafson's promising unums
(rather than any amateurish God bothering alternative).
 
/Flibble
 
--
"Suppose it's all true, and you walk up to the pearly gates, and are
confronted by God," Bryne asked on his show The Meaning of Life. "What
will Stephen Fry say to him, her, or it?"
"I'd say, bone cancer in children? What's that about?" Fry replied.
"How dare you? How dare you create a world to which there is such misery
that is not our fault. It's not right, it's utterly, utterly evil."
"Why should I respect a capricious, mean-minded, stupid God who creates
a world that is so full of injustice and pain. That's what I would say."
"Rick C. Hodgin" <rick.c.hodgin@gmail.com>: Mar 28 01:19PM -0700

On Wednesday, March 28, 2018 at 4:13:03 PM UTC-4, Mr Flibble wrote:
 
> IEEE-754 serves its purpose quite adequately evidenced by its successful
> pervasiveness. If there was a significantly better way of doing things
> we would be doing it that way by now, modulo Gustafson's promising unums
 
There is inertia at this point, and the transformation from IEEE-754
to another format is a lot different than moving from FPUs to SIMD.
It will require a major overhaul of many apps to be able to handle
the new format.
 
It's not as easy a sell as you might think.
 
--
Rick C. Hodgin
Mr Flibble <flibbleREMOVETHISBIT@i42.co.uk>: Mar 28 09:33PM +0100

On 28/03/2018 21:19, Rick C. Hodgin wrote:
> It will require a major overhaul of many apps to be able to handle
> the new format.
 
> It's not as easy a sell as you might think.
 
Read my reply again: I used the word "pervasiveness" and I what I didn't
say is moving away from the status quo would be "an easy sell".
Changing from something that pervasively bedded in is never easy.
 
/Flibble
 
--
"Suppose it's all true, and you walk up to the pearly gates, and are
confronted by God," Bryne asked on his show The Meaning of Life. "What
will Stephen Fry say to him, her, or it?"
"I'd say, bone cancer in children? What's that about?" Fry replied.
"How dare you? How dare you create a world to which there is such misery
that is not our fault. It's not right, it's utterly, utterly evil."
"Why should I respect a capricious, mean-minded, stupid God who creates
a world that is so full of injustice and pain. That's what I would say."
"Chris M. Thomasson" <invalid_chris_thomasson@invalid.invalid>: Mar 27 08:55PM -0700

Fwiw, since I have been working on fractals so much, I was wondering if
creating a C++11 implementation of a proxy collector of mine would be as
easy as I did it in the past. For some more information please read here:
 
https://groups.google.com/d/topic/lock-free/X3fuuXknQF0/discussion
 
Think of user-space RCU
 
https://en.wikipedia.org/wiki/Read-copy-update
 
Here is my C++11 code, it should compile and run on 32-bit x86 with
support for DWCAS. Well, it should run on any other arch with support
for lock-free DWCAS. In other words, the following struct needs to be at
least lock-free atomic wrt std::atomic<T>:
_______________________
struct anchor
{
std::intptr_t count;
node* head;
};
_______________________
 
http://en.cppreference.com/w/cpp/atomic/atomic/is_lock_free
 
Has to return true. The lock-free part is the DWCAS in the acquire function.
 
Here is my C++11 code [1]:
 
https://pastebin.com/raw/KAt4nhCj
 
The core algorithm has been verified with Relacy Race Detector:
 
https://groups.google.com/d/topic/lock-free/QuLBH87z6B4/discussion
 
I am currently writing up how this can be useful, sorry about that.
Basically, its like RCU where reader threads can read a data-structure
concurrently along with writer threads mutating it. Writer threads can
collect objects while reader threads are accessing them. Every thing is
accounted for using differential counting.
 
I can get the code to compile and run on 32-bit x86.
 
Can anybody else get it to compile and/or run?
 
Thanks.
 
 
[1] code:
___________________________
/* Simple Proxy Collector (DWCAS) - C++11 Version
 
http://www.1024cores.net/home/relacy-race-detector
 
Copyright 3/27/2018
___________________________________________________*/
 
 
#if defined (_MSC_VER)
# define _ENABLE_ATOMIC_ALIGNMENT_FIX // for dwcas

No comments: