Tuesday, May 9, 2017

Digest for comp.lang.c++@googlegroups.com - 25 updates in 3 topics

Cholo Lennon <chololennon@hotmail.com>: May 09 12:10AM -0300

On 05/08/2017 07:41 PM, jacobnavia wrote:
 
> In that example, a slow asm program is compared to the compiler output.
> Yes, the compiler wins. You can write fast asm programs, but you can
> write also so slow ones, that those programs are slower than a compiler.
 
That's happens a lot with C++. Everyone who wants to compare her/his
language of choice has the necessity to do it against C++, but usually
the comparisons are not fair due to person bias and because she/he is
really not proficient in C++. Result: Horrible C++ code vs
optimized/beautiful "foo" code
 
 
--
Cholo Lennon
Bs.As.
ARG
David Brown <david.brown@hesbynett.no>: May 09 01:29PM +0200

On 09/05/17 00:29, jacobnavia wrote:
>> Java or C#.
 
> Maybe you can explain what you mean with
 
> "memory-mapped read-only registers"
 
On a lot of systems, hardware registers are accessed as though they were
memory at specific fixed locations. (On x86, they often use a different
memory space - specific I/O operations rather than memory load/store
operations. But on many other processors it all just looks like memory
to the cpu.) Sometimes these registers are read/write, sometimes they
are read-only, occasionally they are write-only, and sometimes the very
act of reading them or writing them triggers other events. For example,
on microcontrollers it is common for a serial port (UART) to have a
"data register". Writing a value to that triggers sending the character
out on the serial port, reading from it takes the next character out of
a receive FIFO. Such memory accesses must therefore be done very
carefully - using "volatile" in C or C++.
 
I have no experience with Java or C#, but I would think the virtual
machine layer between the source code and the actual operations would
make it very difficult to get this to work as the programmer expects.
 
scott@slp53.sl.home (Scott Lurndal): May 09 01:12PM

>memory at specific fixed locations. (On x86, they often use a different
>memory space - specific I/O operations rather than memory load/store
>operations.
 
X86 has three distinct "address spaces" for I/O:
 
1) One 64kbyte I/O space(s) accessed by the IN & OUT
instructions. This space is used by various legacy
(ISA bus) devices such as the keyboard controller and
serial ports typically provided on the Platform Controller
Hub (PCH) (southbridge) chip. For example, the BIOS/UEFI
firmware writes an 8-bit value to port 0x80 that indicates
which part of the BIOS is currently running for debugging
purposes.
2) One or more regions carved out of the address space
that map to the PCI address space for any plug-in cards
via Base Address Registers (BARs) in the PCI configuration space.
3) The PCI configuration space (accessed indirectly via addresses
0xCF8 and 0xCFC in the I/O space or directly via PCI-Express
extended configuration access method (ECAM) which is mapped
into the physical address space).
 
As David points out, these registers often have side effects upon
access (for example, some bits in a register may have R/W1C semantics
where a write of a bit will cause that bit to be reset in the register;
typically used with interrupt status registers). This precludes
any speculative operations to these registers by the CPU (which is why
they're mapped into non-cacheable memory types by the host kernel using
the x86/amd MTRR registers)
 
> Such memory accesses must therefore be done very
>carefully - using "volatile" in C or C++.
 
volatile only affects compiler optimizations, not the hardware so
volatile may not be sufficient to ensure correct ordering on
a OOO core with a weaker memory model than x86 (e.g. ARM64), in
which case additional precautions (barriers like DSB) must be
taken by software to ensure the external observer (the device)
sees memory accesses in the correct order.
 
Even x86 requires memory barrier instructions in certain multithreaded
situations (see, for example, the network stack in linux where skb
handling requires fences for certain operations, DAMHIKT).
 
>I have no experience with Java or C#, but I would think the virtual
>machine layer between the source code and the actual operations would
>make it very difficult to get this to work as the programmer expects.
 
Technically, one can use JNI with java to access hardware - but that
is more like bolting a drone motor on a F-18 and expecting it to
perform well.
scott@slp53.sl.home (Scott Lurndal): May 09 01:14PM

>the comparisons are not fair due to person bias and because she/he is
>really not proficient in C++. Result: Horrible C++ code vs
>optimized/beautiful "foo" code
 
I challenge you to take any reasonable COBOL program and try to make it
"optimized/beautiful C++" code. Good luck with that.
Cholo Lennon <chololennon@hotmail.com>: May 09 10:56AM -0300

On 09/05/17 10:14, Scott Lurndal wrote:
>> optimized/beautiful "foo" code
 
> I challenge you to take any reasonable COBOL program and try to make it
> "optimized/beautiful C++" code. Good luck with that.
 
You are being too literal... what I wanted to say is that people usually
put an effort to tune their code (in their favourite language), but not
in C++ which is awful, or just a copy of the code in the other language
(ignoring the way C++ is used). This, of course, results in an unfair
comparison.
 
 
--
Cholo Lennon
Bs.As.
ARG
David Brown <david.brown@hesbynett.no>: May 09 04:19PM +0200

On 09/05/17 15:12, Scott Lurndal wrote:
> which case additional precautions (barriers like DSB) must be
> taken by software to ensure the external observer (the device)
> sees memory accesses in the correct order.
 
Yes, indeed. You usually also need help from the MMU to mark the area
as non-cacheable, and perhaps other details to avoid store buffers,
write combining, speculative reads, etc. I don't do this kind of thing
on x86 so I don't know the details here, but it was "fun" getting it all
correct on a dual core PowerPC microcontroller.
 
bitrex <bitrex@de.lete.earthlink.net>: May 09 11:24AM -0400

On 05/08/2017 06:25 PM, jacobnavia wrote:
 
> based on the C language, The compiler generates assembly.
 
> So, I like (and I have done it very often) to outsmart ANY compiler in
> just a few lines of ASM. It is fun.
 
It's probably true on x86, but I never really knew much x86 assembly.
However back when I was pretty good at 8 bit AVR assembly and also wrote
code in C I found avr-gcc extremely difficult to beat, at least when it
came to relatively short procedural functions. I knew the best way to do
that sequence of say 20 operations. It knew, too. The ISA for those RISC
microprocessors is pretty brief and there's only so many ways to do
things, I guess, and the manufacturer sez its ISA was optimized to help
C compilers output efficient assembly anyway.
 
The only thing it seemed pathological about not doing sometimes was
not converting switch statements/if-then-else structures into indirect
jump tables when it would've made sense from both an execution speed and
code space perspective.
bitrex <bitrex@de.lete.earthlink.net>: May 09 11:33AM -0400

On 05/08/2017 05:31 PM, Richard wrote:
 
 
> Modeling memory-mapped read-only registers with zero overhead compared
> to assembly is no big deal for C++. It would be amazingly painful in
> Java or C#.
 
Basically impossible in Java. If you need to access
architecture-specific machine registers at all Java is definitely the
wrong language. Unless you mean "model" in the sense of writing an
emulator or something...
spud@potato.field: May 09 03:37PM

On Tue, 9 May 2017 11:24:42 -0400
 
>> So, I like (and I have done it very often) to outsmart ANY compiler in
>> just a few lines of ASM. It is fun.
 
>It's probably true on x86, but I never really knew much x86 assembly.
 
The optimisation techniques used in modern compilers have been developed by
teams of dozens or even hundreds over the years. Any one person who thinks
they can outsmart a modern compiler in assembly optimisation for x86 in all but
a few edge cases is deluding themselves. The instruction set is now so large
and the pipelines so complex that its next to impossible for most people to
really get a good idea of which instructions to use and how to sequence them
to get the best result.
 
--
Spud
bitrex <bitrex@de.lete.earthlink.net>: May 09 11:37AM -0400

On 05/09/2017 09:12 AM, Scott Lurndal wrote:
 
 
> Technically, one can use JNI with java to access hardware - but that
> is more like bolting a drone motor on a F-18 and expecting it to
> perform well.
 
Sort of defeats the point of the whole "walled garden" virtual machine
paradigm of the language, yeah?
 
If your application requires that you have to access machine-specific
hardware via some method other than the APIs provided then I'd say
you're SOL and shouldn't be using that language.
jacobnavia <jacob@jacob.remcomp.fr>: May 09 05:48PM +0200

> and the pipelines so complex that its next to impossible for most people to
> really get a good idea of which instructions to use and how to sequence them
> to get the best result.
 
I have optimized my 128 bit code in asm and it is about 100% to 200%
faster than gcc.
 
I said (and you cite my words so you must have read it...) that the
scope reduction possible for a human but not for a compiler is the
crucial point that makes the difference.
 
A compiler must respect calling conventions, for instance. An asm
language programmer not.
bitrex <bitrex@de.lete.earthlink.net>: May 09 11:53AM -0400

> and the pipelines so complex that its next to impossible for most people to
> really get a good idea of which instructions to use and how to sequence them
> to get the best result.
 
I do know it very much seems that way on 8 bit, at least. ;-)
 
The few times I have found myself resorting to inline assembly there
recently is that in little embedded applications it sometimes makes
sense to enforce certain things that a universal C/C++ compiler like gcc
wouldn't know about, as you're essentially dealing with programmable
hardware that interfaces with other hardware, not a general purpose
computer.
 
For example GCC pretty much ignores the "register" keyword, but with a
little hacking in asm you can enforce that yes, this 4 byte global
variable should remain in this set of registers forever and ever. As you
have dozens and dozens of GPRs available occupying a couple of them
permanently doesn't affect the efficiency of the rest of the code at
all, and as the architecture is very simple you know that it's always
going to be more expensive to be reading and writing this particular
variable that you need to be updating every interrupt cycle in and out
of SRAM then just leaving it in place.
Bonita Montero <Bonita.Montero@gmail.com>: May 09 06:03PM +0200

> I have optimized my 128 bit code in asm and it is about 100% to 200%
> faster than gcc.
 
I think you're thinking about SSE-code. You can use SSE-intrinsics
in C++ and have the same or even better performance of a compiler.
bitrex <bitrex@de.lete.earthlink.net>: May 09 12:08PM -0400

On 05/09/2017 11:48 AM, jacobnavia wrote:
 
 
> I said (and you cite my words so you must have read it...) that the
> scope reduction possible for a human but not for a compiler is the
> crucial point that makes the difference.
 
It seems like a tautological statement that's true on its face, though.
To take an absurd example if you know your architecture, you know your
register size, you know your instruction set, you know the size of your
data, you're intimately familiar with all those aspects and your data
set is small enough then you could certainly write an asm program that
never once struck out to main memory to do any of its real work. You
could write an asm program to write pseudorandom 8 bit characters to the
display buffer bare-metal faster than any std::string based
implementation ever could.
 
Okay. So what.
scott@slp53.sl.home (Scott Lurndal): May 09 04:51PM

>> faster than gcc.
 
>I think you're thinking about SSE-code. You can use SSE-intrinsics
>in C++ and have the same or even better performance of a compiler.
 
Or just use a good auto-vectorizing compiler.
 
https://en.wikipedia.org/wiki/Automatic_vectorization
jacobnavia <jacob@jacob.remcomp.fr>: May 09 07:03PM +0200

Le 09/05/2017 à 18:03, Bonita Montero a écrit :
>> faster than gcc.
 
> I think you're thinking about SSE-code. You can use SSE-intrinsics
> in C++ and have the same or even better performance of a compiler.
 
Sure but then... you are programming in asm dear!
Bonita Montero <Bonita.Montero@gmail.com>: May 09 07:11PM +0200

>> I think you're thinking about SSE-code. You can use SSE-intrinsics
>> in C++ and have the same or even better performance of a compiler.
 
> Sure but then... you are programming in asm dear!
 
That's like saying you're programming in asm when you have a
"a = b + c" becuase you can imagine what's the resulting code.
Bonita Montero <Bonita.Montero@gmail.com>: May 09 07:16PM +0200

>> in C++ and have the same or even better performance of a compiler.
 
> Or just use a good auto-vectorizing compiler.
> https://en.wikipedia.org/wiki/Automatic_vectorization
 
Automatic vectorization works only on close matches to code-patterns
that match what the compiler knows. Slight differences often prevent
auto-vectorization.
And consider the main-loop of the Stream-Triad-benchmark:
 
void tuned_STREAM_Triad(STREAM_TYPE scalar)
{
ssize_t j;
 
for (j = 0; j<STREAM_ARRAY_SIZE; j++)
a[j] = b[j] + scalar*c[j];
}
 
This loop can be detected by a compiler to be vectorizable easily.
But the compiler doesn't know if the arrays are aligned properly
for the SSE/AVX datatypes. So it can't use the aligned loads/stores.
scott@slp53.sl.home (Scott Lurndal): May 09 05:30PM


>This loop can be detected by a compiler to be vectorizable easily.
>But the compiler doesn't know if the arrays are aligned properly
>for the SSE/AVX datatypes. So it can't use the aligned loads/stores.
 
Except STREAM is benchmarking the memory system (specifically bandwidth), and benchmarks
are generally written in such a way as to prevent overly agressive
compiler optimizations.
Bonita Montero <Bonita.Montero@gmail.com>: May 09 08:29PM +0200

> ..., and benchmarks are generally written in such a way
> as to prevent overly agressive compiler optimizations.
 
Optimization of compilers are arranged in a way to cope with
common benchmarks as well as possible.
Chris Vine <chris@cvine--nospam--.freeserve.co.uk>: May 09 07:34PM +0100

On Tue, 9 May 2017 19:03:04 +0200
 
> > I think you're thinking about SSE-code. You can use SSE-intrinsics
> > in C++ and have the same or even better performance of a compiler.
 
> Sure but then... you are programming in asm dear!
 
Don't be so fucking sexist. Apart from which, you have missed the
point.
jacobnavia <jacob@jacob.remcomp.fr>: May 09 09:16PM +0200

Le 09/05/2017 à 20:34, Chris Vine a écrit :
> Don't be so fucking sexist.
 
????
Sexist because I tell that using intrinsics is using assembly language?
 
?????
bitrex <bitrex@de.lete.earthlink.net>: May 09 11:05AM -0400

On 05/07/2017 02:29 AM, Marcel Mueller wrote:
 
 
> Probably. This explains the pure function exception, but a race with
> placement new could result in a similar problem.
 
> Marcel
 
Thanks, the reason for the segfault appears to be indeed that I was,
like a dummy, not destructing the old objects prior to placing the new ones.
 
The reason I'm recycling the pointers from back to front is basically
because those addresses have already been allocated and "handed off" to
the display thread in the form of weak_ptrs at startup prior to
initialization of the logic thread. This buffer of "effect objects" is
contained within another object that handles the logic of the main
object to which the motion blur effect is being applied; a pointer to
its sprite object is also handed off to the display thread in the same
fashion.
 
I don't think I can dynamically allocate new blocks of the buffer with
new addresses without having to implement some observer for the display
thread that watches every object for whenever anything changes.
 
If the containing object is destroyed then the use count of all those
shared_ptrs drops to zero, and the display thread can simply check that
to dump them from its list of active objects on the display. If I then
want to resurrect Super Mario and make him appear back on the screen
then yeah I will need to have the display-thread re-acquire a new set of
pointers, but I can have it simply observe the logic thread as a whole
for when things change, rather than every individual type of object.
That's the idea.
 
 
If I perform the operations in this sequence:
 
auto obj_ptr = _effect_object_buf.back().get();
obj_ptr->~DisplayObject2D();
new (obj_ptr) DisplayObject2D(*_effect_target);
_effect_object_buf.push_front(std::move(_effect_object_buf.back()));
_effect_object_buf.pop_back();
 
everything seems to work fine with no lock on these operations required,
only the weak_ptr lock when the display thread accesses them. It hasn't
happened in testing at least, but I'm not experienced enough with
threading to know if there's actually a race hazard doing it this way,
or not.
bitrex <bitrex@de.lete.earthlink.net>: May 09 11:09AM -0400

On 05/09/2017 11:05 AM, bitrex wrote:
 
> Thanks, the reason for the segfault appears to be indeed that I was,
> like a dummy, not destructing the old objects prior to placing the new
> ones.
 
I think the effect looks pretty nice for a software-generated effect,
though! Someday I'll learn more about OpenGL shaders and stuff...
 
http://imgur.com/a/TmFyc
spud@potato.field: May 09 01:34PM

Hi
 
I was wondering if its possible for an object constructor to be passed its
position in an array. eg:
 
class my_class
{
:
:
} myarray[10][20];
 
Is there a way for each individual object to find its x,y location? Obviously
I can just declare **myarray then do a load of messy manual allocation passing
the x,y to the constructor at "new", but perhaps the 2011 or 2014 standards
have added automated this at all? Or is there some STL trick that can achieve
this using a container?
 
Thanks for any help
 
--
Spud
You received this digest because you're subscribed to updates for this group. You can change your settings on the group membership page.
To unsubscribe from this group and stop receiving emails from it send an email to comp.lang.c+++unsubscribe@googlegroups.com.

No comments: