- "Need for Speed - C++ versus Assembly Language" - 22 Updates
- std::experimental::atomic_shared_ptr - 2 Updates
- Finding an objects location in an array - 1 Update
Cholo Lennon <chololennon@hotmail.com>: May 09 12:10AM -0300 On 05/08/2017 07:41 PM, jacobnavia wrote: > In that example, a slow asm program is compared to the compiler output. > Yes, the compiler wins. You can write fast asm programs, but you can > write also so slow ones, that those programs are slower than a compiler. That's happens a lot with C++. Everyone who wants to compare her/his language of choice has the necessity to do it against C++, but usually the comparisons are not fair due to person bias and because she/he is really not proficient in C++. Result: Horrible C++ code vs optimized/beautiful "foo" code -- Cholo Lennon Bs.As. ARG |
David Brown <david.brown@hesbynett.no>: May 09 01:29PM +0200 On 09/05/17 00:29, jacobnavia wrote: >> Java or C#. > Maybe you can explain what you mean with > "memory-mapped read-only registers" On a lot of systems, hardware registers are accessed as though they were memory at specific fixed locations. (On x86, they often use a different memory space - specific I/O operations rather than memory load/store operations. But on many other processors it all just looks like memory to the cpu.) Sometimes these registers are read/write, sometimes they are read-only, occasionally they are write-only, and sometimes the very act of reading them or writing them triggers other events. For example, on microcontrollers it is common for a serial port (UART) to have a "data register". Writing a value to that triggers sending the character out on the serial port, reading from it takes the next character out of a receive FIFO. Such memory accesses must therefore be done very carefully - using "volatile" in C or C++. I have no experience with Java or C#, but I would think the virtual machine layer between the source code and the actual operations would make it very difficult to get this to work as the programmer expects. |
scott@slp53.sl.home (Scott Lurndal): May 09 01:12PM >memory at specific fixed locations. (On x86, they often use a different >memory space - specific I/O operations rather than memory load/store >operations. X86 has three distinct "address spaces" for I/O: 1) One 64kbyte I/O space(s) accessed by the IN & OUT instructions. This space is used by various legacy (ISA bus) devices such as the keyboard controller and serial ports typically provided on the Platform Controller Hub (PCH) (southbridge) chip. For example, the BIOS/UEFI firmware writes an 8-bit value to port 0x80 that indicates which part of the BIOS is currently running for debugging purposes. 2) One or more regions carved out of the address space that map to the PCI address space for any plug-in cards via Base Address Registers (BARs) in the PCI configuration space. 3) The PCI configuration space (accessed indirectly via addresses 0xCF8 and 0xCFC in the I/O space or directly via PCI-Express extended configuration access method (ECAM) which is mapped into the physical address space). As David points out, these registers often have side effects upon access (for example, some bits in a register may have R/W1C semantics where a write of a bit will cause that bit to be reset in the register; typically used with interrupt status registers). This precludes any speculative operations to these registers by the CPU (which is why they're mapped into non-cacheable memory types by the host kernel using the x86/amd MTRR registers) > Such memory accesses must therefore be done very >carefully - using "volatile" in C or C++. volatile only affects compiler optimizations, not the hardware so volatile may not be sufficient to ensure correct ordering on a OOO core with a weaker memory model than x86 (e.g. ARM64), in which case additional precautions (barriers like DSB) must be taken by software to ensure the external observer (the device) sees memory accesses in the correct order. Even x86 requires memory barrier instructions in certain multithreaded situations (see, for example, the network stack in linux where skb handling requires fences for certain operations, DAMHIKT). >I have no experience with Java or C#, but I would think the virtual >machine layer between the source code and the actual operations would >make it very difficult to get this to work as the programmer expects. Technically, one can use JNI with java to access hardware - but that is more like bolting a drone motor on a F-18 and expecting it to perform well. |
scott@slp53.sl.home (Scott Lurndal): May 09 01:14PM >the comparisons are not fair due to person bias and because she/he is >really not proficient in C++. Result: Horrible C++ code vs >optimized/beautiful "foo" code I challenge you to take any reasonable COBOL program and try to make it "optimized/beautiful C++" code. Good luck with that. |
Cholo Lennon <chololennon@hotmail.com>: May 09 10:56AM -0300 On 09/05/17 10:14, Scott Lurndal wrote: >> optimized/beautiful "foo" code > I challenge you to take any reasonable COBOL program and try to make it > "optimized/beautiful C++" code. Good luck with that. You are being too literal... what I wanted to say is that people usually put an effort to tune their code (in their favourite language), but not in C++ which is awful, or just a copy of the code in the other language (ignoring the way C++ is used). This, of course, results in an unfair comparison. -- Cholo Lennon Bs.As. ARG |
David Brown <david.brown@hesbynett.no>: May 09 04:19PM +0200 On 09/05/17 15:12, Scott Lurndal wrote: > which case additional precautions (barriers like DSB) must be > taken by software to ensure the external observer (the device) > sees memory accesses in the correct order. Yes, indeed. You usually also need help from the MMU to mark the area as non-cacheable, and perhaps other details to avoid store buffers, write combining, speculative reads, etc. I don't do this kind of thing on x86 so I don't know the details here, but it was "fun" getting it all correct on a dual core PowerPC microcontroller. |
bitrex <bitrex@de.lete.earthlink.net>: May 09 11:24AM -0400 On 05/08/2017 06:25 PM, jacobnavia wrote: > based on the C language, The compiler generates assembly. > So, I like (and I have done it very often) to outsmart ANY compiler in > just a few lines of ASM. It is fun. It's probably true on x86, but I never really knew much x86 assembly. However back when I was pretty good at 8 bit AVR assembly and also wrote code in C I found avr-gcc extremely difficult to beat, at least when it came to relatively short procedural functions. I knew the best way to do that sequence of say 20 operations. It knew, too. The ISA for those RISC microprocessors is pretty brief and there's only so many ways to do things, I guess, and the manufacturer sez its ISA was optimized to help C compilers output efficient assembly anyway. The only thing it seemed pathological about not doing sometimes was not converting switch statements/if-then-else structures into indirect jump tables when it would've made sense from both an execution speed and code space perspective. |
bitrex <bitrex@de.lete.earthlink.net>: May 09 11:33AM -0400 On 05/08/2017 05:31 PM, Richard wrote: > Modeling memory-mapped read-only registers with zero overhead compared > to assembly is no big deal for C++. It would be amazingly painful in > Java or C#. Basically impossible in Java. If you need to access architecture-specific machine registers at all Java is definitely the wrong language. Unless you mean "model" in the sense of writing an emulator or something... |
spud@potato.field: May 09 03:37PM On Tue, 9 May 2017 11:24:42 -0400 >> So, I like (and I have done it very often) to outsmart ANY compiler in >> just a few lines of ASM. It is fun. >It's probably true on x86, but I never really knew much x86 assembly. The optimisation techniques used in modern compilers have been developed by teams of dozens or even hundreds over the years. Any one person who thinks they can outsmart a modern compiler in assembly optimisation for x86 in all but a few edge cases is deluding themselves. The instruction set is now so large and the pipelines so complex that its next to impossible for most people to really get a good idea of which instructions to use and how to sequence them to get the best result. -- Spud |
bitrex <bitrex@de.lete.earthlink.net>: May 09 11:37AM -0400 On 05/09/2017 09:12 AM, Scott Lurndal wrote: > Technically, one can use JNI with java to access hardware - but that > is more like bolting a drone motor on a F-18 and expecting it to > perform well. Sort of defeats the point of the whole "walled garden" virtual machine paradigm of the language, yeah? If your application requires that you have to access machine-specific hardware via some method other than the APIs provided then I'd say you're SOL and shouldn't be using that language. |
jacobnavia <jacob@jacob.remcomp.fr>: May 09 05:48PM +0200 > and the pipelines so complex that its next to impossible for most people to > really get a good idea of which instructions to use and how to sequence them > to get the best result. I have optimized my 128 bit code in asm and it is about 100% to 200% faster than gcc. I said (and you cite my words so you must have read it...) that the scope reduction possible for a human but not for a compiler is the crucial point that makes the difference. A compiler must respect calling conventions, for instance. An asm language programmer not. |
bitrex <bitrex@de.lete.earthlink.net>: May 09 11:53AM -0400 > and the pipelines so complex that its next to impossible for most people to > really get a good idea of which instructions to use and how to sequence them > to get the best result. I do know it very much seems that way on 8 bit, at least. ;-) The few times I have found myself resorting to inline assembly there recently is that in little embedded applications it sometimes makes sense to enforce certain things that a universal C/C++ compiler like gcc wouldn't know about, as you're essentially dealing with programmable hardware that interfaces with other hardware, not a general purpose computer. For example GCC pretty much ignores the "register" keyword, but with a little hacking in asm you can enforce that yes, this 4 byte global variable should remain in this set of registers forever and ever. As you have dozens and dozens of GPRs available occupying a couple of them permanently doesn't affect the efficiency of the rest of the code at all, and as the architecture is very simple you know that it's always going to be more expensive to be reading and writing this particular variable that you need to be updating every interrupt cycle in and out of SRAM then just leaving it in place. |
Bonita Montero <Bonita.Montero@gmail.com>: May 09 06:03PM +0200 > I have optimized my 128 bit code in asm and it is about 100% to 200% > faster than gcc. I think you're thinking about SSE-code. You can use SSE-intrinsics in C++ and have the same or even better performance of a compiler. |
bitrex <bitrex@de.lete.earthlink.net>: May 09 12:08PM -0400 On 05/09/2017 11:48 AM, jacobnavia wrote: > I said (and you cite my words so you must have read it...) that the > scope reduction possible for a human but not for a compiler is the > crucial point that makes the difference. It seems like a tautological statement that's true on its face, though. To take an absurd example if you know your architecture, you know your register size, you know your instruction set, you know the size of your data, you're intimately familiar with all those aspects and your data set is small enough then you could certainly write an asm program that never once struck out to main memory to do any of its real work. You could write an asm program to write pseudorandom 8 bit characters to the display buffer bare-metal faster than any std::string based implementation ever could. Okay. So what. |
scott@slp53.sl.home (Scott Lurndal): May 09 04:51PM >> faster than gcc. >I think you're thinking about SSE-code. You can use SSE-intrinsics >in C++ and have the same or even better performance of a compiler. Or just use a good auto-vectorizing compiler. https://en.wikipedia.org/wiki/Automatic_vectorization |
jacobnavia <jacob@jacob.remcomp.fr>: May 09 07:03PM +0200 Le 09/05/2017 à 18:03, Bonita Montero a écrit : >> faster than gcc. > I think you're thinking about SSE-code. You can use SSE-intrinsics > in C++ and have the same or even better performance of a compiler. Sure but then... you are programming in asm dear! |
Bonita Montero <Bonita.Montero@gmail.com>: May 09 07:11PM +0200 >> I think you're thinking about SSE-code. You can use SSE-intrinsics >> in C++ and have the same or even better performance of a compiler. > Sure but then... you are programming in asm dear! That's like saying you're programming in asm when you have a "a = b + c" becuase you can imagine what's the resulting code. |
Bonita Montero <Bonita.Montero@gmail.com>: May 09 07:16PM +0200 >> in C++ and have the same or even better performance of a compiler. > Or just use a good auto-vectorizing compiler. > https://en.wikipedia.org/wiki/Automatic_vectorization Automatic vectorization works only on close matches to code-patterns that match what the compiler knows. Slight differences often prevent auto-vectorization. And consider the main-loop of the Stream-Triad-benchmark: void tuned_STREAM_Triad(STREAM_TYPE scalar) { ssize_t j; for (j = 0; j<STREAM_ARRAY_SIZE; j++) a[j] = b[j] + scalar*c[j]; } This loop can be detected by a compiler to be vectorizable easily. But the compiler doesn't know if the arrays are aligned properly for the SSE/AVX datatypes. So it can't use the aligned loads/stores. |
scott@slp53.sl.home (Scott Lurndal): May 09 05:30PM >This loop can be detected by a compiler to be vectorizable easily. >But the compiler doesn't know if the arrays are aligned properly >for the SSE/AVX datatypes. So it can't use the aligned loads/stores. Except STREAM is benchmarking the memory system (specifically bandwidth), and benchmarks are generally written in such a way as to prevent overly agressive compiler optimizations. |
Bonita Montero <Bonita.Montero@gmail.com>: May 09 08:29PM +0200 > ..., and benchmarks are generally written in such a way > as to prevent overly agressive compiler optimizations. Optimization of compilers are arranged in a way to cope with common benchmarks as well as possible. |
Chris Vine <chris@cvine--nospam--.freeserve.co.uk>: May 09 07:34PM +0100 On Tue, 9 May 2017 19:03:04 +0200 > > I think you're thinking about SSE-code. You can use SSE-intrinsics > > in C++ and have the same or even better performance of a compiler. > Sure but then... you are programming in asm dear! Don't be so fucking sexist. Apart from which, you have missed the point. |
jacobnavia <jacob@jacob.remcomp.fr>: May 09 09:16PM +0200 Le 09/05/2017 à 20:34, Chris Vine a écrit : > Don't be so fucking sexist. ???? Sexist because I tell that using intrinsics is using assembly language? ????? |
bitrex <bitrex@de.lete.earthlink.net>: May 09 11:05AM -0400 On 05/07/2017 02:29 AM, Marcel Mueller wrote: > Probably. This explains the pure function exception, but a race with > placement new could result in a similar problem. > Marcel Thanks, the reason for the segfault appears to be indeed that I was, like a dummy, not destructing the old objects prior to placing the new ones. The reason I'm recycling the pointers from back to front is basically because those addresses have already been allocated and "handed off" to the display thread in the form of weak_ptrs at startup prior to initialization of the logic thread. This buffer of "effect objects" is contained within another object that handles the logic of the main object to which the motion blur effect is being applied; a pointer to its sprite object is also handed off to the display thread in the same fashion. I don't think I can dynamically allocate new blocks of the buffer with new addresses without having to implement some observer for the display thread that watches every object for whenever anything changes. If the containing object is destroyed then the use count of all those shared_ptrs drops to zero, and the display thread can simply check that to dump them from its list of active objects on the display. If I then want to resurrect Super Mario and make him appear back on the screen then yeah I will need to have the display-thread re-acquire a new set of pointers, but I can have it simply observe the logic thread as a whole for when things change, rather than every individual type of object. That's the idea. If I perform the operations in this sequence: auto obj_ptr = _effect_object_buf.back().get(); obj_ptr->~DisplayObject2D(); new (obj_ptr) DisplayObject2D(*_effect_target); _effect_object_buf.push_front(std::move(_effect_object_buf.back())); _effect_object_buf.pop_back(); everything seems to work fine with no lock on these operations required, only the weak_ptr lock when the display thread accesses them. It hasn't happened in testing at least, but I'm not experienced enough with threading to know if there's actually a race hazard doing it this way, or not. |
bitrex <bitrex@de.lete.earthlink.net>: May 09 11:09AM -0400 On 05/09/2017 11:05 AM, bitrex wrote: > Thanks, the reason for the segfault appears to be indeed that I was, > like a dummy, not destructing the old objects prior to placing the new > ones. I think the effect looks pretty nice for a software-generated effect, though! Someday I'll learn more about OpenGL shaders and stuff... http://imgur.com/a/TmFyc |
spud@potato.field: May 09 01:34PM Hi I was wondering if its possible for an object constructor to be passed its position in an array. eg: class my_class { : : } myarray[10][20]; Is there a way for each individual object to find its x,y location? Obviously I can just declare **myarray then do a load of messy manual allocation passing the x,y to the constructor at "new", but perhaps the 2011 or 2014 standards have added automated this at all? Or is there some STL trick that can achieve this using a container? Thanks for any help -- Spud |
You received this digest because you're subscribed to updates for this group. You can change your settings on the group membership page. To unsubscribe from this group and stop receiving emails from it send an email to comp.lang.c+++unsubscribe@googlegroups.com. |
No comments:
Post a Comment