- "Need for Speed - C++ versus Assembly Language" - 10 Updates
- Why does VC++ ignore the 'private' of private inheritance? - 2 Updates
- Why this UTF-8-conversion code works with Visual C++ and not with g++? - 2 Updates
- compiler generate jmp to another jmp - 6 Updates
- The worm that spreads WanaCrypt0r - 5 Updates
Jerry Stuckle <jstucklex@attglobal.net>: May 15 12:02AM -0400 On 5/12/2017 4:29 AM, David Brown wrote: > might not mean /big/ changes, but it will mean changes. In C, you can > make another local variable when you want it - in assembly, if you run > out of registers you need to make changes. But in that case you're not talking about unrelated parts of the code. They are quite closely related. > has an "if-the-else" construct that supports up to four conditional > instructions. Make a change that goes over that limit of four, and you > have to re-structure the code to use conditional branches. Once again, you're not talking unrelated code. > limited in flexibility. > All this might not have a noticeable effect on code performance - but it > does have an effect on the time and effort spent. But it *can* have an effect on code performance, and in some cases that is quite important. > is still subject to caches, interrupts, etc.), and much higher > development costs and risks. Programmable logic lets them get latency > variation down to a single clock cycle - it is worth the cost. Those are excellent examples, and the people who run those projects pay top dollar for programmers who can do just what I described. Virtually all of the time-critical code is assembler. Only non-critical sections are C, C++ or other languages. > target processor, and saving 2% average time will save 2% hardware > budget and 2% electricity costs. It is absolutely worth the effort > going for assembly there. Throughput after the fact is not as critical as throughput during the event. For real time processing, if you miss something, it is gone forever. > it is far easier to track exactly what you are doing because there is a > small instruction set, no scheduling or multiple issues, very clear and > simple pipelines, no caches, etc. I've seen million+ lines of assembler code, even on mainframes. It is done much more often than you think. > I agree that it /is/ done, I merely say that it is very rarely worth the > effort. Massive scientific processing is an example where it sometimes > /is/ worth the effort. And I'm saying you're wrong - there are times when it is quite important, and it is done this way. Much more often than you think. And BTW - another example is in weather prediction. Most of the models are written in assembler and run on supercomputers. Even at that, to get an accurate 24 hour forecast would take 72 hours or processing time. -- ================== Remove the "x" from my email address Jerry Stuckle jstucklex@attglobal.net ================== |
Christian Gollwitzer <auriocus@gmx.de>: May 15 09:32PM +0200 Am 15.05.17 um 06:02 schrieb Jerry Stuckle: > And BTW - another example is in weather prediction. Most of the models > are written in assembler and run on supercomputers. Do you have a reference for this? Supercomputers are use, no doubt, but I wouldn't have expected that they write the PDE solvers in assembly. I found one at https://github.com/yyr/wrf which is written in Fortran. Christian |
Gareth Owen <gwowen@gmail.com>: May 15 08:48PM +0100 > Do you have a reference for this? Supercomputers are use, no doubt, > but I wouldn't have expected that they write the PDE solvers in > assembly. Most of them are in Fortran - some are in C and C++. I've worked on and seen codebases for large scale climate and weather modeling, and I've seen literally no assembler. In fact, few of them are even running on the same architectures on which they originally run (and many are run on multiple-archs as most predictions are ensemble averages of deliberately mildly different runs - different architectures help with that). It might have been true 30 years ago, which is Jerry's default point of reference. This is hardly surprising, as easy portability to the latest-and-fastest processors is far more critical for performance than squeezing 3% out of the current processor using assembler. |
Jerry Stuckle <jstucklex@attglobal.net>: May 15 04:12PM -0400 On 5/15/2017 3:32 PM, Christian Gollwitzer wrote: > I wouldn't have expected that they write the PDE solvers in assembly. > I found one at https://github.com/yyr/wrf which is written in Fortran. > Christian You won't find these programs on github. And anything you find on github in this area is not used by serious forecasters. -- ================== Remove the "x" from my email address Jerry Stuckle jstucklex@attglobal.net ================== |
Jerry Stuckle <jstucklex@attglobal.net>: May 15 04:18PM -0400 On 5/15/2017 3:48 PM, Gareth Owen wrote: > the same architectures on which they originally run (and many are run on > multiple-archs as most predictions are ensemble averages of deliberately > mildly different runs - different architectures help with that). Yes, some code is written in Fortran. But there is also a significant amount of assembler - I've seen it. > It might have been true 30 years ago, which is Jerry's default point of > reference. My reference is fresh - this year, in fact. > This is hardly surprising, as easy portability to the latest-and-fastest > processors is far more critical for performance than squeezing 3% out of > the current processor using assembler. Then you haven't seen the code for the models used on those supercomputers. I'm referring specifically to GFS and European models, although others are similar. -- ================== Remove the "x" from my email address Jerry Stuckle jstucklex@attglobal.net ================== |
fir <profesor.fir@gmail.com>: May 15 01:40PM -0700 W dniu poniedziałek, 15 maja 2017 22:18:10 UTC+2 użytkownik Jerry Stuckle napisał: > Then you haven't seen the code for the models used on those > supercomputers. I'm referring specifically to GFS and European models, > although others are similar. i once remember thread on big number library it was probably GMP it uses assembly as far as i remember, (some my try to find some info how assembly is faster there)... i also remember some probably fastest fractal/mandelbrot 'explorer' it also uses assembly afair .. i would tend to belive that most fastests cpu solutions tend to go to assembly anyway (but i dont do research in this field) |
Gareth Owen <gwowen@gmail.com>: May 15 09:43PM +0100 > Then you haven't seen the code for the models used on those > supercomputers. I'm referring specifically to GFS and European models, > although others are similar. https://www.researchgate.net/publication/228791697_FAMOUS_faster_using_parallel_computing_techniques_to_accelerate_the_FAMOUSHadCM3_climate_model_with_a_focus_on_the_radiative_transfer_algorithm Here's a paper about HadCM3 model - which is written in C and Fortran. As discussed here, when the need real performance they use massively parallel systems like OpenCL, not assembler. Most of the discussion in this paper talks about porting the speed-critical sections *from* Fortran. |
Gareth Owen <gwowen@gmail.com>: May 15 09:44PM +0100 > fractal/mandelbrot 'explorer' it also uses assembly afair .. i would > tend to belive that most fastests cpu solutions tend to go to assembly > anyway (but i dont do research in this field) One place where assembly is widely used is FFmpeg, for encoding video. |
"Chris M. Thomasson" <invalid@invalid.invalid>: May 15 01:54PM -0700 On 5/15/2017 1:40 PM, fir wrote: > 'explorer' it also uses assembly afair .. i would tend to belive that most > fastests cpu solutions tend to go to assembly anyway (but i dont do research > in this field) Wrt fast fractal explorers, assembly on the cpu is very slow compared to implementing the fractal in a GPU shader. |
fir <profesor.fir@gmail.com>: May 15 02:10PM -0700 W dniu poniedziałek, 15 maja 2017 22:43:31 UTC+2 użytkownik gwowen napisał: > parallel systems like OpenCL, not assembler. Most of the discussion in > this paper talks about porting the speed-critical sections *from* > Fortran. well thats obvious (that gpu has more processing power) but we talk here about cpu as to gpu gpu also has its c and has its assebly too ;c i once was interested how much faster assembly coding on gpu is better than c coding on gpu but it is hard to find people able to answer that ps i posted once on comp.lang.c simple tests related to this topic here (i mean comparing scalar c / asm (well half-asm/intrinsics) and gpu-c, this test is interesting by its simplicity so i maybe repaste it, someone could take it as a base for some benchmark maybe >>>paste below>>>> This noght run my first gpu c code that is doing something and made some tests this is a simple mandelbrot drwing code, first i run scalar version int mandelbrot_n( float cRe, float cIm, int max_iter ) { float re = cRe; float im = cIm; float rere=re*re; float imim=im*im; for(int n=1; n<=max_iter; n++) { im = (re+re)*im + cIm; re = rere - imim + cRe; rere=re*re; imim=im*im; if ( (rere + imim) > 4.0 ) return n; } return 0; } for 256 x256 x 1000 iteration it take 90 ms then i made sse intrinsic version __attribute__((force_align_arg_pointer)) __m128i mandelbrot_n_sse( __m128 cre, __m128 cim, int max_iter ) { __m128 re = _mm_setzero_ps(); __m128 im = _mm_setzero_ps(); __m128 _1 = _mm_set_ps1(1.); __m128 _4 = _mm_set_ps1(4.); __m128 iteration_counter = _mm_set_ps1(0.); for(int n=0; n<=max_iter; n++) { __m128 re2 = _mm_mul_ps(re, re); __m128 im2 = _mm_mul_ps(im, im); __m128 radius2 = _mm_add_ps(re2,im2); __m128 compare_mask = _mm_cmplt_ps( radius2, _4); iteration_counter = _mm_add_ps( iteration_counter, _mm_and_ps(compare_mask, _1) ); if (_mm_movemask_ps(compare_mask)==0) break; __m128 ren = _mm_add_ps( _mm_sub_ps(re2, im2), cre); __m128 reim = _mm_mul_ps(re, im); __m128 imn = _mm_add_ps( _mm_add_ps(reim, reim), cim); re = ren; im = imn; } __m128i n = _mm_cvtps_epi32(iteration_counter); return n; } this run 20 ms (more that 4 times faster, dont know why) (the procesor i run is anyway old core2 e6550 2.33GHz - i got better machine with avx support but didnt use it here yet) then i make opencl code "__kernel void square( \n" \ " __global int* input, \n" \ " __global int* output, \n" \ " const unsigned int count) \n" \ "{ \n" \ " int i = get_global_id(0); \n" \ " if(i < count) \n" \ " { \n" \ " int x = i%256; \n" \ " // if(x>=256) return; \n" \ " int y = i/256; \n" \ " // if(y>=256) return; \n" \ " float cRe = -0.5 + -1.5 + x/256.*3.; \n" \ " float cIm = 0.0 + -1.5 + y/256.*3.; \n" \ " float re = 0; \n" \ " float im = 0; \n" \ " int n = 0; \n" \ " for( n=0; n<=1000; n++) { \n" \ " if( re * re + im * im > 4.0 ) { output[256*y+x] = n + 256*n + 256*256*n; return;} \n" \ " float re_n = re * re - im * im + cRe; \n" \ " float im_n = 2 * re * im + cIm; \n" \ " re = re_n; \n" \ " im = im_n; \n" \ " } \n" \ " output[256*y+x] = 250<<8; \n" \ " } \n" \ "} \n" \ "\n"; this works with not a problem and works at 7 ms (i got weak gpu gt610) How to optimise this gpu version? Is it common to write such scalar code on gpu, maybe there is some way of writing something like sse intrinsics here? or other kind of optimisation? (anyway i must say that thiose critics of gpu /opencl coding i dont fully agree this works easy and fine - at least for some cases, (esp good is that it has not to much slowdown when runing gpu from cpu and getting back results - it seem i can run it in the 1 milisecond window, so its very fine) i belive that with harder codes it may getting slower, but also belive with better card i may go also better than 7 ms) |
praxisjuergenschmidt@gmail.com: May 15 06:12AM -0700 I tried a variation of the following code fragment: class Base { public: virtual ~Base() {} }; class Derived : Base { }; class Test { Base* b_; public: Test( Base* b ) : b_( b ) {} ~Test() { delete b_; } }; Test testA() { return Test( new Derived ); } It is compiling well with VC++ (cl version 15 and 17) ignoring private inheritance, but doesn't with any other compiler. Is it just a bug? Thanks J |
"Alf P. Steinbach" <alf.p.steinbach+usenet@gmail.com>: May 15 11:09PM +0200 > It is compiling well with VC++ (cl version 15 and 17) ignoring > private inheritance, but doesn't with any other compiler. Is it just > a bug? That's clearly a compiler bug. Please report it. If this is Visual C++ 2015 or earlier then you can use Microsoft Connect. I think there's a new error reporting site for Visual C++ 2017. Cheers!, - ALf |
scott@slp53.sl.home (Scott Lurndal): May 15 01:11PM >platform is. :) I think the C++ standard library's i/o and text handling >is fundamentally broken, because it was designed with the codepage model >in mind, and now is used for variable length encodings. I don't believe the SGI folks ever even considered windows, or codepages when designing the STL. |
"Alf P. Steinbach" <alf.p.steinbach+usenet@gmail.com>: May 15 11:04PM +0200 On 15-May-17 3:11 PM, Scott Lurndal wrote: >> in mind, and now is used for variable length encodings. > I don't believe the SGI folks ever even considered windows, or codepages > when designing the STL. The STL was designed by Alexander Stepanov, starting in 1992 at HP. See <url: https://en.wikipedia.org/wiki/Standard_Template_Library#History>. "SGI folks" did not design the STL but provided an STL implementation, apparently still available at <url: https://www.sgi.com/tech/stl/>. Here's a nice article about it: <url: http://www.drdobbs.com/cpp/the-sgi-standard-template-library/184410249>. The STL has roughly nothing to do with text handling and streams. Stepanov focused on (the separation of) algorithms and containers. Possibly you're conflating the STL with the standard library. That's natural because it's a common misconception that's floating around. Codepages are not particular to Windows. "Codepage" is a term that originally denoted only single byte encodings and that stems from IBM, as far as I know. Today that's the main meaning. Most systems including Unix were based on the codepage model (single byte encodings, with some special support for Shift-JIS etc.) before the advent of Unicode. See <url: https://en.wikipedia.org/wiki/Code_page>. Summing up, your comment provides many opportunities to learn. ;-) Cheers & hth., - Alf |
qak <q3k@mts.net.NOSPAM>: May 15 04:56PM I disassempled 2 highly optimized direct show filter (CoreAvc and Lentoid, known to be fastest for H264 and H265) and notice many (more than a few dozen) patterns like the following: jz Label_1 ; this is short jump but some of them unconditional jump ... Label_1: jmp Label_123 ; quite a few of them jump again to another jump ... ... Label_123: jmp Label_234 I could replace the first line with 'jz Label_234' and reassemple without problem. Could the optimizer miss the oportunity ? or I know too little! The first line is not always short jump, sometimes is even 'call Label_1'. |
"Rick C. Hodgin" <rick.c.hodgin@gmail.com>: May 15 10:14AM -0700 On Monday, May 15, 2017 at 12:56:49 PM UTC-4, qak wrote: > problem. > Could the optimizer miss the oportunity ? or I know too little! > The first line is not always short jump, sometimes is even 'call Label_1'. Some CPUs only allow conditional jumps which go for a certain distance (number of bytes) away from their instruction. As such, it's a common occurrence to see the condition jump only move to another instruction which then does a hard jump to some location further away. A more common practice is to jump locally on the alternate condition, and then hard-jump if the opposite condition test fails. However, for optimization, when it's known that a particular branch is more likely or less likely to branch, it can be coded in those ways. In this case, since you are able to modify the source code and generate a binary without an error, it's likely legacy code leftover from a 32-bit codec, which supported only +/- 128 bytes for branch target offsets, whereas in 64-bit code it can go pretty much anywhere needed. It may also be operating that way for alignment, so the code nearby is aligned at a particular boundary. Without knowing the specifics of the algorithm, it would be hard to say for certain, but it's not uncommon to see what you find there, and the reasons it exists typically are the limitations of the instruction ISA, or for some reasons related to optimization. Thank you, Rick C. Hodgin |
"Rick C. Hodgin" <rick.c.hodgin@gmail.com>: May 15 10:42AM -0700 On Monday, May 15, 2017 at 1:14:31 PM UTC-4, Rick C. Hodgin wrote: > the instruction ISA, or for some reasons related to optimization. > Thank you, > Rick C. Hodgin Something else ... a hard JMP instruction forces a refill of the instruction cache. If the algorithm is using self-modifying code, that could be the reason for the dual-target jump. On a conditional branch, the pipeline is only invalidated and refilled if the branch unit mis-predicted the branch. By forcing a hard branch, it will then re-fill the instruction code in the pipeline, which will read the recently altered self-modifying code (if it existed). If you re-post your question to alt.lang.asm or comp.lang.asm.x86 if it's x86-specific, then you'll get better answers. Thank you, Rick C. Hodgin |
scott@slp53.sl.home (Scott Lurndal): May 15 05:54PM >> distance (number of bytes) away from their instruction. >Something else ... a hard JMP instruction forces a refill of the >instruction cache. No, it doesn't. The only difference between an unconditional branch operation and a condition branch operation is whether or not instructions are speculatively executed. Modern branch predictors are pretty good at preventing speculation down the wrong branch path. Intel's instruction cache is snooped specifically to support self-modifying code without the programmer having to do anything special (like flushing the L1I cache). > By forcing a hard branch, it will >then re-fill the instruction code in the pipeline, which will read >the recently altered self-modifying code (if it existed). No, the unconditional branch does nothing other than changing the flow of execution. It has no effect on cache coherency. |
"Rick C. Hodgin" <rick.c.hodgin@gmail.com>: May 15 11:19AM -0700 On Monday, May 15, 2017 at 1:54:54 PM UTC-4, Scott Lurndal wrote: > No, the unconditional branch does nothing other than > changing the flow of execution. It has no effect on cache > coherency. That is information beyond my existing understanding. I just did a search online and Randall Hyde wrote on page 447 of his book: "Write Great Code, Vol. 2: Thinking Low-Level, Writing High-Level" https://books.google.com/books?id=mM58oD4LATUC&pg=PA447 "Although these statements typically compile to a single machine instruction (jmp), don't get the impression they are efficient to use. Even ignoring the fact that a jmp can be somewhat expensive (because it forces the CPU to flush the instruction pipeline), statements that branch out of a loop can have..." When did the JMP instruction stop forcing a refill of the cache? Thank you, Rick C. Hodgin |
"Rick C. Hodgin" <rick.c.hodgin@gmail.com>: May 15 11:27AM -0700 On Monday, May 15, 2017 at 1:54:54 PM UTC-4, Scott Lurndal wrote: > "Rick C. Hodgin" <rick.c.hodgin@gmail.com> writes: > >Something else ... a hard JMP instruction forces a refill of the > >instruction cache. I just realized that I wrote "instruction cache" here when I meant to write "instruction pipeline." I realize the cache is not invalidated, but only what has already been pre-loaded into the instruction pipeline. Those pre-decoded instructions already in the pipeline, which may have been received from prior reads before SMC updated something, are invalidated by the JMP and it re-fills from the instruction cache. If that is incorrect, then my information is notably out of date because that's how it used to work. > No, the unconditional branch does nothing other than > changing the flow of execution. It has no effect on cache > coherency. How does the CPU synchronize instructions which have been pre-fetched from now stale instruction data for an upcoming instruction that's already begun decoding for its pipeline, to then later signal without the hard JMP to know that it's going to be executing stale SMC? That would be quite a slick feature to have in a CPU, so that no matter when SMC was used, it always executed the correct version. Thank you, Rick C. Hodgin |
Real Troll <real.troll@trolls.com>: May 14 09:59PM -0400 The snippet code is here: <https://blog.malwarebytes.com/threat-analysis/2017/05/the-worm-that-spreads-wanacrypt0r/> Use your brains to get it work for you in your test labs. No Jesus needed here. |
"Chris M. Thomasson" <invalid@invalid.invalid>: May 14 09:57PM -0700 On 5/14/2017 6:59 PM, Real Troll wrote: > <https://blog.malwarebytes.com/threat-analysis/2017/05/the-worm-that-spreads-wanacrypt0r/> > Use your brains to get it work for you in your test labs. > No Jesus needed here. Thank you for posting this. |
"Chris M. Thomasson" <invalid@invalid.invalid>: May 14 10:18PM -0700 On 5/14/2017 9:57 PM, Chris M. Thomasson wrote: >> Use your brains to get it work for you in your test labs. >> No Jesus needed here. > Thank you for posting this. The one point that has me thinking is the missing leading underscore to the _beginthreadex function: https://msdn.microsoft.com/en-us/library/kdzttdcb.aspx It should not link? Weird. Also, that function is used to create threads along with an initialization of the CRT for said thread. So, the virus is C. Creating code on Windows that does not use the CRT can make use of CreateThread. Strange that a virus is not written in assembly language. |
"Rick C. Hodgin" <rick.c.hodgin@gmail.com>: May 15 12:18AM -0700 Real Troll wrote: > No Jesus needed here. This belief will seem to work well until you leave this world and are summoned by name to give an account of your life and it's realized your sin is still with you, God is real, judgment for sin is real, and Hellfire is real. It's why God warns you in advance ... to give you space and time to repent, and to come to Jesus asking forgiveness. You must do this today, because none of us are promised tomorrow. Thank you, Rick C. Hodgin |
Cholo Lennon <chololennon@hotmail.com>: May 15 10:33AM -0300 On 15/05/17 02:18, Chris M. Thomasson wrote: > Strange that a virus is not written in assembly language. Well, nowadays it's very unusual for a virus/malware to be written in assembler, it has no sense (IMO) -- Cholo Lennon Bs.As. ARG |
You received this digest because you're subscribed to updates for this group. You can change your settings on the group membership page. To unsubscribe from this group and stop receiving emails from it send an email to comp.lang.c+++unsubscribe@googlegroups.com. |
No comments:
Post a Comment