soft and program: Digest for comp.lang.c++@googlegroups.com

comp.lang.c++@googlegroups.com

Google Groups

"Need for Speed - C++ versus Assembly Language" - 10 Updates
Why does VC++ ignore the 'private' of private inheritance? - 2 Updates
Why this UTF-8-conversion code works with Visual C++ and not with g++? - 2 Updates
compiler generate jmp to another jmp - 6 Updates
The worm that spreads WanaCrypt0r - 5 Updates

"Need for Speed - C++ versus Assembly Language"

Jerry Stuckle <jstucklex@attglobal.net>: May 15 12:02AM -0400

On 5/12/2017 4:29 AM, David Brown wrote:
> might not mean /big/ changes, but it will mean changes. In C, you can
> make another local variable when you want it - in assembly, if you run
> out of registers you need to make changes.

But in that case you're not talking about unrelated parts of the code.
They are quite closely related.

> has an "if-the-else" construct that supports up to four conditional
> instructions. Make a change that goes over that limit of four, and you
> have to re-structure the code to use conditional branches.

Once again, you're not talking unrelated code.

> limited in flexibility.

> All this might not have a noticeable effect on code performance - but it
> does have an effect on the time and effort spent.

But it *can* have an effect on code performance, and in some cases that
is quite important.

> is still subject to caches, interrupts, etc.), and much higher
> development costs and risks. Programmable logic lets them get latency
> variation down to a single clock cycle - it is worth the cost.

Those are excellent examples, and the people who run those projects pay
top dollar for programmers who can do just what I described. Virtually
all of the time-critical code is assembler. Only non-critical sections
are C, C++ or other languages.

> target processor, and saving 2% average time will save 2% hardware
> budget and 2% electricity costs. It is absolutely worth the effort
> going for assembly there.

Throughput after the fact is not as critical as throughput during the
event. For real time processing, if you miss something, it is gone forever.

> it is far easier to track exactly what you are doing because there is a
> small instruction set, no scheduling or multiple issues, very clear and
> simple pipelines, no caches, etc.

I've seen million+ lines of assembler code, even on mainframes. It is
done much more often than you think.

> I agree that it /is/ done, I merely say that it is very rarely worth the
> effort. Massive scientific processing is an example where it sometimes
> /is/ worth the effort.

And I'm saying you're wrong - there are times when it is quite
important, and it is done this way. Much more often than you think.

And BTW - another example is in weather prediction. Most of the models
are written in assembler and run on supercomputers. Even at that, to
get an accurate 24 hour forecast would take 72 hours or processing time.

--
==================
Remove the "x" from my email address
Jerry Stuckle
jstucklex@attglobal.net
==================

Christian Gollwitzer <auriocus@gmx.de>: May 15 09:32PM +0200

Am 15.05.17 um 06:02 schrieb Jerry Stuckle:
> And BTW - another example is in weather prediction. Most of the models
> are written in assembler and run on supercomputers.

Do you have a reference for this? Supercomputers are use, no doubt, but
I wouldn't have expected that they write the PDE solvers in assembly.

I found one at https://github.com/yyr/wrf which is written in Fortran.

Christian

Gareth Owen <gwowen@gmail.com>: May 15 08:48PM +0100

> Do you have a reference for this? Supercomputers are use, no doubt,
> but I wouldn't have expected that they write the PDE solvers in
> assembly.

Most of them are in Fortran - some are in C and C++. I've worked on and
seen codebases for large scale climate and weather modeling, and I've
seen literally no assembler. In fact, few of them are even running on
the same architectures on which they originally run (and many are run on
multiple-archs as most predictions are ensemble averages of deliberately
mildly different runs - different architectures help with that).

It might have been true 30 years ago, which is Jerry's default point of
reference.

This is hardly surprising, as easy portability to the latest-and-fastest
processors is far more critical for performance than squeezing 3% out of
the current processor using assembler.

Jerry Stuckle <jstucklex@attglobal.net>: May 15 04:12PM -0400

On 5/15/2017 3:32 PM, Christian Gollwitzer wrote:
> I wouldn't have expected that they write the PDE solvers in assembly.

> I found one at https://github.com/yyr/wrf which is written in Fortran.

> Christian

You won't find these programs on github. And anything you find on
github in this area is not used by serious forecasters.

--
==================
Remove the "x" from my email address
Jerry Stuckle
jstucklex@attglobal.net
==================

Jerry Stuckle <jstucklex@attglobal.net>: May 15 04:18PM -0400

On 5/15/2017 3:48 PM, Gareth Owen wrote:
> the same architectures on which they originally run (and many are run on
> multiple-archs as most predictions are ensemble averages of deliberately
> mildly different runs - different architectures help with that).

Yes, some code is written in Fortran. But there is also a significant
amount of assembler - I've seen it.

> It might have been true 30 years ago, which is Jerry's default point of
> reference.

My reference is fresh - this year, in fact.

> This is hardly surprising, as easy portability to the latest-and-fastest
> processors is far more critical for performance than squeezing 3% out of
> the current processor using assembler.

Then you haven't seen the code for the models used on those
supercomputers. I'm referring specifically to GFS and European models,
although others are similar.

--
==================
Remove the "x" from my email address
Jerry Stuckle
jstucklex@attglobal.net
==================

fir <profesor.fir@gmail.com>: May 15 01:40PM -0700

W dniu poniedziałek, 15 maja 2017 22:18:10 UTC+2 użytkownik Jerry Stuckle napisał:

> Then you haven't seen the code for the models used on those
> supercomputers. I'm referring specifically to GFS and European models,
> although others are similar.

i once remember thread on big number library it was probably GMP it uses assembly as far as i remember, (some my try to find some info how assembly is faster there)... i also remember some probably fastest fractal/mandelbrot 'explorer' it also uses assembly afair .. i would tend to belive that most fastests cpu solutions tend to go to assembly anyway (but i dont do research in this field)

Gareth Owen <gwowen@gmail.com>: May 15 09:43PM +0100

> Then you haven't seen the code for the models used on those
> supercomputers. I'm referring specifically to GFS and European models,
> although others are similar.

https://www.researchgate.net/publication/228791697_FAMOUS_faster_using_parallel_computing_techniques_to_accelerate_the_FAMOUSHadCM3_climate_model_with_a_focus_on_the_radiative_transfer_algorithm

Here's a paper about HadCM3 model - which is written in C and Fortran.
As discussed here, when the need real performance they use massively
parallel systems like OpenCL, not assembler. Most of the discussion in
this paper talks about porting the speed-critical sections *from*
Fortran.

Gareth Owen <gwowen@gmail.com>: May 15 09:44PM +0100

> fractal/mandelbrot 'explorer' it also uses assembly afair .. i would
> tend to belive that most fastests cpu solutions tend to go to assembly
> anyway (but i dont do research in this field)

One place where assembly is widely used is FFmpeg, for encoding video.

"Chris M. Thomasson" <invalid@invalid.invalid>: May 15 01:54PM -0700

On 5/15/2017 1:40 PM, fir wrote:
> 'explorer' it also uses assembly afair .. i would tend to belive that most
> fastests cpu solutions tend to go to assembly anyway (but i dont do research
> in this field)

Wrt fast fractal explorers, assembly on the cpu is very slow compared to
implementing the fractal in a GPU shader.

fir <profesor.fir@gmail.com>: May 15 02:10PM -0700

W dniu poniedziałek, 15 maja 2017 22:43:31 UTC+2 użytkownik gwowen napisał:
> parallel systems like OpenCL, not assembler. Most of the discussion in
> this paper talks about porting the speed-critical sections *from*
> Fortran.

well thats obvious (that gpu has more processing power) but we talk here about cpu

as to gpu gpu also has its c and has its assebly too ;c i once was interested how much faster assembly coding on gpu is better than c coding on gpu but it is hard to find people able to answer that

ps i posted once on comp.lang.c simple tests related to this topic here (i mean comparing scalar c / asm (well half-asm/intrinsics) and gpu-c, this test is interesting by its simplicity so i maybe repaste it, someone could take it as a base for some benchmark maybe

>>>paste below>>>>

This noght run my first gpu c code that is doing something and made some tests

this is a simple mandelbrot drwing code, first i
run scalar version

int mandelbrot_n( float cRe, float cIm, int max_iter )
{
float re = cRe;
float im = cIm;

float rere=re*re;
float imim=im*im;

for(int n=1; n<=max_iter; n++)
{

im = (re+re)*im + cIm;
re = rere - imim + cRe;

rere=re*re;
imim=im*im;

if ( (rere + imim) > 4.0 )
return n;

}

return 0;

}

for 256 x256 x 1000 iteration it take 90 ms

then i made sse intrinsic version

__attribute__((force_align_arg_pointer))
__m128i mandelbrot_n_sse( __m128 cre, __m128 cim, int max_iter )

{
__m128 re = _mm_setzero_ps();
__m128 im = _mm_setzero_ps();

__m128 _1 = _mm_set_ps1(1.);
__m128 _4 = _mm_set_ps1(4.);

__m128 iteration_counter = _mm_set_ps1(0.);

for(int n=0; n<=max_iter; n++)
{

__m128 re2 = _mm_mul_ps(re, re);
__m128 im2 = _mm_mul_ps(im, im);
__m128 radius2 = _mm_add_ps(re2,im2);

__m128 compare_mask = _mm_cmplt_ps( radius2, _4);
iteration_counter = _mm_add_ps( iteration_counter, _mm_and_ps(compare_mask, _1) );
if (_mm_movemask_ps(compare_mask)==0) break;

__m128 ren = _mm_add_ps( _mm_sub_ps(re2, im2), cre);
__m128 reim = _mm_mul_ps(re, im);

__m128 imn = _mm_add_ps( _mm_add_ps(reim, reim), cim);

re = ren;
im = imn;

}

__m128i n = _mm_cvtps_epi32(iteration_counter);

return n;
}

this run 20 ms (more that 4 times faster, dont know why)

(the procesor i run is anyway old core2 e6550 2.33GHz - i got better machine with avx support but didnt use it here yet)

then i make opencl code

"__kernel void square( \n" \
" __global int* input, \n" \
" __global int* output, \n" \
" const unsigned int count) \n" \
"{ \n" \
" int i = get_global_id(0); \n" \
" if(i < count) \n" \
" { \n" \
" int x = i%256; \n" \
" // if(x>=256) return; \n" \
" int y = i/256; \n" \
" // if(y>=256) return; \n" \
" float cRe = -0.5 + -1.5 + x/256.*3.; \n" \
" float cIm = 0.0 + -1.5 + y/256.*3.; \n" \
" float re = 0; \n" \
" float im = 0; \n" \
" int n = 0; \n" \
" for( n=0; n<=1000; n++) { \n" \
" if( re * re + im * im > 4.0 ) { output[256*y+x] = n + 256*n + 256*256*n; return;} \n" \
" float re_n = re * re - im * im + cRe; \n" \
" float im_n = 2 * re * im + cIm; \n" \
" re = re_n; \n" \
" im = im_n; \n" \
" } \n" \
" output[256*y+x] = 250<<8; \n" \
" } \n" \
"} \n" \
"\n";

this works with not a problem and works at 7 ms
(i got weak gpu gt610)

How to optimise this gpu version? Is it common to write such scalar code on gpu, maybe there is some way of writing something like sse intrinsics here? or other kind of optimisation?

(anyway i must say that thiose critics of gpu /opencl coding i dont fully agree this works
easy and fine - at least for some cases, (esp good is that it has not to much slowdown when
runing gpu from cpu and getting back results
- it seem i can run it in the 1 milisecond
window, so its very fine) i belive that with harder codes it may getting slower, but also belive with better card i may go also better than 7 ms)

Why does VC++ ignore the 'private' of private inheritance?

praxisjuergenschmidt@gmail.com: May 15 06:12AM -0700

I tried a variation of the following code fragment:

class Base { public: virtual ~Base() {} };
class Derived : Base { };

class Test
{
Base* b_;
public:
Test( Base* b ) : b_( b ) {}
~Test() { delete b_; }
};

Test testA()
{
return Test( new Derived );
}

It is compiling well with VC++ (cl version 15 and 17) ignoring private inheritance, but doesn't with any other compiler. Is it just a bug?

Thanks

J

"Alf P. Steinbach" <alf.p.steinbach+usenet@gmail.com>: May 15 11:09PM +0200

> It is compiling well with VC++ (cl version 15 and 17) ignoring
> private inheritance, but doesn't with any other compiler. Is it just
> a bug?

That's clearly a compiler bug.

Please report it. If this is Visual C++ 2015 or earlier then you can use
Microsoft Connect. I think there's a new error reporting site for Visual
C++ 2017.

Cheers!,

- ALf

Why this UTF-8-conversion code works with Visual C++ and not with g++?

scott@slp53.sl.home (Scott Lurndal): May 15 01:11PM

>platform is. :) I think the C++ standard library's i/o and text handling
>is fundamentally broken, because it was designed with the codepage model
>in mind, and now is used for variable length encodings.

I don't believe the SGI folks ever even considered windows, or codepages
when designing the STL.

"Alf P. Steinbach" <alf.p.steinbach+usenet@gmail.com>: May 15 11:04PM +0200

On 15-May-17 3:11 PM, Scott Lurndal wrote:
>> in mind, and now is used for variable length encodings.

> I don't believe the SGI folks ever even considered windows, or codepages
> when designing the STL.

The STL was designed by Alexander Stepanov, starting in 1992 at HP.

See <url: https://en.wikipedia.org/wiki/Standard_Template_Library#History>.

"SGI folks" did not design the STL but provided an STL implementation,
apparently still available at <url: https://www.sgi.com/tech/stl/>.
Here's a nice article about it: <url:
http://www.drdobbs.com/cpp/the-sgi-standard-template-library/184410249>.

The STL has roughly nothing to do with text handling and streams.
Stepanov focused on (the separation of) algorithms and containers.
Possibly you're conflating the STL with the standard library. That's
natural because it's a common misconception that's floating around.

Codepages are not particular to Windows. "Codepage" is a term that
originally denoted only single byte encodings and that stems from IBM,
as far as I know. Today that's the main meaning. Most systems including
Unix were based on the codepage model (single byte encodings, with some
special support for Shift-JIS etc.) before the advent of Unicode. See
<url: https://en.wikipedia.org/wiki/Code_page>.

Summing up, your comment provides many opportunities to learn. ;-)

Cheers & hth.,

- Alf

compiler generate jmp to another jmp

qak <q3k@mts.net.NOSPAM>: May 15 04:56PM

I disassempled 2 highly optimized direct show filter (CoreAvc and Lentoid,
known to be fastest for H264 and H265) and notice many (more than a few
dozen) patterns like the following:
jz Label_1 ; this is short jump but some of them unconditional jump
...
Label_1:
jmp Label_123 ; quite a few of them jump again to another jump
...
...
Label_123:
jmp Label_234

I could replace the first line with 'jz Label_234' and reassemple without
problem.
Could the optimizer miss the oportunity ? or I know too little!
The first line is not always short jump, sometimes is even 'call Label_1'.

"Rick C. Hodgin" <rick.c.hodgin@gmail.com>: May 15 10:14AM -0700

On Monday, May 15, 2017 at 12:56:49 PM UTC-4, qak wrote:
> problem.
> Could the optimizer miss the oportunity ? or I know too little!
> The first line is not always short jump, sometimes is even 'call Label_1'.

Some CPUs only allow conditional jumps which go for a certain
distance (number of bytes) away from their instruction. As such,
it's a common occurrence to see the condition jump only move to
another instruction which then does a hard jump to some location
further away. A more common practice is to jump locally on the
alternate condition, and then hard-jump if the opposite condition
test fails. However, for optimization, when it's known that a
particular branch is more likely or less likely to branch, it
can be coded in those ways.

In this case, since you are able to modify the source code and
generate a binary without an error, it's likely legacy code
leftover from a 32-bit codec, which supported only +/- 128
bytes for branch target offsets, whereas in 64-bit code it can
go pretty much anywhere needed.

It may also be operating that way for alignment, so the code
nearby is aligned at a particular boundary. Without knowing
the specifics of the algorithm, it would be hard to say for
certain, but it's not uncommon to see what you find there,
and the reasons it exists typically are the limitations of
the instruction ISA, or for some reasons related to optimization.

Thank you,
Rick C. Hodgin

"Rick C. Hodgin" <rick.c.hodgin@gmail.com>: May 15 10:42AM -0700

On Monday, May 15, 2017 at 1:14:31 PM UTC-4, Rick C. Hodgin wrote:
> the instruction ISA, or for some reasons related to optimization.

> Thank you,
> Rick C. Hodgin

Something else ... a hard JMP instruction forces a refill of the
instruction cache. If the algorithm is using self-modifying code,
that could be the reason for the dual-target jump. On a conditional
branch, the pipeline is only invalidated and refilled if the branch
unit mis-predicted the branch. By forcing a hard branch, it will
then re-fill the instruction code in the pipeline, which will read
the recently altered self-modifying code (if it existed).

If you re-post your question to alt.lang.asm or comp.lang.asm.x86
if it's x86-specific, then you'll get better answers.

Thank you,
Rick C. Hodgin

scott@slp53.sl.home (Scott Lurndal): May 15 05:54PM

>> distance (number of bytes) away from their instruction.

>Something else ... a hard JMP instruction forces a refill of the
>instruction cache.

No, it doesn't. The only difference between an unconditional
branch operation and a condition branch operation is whether or
not instructions are speculatively executed. Modern branch
predictors are pretty good at preventing speculation down the
wrong branch path.

Intel's instruction cache is snooped specifically to support
self-modifying code without the programmer having to do anything
special (like flushing the L1I cache).

> By forcing a hard branch, it will
>then re-fill the instruction code in the pipeline, which will read
>the recently altered self-modifying code (if it existed).

No, the unconditional branch does nothing other than
changing the flow of execution. It has no effect on cache
coherency.

"Rick C. Hodgin" <rick.c.hodgin@gmail.com>: May 15 11:19AM -0700

On Monday, May 15, 2017 at 1:54:54 PM UTC-4, Scott Lurndal wrote:

> No, the unconditional branch does nothing other than
> changing the flow of execution. It has no effect on cache
> coherency.

That is information beyond my existing understanding. I just
did a search online and Randall Hyde wrote on page 447 of his
book:

"Write Great Code, Vol. 2: Thinking Low-Level, Writing High-Level"
https://books.google.com/books?id=mM58oD4LATUC&pg=PA447

"Although these statements typically compile to a single
machine instruction (jmp), don't get the impression they
are efficient to use. Even ignoring the fact that a jmp
can be somewhat expensive (because it forces the CPU to
flush the instruction pipeline), statements that branch
out of a loop can have..."

When did the JMP instruction stop forcing a refill of the cache?

Thank you,
Rick C. Hodgin

"Rick C. Hodgin" <rick.c.hodgin@gmail.com>: May 15 11:27AM -0700

On Monday, May 15, 2017 at 1:54:54 PM UTC-4, Scott Lurndal wrote:
> "Rick C. Hodgin" <rick.c.hodgin@gmail.com> writes:
> >Something else ... a hard JMP instruction forces a refill of the
> >instruction cache.

I just realized that I wrote "instruction cache" here when I meant
to write "instruction pipeline." I realize the cache is not
invalidated, but only what has already been pre-loaded into the
instruction pipeline.

Those pre-decoded instructions already in the pipeline, which may
have been received from prior reads before SMC updated something,
are invalidated by the JMP and it re-fills from the instruction
cache.

If that is incorrect, then my information is notably out of date
because that's how it used to work.

> No, the unconditional branch does nothing other than
> changing the flow of execution. It has no effect on cache
> coherency.

How does the CPU synchronize instructions which have been pre-fetched
from now stale instruction data for an upcoming instruction that's
already begun decoding for its pipeline, to then later signal without
the hard JMP to know that it's going to be executing stale SMC? That
would be quite a slick feature to have in a CPU, so that no matter
when SMC was used, it always executed the correct version.

Thank you,
Rick C. Hodgin

The worm that spreads WanaCrypt0r

Real Troll <real.troll@trolls.com>: May 14 09:59PM -0400

The snippet code is here:

<https://blog.malwarebytes.com/threat-analysis/2017/05/the-worm-that-spreads-wanacrypt0r/>

Use your brains to get it work for you in your test labs.

No Jesus needed here.

"Chris M. Thomasson" <invalid@invalid.invalid>: May 14 09:57PM -0700

On 5/14/2017 6:59 PM, Real Troll wrote:

> <https://blog.malwarebytes.com/threat-analysis/2017/05/the-worm-that-spreads-wanacrypt0r/>

> Use your brains to get it work for you in your test labs.

> No Jesus needed here.

Thank you for posting this.

"Chris M. Thomasson" <invalid@invalid.invalid>: May 14 10:18PM -0700

On 5/14/2017 9:57 PM, Chris M. Thomasson wrote:

>> Use your brains to get it work for you in your test labs.

>> No Jesus needed here.

> Thank you for posting this.

The one point that has me thinking is the missing leading underscore to
the _beginthreadex function:

https://msdn.microsoft.com/en-us/library/kdzttdcb.aspx

It should not link? Weird. Also, that function is used to create threads
along with an initialization of the CRT for said thread. So, the virus is C.

Creating code on Windows that does not use the CRT can make use of
CreateThread. Strange that a virus is not written in assembly language.

"Rick C. Hodgin" <rick.c.hodgin@gmail.com>: May 15 12:18AM -0700

Real Troll wrote:
> No Jesus needed here.

This belief will seem to work well until you leave this world and are summoned
by name to give an account of your life and it's realized your sin is still with
you, God is real, judgment for sin is real, and Hellfire is real.

It's why God warns you in advance ... to give you space and time to repent,
and to come to Jesus asking forgiveness.

You must do this today, because none of us are promised tomorrow.

Thank you,
Rick C. Hodgin

Cholo Lennon <chololennon@hotmail.com>: May 15 10:33AM -0300

On 15/05/17 02:18, Chris M. Thomasson wrote:
> Strange that a virus is not written in assembly language.

Well, nowadays it's very unusual for a virus/malware to be written in
assembler, it has no sense (IMO)

--
Cholo Lennon
Bs.As.
ARG

You received this digest because you're subscribed to updates for this group. You can change your settings on the group membership page.
To unsubscribe from this group and stop receiving emails from it send an email to comp.lang.c+++unsubscribe@googlegroups.com.

soft and program

Monday, May 15, 2017

Digest for comp.lang.c++@googlegroups.com - 25 updates in 5 topics

No comments:

Blog Archive

About Me