Sunday, November 18, 2018

Digest for comp.lang.c++@googlegroups.com - 18 updates in 5 topics

David Brown <david.brown@hesbynett.no>: Nov 18 02:21PM +0100

On 17/11/2018 18:42, Melzzzzz wrote:
>> Templates, etc. But if you know what you're sorting and are willing to
>> put that into the sort code - I don't know why C++ should be faster.
 
> Well, qsort in clib is much slower then C++ sort...
 
That is for two reasons. One is that the standard C library qsort
function is generic - as Vir said, /generic/ functions can be more
efficient in C++ (at least while you want things to be clear,
maintainable, etc.), but a specialised sort for a specific type will be
as fast in C as in C++. The other issue is that the actual algorithm
for sorting is not specified - it is quite possible that the C++ library
you tested has a more efficient algorithm than the C library you tested.
Christian Gollwitzer <auriocus@gmx.de>: Nov 18 07:29PM +0100

Am 18.11.18 um 14:21 schrieb David Brown:
> function is generic - as Vir said, /generic/ functions can be more
> efficient in C++ (at least while you want things to be clear,
> maintainable, etc.),
 
 
I'm not even sure that this holds for a sort function using a callback.
The main difference between the C and the C++ version is that "qsort" is
defined in a separately compiled file, while std::sort is defined in the
header file and compiled simultaneously.
 
If you define qsort like
 
inline void qsort(void* base, size_t num, size_t size,
int (*compar)(const void*,const void*))
{ ... code... }
 
*inside* a header file, and you call it with a fixed callback function,
then the compiler can inline the comparator in the same manner it
inlines it in C++ and I see no reason why it should be slower.
 
Christian
David Brown <david.brown@hesbynett.no>: Nov 18 07:40PM +0100

On 18/11/2018 19:29, Christian Gollwitzer wrote:
 
> *inside* a header file, and you call it with a fixed callback function,
> then the compiler can inline the comparator in the same manner it
> inlines it in C++ and I see no reason why it should be slower.
 
Yes, you could do that (with a compiler that is smart enough to inline
functions called through pointers). But then you would not be using the
qsort from the C library.
 
So this is all proving the same point - you /can/ get the same
performance from C as from C++, but it is harder to do so when you are
talking about generic functions.
"Rick C. Hodgin" <rick.c.hodgin@gmail.com>: Nov 18 03:23PM -0800

On Saturday, November 17, 2018 at 12:42:21 PM UTC-5, Melzzzzz wrote:
> Well, qsort in clib is much slower then C++ sort...
 
It's much slower still in Hungarian folk dance:
 
Quick-sort with Hungarian (Küküllőmenti legényes) folk dance
https://www.youtube.com/watch?v=ywWBy6J5gz8
 
--
Rick C. Hodgin
Ben Bacarisse <ben.usenet@bsb.me.uk>: Nov 18 02:35AM

> is known to be much faster in general than naive bit by bit methods.
 
> It's a better big O. Your method is O(number of bits set) which is
> better than O(log n).
 
You can't really say one is asymptotically better than the other unless
you start to talk about the distribution of data values, and most people
don't what to go there when using O(...) notation. When not otherwise
stated O(...) is usually used to denote the worst-case performance, and
here both algorithms are O(log n).
 
And if you switch to "average case" from "worst case", for uniformly
distributed data, O(number of bits set) = O((log n)/2) = O(log n).
You'd need some non-uniform assumption about the values to be able to
say that O(number of bits set) is better than O(log n).
 
--
Ben.
Paul <pepstein5@gmail.com>: Nov 18 01:18AM -0800

On Sunday, November 18, 2018 at 2:36:04 AM UTC, Ben Bacarisse wrote:
> distributed data, O(number of bits set) = O((log n)/2) = O(log n).
> You'd need some non-uniform assumption about the values to be able to
> say that O(number of bits set) is better than O(log n).
 
Ben,
 
Thanks. Some very valid and good points here.
For clarity, the two methods being discussed on these last few posts are
posted at the end of this thread.
The fact of the matter, which you will see if you test them, is that
(with the possible exception of some rare cases) the first implementation
is much faster. Your initial post on the first method doesn't
acknowledge that. Try it and see if you agree with me.
Naively, you'd expect it to take about half the time in general (because with random numbers you'd expect roughly half the bits to be set to 1).
I just tried testing the two with numbers up to 10 million.
The x &= x-1 trick reduced the time by 40%. The fact that this is not
quite 50% suggests that, if nearly all the original bits are set to 1,
iterating x by x/=2 is narrowly faster. Obviously, these thoughts are
very tentative because they depend on compiler settings, systematic trials
etc.
Paul
 
int popcnt(unsigned x)
{
int count = 0;
for (; x; count++) x &= x - 1;
return count;
}
 
and
 
int popcnt(unsigned x)
{
int count = 0;
while (x)
{
count += x & 1;
x /= 2;
}
return count;
}
Ben Bacarisse <ben.usenet@bsb.me.uk>: Nov 18 12:41PM

> (with the possible exception of some rare cases) the first implementation
> is much faster. Your initial post on the first method doesn't
> acknowledge that.
 
Yes, I was way too vague. It's going to take a bad compiler to make it
slower that the other method given here. Other constant-time methods
may beat it depending on the architecture, compiler etc.
 
> expect roughly half the bits to be set to 1). I just tried testing
> the two with numbers up to 10 million. The x &= x-1 trick reduced the
> time by 40%.
 
I got a slightly better reduction in time on uniform data. On worst
case data they take about the same time (as you would expect). But on
my machine __builtin_popcount beats all other methods hands-down by at
least a factor of 5.
 
By the way, the while (x) x /= 2 loop does not test all bits either but
it does test about twice the number that the other method tests. (I've
not done the exact computation. It would be a neat exercise for a maths
class.)
 
If I get time, I may try to time these against a table look-up method.
It's a lone time since I've tried that.
 
 
--
Ben.
Paul <pepstein5@gmail.com>: Nov 18 05:12AM -0800

On Sunday, November 18, 2018 at 12:41:25 PM UTC, Ben Bacarisse wrote:
... on
> my machine __builtin_popcount beats all other methods hands-down by at
> least a factor of 5.
...
 
Thanks for your thoughts. Could you clarify "all other methods" please?
How about the below?
My tests make it more or less an exact match for the builtin, and googling
for discussions, that's what others are saying, too.
I don't think the below is unambiguously slower than the builtin.
Admittedly, SWAR is very difficult to explain to someone who hasn't seen
it before. Perhaps you mean "All other methods which are easily
explainable in a few minutes so long as someone knows a bit of C++".
 
Thanks again,
Paul
 
int swarPopcount(unsigned i)
{
i = i - ((i >> 1) & 0x55555555);
i = (i & 0x33333333) + ((i >> 2) & 0x33333333);
return (((i + (i >> 4)) & 0x0F0F0F0F) * 0x01010101) >> 24;
}
Ben Bacarisse <ben.usenet@bsb.me.uk>: Nov 18 03:37PM

> ...
 
> Thanks for your thoughts. Could you clarify "all other methods"
> please?
 
A mistake. I should have said both other methods -- the two loops you
were comparing. I have not test all others by any means!
 
> How about the below?
> My tests make it more or less an exact match for the builtin, and googling
> for discussions, that's what others are saying, too.
 
Optimised (-O2 and -O3), it's faster than the __builtin_popcount on my
system and takes about twice the time unoptimised.
 
This:
 
unsigned c;
__asm__ volatile ("POPCNT %1, %0;"
:"=r"(c)
:"r"(x)
);
return c;
 
is faster than the SWAR version (and therefore the builtin one and the
loops) but the difference becomes small when both are optimised. But of
course it's horribly non-portable. I'd assumed that __builtin_popcount
would use this method but it appear not.)
 
(BTW, I don't know much about __asm__ in gcc so if anyone wants to tell
me how I should have done it, I'd be grateful.)
 
> Admittedly, SWAR is very difficult to explain to someone who hasn't seen
> it before. Perhaps you mean "All other methods which are easily
> explainable in a few minutes so long as someone knows a bit of C++".
 
No just a slip of the fingers. "All" should have been just the two
loops being compared in those recent posts.
 
 
--
Ben.
Jorgen Grahn <grahn+nntp@snipabacken.se>: Nov 18 04:44PM

On Sat, 2018-11-17, Paul wrote:
> On Saturday, November 17, 2018 at 3:38:07 PM UTC, robert...@yahoo.com wrote:
...
 
> did is that I was looking for the most direct approach so I said
> "What's happening in the kth binary place" So this led to me
> introducing 1 << place.
 
I was wondering about that (but I didn't ask). What you write makes
sense: scan through position 0, 1, 2, ... and look for set bits.
 
Yet my instinct is to do it like robert: draining all the bits out
through the bottom, as it were.
 
> x/=2; seems to execute faster than x >>= 1; so maybe a problem with
> my code is that bit shifting isn't as fast as the alternatives.
 
That seems unlikely. Back in the 1980s, people used to write n>>2
instead of n/4, because compilers were poor and DIV instructions very
expensive. But I soon learned that a half-decent optimizer would turn
both into the same assembly code (if n was unsigned, anyway).
 
/Jorgen
 
--
// Jorgen Grahn <grahn@ Oo o. . .
\X/ snipabacken.se> O o .
David Brown <david.brown@hesbynett.no>: Nov 18 07:36PM +0100

On 18/11/2018 16:37, Ben Bacarisse wrote:
> loops) but the difference becomes small when both are optimised. But of
> course it's horribly non-portable. I'd assumed that __builtin_popcount
> would use this method but it appear not.)
 
POPCNT was added with the "nehalem" generation of Intel chips, along
with the SSE4 instructions, as far as I can tell. In the AMD line, it
came a little later. __builtin_popcount on gcc (and presumably also
clang) will use POPCNT on x86, and equivalent instructions for other
processors, if it is told to generate code for a device family that
supports it.
 
Thus if you just ask for "x86-64" instructions, gcc won't use the POPCNT
instruction as it is not supported by all x86-64 devices. But if you
ask for "-march=nehalem", you /will/ get it for __builtin_popcount.
 
In particular, if you use "-march=native", gcc will generate code that
makes use of all the features of the cpu you are using at the time.
That is a good idea for making programs that you will run on your own
system, but limits portability of the binary on other processors. If
you really want fast code /and/ portability, gcc supports generation of
multiple copies of functions with run-time selection of the actual
version based on the processor in use at the time.
 
> (BTW, I don't know much about __asm__ in gcc so if anyone wants to tell
> me how I should have done it, I'd be grateful.)
 
It looks fine to me. You can use names instead of numbered parameters,
which is useful for more advanced inline assembly, but numbered
parameters are okay here.
 
Ben Bacarisse <ben.usenet@bsb.me.uk>: Nov 18 07:48PM


> On 18/11/2018 16:37, Ben Bacarisse wrote:
>> Paul <pepstein5@gmail.com> writes:
<snip>
> clang) will use POPCNT on x86, and equivalent instructions for other
> processors, if it is told to generate code for a device family that
> supports it.
<snip>
> In particular, if you use "-march=native", gcc will generate code that
> makes use of all the features of the cpu you are using at the
> time.
 
Interesting. I had forgotten that I was just doing a generic compile.
I've tried -march=native and the timing is exactly as for the __asm__
version. Thanks for that tip.
 
(Yet another way in which benchmarking code like this is increasingly
complicated.)
 
<snip>
--
Ben.
Paavo Helde <myfirstname@osa.pri.ee>: Nov 18 09:09AM +0200

On 18.11.2018 3:56, Stefan Ram wrote:
 
> . But an evaluation happens at /run time/. How is an
> evaluation which happens at run time supposed to evaluate
> an expression which only exists until compile-time?
 
Because the object code for evaluating the expression has been prepared
at compile time, doh.
 
This is the same how we can have chicken soup for dinner even though no
chicken exists in the supermarket.
ram@zedat.fu-berlin.de (Stefan Ram): Nov 18 01:08AM

Ways how to explain object lifetime:
 
"When the destructor is called, this is like when the doctor
says to the object, 'I'm sorry, but you should get your
things in order.'".
 
"The type of the expression (rvalue or glvalue) is like the
last will and testament of the object, determining whether
or not its estate can be taken by others."
ram@zedat.fu-berlin.de (Stefan Ram): Nov 18 01:55AM

An expression is an entity of the source-code model.
It sits right there in your code and consists of
characters you typed.
 
At run-time, there are no expressions; they have been
lost during the translation. They are not part of the
run-time model. For example, a program usually cannot
print its expressions.
 
Yet, we use the term â€ÿóÿý to evaluate an expression†. E.g.,
ram@zedat.fu-berlin.de (Stefan Ram): Nov 18 01:56AM

Supersedes: <evaluation-20181118024605@ram.dialup.fu-berlin.de>
 
An expression is an entity of the source-code model.
It sits right there in your code and consists of
characters you typed.
 
At run-time, there are no expressions; they have been
lost during the translation. They are not part of the
run-time model. For example, a program usually cannot
print its expressions.
 
Yet, we use the term "to evaluate an expression". E.g.,
 
|The postfix expression before the dot or arrow is evaluated;
n4762
 
. But an evaluation happens at /run time/. How is an
evaluation which happens at run time supposed to evaluate
an expression which only exists until compile-time?
"Chris M. Thomasson" <invalid_chris_thomasson@invalid.invalid>: Nov 17 03:45PM -0800

On 11/14/2018 7:25 PM, Pavel wrote:
> alien-like fungi that enslave and kill ants (warning: not for the faint of
> heart). https://www.youtube.com/watch?v=XuKjBIBBAL8 ... sorry for the
> offshoot-off-topic.
 
Will take a look. Thank you. Busy working on some animations right now.
Fwiw, my new software has dozens of points to interpolate with fun
functions. Here is one:
 
https://youtu.be/1jjiCMC2izc
 
This animation has created some interest 3d volumetric renderings:
 
https://plus.google.com/101799841244447089430/posts/9s9DZ6SYz5x
(BUGS!)
 
https://plus.google.com/101799841244447089430/posts/E75s7Ya6fkg
 
https://plus.google.com/101799841244447089430/posts/VLBFc2cS9ap
(the GHOST!)
 
;^D
"Chris M. Thomasson" <invalid_chris_thomasson@invalid.invalid>: Nov 17 03:47PM -0800

On 11/14/2018 7:25 PM, Pavel wrote:
>>>>> On Monday, November 12, 2018 at 1:38:05 PM UTC-5, Kenny McCormack wrote:
>>>>>> In article <9a34066c-9aa4-4512-a9ef-3d8af47206cc@googlegroups.com>,
>>>>>> Rick C. Hodgin <rick.c.hodgin@gmail.com> wrote:
[...]
> alien-like fungi that enslave and kill ants (warning: not for the faint of
> heart). https://www.youtube.com/watch?v=XuKjBIBBAL8 ... sorry for the
> offshoot-off-topic.
 
ARGH! That is gnarly!
 
https://youtu.be/XuKjBIBBAL8?t=110
 
Not for the faint of heart. ouch. ;^o
 
https://youtu.be/XuKjBIBBAL8?t=134
 
oohhh...
You received this digest because you're subscribed to updates for this group. You can change your settings on the group membership page.
To unsubscribe from this group and stop receiving emails from it send an email to comp.lang.c+++unsubscribe@googlegroups.com.

No comments: