Wednesday, August 7, 2019

Digest for comp.lang.c++@googlegroups.com - 25 updates in 3 topics

JiiPee <no@notvalid.com>: Aug 07 01:10PM +0100

On 25/07/2019 03:45, Alf P. Steinbach wrote:
 
>> and that didn't seem like a fruitful avenue.  What am I
>> missing?  Thanks in advance.
 
> There's the math constants, finally an official standard library pi! :)
 
 
woo, no need to define it always ourselves
Bonita Montero <Bonita.Montero@gmail.com>: Aug 07 07:13AM +0200

I just wrote a litte test that checks the performance of unaligned
memory-acesses on x86 / Win32. I've run this code on my Ryten 1800X:
 
#pragma warning(disable: 6387)
#pragma warning(disable: 6001)
#pragma warning(disable: 4244)
 
#include <Windows.h>
#include <iostream>
#include <cstdint>
#include <cstring>
#include <algorithm>
#include <intrin.h>
 
using namespace std;
 
struct unaligned_uint32
{
uint8_t u[sizeof(uint32_t)];
operator uint32_t();
unaligned_uint32 &operator = ( uint32_t ui );
};
 
inline
unaligned_uint32::operator uint32_t()
{
return ((uint32_t)u[0] | (uint32_t)u[1] << 8) | ((uint32_t)u[2] <<
16 | (uint32_t)u[3] << 24);
}
 
inline
unaligned_uint32 &unaligned_uint32::operator = ( uint32_t ui )
{
u[0] = (uint8_t)ui;
u[1] = (uint8_t)(ui << 8);
u[2] = (uint8_t)(ui << 16);
u[3] = (uint8_t)(ui << 24);
return *this;
}
 
 
template<typename TUI32>
void memkill( TUI32 *m, size_t elems );
 
int main()
{
size_t const SIZE = (size_t)1024 * 1024;
size_t const ELEMS = SIZE / sizeof(uint32_t);
unsigned const ITERATIONS = 10'000;
char *m;
LONGLONG llFreq;
double freq;
LONGLONG start, end;
double seconds;
 
m = (char *)VirtualAlloc( nullptr, SIZE, MEM_RESERVE | MEM_COMMIT,
PAGE_READWRITE );
memset( m, 0, SIZE );
 
QueryPerformanceFrequency( &(LARGE_INTEGER &)llFreq );
freq = llFreq;
 
QueryPerformanceCounter( &(LARGE_INTEGER &)start );
for( unsigned i = 0; i != ITERATIONS; i++ )
memkill( (uint32_t *)m, ELEMS );
QueryPerformanceCounter( &(LARGE_INTEGER &)end );
seconds = (end - start) / freq;
cout << "aligned native: " << seconds << endl;
 
QueryPerformanceCounter( &(LARGE_INTEGER &)start );
for( unsigned i = 0; i != ITERATIONS; i++ )
memkill( (uint32_t *)(m + 1), ELEMS - 1 );
QueryPerformanceCounter( &(LARGE_INTEGER &)end );
seconds = (end - start) / freq;
cout << "unaligned native: " << seconds << endl;
 
QueryPerformanceCounter( &(LARGE_INTEGER &)start );
for( unsigned i = 0; i != ITERATIONS; i++ )
memkill( (unaligned_uint32 *)(m + 1), ELEMS - 1 );
QueryPerformanceCounter( &(LARGE_INTEGER &)end );
seconds = (end - start) / freq;
cout << "unaligned wrapped: " << seconds << endl;
 
}
 
template<typename TUI32>
void memkill( TUI32 *m, size_t elems )
{
for_each( m, m + elems, []( TUI32 &e ) { e = (uint32_t)e + 1; } );
}
 
As you can see the code also tests accessing unaligned memory unaligned
with manual shifting of the bytes.
Here are the results of my 1800X:
 
aligned native: 0.244328
unaligned native: 0.437457
unaligned wrapped: 2.12482
 
I was very surprised that unaligned memory-access is less than twice as
slow on my PC.
It would be nice to see results from Intel-CPUs here. Thanks in advance.
"Chris M. Thomasson" <invalid_chris_thomasson_invalid@invalid.com>: Aug 06 11:20PM -0700

On 8/6/2019 10:13 PM, Bonita Montero wrote:
 
> I was very surprised that unaligned memory-access is less than twice as
> slow on my PC.
> It would be nice to see results from Intel-CPUs here. Thanks in advance.
 
Try using unaligned addresses with several threads. Try doing a LOCK
XADD on a location that straddles two cache lines, and is not aligned on
a line, vs one that is aligned on a cache line, and properly padded.
Bonita Montero <Bonita.Montero@gmail.com>: Aug 07 08:57AM +0200

> Try using unaligned addresses with several threads.
 
That's not relevant to me because I only wanted to measure the
cost of an unaligned access.
 
> Try doing a LOCK XADD on a location that straddles two cache
> lines, ..
 
That's also not relevant to me; and not to me because no one
would do that in reality because this isn't of any use.
Bonita Montero <Bonita.Montero@gmail.com>: Aug 07 08:57AM +0200

> Try using unaligned addresses with several threads.
 
That's not relevant to me because I only wanted to measure the
cost of an unaligned access.
 
> Try doing a LOCK XADD on a location that straddles two cache
> lines, ..
 
That's also not relevant to me; and not only to me because no
one would do that in reality because this isn't of any use.
"Chris M. Thomasson" <invalid_chris_thomasson_invalid@invalid.com>: Aug 07 12:22AM -0700

On 8/6/2019 11:57 PM, Bonita Montero wrote:
>> lines, ..
 
> That's also not relevant to me; and not to me because no one
> would do that in reality because this isn't of any use.
 
Okay. I was just sort of, thinking out loud. Sorry. Fwiw, using LOCK
XADD on unaligned address that straddles a cache line can invoke a Bus lock!
"Chris M. Thomasson" <invalid_chris_thomasson_invalid@invalid.com>: Aug 07 12:24AM -0700

On 8/6/2019 11:57 PM, Bonita Montero wrote:
>> lines, ..
 
> That's also not relevant to me; and not to me because no one
> would do that in reality because this isn't of any use.
 
Actually, it is of some use. It can trigger a BUS lock. So, one can use
it for a forced quiescence period in a RCU algorihtm. Actually, Windows
has a nice way to do this without totally abusing x86/64:
 
https://docs.microsoft.com/en-us/windows/win32/api/processthreadsapi/nf-processthreadsapi-flushprocesswritebuffers
Bonita Montero <Bonita.Montero@gmail.com>: Aug 07 09:26AM +0200

> Fwiw, using LOCK XADD on unaligned address that
> straddles a cache line can invoke a Bus lock!
 
It's logical that Intel implemented unaligned loads and stores
for backward-compatibility. And these loads / stores are useful
for performantly mofifying datastructure for persistence or
transmission over the network. So this is clearly an unique
advantage of the Intel-Architecture.
But for what should an unaligned LOCK * be useful? I can see
no sense in that.
"Chris M. Thomasson" <invalid_chris_thomasson_invalid@invalid.com>: Aug 07 12:33AM -0700

On 8/7/2019 12:26 AM, Bonita Montero wrote:
> advantage of the Intel-Architecture.
> But for what should an unaligned LOCK * be useful? I can see
> no sense in that.
 
To trigger a full blown bus lock, to be used as a quiescence point in
user space RCU. There is an older write up on this, back in 2010:
 
https://blogs.oracle.com/dave/qpi-quiescence
 
I have implemented RCU in userspace before.
"Chris M. Thomasson" <invalid_chris_thomasson_invalid@invalid.com>: Aug 07 12:35AM -0700

On 8/7/2019 12:26 AM, Bonita Montero wrote:
> advantage of the Intel-Architecture.
> But for what should an unaligned LOCK * be useful? I can see
> no sense in that.
 
There are a whole class of exotic asymmetric synchronization algorithms
that can use this. Although, FlushProcessWriteBuffers can work okay.
Bonita Montero <Bonita.Montero@gmail.com>: Aug 07 09:52AM +0200

>> no sense in that.
 
> To trigger a full blown bus lock, to be used as a quiescence point
> in user space RCU. There is an older write up on this, back in 2010:
 
Ok, but's that's extremely exotic.
 
> There are a whole class of exotic asymmetric synchronization
> algorithms that can use this. Although, FlushProcessWriteBuffers
> can work okay.
 
Do you have a link?
Paavo Helde <myfirstname@osa.pri.ee>: Aug 07 11:00AM +0300

On 7.08.2019 8:13, Bonita Montero wrote:
 
> I was very surprised that unaligned memory-access is less than twice as
> slow on my PC.
> It would be nice to see results from Intel-CPUs here. Thanks in advance.
 
Results from x64 build on Intel Core i7-6600U:
 
aligned native: 0.30228
unaligned native: 0.31015
unaligned wrapped: 2.64527
 
Hmm, seems not so unaligned at all... Trying 32-bit build:
 
aligned native: 0.362762
unaligned native: 0.42812
unaligned wrapped: 2.63736
 
 
BTW, it looks to me your code has a buffer overrun bug. Probably won't
affect benchmarks, but still.
Bonita Montero <Bonita.Montero@gmail.com>: Aug 07 10:03AM +0200

Am 07.08.2019 um 10:00 schrieb Paavo Helde:
 
> BTW, it looks to me your code has a buffer overrun bug.
> Probably won't affect benchmarks, but still.
 
No, look at the "- 1"!
 
memkill( (unaligned_uint32 *)(m + 1), ELEMS - 1 );
 
Without this the code would touch the next page which is very likely
to be not allocated so that it would crash.
Bonita Montero <Bonita.Montero@gmail.com>: Aug 07 10:07AM +0200

> https://docs.microsoft.com/en-us/windows/win32/api/processthreadsapi/nf-processthreadsapi-flushprocesswritebuffers
 
Why is this necessary? We have M(O)ESI.
Paavo Helde <myfirstname@osa.pri.ee>: Aug 07 11:46AM +0300

On 7.08.2019 11:03, Bonita Montero wrote:
 
> memkill( (unaligned_uint32 *)(m + 1), ELEMS - 1 );
 
> Without this the code would touch the next page which is very likely
> to be not allocated so that it would crash.
 
I see. It seems I looked only at the first (aligned) call and assumed
the test set size would be the same always.
David Brown <david.brown@hesbynett.no>: Aug 07 11:19AM +0200

On 07/08/2019 10:00, Paavo Helde wrote:
>>      u[3] = (uint8_t)(ui << 24);
>>      return *this;
>> }
 
Can MSVC not turn these into optimised unaligned accesses? gcc and
clang treat them exactly like accesses via a cast to a uint32_t pointer,
except that the behaviour is defined and portable. (On targets that
don't support unaligned access, gcc will access data by bytes.)
Bonita Montero <Bonita.Montero@gmail.com>: Aug 07 01:13PM +0200

> Can MSVC not turn these into optimised unaligned accesses?
 
Do you really think there will be ever a compiler that "optimizes"
away the unaligned loads? The compiler won't simply do what I told
him by doing that. I bet there won'T be any C/C++-compiler that does
this until the eath is burnt by the sun.
David Brown <david.brown@hesbynett.no>: Aug 07 01:29PM +0200

On 07/08/2019 13:13, Bonita Montero wrote:
> away the unaligned loads? The compiler won't simply do what I told
> him by doing that. I bet there won'T be any C/C++-compiler that does
> this until the eath is burnt by the sun.
 
 
#include <stdint.h>
 
struct unaligned_uint32
{
uint8_t u[sizeof(uint32_t)];
operator uint32_t();
unaligned_uint32 &operator = ( uint32_t ui );
};
 
unaligned_uint32::operator uint32_t()
{
return ((uint32_t)u[0] | (uint32_t)u[1] << 8) | ((uint32_t)u[2] <<
16 | (uint32_t)u[3] << 24);
}
 
 
gcc -O2:
 
unaligned_uint32::operator unsigned int():
movl (%rdi), %eax
ret
Bonita Montero <Bonita.Montero@gmail.com>: Aug 07 02:02PM +0200

> unaligned_uint32::operator unsigned int():
> movl (%rdi), %eax
> ret
 
The compiler isn't optimizing away an unaligned load here.
If rdi is unaligned the load is also.
Bonita Montero <Bonita.Montero@gmail.com>: Aug 07 02:08PM +0200

>>          ret
 
> The compiler isn't optimizing away an unaligned load here.
> If rdi is unaligned the load is also.
 
Sorry, you miss-worded what you'd like to tell.
> Can MSVC not turn these into optimised unaligned accesses?
... is not what you'd like to tell. You wanted to tell that the
operator is compiled in a way that the shifts and loads are bundled
in a single load. So I misunderstood you.
scott@slp53.sl.home (Scott Lurndal): Aug 06 11:49PM

>> don't - so they are irrelevant to your posting about epoll.
 
>AIO should be fine, its POSIX but oh well:
 
>http://man7.org/linux/man-pages/man7/aio.7.html
 
Indeed, and that's what I used in the Oracle RDMS os dependent code I wrote.
(well, technically, lio_listio() because we'd submit hundreds of
aiocbs with a single request - we had 128 physical disk drives on
64 SCSI controllers running Oracle Parallel Server on 64 tightly coupled
P6 processors running a distributed single-system-image micro-kernel
based operating system (written in C++, using cfront)).
 
Let the OS I/O subsystem do the scheduling, don't try to do it in user mode.
 
Note that modern network workloads that require high packet throughput
generally use DPDK which provides the capability to leverage any hardware
acceleration or offloading capabilities to improve I/O performance. This
uses threads for parallelism.
 
https://www.dpdk.org/
 
Responding to Sam's assertion regarding the TOP 500; they, for the most
part, use Infiniband RDMA Verbs interface for communications between nodes,
not ethernet. And it's all buried under openMP, PVM or other standard
communication library APIs integrated with C and C++.
scott@slp53.sl.home (Scott Lurndal): Aug 06 11:52PM

>synchronously I would have thought. The alternative seems to be to use
>signals.
 
>Do you have any experience of AIO? (As I say, I don't.)
 
I used it on real unix extensively (mainly lio_listio()). It's a good
way to let the operating system I/O scheduler handling the I/O in the
most efficient way and was quite efficient particularly with multiple
disk drives on multiple controllers. Granted the application was
the oracle RDMS, for which the I/O requirements are not typical of
most applications.
"Öö Tiib" <ootiib@hot.ee>: Aug 06 04:57PM -0700

On Wednesday, 7 August 2019 02:01:21 UTC+3, Chris Vine wrote:
> you don't do it for the fun of it: you do it to process the data you
> obtain, which is likely to require use of the CPU. So why not just use
> threads without AIO?
 
It is better to use either async or blocking I/O but not to mix.
Translation between the two is possible to do with a thread. What
difference there is if such translation from blocking to async I/O
was made by glibc or ourselves (other than less work for
ourselves when glibc did it)?
"Alf P. Steinbach" <alf.p.steinbach+usenet@gmail.com>: Aug 07 12:52AM +0200

On 07.08.2019 00:38, Stefan Ram wrote:
 
> A coroutine is a function that contains
> either a coroutine-return-statement or
> an await-expression or a yield-expression.
 
You mean, "A C++20 coroutine is...".
 
General implementations of coroutines do not have that or corresponding
restriction.
 
Even if the restriction is a special case of coroutines, I think it can
be argued that "coroutine" is a misnomer for the C++20 thing. :-)
Because it implies much that isn't there.
 
---
 
By the way, I learned a bit more, by looking around at tutorials, and
I'm appalled at all the machinery it seems one must define for the
return type of even the simplest C++20 coroutine.
 
It's akin to the infamous 600+ line Microsoft OLE technology "Hello,
world!"...
 
Hopefully they'll get around to provide some default machinery, before
it's accepted in the standard.
 
 
Cheers!,
 
- Alf
Chris Vine <chris@cvine--nospam--.freeserve.co.uk>: Aug 07 12:01AM +0100

On Tue, 6 Aug 2019 15:19:48 -0700
"Chris M. Thomasson" <invalid_chris_thomasson_invalid@invalid.com>
wrote: [snip]
> AIO should be fine, its POSIX but oh well:
 
> http://man7.org/linux/man-pages/man7/aio.7.html
 
> https://pubs.opengroup.org/onlinepubs/009695399/basedefs/aio.h.html
 
Yes as I understand it AIO works for asynchronous i/o with block devices
representing hard disks, even though you can't use select/poll and I
think epoll for them. But AIO is not my area and I have to say that I
can't really understand the point of it with disks: the issue with such
block devices is that in polling terms they are always available to
read or write in the absence of an error. For reading in particular,
you don't do it for the fun of it: you do it to process the data you
obtain, which is likely to require use of the CPU. So why not just use
threads without AIO? You are not dealing with network latency.
 
I notice incidentally that the documentation for glibc's implementation
of AIO says: "This has a number of limitations, most notably that
maintaining multiple threads to perform I/O operations is expensive
and scales poorly". All the more reason to do that particular work
synchronously I would have thought. The alternative seems to be to use
signals.
 
Do you have any experience of AIO? (As I say, I don't.)
You received this digest because you're subscribed to updates for this group. You can change your settings on the group membership page.
To unsubscribe from this group and stop receiving emails from it send an email to comp.lang.c+++unsubscribe@googlegroups.com.

No comments: