soft and program: Digest for comp.lang.c++@googlegroups.com

comp.lang.c++@googlegroups.com

Google Groups

In the end, rason will come - 1 Update
Performance of unaligned memory-accesses - 24 Updates

Ralf Goertz <me@myprovider.invalid>: Aug 08 12:14PM +0200

Am Wed, 7 Aug 2019 21:20:36 +0200

> Backslash is used for set difference and a few other things in
> mathematics. Quotient groups use forward slash in my experience as
> does integer division. So I don't see any logic here at all.

I have the same experience. Maybe apart from integer division. I haven't
seen the forward slash as indicator of *integer* division but then I am
usually more interested with the remainder than the quotient. :-)

> However, sometimes mathematical symbol conventions vary by language -
> I have no idea about conventions in German.

The conventions in German don't differ from those in English in this
respect AFAIK.

Performance of unaligned memory-accesses

Keith Thompson <kst-u@mib.org>: Aug 07 05:27PM -0700

David Brown <david.brown@hesbynett.no> writes:
[...]
> make the code clearer and more efficient - assuming, of course, that the
> lack of portability is not a problem. Compilers that don't support such
> features will complain, so you don't get silent problems.

Compiler extension like "packed" can cause problems. For example,
suppose you have something like this (using gcc syntax):

struct foo {
char c;
int i;
} __attribute__((packed));
foo obj;
some_func(&obj.i);

some_func() takes an argument of type int*, but there's no indication
that it's misaligned, so it can't take any special steps to avoid
blowing up when it dereferences its argument.

Recent versions of gcc and clang warn about taking the address of a
misaligned member of a packed structure.

https://stackoverflow.com/q/8568432/827263
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=51628

--
Keith Thompson (The_Other_Keith) kst-u@mib.org <http://www.ghoti.net/~kst>
Will write code for food.
void Void(void) { Void(); } /* The recursive call of the void */

Keith Thompson <kst-u@mib.org>: Aug 07 05:27PM -0700

> /Please/ learn to use Usenet properly! Keep attributions, and quote an
> appropriate amount of context!

> On 07/08/2019 16:53, Bonita Montero wrote:
[SNIP]

It's not a matter of learning. Bonita Montero's headers indicate
that she(?) is using Thunderbird, which I'm reasonably sure handles
attribution lines correctly. She must be removing them deliberately.
I asked her to keep attribution lines a while ago. She responded
with insults.

Do you use a killfile?

--
Keith Thompson (The_Other_Keith) kst-u@mib.org <http://www.ghoti.net/~kst>
Will write code for food.
void Void(void) { Void(); } /* The recursive call of the void */

Christopher Collins <ccollins476ad@gmail.com>: Aug 07 05:28PM -0700

> is the programmer's responsibility. Turning it into small and fast
> object code is the compiler's responsibility. Don't try to do the
> compiler's job - work with it so that it can to the best job it can.

I did an experiment using godbolt (ARM gcc 8.2, compiler settings:
-mcpu=cortex-m4 -Os). In the below code, there are two functions,
`marshal1` and `marshal2`. Both functions serialize an object of type
`struct msg` into a byte array. `marshal1` does it member-by-member;
`marshal2` does it with a single memcpy as I described in my previous
post.

// Code:

#include <inttypes.h>
#include <string.h>

struct msg {
uint8_t a;
uint32_t b;
uint16_t c;
uint8_t d;
uint16_t e[4];
} __attribute__((packed));

void marshal1(const struct msg *m, uint8_t *out) {
out[0] = m->a;
memcpy(&out[1], &m->b, 4);
memcpy(&out[5], &m->c, 2);
out[7] = m->d;
memcpy(&out[8], m->e, 8);
}

void marshal2(const struct msg *m, uint8_t *out) {
memcpy(out, m, sizeof *m);
}

// Result:

marshal1:
ldrb r3, [r0] @ zero_extendqisi2
strb r3, [r1]
ldr r3, [r0, #1] @ unaligned
str r3, [r1, #1] @ unaligned
ldrh r3, [r0, #5] @ unaligned
strh r3, [r1, #5] @ unaligned
ldrb r3, [r0, #7] @ zero_extendqisi2
strb r3, [r1, #7]
ldr r3, [r0, #8]! @ unaligned
str r3, [r1, #8] @ unaligned
ldr r3, [r0, #4] @ unaligned
str r3, [r1, #12] @ unaligned
bx lr
marshal2:
add r3, r0, #16
.L3:
ldr r2, [r0], #4 @ unaligned
str r2, [r1], #4 @ unaligned
cmp r0, r3
bne .L3
bx lr

The output for `marshal2` is obviously smaller than `marshal1`. How can
I acheive the same results without relying on this technique?

Chris

Keith Thompson <kst-u@mib.org>: Aug 07 05:52PM -0700

Bart <bc@freeuk.com> writes:
[...]
> that, or just accept it, and try and create arrays with a stride of 65
> bytes, or pad them to 128 bytes?

> I doubt it will re-design the struct to keep it to a power-of-two.
[...]

If the target imposes alignment constraints and you don't use any
compiler extensions to force packing, the compiler will do what it needs
to do to prevent unaligned access. There's unlikely to be any need to
pad a 65-byte structure to 128 bits. It would probably just be padded
to 68 or perhaps 72 bytes, depending on the alignment requirements of
its members.

For example:

#include <stdio.h>
#include <stdint.h>
int main(void) {
struct foo {
uint32_t arr[16];
char c;
};
struct bar {
uint64_t arr[8];
char c;
};
printf("struct foo is %zu bytes\n", sizeof (struct foo));
printf("struct bar is %zu bytes\n", sizeof (struct bar));
}

On my system:

struct foo is 68 bytes
struct bar is 72 bytes

The stride of an array of FOO is always equal to sizeof (FOO); there is
no padding between array elements.

--
Keith Thompson (The_Other_Keith) kst-u@mib.org <http://www.ghoti.net/~kst>
Will write code for food.
void Void(void) { Void(); } /* The recursive call of the void */

"Chris M. Thomasson" <invalid_chris_thomasson_invalid@invalid.com>: Aug 07 08:55PM -0700

On 8/7/2019 1:07 AM, Bonita Montero wrote:
>> https://docs.microsoft.com/en-us/windows/win32/api/processthreadsapi/nf-processthreadsapi-flushprocesswritebuffers

> Why is this necessary? We have M(O)ESI.

The inherent acquire and release wrt loads and stores on Intel is simply
not strong enough to get hazard points up and running. Think about it.
Loading/Storing from memory into a local variable implies
acquire/release semantics wrt a MOV instruction. Iirc, its in WB memory,
or something. so, this is not strong enough for SMR, or Safe Memory
Reclamation, aka: Hazard Pointers. Intel is NOT seq_cst... ;^)

SMR needs to load back a value _after_ the previous store obtained real
data. Intel does NOT allow for this without an explicit fence. A store
followed by a load to another location, can and will be reordered. Well,
this is not Kosher wrt SMR!

load from A
store A in B
load from A
does A == B?

the store of A into B needs to be _committed_ before the subsequent load
from A. This can use MFENCE or a LOCK prefix wrt a "dummy" RMW, or even
directly. So:

load from A
store A in B
MFENCE
load from A
does A == B

This is an explicit memory barrier that needs to be added, even on
Intel, believe it or not. Its do to the usage pattern of the algorihtm
and the details of Intel.

"Chris M. Thomasson" <invalid_chris_thomasson_invalid@invalid.com>: Aug 07 10:42PM -0700

On 8/6/2019 11:57 PM, Bonita Montero wrote:
>> lines, ..

> That's also not relevant to me; and not only to me because no
> one would do that in reality because this isn't of any use.

Fair enough.

Bonita Montero <Bonita.Montero@gmail.com>: Aug 08 07:43AM +0200

> Out of interest, how are you getting an unaligned_uint32 object
> which is in fact unaligned for the target in question?

I also saw that, but I thought that's sufficient for an example and
David would be able to think the rest. And even there's a missing
directive to enforce unaligned placement: the object might be placed
unaligned by casting a pointet; so a missing directive doesn't count.

"Chris M. Thomasson" <invalid_chris_thomasson_invalid@invalid.com>: Aug 07 10:54PM -0700

On 8/7/2019 10:43 PM, Bonita Montero wrote:
> David would be able to think the rest. And even there's a missing
> directive to enforce unaligned placement: the object might be placed
> unaligned by casting a pointet; so a missing directive doesn't count.

Think of unaligned straddling a cache line!

Bonita Montero <Bonita.Montero@gmail.com>: Aug 08 08:03AM +0200

If you are doing lock-free synchronizazion and thereby calling
FlushProcessWriteBuffers, you've lost anyway. FlushProcessWriteBuffers
is a kernel call and slow. If you are doing it that way, you could
stick with usual locking and having a better performance.

Bonita Montero <Bonita.Montero@gmail.com>: Aug 08 08:04AM +0200

>> directive to enforce unaligned placement: the object might be placed
>> unaligned by casting a pointet; so a missing directive doesn't count.

> Think of unaligned straddling a cache line!

That has nothing to do with Chris or mine statement.

Bonita Montero <Bonita.Montero@gmail.com>: Aug 08 08:06AM +0200

I don't think FPWB is suitable for lock-free-programming because
it a kernel-call and thereby very slow.

Bonita Montero <Bonita.Montero@gmail.com>: Aug 08 08:14AM +0200

Consider this:

#include <Windows.h>
#include <iostream>
#include <intrin.h>

using namespace std;

int main()
{
unsigned const ROUNDS = 10'000'000;
int64_t start, end;
double ticksPerCall;

start = (int64_t)__rdtsc();
for( unsigned i = 0; i != ROUNDS; ++i )
FlushProcessWriteBuffers();
end = (int64_t)__rdtsc();

ticksPerCall = (end - start) / (double)ROUNDS;
cout << ticksPerCall << endl;
}

This gives about 1.100 clock-cycles per call on mmy 1800.
Even this is not accurate because it might be the base-clock;
that's horrible.

"Chris M. Thomasson" <invalid_chris_thomasson_invalid@invalid.com>: Aug 07 11:54PM -0700

On 8/7/2019 11:03 PM, Bonita Montero wrote:
> FlushProcessWriteBuffers, you've lost anyway. FlushProcessWriteBuffers
> is a kernel call and slow. If you are doing it that way, you could
> stick with usual locking and having a better performance.

Calls to FlushProcessWriteBuffers are all on the slow side. The slow
side of the asymmetric sync, think about it for a moment.

"Chris M. Thomasson " <ahh_f_it@crap.nothing>: Aug 07 11:59PM -0700

On 8/7/2019 11:06 PM, Bonita Montero wrote:
> I don't think FPWB is suitable for lock-free-programming because
> it a kernel-call and thereby very slow.

Abusing the bus lock can create very excellent performance. User space
RCU wrt read "mostly" work loads..

"Chris M. Thomasson " <ahh_f_it@crap.nothing>: Aug 08 12:00AM -0700

On 8/7/2019 11:14 PM, Bonita Montero wrote:

> This gives about 1.100 clock-cycles per call on mmy 1800.
> Even this is not accurate because it might be the base-clock;
> that's horrible.

FlushProcessWriteBuffers() is called on the slowpath. Did you read my links?

"Chris M. Thomasson" <invalid_chris_thomasson_invalid@invalid.com>: Aug 08 12:16AM -0700

On 8/7/2019 11:04 PM, Bonita Montero wrote:
>>> unaligned by casting a pointet; so a missing directive doesn't count.

>> Think of unaligned straddling a cache line!

> That has nothing to do with Chris or mine statement.

Ahh shi%. Sorry again. I keep thinking of where unaligned access on x86
wrt LOCK can possibly be "useful".

Bonita Montero <Bonita.Montero@gmail.com>: Aug 08 09:30AM +0200

Am 08.08.2019 um 02:28 schrieb Christopher Collins:

> The output for `marshal2` is obviously smaller than `marshal1`. How can
> I acheive the same results without relying on this technique?

> Chris

What you do doesn't makes sense. When you have a packed data-structure
it's those which you want to persist or send over the network. So what
you put to out should be directly what the data-structure holds.

David Brown <david.brown@hesbynett.no>: Aug 08 09:56AM +0200

On 07/08/2019 22:48, Bonita Montero wrote:
> unaligned loads with proper code and without vaguely relying on the
> compiler to strip away the assembly of a uint32_t with shifts and
> ORs with a single load / store.

I support the principle of using conditional compilation to let you use
known efficient methods on platforms that support it, and fall back to
generic but safe methods elsewhere. That is a good way to handle
getting maximal efficiency on platforms you view as important, while
keeping portability.

Unfortunately, you are completely wrong about where unaligned accesses
are actually supported by the compiler. You have identified some
/targets/ that support unaligned access, but the /compilers/ do not
support it.

Note that this does /not/ mean the compilers will not generate code that
does what you want. The code you write might work fine in testing. You
could look at the assembly, and it looks fine. And then one day you
make a minor change to another part of the code, and suddenly it does
/not/ work as you expect. Or you upgrade your compiler. Or you change
an optimisation flag.

C (C++ follows here) does not allow you to create an unaligned pointer
by conversions - see C11 6.3.2.3p7 "A pointer to an object type may be
converted to a pointer to a different object type. If the resulting
pointer is not correctly aligned 68) for the referenced type, the
behavior is undefined." It does not, except for certain cases, allow
you to take a pointer to one type, convert it to a pointer to a
different type, and use that to access data (6.5p7). It does not allow
you to access data through an unaligned pointer (6.5.3.3p4).

Nothing in the gcc documentation, nor the MSVC documentation I have
read, nor the documentation for any other compiler I have ever used (and
that's a lot, on many embedded systems) gives the impression that the
compilers support unaligned access.

You are living on luck. That is a ridiculous attitude to take, when it
is not difficult to get code that is correct /and/ efficient.

> #ifdef _MSC_VER
>     // MSC only works on x86/x64 and ARMv8
>     #define SUPPORTS_UNALIGNED

MSVC does not, as far as I can tell, support unaligned accesses on any
platform unless you use the __unaligned keyword.

> #elif(__GNUC__)
>     #if defined(__x86_64__) || defined(__i386__)
>         #define SUPPORTS_UNALIGNED

gcc does not support unaligned accesses on any platforms.

>     #elif defined(__aarch64__)
>         #define SUPPORTS_UNALIGNED

Compilers for aarch64 do not support unaligned accesses unless they
specifically document that they do.

>     return *this;
>

soft and program

Thursday, August 8, 2019

Digest for comp.lang.c++@googlegroups.com - 25 updates in 2 topics

No comments:

Blog Archive

About Me