- In the end, rason will come - 1 Update
- Performance of unaligned memory-accesses - 24 Updates
Ralf Goertz <me@myprovider.invalid>: Aug 08 12:14PM +0200 Am Wed, 7 Aug 2019 21:20:36 +0200 > Backslash is used for set difference and a few other things in > mathematics. Quotient groups use forward slash in my experience as > does integer division. So I don't see any logic here at all. I have the same experience. Maybe apart from integer division. I haven't seen the forward slash as indicator of *integer* division but then I am usually more interested with the remainder than the quotient. :-) > However, sometimes mathematical symbol conventions vary by language - > I have no idea about conventions in German. The conventions in German don't differ from those in English in this respect AFAIK. |
Keith Thompson <kst-u@mib.org>: Aug 07 05:27PM -0700 David Brown <david.brown@hesbynett.no> writes: [...] > make the code clearer and more efficient - assuming, of course, that the > lack of portability is not a problem. Compilers that don't support such > features will complain, so you don't get silent problems. Compiler extension like "packed" can cause problems. For example, suppose you have something like this (using gcc syntax): struct foo { char c; int i; } __attribute__((packed)); foo obj; some_func(&obj.i); some_func() takes an argument of type int*, but there's no indication that it's misaligned, so it can't take any special steps to avoid blowing up when it dereferences its argument. Recent versions of gcc and clang warn about taking the address of a misaligned member of a packed structure. https://stackoverflow.com/q/8568432/827263 https://gcc.gnu.org/bugzilla/show_bug.cgi?id=51628 -- Keith Thompson (The_Other_Keith) kst-u@mib.org <http://www.ghoti.net/~kst> Will write code for food. void Void(void) { Void(); } /* The recursive call of the void */ |
Keith Thompson <kst-u@mib.org>: Aug 07 05:27PM -0700 > /Please/ learn to use Usenet properly! Keep attributions, and quote an > appropriate amount of context! > On 07/08/2019 16:53, Bonita Montero wrote: [SNIP] It's not a matter of learning. Bonita Montero's headers indicate that she(?) is using Thunderbird, which I'm reasonably sure handles attribution lines correctly. She must be removing them deliberately. I asked her to keep attribution lines a while ago. She responded with insults. Do you use a killfile? -- Keith Thompson (The_Other_Keith) kst-u@mib.org <http://www.ghoti.net/~kst> Will write code for food. void Void(void) { Void(); } /* The recursive call of the void */ |
Christopher Collins <ccollins476ad@gmail.com>: Aug 07 05:28PM -0700 > is the programmer's responsibility. Turning it into small and fast > object code is the compiler's responsibility. Don't try to do the > compiler's job - work with it so that it can to the best job it can. I did an experiment using godbolt (ARM gcc 8.2, compiler settings: -mcpu=cortex-m4 -Os). In the below code, there are two functions, `marshal1` and `marshal2`. Both functions serialize an object of type `struct msg` into a byte array. `marshal1` does it member-by-member; `marshal2` does it with a single memcpy as I described in my previous post. // Code: #include <inttypes.h> #include <string.h> struct msg { uint8_t a; uint32_t b; uint16_t c; uint8_t d; uint16_t e[4]; } __attribute__((packed)); void marshal1(const struct msg *m, uint8_t *out) { out[0] = m->a; memcpy(&out[1], &m->b, 4); memcpy(&out[5], &m->c, 2); out[7] = m->d; memcpy(&out[8], m->e, 8); } void marshal2(const struct msg *m, uint8_t *out) { memcpy(out, m, sizeof *m); } // Result: marshal1: ldrb r3, [r0] @ zero_extendqisi2 strb r3, [r1] ldr r3, [r0, #1] @ unaligned str r3, [r1, #1] @ unaligned ldrh r3, [r0, #5] @ unaligned strh r3, [r1, #5] @ unaligned ldrb r3, [r0, #7] @ zero_extendqisi2 strb r3, [r1, #7] ldr r3, [r0, #8]! @ unaligned str r3, [r1, #8] @ unaligned ldr r3, [r0, #4] @ unaligned str r3, [r1, #12] @ unaligned bx lr marshal2: add r3, r0, #16 .L3: ldr r2, [r0], #4 @ unaligned str r2, [r1], #4 @ unaligned cmp r0, r3 bne .L3 bx lr The output for `marshal2` is obviously smaller than `marshal1`. How can I acheive the same results without relying on this technique? Chris |
Keith Thompson <kst-u@mib.org>: Aug 07 05:52PM -0700 Bart <bc@freeuk.com> writes: [...] > that, or just accept it, and try and create arrays with a stride of 65 > bytes, or pad them to 128 bytes? > I doubt it will re-design the struct to keep it to a power-of-two. [...] If the target imposes alignment constraints and you don't use any compiler extensions to force packing, the compiler will do what it needs to do to prevent unaligned access. There's unlikely to be any need to pad a 65-byte structure to 128 bits. It would probably just be padded to 68 or perhaps 72 bytes, depending on the alignment requirements of its members. For example: #include <stdio.h> #include <stdint.h> int main(void) { struct foo { uint32_t arr[16]; char c; }; struct bar { uint64_t arr[8]; char c; }; printf("struct foo is %zu bytes\n", sizeof (struct foo)); printf("struct bar is %zu bytes\n", sizeof (struct bar)); } On my system: struct foo is 68 bytes struct bar is 72 bytes The stride of an array of FOO is always equal to sizeof (FOO); there is no padding between array elements. -- Keith Thompson (The_Other_Keith) kst-u@mib.org <http://www.ghoti.net/~kst> Will write code for food. void Void(void) { Void(); } /* The recursive call of the void */ |
"Chris M. Thomasson" <invalid_chris_thomasson_invalid@invalid.com>: Aug 07 08:55PM -0700 On 8/7/2019 1:07 AM, Bonita Montero wrote: >> https://docs.microsoft.com/en-us/windows/win32/api/processthreadsapi/nf-processthreadsapi-flushprocesswritebuffers > Why is this necessary? We have M(O)ESI. The inherent acquire and release wrt loads and stores on Intel is simply not strong enough to get hazard points up and running. Think about it. Loading/Storing from memory into a local variable implies acquire/release semantics wrt a MOV instruction. Iirc, its in WB memory, or something. so, this is not strong enough for SMR, or Safe Memory Reclamation, aka: Hazard Pointers. Intel is NOT seq_cst... ;^) SMR needs to load back a value _after_ the previous store obtained real data. Intel does NOT allow for this without an explicit fence. A store followed by a load to another location, can and will be reordered. Well, this is not Kosher wrt SMR! load from A store A in B load from A does A == B? the store of A into B needs to be _committed_ before the subsequent load from A. This can use MFENCE or a LOCK prefix wrt a "dummy" RMW, or even directly. So: load from A store A in B MFENCE load from A does A == B This is an explicit memory barrier that needs to be added, even on Intel, believe it or not. Its do to the usage pattern of the algorihtm and the details of Intel. |
"Chris M. Thomasson" <invalid_chris_thomasson_invalid@invalid.com>: Aug 07 10:42PM -0700 On 8/6/2019 11:57 PM, Bonita Montero wrote: >> lines, .. > That's also not relevant to me; and not only to me because no > one would do that in reality because this isn't of any use. Fair enough. |
Bonita Montero <Bonita.Montero@gmail.com>: Aug 08 07:43AM +0200 > Out of interest, how are you getting an unaligned_uint32 object > which is in fact unaligned for the target in question? I also saw that, but I thought that's sufficient for an example and David would be able to think the rest. And even there's a missing directive to enforce unaligned placement: the object might be placed unaligned by casting a pointet; so a missing directive doesn't count. |
"Chris M. Thomasson" <invalid_chris_thomasson_invalid@invalid.com>: Aug 07 10:54PM -0700 On 8/7/2019 10:43 PM, Bonita Montero wrote: > David would be able to think the rest. And even there's a missing > directive to enforce unaligned placement: the object might be placed > unaligned by casting a pointet; so a missing directive doesn't count. Think of unaligned straddling a cache line! |
Bonita Montero <Bonita.Montero@gmail.com>: Aug 08 08:03AM +0200 If you are doing lock-free synchronizazion and thereby calling FlushProcessWriteBuffers, you've lost anyway. FlushProcessWriteBuffers is a kernel call and slow. If you are doing it that way, you could stick with usual locking and having a better performance. |
Bonita Montero <Bonita.Montero@gmail.com>: Aug 08 08:04AM +0200 >> directive to enforce unaligned placement: the object might be placed >> unaligned by casting a pointet; so a missing directive doesn't count. > Think of unaligned straddling a cache line! That has nothing to do with Chris or mine statement. |
Bonita Montero <Bonita.Montero@gmail.com>: Aug 08 08:06AM +0200 I don't think FPWB is suitable for lock-free-programming because it a kernel-call and thereby very slow. |
Bonita Montero <Bonita.Montero@gmail.com>: Aug 08 08:14AM +0200 Consider this: #include <Windows.h> #include <iostream> #include <intrin.h> using namespace std; int main() { unsigned const ROUNDS = 10'000'000; int64_t start, end; double ticksPerCall; start = (int64_t)__rdtsc(); for( unsigned i = 0; i != ROUNDS; ++i ) FlushProcessWriteBuffers(); end = (int64_t)__rdtsc(); ticksPerCall = (end - start) / (double)ROUNDS; cout << ticksPerCall << endl; } This gives about 1.100 clock-cycles per call on mmy 1800. Even this is not accurate because it might be the base-clock; that's horrible. |
"Chris M. Thomasson" <invalid_chris_thomasson_invalid@invalid.com>: Aug 07 11:54PM -0700 On 8/7/2019 11:03 PM, Bonita Montero wrote: > FlushProcessWriteBuffers, you've lost anyway. FlushProcessWriteBuffers > is a kernel call and slow. If you are doing it that way, you could > stick with usual locking and having a better performance. Calls to FlushProcessWriteBuffers are all on the slow side. The slow side of the asymmetric sync, think about it for a moment. |
"Chris M. Thomasson " <ahh_f_it@crap.nothing>: Aug 07 11:59PM -0700 On 8/7/2019 11:06 PM, Bonita Montero wrote: > I don't think FPWB is suitable for lock-free-programming because > it a kernel-call and thereby very slow. Abusing the bus lock can create very excellent performance. User space RCU wrt read "mostly" work loads.. |
"Chris M. Thomasson " <ahh_f_it@crap.nothing>: Aug 08 12:00AM -0700 On 8/7/2019 11:14 PM, Bonita Montero wrote: > This gives about 1.100 clock-cycles per call on mmy 1800. > Even this is not accurate because it might be the base-clock; > that's horrible. FlushProcessWriteBuffers() is called on the slowpath. Did you read my links? |
"Chris M. Thomasson" <invalid_chris_thomasson_invalid@invalid.com>: Aug 08 12:16AM -0700 On 8/7/2019 11:04 PM, Bonita Montero wrote: >>> unaligned by casting a pointet; so a missing directive doesn't count. >> Think of unaligned straddling a cache line! > That has nothing to do with Chris or mine statement. Ahh shi%. Sorry again. I keep thinking of where unaligned access on x86 wrt LOCK can possibly be "useful". |
Bonita Montero <Bonita.Montero@gmail.com>: Aug 08 09:30AM +0200 Am 08.08.2019 um 02:28 schrieb Christopher Collins: > The output for `marshal2` is obviously smaller than `marshal1`. How can > I acheive the same results without relying on this technique? > Chris What you do doesn't makes sense. When you have a packed data-structure it's those which you want to persist or send over the network. So what you put to out should be directly what the data-structure holds. |
David Brown <david.brown@hesbynett.no>: Aug 08 09:56AM +0200 On 07/08/2019 22:48, Bonita Montero wrote: > unaligned loads with proper code and without vaguely relying on the > compiler to strip away the assembly of a uint32_t with shifts and > ORs with a single load / store. I support the principle of using conditional compilation to let you use known efficient methods on platforms that support it, and fall back to generic but safe methods elsewhere. That is a good way to handle getting maximal efficiency on platforms you view as important, while keeping portability. Unfortunately, you are completely wrong about where unaligned accesses are actually supported by the compiler. You have identified some /targets/ that support unaligned access, but the /compilers/ do not support it. Note that this does /not/ mean the compilers will not generate code that does what you want. The code you write might work fine in testing. You could look at the assembly, and it looks fine. And then one day you make a minor change to another part of the code, and suddenly it does /not/ work as you expect. Or you upgrade your compiler. Or you change an optimisation flag. C (C++ follows here) does not allow you to create an unaligned pointer by conversions - see C11 6.3.2.3p7 "A pointer to an object type may be converted to a pointer to a different object type. If the resulting pointer is not correctly aligned 68) for the referenced type, the behavior is undefined." It does not, except for certain cases, allow you to take a pointer to one type, convert it to a pointer to a different type, and use that to access data (6.5p7). It does not allow you to access data through an unaligned pointer (6.5.3.3p4). Nothing in the gcc documentation, nor the MSVC documentation I have read, nor the documentation for any other compiler I have ever used (and that's a lot, on many embedded systems) gives the impression that the compilers support unaligned access. You are living on luck. That is a ridiculous attitude to take, when it is not difficult to get code that is correct /and/ efficient. > #ifdef _MSC_VER > // MSC only works on x86/x64 and ARMv8 > #define SUPPORTS_UNALIGNED MSVC does not, as far as I can tell, support unaligned accesses on any platform unless you use the __unaligned keyword. > #elif(__GNUC__) > #if defined(__x86_64__) || defined(__i386__) > #define SUPPORTS_UNALIGNED gcc does not support unaligned accesses on any platforms. > #elif defined(__aarch64__) > #define SUPPORTS_UNALIGNED Compilers for aarch64 do not support unaligned accesses unless they specifically document that they do. > return *this; >
Subscribe to:
Post Comments (Atom)
|
No comments:
Post a Comment