Pavel <pauldontspamtolk@removeyourself.dontspam.yahoo>: Sep 06 11:56PM -0400 Chris Vine wrote: > (not C++17), allowing type deduction for operator(). It does not need > to rely on C++17 deduction guides - you can supply an empty template > type specifier of <> when instantiating std::greater in C++14. OP asked how std::greater() works (i.e. w/o <>). I think for this to work, guides are needed, no? -Pavel |
Chris Vine <chris@cvine--nospam--.freeserve.co.uk>: Sep 07 03:25PM +0100 On Fri, 6 Sep 2019 23:56:30 -0400 > > to rely on C++17 deduction guides - you can supply an empty template > > type specifier of <> when instantiating std::greater in C++14. > work, guides are needed, no? -Pavel As I read it, he asked about the specialization of std::greater<void> which allows type deduction in its call operator. That was made available in C++14. In any event, although I am willing to be corrected I thought deduction guides were only relevant to constructors, and the constructor of std::greater takes no arguments from which types can be deduced. It is the call operator which carries out type deduction where the std::greater type is instantiated (explicitly or implicitly) for the void type. C++17 happens (as I understand it) to allow the <> to be omitted where the object is instantiated for the default type. |
Juha Nieminen <nospam@thanks.invalid>: Sep 07 06:55PM > As I read it, he asked about the specialization of std::greater<void> > which allows type deduction in its call operator. That was made > available in C++14. I was, in fact, asking about both things. It was completely mysterious to me how exactly std::greater would know that it was supposed to be comparing doubles, when nowhere in its instantiation it's told that. I didn't know that it has a specialization for void that works for exactly this kind of situation. It's also a bit unclear to me how it works without the empty <> brackets. I know it's related to C++ automatic template type deduction, but it's unclear to me how it works in this case. Curiously, it appears that it works without any explicit deduction guides. It works for the mere reason that the default template parameter for std::greater is void. The rules of automatic template parameter type deduction are still a bit unclear to me. |
Bo Persson <bo@bo-persson.se>: Sep 07 09:28PM +0200 On 2019-09-07 at 20:55, Juha Nieminen wrote: > parameter for std::greater is void. > The rules of automatic template parameter type deduction are still > a bit unclear to me. The possiblity of omitting an empty <> is just mentioned in passing, while decribing "Explicit template argument specification": "Trailing template arguments that can be deduced or obtained from default template-arguments may be omitted from the list of explicit template-arguments. A trailing template parameter pack ([temp.variadic]) not otherwise deduced will be deduced as an empty sequence of template arguments. If all of the template arguments can be deduced, they may all be omitted; in this case, **the empty template argument list <> itself may also be omitted.**" http://eel.is/c++draft/temp.arg.explicit#4 Not obvious at all, if you happened to skip over the last part of this sentence. :-) Bo Persson |
Chris Vine <chris@cvine--nospam--.freeserve.co.uk>: Sep 07 11:06PM +0100 On Sat, 7 Sep 2019 18:55:59 -0000 (UTC) > parameter for std::greater is void. > The rules of automatic template parameter type deduction are still > a bit unclear to me. For instantiations of std::greater for other than the void type, the call operator takes arguments of the type for which the struct has been instantiated. In particular, std::greater is a class template, but the call operator is not a function template so no type deduction takes place. std::greater<void> is a specialization of std::greater. In this specialization, the call operator is a function template. So it does deduce the type. If this is still confusing note that: template <class T> struct A { // class template void do_it(T) {...} // not a function template, no deduction }; struct B { template <class T> void do_it(T) {...} // function template, argument types deduced }; |
Pavel <pauldontspamtolk@removeyourself.dontspam.yahoo>: Sep 07 06:35PM -0400 Chris Vine wrote: > where the std::greater type is instantiated (explicitly or implicitly) > for the void type. C++17 happens (as I understand it) to allow the <> > to be omitted where the object is instantiated for the default type. Thanks, good to know; always glad to stand corrected. As I mentioned, I did not yet understand the guides enough so I guess I was ascribing all superpowers to them. Turned out, the explanation was much simpler: the syntax with <> could even be implemented at C++11 level if defined for the library. -Pavel |
BGB <cr88192@gmail.com>: Sep 06 10:03PM -0500 On 9/6/2019 8:51 AM, Scott Lurndal wrote: > even given the infrequency of longjmp calls/throw (exceptions in production > code are rare; mainly when the stack limit is reached and more stack needs > to be allocated by the OS to the application). OK. In my emulators, I instead worked by taking a snapshot of the state at the time of the fault (and disabling memory IO operations and similar at this time), setting things up so that the inner trampoline loop would terminate (transferring control back to an outer-loop which handles hardware faults/exceptions), which would then restore the relevant state (and restore IO) while transferring control to the interrupt handler. In my case, the emulator was mostly operating via "continuation passing style", and would typically set up the pointer to the "next instruction trace" prior to executing the current trace, so if an exception occurs it could set this pointer to NULL and thus cause the outer trampoline to terminate. More or less (note, written in C): while(tr && (cpu->lim>0)) { cpu->lim-=tr->n_cyc; tr=tr->Run(cpu, tr); } In this case, no usage is made here for either exceptions or longjmp... Some of my other stuff uses longjmp though. The 'lim' stuff is mostly related to time-slicing and hardware interrupts and similar (and can be in the CPU Context structure partly to facilitate "penalties" such as accounting for cache misses and similar). This part would be omitted if one doesn't need time-slicing, leaving a trampoline loop more like: while(tr) { tr=tr->Run(cpu, tr); } Frequently, these trampoline loops can become significant in the profiler, so need to be kept fast. Typically, my emulators are mostly one-off, typically written for a single ISA and a set of hardware devices (as opposed to trying to support "everything under the sun"). With an emulator for my custom hobby ISA (*1), on my Ryzen 2700X (single threaded), I can generally get ~ 200 MIPS (~ 300 MHz) with a plain C interpreter in this style, and ~ 650 MIPs (~ 800MHz) with a quick/dirty JIT (output is mostly in a "call-threaded" style, with the output consisting mostly of function calls; still operating via the trampoline loops). In the past (with an emulator for SH4), had managed to break 1000 MIPs with a similar design. Though, SH4 does less useful work per instruction, owing to the limitations of fixed-length 16-bit instructions. Though, this emulator is more written to try to model the expected hardware behavior than for maximum performance. The emulator is a bit slower running on ARM hardware (eg: on a Raspberry Pi or on my phone), but still manages to be considerably faster than DOSBox and similar (in that a Doom port to my ISA is playable; whereas with DOSBox on a RasPi it is basically unplayable). This is still true even when just using the interpreter. *1: Takes inspiration from various ISAs, general summary: Load/Store (RISC style; Base+Disp and Base+Index*Sc addressing); Also supports PC-rel and GBR-rel as special cases. Constrained variable length (16/32 or 16/32/48-bit encodings); Predicated instructions (sorta like ARM, but T/F vs full CC's); Explicitly parallel instructions (vaguely akin to IA64 or similar); Bundles are encoded as variable-length chains of instructions; Parallel ops and predicated instructions use similar encodings. Supports specialized ops to help with constant loading; Constants are composed inline, rather than using memory loads; Is 64-bit with 32 registers in the main RF (27 GPRs + 5 SPRs) Double precision FPU (uses GPRs) Optional packed-integer SIMD support (also via GPRs) ... It can execute multiple instructions per clock, but this depends on the compiler (or explicitly via ASM code) to figure out how to do so (similar to VLIW style ISA's, such as Itanium/IA64 or TMS320C6xxx or similar). This avoids some of the relative hardware cost and complexity of "going superscalar". However, its encoding also avoids the severe code-density penalties typical of VLIW architectures while doing so (overall code density remains competitive with scalar RISCs); code can also remain binary compatible with a scalar-only implementation (if it follows the rules). So, for a lot of the "fast" ASM code, there is some amount of: ADD R23, R19, R9 | AND R29, #511, R13 | MUL R30, R27, R21 Or similar, with '|' designating ops which are to be run in parallel (though, the ISA does impose some limits as to what sorts of things are allowed with this; if one goes outside the rules for a given profile, the results are undefined). Otherwise, my C compiler will attempt to shuffle the instructions around to find cases where it can put instructions in parallel with each other. Compiler: Mostly outputs flat ROM images, and a tweaked PE/COFF variant; Currently only really supports C. There is a partially-implemented / mostly-untested C++ subset though. Still working on having a usable Verilog/FPGA implementation. CPU core mostly works in simulation, mostly working on the various peripheral hardware and trying to fill in the unit-testing at this point. Which features can be enabled will depend some on the FPGA (currently mostly targeting medium-range Spartan-7 and Artix-7 devices). |
Tim Rentsch <tr.17687@z991.linuxsc.com>: Sep 07 06:59AM -0700 > On Tuesday, September 3, 2019 at 4:36:02 PM UTC-4, Scott Lurndal wrote: >> Funny, I get the opposite impression. > True or not, it's not necessary to share these impressions, [...] It was you giving your impression that started it. If you think other people shouldn't do something then you shouldn't do it either. |
BGB <cr88192@hotmail.com>: Sep 07 10:57AM -0500 ( Mostly trying to clarify some things ) On 9/6/2019 10:03 PM, BGB wrote: > trace" prior to executing the current trace, so if an exception occurs > it could set this pointer to NULL and thus cause the outer trampoline to > terminate. ... > trampoline loop more like: > while(tr) > { tr=tr->Run(cpu, tr); } Missing some relevant details here. Trace Run was typically: Set up some initial trace state ("cpu->tr_next" pointer, ...); Execute an unrolled loop of indirect function calls; Each unrolled loop-length, ..., has its own Run functions; Return the "cpu->tr_next" pointer. This may have been NULL'ed if an exception was thrown. As-needed, the trace function may fetch traces for connected control-flow patch, which will go through a big hash table (if the trace has already been decoded for a given PC address) and may trigger the emulator's instruction-decoding logic as-needed (which decodes a non-branching trace of up to 32 instructions). Ideally, one wants to avoid going through decoding if possible, as doing so is fairly expensive. Following the fetch, the Run function may replace itself with a version which does not fetch the traces (using the 'tr'/'Trace' structure to cache the relevant pointers, setting the next pointers from the cached values). Note that conditional direct-branch instructions are handled by conditionally copying the value from one pointer to another and setting a different value for the CPU Context's PC register. If the emulated program triggers an I-cache flush, it may be necessary to teardown the entire trace-graph (and flush any JIT related state, ..., if applicable). In my tests, full teardown was generally the faster option. Typically, also memory IO was via wrapper functions, which then call through function pointers held in the CPU Context structure. These pointers may be changed depending on state (such as whether or not the MMU has been enabled, whether the address space is 32 or 48 bits, whether or not an exception is in the process of being raised, ...). The memory address space is fully emulated in my case, generally: (If JIT/Fast case) Checks address against a cached set of 2 pages; If found, do IO these and return. (If MMU) Translate address via TLB; May trigger TLB miss exception (ISA uses semi-exposed TLB) (If Slow case) Model L1 and L2 cache misses Accumulates counts of cache hits/misses, adds "penalty cycles", ... Lookup the "memory span" associated with this address Searches an spans array via a binary search. If is is a "basic memory" span: Add to 2-page cache, perform IO. Else: Call span's appropriate Get/Set handlers. (Each span is set up with pointers for its matching HW device). I guess some other emulators go and use OS-level memory-mapping tricks to try to implement the address space, but I didn't want to deal with this sort of hassle (similar reasons for it being single-threaded). It is generally fast enough for what I need it for, so is acceptable. All this is considerably different from how it would work in Verilog, but I wanted it fast enough to be able to keep up with real-time. A lot of the peripheral hardware is written using a lot more if/else branches and "switch()" and similar though. Similar, what makes for good results in Verilog results in particularly bad performance if done in C (and vice versa). Though, some parts of the emulator have been reused in some of the unit tests for Verilog modules. Not really sure how common or uncommon of a design any of this is. Note that underlying design can be applied to emulating other ISA designs, eg, x86 or ARM, but in both cases, their use of condition codes makes achieving decent emulator performance more difficult (CC's are more difficult to check/update efficiently, vs a SuperH style True/False flag or an explicit boolean register). > Frequently, these trampoline loops can become significant in the > profiler, so need to be kept fast. ... > JIT (output is mostly in a "call-threaded" style, with the output > consisting mostly of function calls; still operating via the trampoline > loops). ... On x86, much of the speedup from the JIT seems to be eliminating all the indirect calls in favor of direct function calls. So, an unrolled loop something like: (*ops)->Run(cpu, *ops); ops++; (*ops)->Run(cpu, *ops); ops++; (*ops)->Run(cpu, *ops); ops++; ... Becomes effectively something like, eg: JX2VM_OpXOR_RIR(cpu, op1); JX2VM_OpADD_RRR(cpu, op2); ... (just in machine code). Though, various common operations are translated directly, however much of the overall speedup seems to be due to this transformation. JIT is only really enabled at higher clock-speed targets for now, as it interferes with some of the modeling and analysis features. If I am telling it to operate at 50 or 100 MHz, chances are I care about analysis, and if I tell it to go at 500MHz, it assumes I probably don't care as much (and the plain interpreter can't keep up with real-time emulation at 500MHz). On PC's though, even just the interpreter is faster than what I can expect from the hardware on an FPGA (short of getting a faster and much more expensive FPGA, like a Kintex or Virtex), so it is already basically fast enough for what I am doing. I have noted, however, that this does not seem to apply to ARM; I saw very little gain from call-threaded code over the plain interpreter (and no JIT yet exists for Aarch64). I had planned to move parts of the ARM JIT from using GPRs over to NEON (NEON is a closer fit to my ISA than ARM's base integer ISA, but needs a Pi3 or newer), however had not done so yet. On a RasPi, the emulator isn't currently fast enough (even with JIT) to give acceptable framerates if running Quake, which is kinda lame... Though, it isn't exactly like the FPGA version would do much better here (currently limited to ~ 50MHz). Quake could probably be made faster, but it would likely involve rewriting big chunks of the renderer in ASM or similar (like Quake did originally on x86; rather than using the C version of the renderer). |
Bonita Montero <Bonita.Montero@gmail.com>: Sep 07 06:17PM +0200 > It's the unwinding support that adds overhead. Regardless of > whether the exception is ever thrown in any particular codepath. Deallocation of the resources happens anyway, regardles if you program in C or C++. So if you have the same kind of structures that hold the resources and the deallocation is inlined or not in both cases, the performance of resource-dellocation is actually the same. But the way in C++ is more convenient when you consider something like the return -code-evaluation- and goto-orgies in the Linux-kernel. But throwing an exception itself has a high overhead independently of calling any destructors. I wrote some little test-code tha measues the overhead of cathing the same exception as thrown and thowing an excep- tion two inheritance-levels deeper as catched: #include <iostream> #include <chrono> #include <exception> using namespace std; using namespace std::chrono; struct ExcBase : public exception { }; struct ExcDerivedA : public ExcBase { }; struct ExcDerivedB : public ExcDerivedA { }; typedef long long LL; void BaseThrower(); void DerivedBThrower(); LL Benchmark( void (*thrower)(), unsigned const rep ); int main() { unsigned const REP = 1'000'000; LL ns; ns = Benchmark( BaseThrower, REP ); cout << "BaseThrower: " << (double)ns / REP << "ns" << endl; ns = Benchmark( DerivedBThrower, REP ); cout << "DerivedBThrower: " << (double)ns / REP << "ns" << endl; } LL Benchmark( void (*thrower)(), unsigned const rep ) { time_point<high_resolution_clock> start = high_resolution_clock::now(); for( unsigned i = rep; i; --i ) try { thrower(); } catch( ExcBase & ) { } return (LL)duration_cast<nanoseconds>( high_resolution_clock::now() - start ).count(); } void BaseThrower() { throw ExcBase(); } void DerivedBThrower() { throw ExcDerivedB(); } I assume that this code ramps up the clock of my 3,6GHz Ryzen 7 1800X to its boost of 4GHz as there's no load on the other cores. And it turns out that throwing and catching the same exception is about 9300 clock -cycles and catching the derived exception is ahout 10.600 clock-cycles. So there might be many cases where calling the destructors when throwing an exception isn't the part that consumes most of the time, f.e. when you only deallocate memory to a memory-pool, which is usually very fast. |
Bonita Montero <Bonita.Montero@gmail.com>: Sep 07 06:48PM +0200 > So there might be many cases where calling the destructors when throwing > an exception isn't the part that consumes most of the time, f.e. when > you only deallocate memory to a memory-pool, which is usually very fast. And I'm astonished that when I compile this code in 32 bit with VC++ 2019 the overhead of catching and throwing an exception is roughly twice as high. In 32-bit-mode the stack-frames are concatenated thorugh a linked list where the top is stored in a thread-local storage at FS:0. |
Bart <bc@freeuk.com>: Sep 07 06:09PM +0100 On 06/09/2019 12:24, BGB wrote: > A sane implementation of exceptions can use a lookup table for > PC/EIP/RIP/whatever, and so only involves significant time overhead when > actually throwing an exception. I've never used exceptions (apart from once, see below**). But I understood that in a call-chain like this (so F calls G etc) F -> G -> H -> I If code inside I throws/raises an exception, then it somehow has to works its way up the call stack, releasing any resources attached to local variables, until it finds the place where that exception is caught (ie. checked for in the code). If that happens to be in F, then all the active locals of the intervening G and H functions need to be processed (I guess, call their destructors in C++). Block-scopes within G and H would make that harder. But how does it do that? Does it need extra information pushed at every call to make it possible? If so, then that would be extra overhead that will affect even this call-chain: A -> G -> H -> B where exceptions aren't used at all. (** I've used exceptions once, in a language where I myself implemented an experimental version. But that was dynamic byte-code with a tagged stack structure, so unwinding was trivial. And also much slower than you would expected in C++) |
BGB <cr88192@gmail.com>: Sep 07 03:32PM -0500 On 9/7/2019 12:09 PM, Bart wrote: > will affect even this call-chain: > A -> G -> H -> B > where exceptions aren't used at all. No extra info is needed on the stack or at runtime, as it is generally provided indirectly via the program binary (such as via the PE/COFF headers or similar). It mostly uses a big lookup table, so an exception will occur at one address, and the dispatcher looks up this address in the table. This could contain either: A pointer to the function function's prolog or epilog It will parse these to restore the caller's frame The allowed instruction sequences are defined in the ABI. The process repeats, this time using the callers' address A pointer to some handler or unwind logic Control is passed to this logic, which does its thing. If it catches the exception, we are done here. If it is unwind logic, or doesn't catch the exception Control returns back to the dispatcher. We restore the caller's frame and continue as above. Nothing is found: Typically the process terminates or similar. At this point, unwinding further is effectively impossible. The table format is target specific, but generally encodes (in some form): The starting address (as an RVA); (MIPS, etc) RVA of ending address, exception handler, handler info, etc. (x86-64 or IA64) Pointer to an UNWIND_INFO structure (process is more involved). (ARM/SH/Thumb/etc) second DWORD holds several bit-twiddled fields Whether or not an exception-handler exists is given as a flag. The entry point for the handler will directly follow the epilog. Otherwise, the prolog sequence is be parsed to unwind the function. In my ISA (BJX2), it is encoded similarly to SH and Thumb and similar, but with a little more flexibility regarding the prolog and epilog. For example, the compiler uses a space-saving feature where the common parts of the prolog and epilog sequences are "folded off" and reused across multiple functions; the entry point for a function may simply preserve the Link Register and then direct-call into the reused prolog sequence. In this case, the unwind logic would need to follow the branch to find the "real" prolog sequence (which is terminated by a function return). AFAIK, on ELF based systems, a specialized subset of the DWARF format is used for unwinding from exceptions, rather than the Win64 approach of parsing fixed-format epilog sequences. > an experimental version. But that was dynamic byte-code with a tagged > stack structure, so unwinding was trivial. And also much slower than you > would expected in C++) OK. I sort of have a lot of this stuff in place in my compiler and ABI spec, but the relevant exception-handling machinery in the runtime hasn't been implemented yet IIRC (not gotten around to it). |
Pavel <pauldontspamtolk@removeyourself.dontspam.yahoo>: Sep 06 11:48PM -0400 Tim Rentsch wrote: > template argument types explicitly rather than having them be > deduced. I have a similar disclaimer about not having any > level of expertise in template type deduction. Thanks, I thought about it but wanted to avoid it if I could. I think I will be able to resolve by-reference-vs-by-value issue specifically by using some pattern on the argument, something like, in your terms, invoke<some_type_calculation<Stuff>::type...>() and then maybe wrapping it to a macro? But this of course sucks -- because of a macro and because it would only take care of only one of transformations that function template parameter substitution may do to the pack. I would rather somehow extract the pack of argument types from method precisely (this is what I am focusing on now, without much success so far). -Pavel |
Pavel <pauldontspamtolk@removeyourself.dontspam.yahoo>: Sep 07 10:16AM -0400 I just realized the problem was solved (not by me but by a high-class C++ expert who is a colleague of mine, to my luck). The below solution works with gcc 4.8.5. The key is to apply "Identity" pattern to the variable parameter type list and provide more information for type deduction. The prototypes of generated callMethod member functions are as below: void Caller<IA, 1>::callMethod<I, I>(void (IA::*)(I, I), Identity<I>::type, Identity<I>::type) void Caller<IA, 2>::callMethod<I const&, I const&>(void (IA::*)(I const&, I const&), Identity<I const&>::type, Identity<I const&>::type) void Caller<IA, 3>::callMethod<I const&, I>(void (IA::*)(I const&, I), Identity<I const&>::type, Identity<I>::type) where Identity<T>::type is actually T so everything is to my satisfaction: #include <iostream> #include <functional> using namespace std; template<class M> struct MethodTraits; template<class T> struct Identity { using type = T; }; struct I; ostream& operator<<(ostream&, const I&); struct I { int i_; I(int i): i_(i) {} I(const I& i): i_(i.i_) { cout << "I(c&" << i << ")\n"; } I(I&& i): i_(i.i_) { cout << "I(&&" << i << ")\n"; } }; ostream& operator<<(ostream& os, const I& i) { return os << i.i_; } struct IA { virtual void ma(I, I) = 0; virtual void mb(const I&, const I&) = 0; virtual void mc(const I&, I) = 0; }; template <class T, const int> struct Caller { Caller(T& to): o(to) {} T& o; template<class... A> void callMethod(void (T::*m)(A...), typename Identity<A>::type... args) { (o.*m)(args...); } }; struct A: public IA { void ma(I, I) override { cout << "A::ma(I, I)\n"; } void mb(const I&, const I&) override { cout << "A::mb(const I& , const I&)\n"; } void mc(const I&, I) override { cout << "A::mc(const I& , I)\n"; } }; int main(int, char*[]) { A a; const I i1{1}; const I i2{2}; I i3{3}; I i4{4}; const I i5{5}; const I i6{6}; Caller<IA, 1>(a).callMethod(&IA::ma, i1, i2); Caller<IA, 2>(a).callMethod(&IA::mb, i3, i4); Caller<IA, 3>(a).callMethod(&IA::mc, i5, i6); return 0; } |
You received this digest because you're subscribed to updates for this group. You can change your settings on the group membership page. To unsubscribe from this group and stop receiving emails from it send an email to comp.lang.c+++unsubscribe@googlegroups.com. |
No comments:
Post a Comment