soft and program: Digest for comp.lang.c++@googlegroups.com

comp.lang.c++@googlegroups.com

Google Groups

Template argument deduction mystery - 6 Updates
""Rust is the future of systems programming, C is the new Assembly": Intel principal engineer, Josh Triplett" - 7 Updates
forward variable number of arguments precisely to a template method - 2 Updates

Pavel <pauldontspamtolk@removeyourself.dontspam.yahoo>: Sep 06 11:56PM -0400

Chris Vine wrote:
> (not C++17), allowing type deduction for operator(). It does not need
> to rely on C++17 deduction guides - you can supply an empty template
> type specifier of <> when instantiating std::greater in C++14.
OP asked how std::greater() works (i.e. w/o <>). I think for this to
work, guides are needed, no? -Pavel

Chris Vine <chris@cvine--nospam--.freeserve.co.uk>: Sep 07 03:25PM +0100

On Fri, 6 Sep 2019 23:56:30 -0400
> > to rely on C++17 deduction guides - you can supply an empty template
> > type specifier of <> when instantiating std::greater in C++14.

> work, guides are needed, no? -Pavel

As I read it, he asked about the specialization of std::greater<void>
which allows type deduction in its call operator. That was made
available in C++14.

In any event, although I am willing to be corrected I thought
deduction guides were only relevant to constructors, and the
constructor of std::greater takes no arguments from which types can be
deduced. It is the call operator which carries out type deduction
where the std::greater type is instantiated (explicitly or implicitly)
for the void type. C++17 happens (as I understand it) to allow the <>
to be omitted where the object is instantiated for the default type.

Juha Nieminen <nospam@thanks.invalid>: Sep 07 06:55PM

> As I read it, he asked about the specialization of std::greater<void>
> which allows type deduction in its call operator. That was made
> available in C++14.

I was, in fact, asking about both things.

It was completely mysterious to me how exactly std::greater would
know that it was supposed to be comparing doubles, when nowhere in
its instantiation it's told that. I didn't know that it has a
specialization for void that works for exactly this kind of situation.

It's also a bit unclear to me how it works without the empty <>
brackets. I know it's related to C++ automatic template type
deduction, but it's unclear to me how it works in this case.
Curiously, it appears that it works without any explicit deduction
guides. It works for the mere reason that the default template
parameter for std::greater is void.

The rules of automatic template parameter type deduction are still
a bit unclear to me.

Bo Persson <bo@bo-persson.se>: Sep 07 09:28PM +0200

On 2019-09-07 at 20:55, Juha Nieminen wrote:
> parameter for std::greater is void.

> The rules of automatic template parameter type deduction are still
> a bit unclear to me.

The possiblity of omitting an empty <> is just mentioned in passing,
while decribing "Explicit template argument specification":

"Trailing template arguments that can be deduced or obtained from
default template-arguments may be omitted from the list of explicit
template-arguments. A trailing template parameter pack ([temp.variadic])
not otherwise deduced will be deduced as an empty sequence of template
arguments. If all of the template arguments can be deduced, they may all
be omitted; in this case, **the empty template argument list <> itself
may also be omitted.**"

http://eel.is/c++draft/temp.arg.explicit#4

Not obvious at all, if you happened to skip over the last part of this
sentence. :-)

Bo Persson

Chris Vine <chris@cvine--nospam--.freeserve.co.uk>: Sep 07 11:06PM +0100

On Sat, 7 Sep 2019 18:55:59 -0000 (UTC)
> parameter for std::greater is void.

> The rules of automatic template parameter type deduction are still
> a bit unclear to me.

For instantiations of std::greater for other than the void type, the
call operator takes arguments of the type for which the struct has been
instantiated. In particular, std::greater is a class template, but the
call operator is not a function template so no type deduction takes
place.

std::greater<void> is a specialization of std::greater. In this
specialization, the call operator is a function template. So it does
deduce the type.

If this is still confusing note that:

template <class T> struct A { // class template
void do_it(T) {...} // not a function template, no deduction
};

struct B {
template <class T>
void do_it(T) {...} // function template, argument types deduced
};

Pavel <pauldontspamtolk@removeyourself.dontspam.yahoo>: Sep 07 06:35PM -0400

Chris Vine wrote:
> where the std::greater type is instantiated (explicitly or implicitly)
> for the void type. C++17 happens (as I understand it) to allow the <>
> to be omitted where the object is instantiated for the default type.

Thanks, good to know; always glad to stand corrected. As I mentioned, I
did not yet understand the guides enough so I guess I was ascribing all
superpowers to them. Turned out, the explanation was much simpler: the
syntax with <> could even be implemented at C++11 level if defined for
the library. -Pavel

""Rust is the future of systems programming, C is the new Assembly": Intel principal engineer, Josh Triplett"

BGB <cr88192@gmail.com>: Sep 06 10:03PM -0500

On 9/6/2019 8:51 AM, Scott Lurndal wrote:
> even given the infrequency of longjmp calls/throw (exceptions in production
> code are rare; mainly when the stack limit is reached and more stack needs
> to be allocated by the OS to the application).

OK.

In my emulators, I instead worked by taking a snapshot of the state at
the time of the fault (and disabling memory IO operations and similar at
this time), setting things up so that the inner trampoline loop would
terminate (transferring control back to an outer-loop which handles
hardware faults/exceptions), which would then restore the relevant state
(and restore IO) while transferring control to the interrupt handler.

In my case, the emulator was mostly operating via "continuation passing
style", and would typically set up the pointer to the "next instruction
trace" prior to executing the current trace, so if an exception occurs
it could set this pointer to NULL and thus cause the outer trampoline to
terminate.

More or less (note, written in C):
while(tr && (cpu->lim>0))
{ cpu->lim-=tr->n_cyc; tr=tr->Run(cpu, tr); }

In this case, no usage is made here for either exceptions or longjmp...

Some of my other stuff uses longjmp though.

The 'lim' stuff is mostly related to time-slicing and hardware
interrupts and similar (and can be in the CPU Context structure partly
to facilitate "penalties" such as accounting for cache misses and similar).

This part would be omitted if one doesn't need time-slicing, leaving a
trampoline loop more like:
while(tr)
{ tr=tr->Run(cpu, tr); }

Frequently, these trampoline loops can become significant in the
profiler, so need to be kept fast.

Typically, my emulators are mostly one-off, typically written for a
single ISA and a set of hardware devices (as opposed to trying to
support "everything under the sun").

With an emulator for my custom hobby ISA (*1), on my Ryzen 2700X (single
threaded), I can generally get ~ 200 MIPS (~ 300 MHz) with a plain C
interpreter in this style, and ~ 650 MIPs (~ 800MHz) with a quick/dirty
JIT (output is mostly in a "call-threaded" style, with the output
consisting mostly of function calls; still operating via the trampoline
loops).

In the past (with an emulator for SH4), had managed to break 1000 MIPs
with a similar design. Though, SH4 does less useful work per
instruction, owing to the limitations of fixed-length 16-bit instructions.

Though, this emulator is more written to try to model the expected
hardware behavior than for maximum performance.

The emulator is a bit slower running on ARM hardware (eg: on a Raspberry
Pi or on my phone), but still manages to be considerably faster than
DOSBox and similar (in that a Doom port to my ISA is playable; whereas
with DOSBox on a RasPi it is basically unplayable). This is still true
even when just using the interpreter.

*1: Takes inspiration from various ISAs, general summary:
Load/Store (RISC style; Base+Disp and Base+Index*Sc addressing);
Also supports PC-rel and GBR-rel as special cases.
Constrained variable length (16/32 or 16/32/48-bit encodings);
Predicated instructions (sorta like ARM, but T/F vs full CC's);
Explicitly parallel instructions (vaguely akin to IA64 or similar);
Bundles are encoded as variable-length chains of instructions;
Parallel ops and predicated instructions use similar encodings.
Supports specialized ops to help with constant loading;
Constants are composed inline, rather than using memory loads;
Is 64-bit with 32 registers in the main RF (27 GPRs + 5 SPRs)
Double precision FPU (uses GPRs)
Optional packed-integer SIMD support (also via GPRs)
...

It can execute multiple instructions per clock, but this depends on the
compiler (or explicitly via ASM code) to figure out how to do so
(similar to VLIW style ISA's, such as Itanium/IA64 or TMS320C6xxx or
similar). This avoids some of the relative hardware cost and complexity
of "going superscalar". However, its encoding also avoids the severe
code-density penalties typical of VLIW architectures while doing so
(overall code density remains competitive with scalar RISCs); code can
also remain binary compatible with a scalar-only implementation (if it
follows the rules).

So, for a lot of the "fast" ASM code, there is some amount of:
ADD R23, R19, R9 | AND R29, #511, R13 | MUL R30, R27, R21
Or similar, with '|' designating ops which are to be run in parallel
(though, the ISA does impose some limits as to what sorts of things are
allowed with this; if one goes outside the rules for a given profile,
the results are undefined).

Otherwise, my C compiler will attempt to shuffle the instructions around
to find cases where it can put instructions in parallel with each other.

Compiler:
Mostly outputs flat ROM images, and a tweaked PE/COFF variant;
Currently only really supports C.
There is a partially-implemented / mostly-untested C++ subset though.

Still working on having a usable Verilog/FPGA implementation.
CPU core mostly works in simulation, mostly working on the various
peripheral hardware and trying to fill in the unit-testing at this point.

Which features can be enabled will depend some on the FPGA (currently
mostly targeting medium-range Spartan-7 and Artix-7 devices).

Tim Rentsch <tr.17687@z991.linuxsc.com>: Sep 07 06:59AM -0700

> On Tuesday, September 3, 2019 at 4:36:02 PM UTC-4, Scott Lurndal wrote:

>> Funny, I get the opposite impression.

> True or not, it's not necessary to share these impressions, [...]

It was you giving your impression that started it. If you
think other people shouldn't do something then you shouldn't
do it either.

BGB <cr88192@hotmail.com>: Sep 07 10:57AM -0500

( Mostly trying to clarify some things )

On 9/6/2019 10:03 PM, BGB wrote:
> trace" prior to executing the current trace, so if an exception occurs
> it could set this pointer to NULL and thus cause the outer trampoline to
> terminate.

...

> trampoline loop more like:
> while(tr)
> { tr=tr->Run(cpu, tr); }

Missing some relevant details here.

Trace Run was typically:
Set up some initial trace state ("cpu->tr_next" pointer, ...);
Execute an unrolled loop of indirect function calls;
Each unrolled loop-length, ..., has its own Run functions;
Return the "cpu->tr_next" pointer.
This may have been NULL'ed if an exception was thrown.

As-needed, the trace function may fetch traces for connected
control-flow patch, which will go through a big hash table (if the trace
has already been decoded for a given PC address) and may trigger the
emulator's instruction-decoding logic as-needed (which decodes a
non-branching trace of up to 32 instructions). Ideally, one wants to
avoid going through decoding if possible, as doing so is fairly expensive.

Following the fetch, the Run function may replace itself with a version
which does not fetch the traces (using the 'tr'/'Trace' structure to
cache the relevant pointers, setting the next pointers from the cached
values). Note that conditional direct-branch instructions are handled by
conditionally copying the value from one pointer to another and setting
a different value for the CPU Context's PC register.

If the emulated program triggers an I-cache flush, it may be necessary
to teardown the entire trace-graph (and flush any JIT related state,
..., if applicable). In my tests, full teardown was generally the faster
option.

Typically, also memory IO was via wrapper functions, which then call
through function pointers held in the CPU Context structure. These
pointers may be changed depending on state (such as whether or not the
MMU has been enabled, whether the address space is 32 or 48 bits,
whether or not an exception is in the process of being raised, ...).

The memory address space is fully emulated in my case, generally:
(If JIT/Fast case) Checks address against a cached set of 2 pages;
If found, do IO these and return.
(If MMU) Translate address via TLB;
May trigger TLB miss exception (ISA uses semi-exposed TLB)
(If Slow case) Model L1 and L2 cache misses
Accumulates counts of cache hits/misses, adds "penalty cycles", ...
Lookup the "memory span" associated with this address
Searches an spans array via a binary search.
If is is a "basic memory" span:
Add to 2-page cache, perform IO.
Else:
Call span's appropriate Get/Set handlers.
(Each span is set up with pointers for its matching HW device).

I guess some other emulators go and use OS-level memory-mapping tricks
to try to implement the address space, but I didn't want to deal with
this sort of hassle (similar reasons for it being single-threaded).

It is generally fast enough for what I need it for, so is acceptable.

All this is considerably different from how it would work in Verilog,
but I wanted it fast enough to be able to keep up with real-time. A lot
of the peripheral hardware is written using a lot more if/else branches
and "switch()" and similar though.

Similar, what makes for good results in Verilog results in particularly
bad performance if done in C (and vice versa). Though, some parts of the
emulator have been reused in some of the unit tests for Verilog modules.

Not really sure how common or uncommon of a design any of this is.

Note that underlying design can be applied to emulating other ISA
designs, eg, x86 or ARM, but in both cases, their use of condition codes
makes achieving decent emulator performance more difficult (CC's are
more difficult to check/update efficiently, vs a SuperH style True/False
flag or an explicit boolean register).

> Frequently, these trampoline loops can become significant in the
> profiler, so need to be kept fast.

...
> JIT (output is mostly in a "call-threaded" style, with the output
> consisting mostly of function calls; still operating via the trampoline
> loops).

...

On x86, much of the speedup from the JIT seems to be eliminating all the
indirect calls in favor of direct function calls.

So, an unrolled loop something like:
(*ops)->Run(cpu, *ops); ops++;
(*ops)->Run(cpu, *ops); ops++;
(*ops)->Run(cpu, *ops); ops++;
...

Becomes effectively something like, eg:
JX2VM_OpXOR_RIR(cpu, op1);
JX2VM_OpADD_RRR(cpu, op2);
...
(just in machine code).

Though, various common operations are translated directly, however much
of the overall speedup seems to be due to this transformation.

JIT is only really enabled at higher clock-speed targets for now, as it
interferes with some of the modeling and analysis features. If I am
telling it to operate at 50 or 100 MHz, chances are I care about
analysis, and if I tell it to go at 500MHz, it assumes I probably don't
care as much (and the plain interpreter can't keep up with real-time
emulation at 500MHz).

On PC's though, even just the interpreter is faster than what I can
expect from the hardware on an FPGA (short of getting a faster and much
more expensive FPGA, like a Kintex or Virtex), so it is already
basically fast enough for what I am doing.

I have noted, however, that this does not seem to apply to ARM; I saw
very little gain from call-threaded code over the plain interpreter (and
no JIT yet exists for Aarch64).

I had planned to move parts of the ARM JIT from using GPRs over to NEON
(NEON is a closer fit to my ISA than ARM's base integer ISA, but needs a
Pi3 or newer), however had not done so yet.

On a RasPi, the emulator isn't currently fast enough (even with JIT) to
give acceptable framerates if running Quake, which is kinda lame...

Though, it isn't exactly like the FPGA version would do much better here
(currently limited to ~ 50MHz).

Quake could probably be made faster, but it would likely involve
rewriting big chunks of the renderer in ASM or similar (like Quake did
originally on x86; rather than using the C version of the renderer).

Bonita Montero <Bonita.Montero@gmail.com>: Sep 07 06:17PM +0200

> It's the unwinding support that adds overhead. Regardless of
> whether the exception is ever thrown in any particular codepath.

Deallocation of the resources happens anyway, regardles if you program
in C or C++. So if you have the same kind of structures that hold the
resources and the deallocation is inlined or not in both cases, the
performance of resource-dellocation is actually the same. But the way
in C++ is more convenient when you consider something like the return
-code-evaluation- and goto-orgies in the Linux-kernel.
But throwing an exception itself has a high overhead independently of
calling any destructors. I wrote some little test-code tha measues the
overhead of cathing the same exception as thrown and thowing an excep-
tion two inheritance-levels deeper as catched:

#include <iostream>
#include <chrono>
#include <exception>

using namespace std;
using namespace std::chrono;

struct ExcBase : public exception
{
};

struct ExcDerivedA : public ExcBase
{
};

struct ExcDerivedB : public ExcDerivedA
{
};

typedef long long LL;

void BaseThrower();
void DerivedBThrower();
LL Benchmark( void (*thrower)(), unsigned const rep );

int main()
{
unsigned const REP = 1'000'000;
LL ns;

ns = Benchmark( BaseThrower, REP );
cout << "BaseThrower: " << (double)ns / REP << "ns" << endl;

ns = Benchmark( DerivedBThrower, REP );
cout << "DerivedBThrower: " << (double)ns / REP << "ns" << endl;
}

LL Benchmark( void (*thrower)(), unsigned const rep )
{
time_point<high_resolution_clock> start = high_resolution_clock::now();
for( unsigned i = rep; i; --i )
try
{
thrower();
}
catch( ExcBase & )
{
}
return (LL)duration_cast<nanoseconds>( high_resolution_clock::now()
- start ).count();
}

void BaseThrower()
{
throw ExcBase();
}

void DerivedBThrower()
{
throw ExcDerivedB();
}

I assume that this code ramps up the clock of my 3,6GHz Ryzen 7 1800X
to its boost of 4GHz as there's no load on the other cores. And it turns
out that throwing and catching the same exception is about 9300 clock
-cycles and catching the derived exception is ahout 10.600 clock-cycles.
So there might be many cases where calling the destructors when throwing
an exception isn't the part that consumes most of the time, f.e. when
you only deallocate memory to a memory-pool, which is usually very fast.

Bonita Montero <Bonita.Montero@gmail.com>: Sep 07 06:48PM +0200

> So there might be many cases where calling the destructors when throwing
> an exception isn't the part that consumes most of the time, f.e. when
> you only deallocate memory to a memory-pool, which is usually very fast.

And I'm astonished that when I compile this code in 32 bit with VC++
2019 the overhead of catching and throwing an exception is roughly
twice as high. In 32-bit-mode the stack-frames are concatenated
thorugh a linked list where the top is stored in a thread-local
storage at FS:0.

Bart <bc@freeuk.com>: Sep 07 06:09PM +0100

On 06/09/2019 12:24, BGB wrote:

> A sane implementation of exceptions can use a lookup table for
> PC/EIP/RIP/whatever, and so only involves significant time overhead when
> actually throwing an exception.

I've never used exceptions (apart from once, see below**). But I
understood that in a call-chain like this (so F calls G etc)

F -> G -> H -> I

If code inside I throws/raises an exception, then it somehow has to
works its way up the call stack, releasing any resources attached to
local variables, until it finds the place where that exception is caught
(ie. checked for in the code).

If that happens to be in F, then all the active locals of the
intervening G and H functions need to be processed (I guess, call their
destructors in C++). Block-scopes within G and H would make that harder.

But how does it do that? Does it need extra information pushed at every
call to make it possible? If so, then that would be extra overhead that
will affect even this call-chain:

A -> G -> H -> B

where exceptions aren't used at all.

(** I've used exceptions once, in a language where I myself implemented
an experimental version. But that was dynamic byte-code with a tagged
stack structure, so unwinding was trivial. And also much slower than you
would expected in C++)

BGB <cr88192@gmail.com>: Sep 07 03:32PM -0500

On 9/7/2019 12:09 PM, Bart wrote:
> will affect even this call-chain:

> A -> G -> H -> B

> where exceptions aren't used at all.

No extra info is needed on the stack or at runtime, as it is generally
provided indirectly via the program binary (such as via the PE/COFF
headers or similar).

It mostly uses a big lookup table, so an exception will occur at one
address, and the dispatcher looks up this address in the table.

This could contain either:
A pointer to the function function's prolog or epilog
It will parse these to restore the caller's frame
The allowed instruction sequences are defined in the ABI.
The process repeats, this time using the callers' address
A pointer to some handler or unwind logic
Control is passed to this logic, which does its thing.
If it catches the exception, we are done here.
If it is unwind logic, or doesn't catch the exception
Control returns back to the dispatcher.
We restore the caller's frame and continue as above.
Nothing is found:
Typically the process terminates or similar.
At this point, unwinding further is effectively impossible.

The table format is target specific, but generally encodes (in some form):
The starting address (as an RVA);
(MIPS, etc)
RVA of ending address, exception handler, handler info, etc.
(x86-64 or IA64)
Pointer to an UNWIND_INFO structure (process is more involved).
(ARM/SH/Thumb/etc) second DWORD holds several bit-twiddled fields
Whether or not an exception-handler exists is given as a flag.
The entry point for the handler will directly follow the epilog.
Otherwise, the prolog sequence is be parsed to unwind the function.

In my ISA (BJX2), it is encoded similarly to SH and Thumb and similar,
but with a little more flexibility regarding the prolog and epilog.

For example, the compiler uses a space-saving feature where the common
parts of the prolog and epilog sequences are "folded off" and reused
across multiple functions; the entry point for a function may simply
preserve the Link Register and then direct-call into the reused prolog
sequence. In this case, the unwind logic would need to follow the branch
to find the "real" prolog sequence (which is terminated by a function
return).

AFAIK, on ELF based systems, a specialized subset of the DWARF format is
used for unwinding from exceptions, rather than the Win64 approach of
parsing fixed-format epilog sequences.

> an experimental version. But that was dynamic byte-code with a tagged
> stack structure, so unwinding was trivial. And also much slower than you
> would expected in C++)

OK.

I sort of have a lot of this stuff in place in my compiler and ABI spec,
but the relevant exception-handling machinery in the runtime hasn't been
implemented yet IIRC (not gotten around to it).

forward variable number of arguments precisely to a template method

Pavel <pauldontspamtolk@removeyourself.dontspam.yahoo>: Sep 06 11:48PM -0400

Tim Rentsch wrote:
> template argument types explicitly rather than having them be
> deduced. I have a similar disclaimer about not having any
> level of expertise in template type deduction.
Thanks, I thought about it but wanted to avoid it if I could. I think I
will be able to resolve by-reference-vs-by-value issue specifically by
using some pattern on the argument, something like, in your terms,
invoke<some_type_calculation<Stuff>::type...>() and then maybe wrapping
it to a macro? But this of course sucks -- because of a macro and
because it would only take care of only one of transformations that
function template parameter substitution may do to the pack. I would
rather somehow extract the pack of argument types from method precisely
(this is what I am focusing on now, without much success so far). -Pavel

Pavel <pauldontspamtolk@removeyourself.dontspam.yahoo>: Sep 07 10:16AM -0400

I just realized the problem was solved (not by me but by a high-class
C++ expert who is a colleague of mine, to my luck). The below solution
works with gcc 4.8.5. The key is to apply "Identity" pattern to the
variable parameter type list and provide more information for type
deduction. The prototypes of generated callMethod member functions are
as below:

void Caller<IA, 1>::callMethod<I, I>(void (IA::*)(I, I),
Identity::type, Identity::type)
void Caller<IA, 2>::callMethod(void (IA::*)(I
const&, I const&), Identity::type, Identity::type)
void Caller<IA, 3>::callMethod(void (IA::*)(I const&, I),
Identity::type, Identity::type)

where Identity<T>::type is actually T so everything is to my satisfaction:

#include <iostream>
#include <functional>
using namespace std;

template<class M>
struct MethodTraits;

template<class T>
struct Identity {
using type = T;
};

struct I;
ostream& operator<<(ostream&, const I&);
struct I {
int i_;
I(int i): i_(i) {}
I(const I& i): i_(i.i_) {
cout << "I(c&" << i << ")\n";
}
I(I&& i): i_(i.i_) {
cout << "I(&&" << i << ")\n";
}
};
ostream&
operator<<(ostream& os, const I& i) { return os << i.i_; }

struct IA {
virtual void ma(I, I) = 0;
virtual void mb(const I&, const I&) = 0;
virtual void mc(const I&, I) = 0;
};

template <class T, const int>
struct Caller {
Caller(T& to): o(to) {}
T& o;
template<class... A>
void callMethod(void (T::*m)(A...), typename Identity<A>::type... args) {
(o.*m)(args...);
}
};

struct A: public IA {
void ma(I, I) override
{ cout << "A::ma(I, I)\n"; }
void mb(const I&, const I&) override
{ cout << "A::mb(const I& , const I&)\n"; }
void mc(const I&, I) override
{ cout << "A::mc(const I& , I)\n"; }
};

int
main(int, char*[])
{
A a;
const I i1{1};
const I i2{2};
I i3{3};
I i4{4};
const I i5{5};
const I i6{6};
Caller<IA, 1>(a).callMethod(&IA::ma, i1, i2);
Caller<IA, 2>(a).callMethod(&IA::mb, i3, i4);
Caller<IA, 3>(a).callMethod(&IA::mc, i5, i6);
return 0;
}

You received this digest because you're subscribed to updates for this group. You can change your settings on the group membership page.
To unsubscribe from this group and stop receiving emails from it send an email to comp.lang.c+++unsubscribe@googlegroups.com.

soft and program

Saturday, September 7, 2019

Digest for comp.lang.c++@googlegroups.com - 15 updates in 3 topics

No comments:

Blog Archive

About Me