soft and program: Digest for comp.lang.c++@googlegroups.com

comp.lang.c++@googlegroups.com

Google Groups

DataItem cache - 4 Updates
Commenting code considered harmful - 6 Updates
Jerry Stuckle - 9 Updates
Derivation without size increase? - 5 Updates

Marcel Mueller <news.5.maazl@spamgourmet.org>: Feb 05 08:34AM +0100

On 04.02.16 22.19, Lynn McGuire wrote:
> have decided to build a DataItem cache and use one DataItem for many of
> the same objects wherever possible. I will use copy on write mechanism
> to create new DataItems that are being modified.

Whether COW is efficient or not depends on the recombination rate, i.e.
how often happens a modified item to be identical to another instance.
If this is likely COW is bad.

> I have structured the DataItem cache using a vector inside a vector
> inside a map:

If you use COW you do not need a cache at all.
You just need to deal with references that are aware of COW. As soon as
you intend to modify the item a copy is made.

> static std::map <int, std::vector <std::vector <DataItem *>>>
> g_DataItem_Cache;

If you need an index then you probably want to do deduplication rather
than COW. I.e. you seek for identical or matching instances before (or
after) you create new ones.

Because you easily will run into serious race-conditions here I strongly
recommend to use smart pointers here. Intrusive reference count is the
first choice.
Of course, if your application never uses a second thread you are safe.

> In other words, a sparse cube. The outside map is for the identity of
> the major object type, i.e. SYM_AirCoolerGroup. The middle vector is
> for the index of the DataItems in that group, i.e. AIR_DUT.

Note that vector is quite inefficient in dealing with sparse content.
Direct lookup is only efficient for types with a small domain like enum
types where most of the values are really used.

Furthermore std::map is not efficient with respect to memory and memory
cache efficiency too if the number of nodes becomes large. The typical
implementations uses a Red-black tree. You should prefer a B-tree if
memory counts. Almost every database do so. Unfortunately this is not
part of the standard. But there are good public implementations
available. E.g. Google published a quite good Java implementation that
can be ported to C++ with reasonable effort (take care of license issues).

> vector is for the various different copies of DataItems that are
> referenced by that Data Group type and index. Each Data Group will need
> to know which version of the Data Item that it is using.

Version?

> I am wondering if I can make this more efficient (less memory usage).
> Any thoughts here? One of my staff is totally for this and another is
> not. I am about 25% complete on making the code changes for testing.

Deduplication can significantly reduce memory usage. Typically for
business database are factors in the order of 10. This is basically one
of the concepts behind in memory databases. I also have achieved factors
up to 100 in some applications.

But there are challenges, too.
Fist of all you strictly need to distinguish between read and write
access. For efficiency reasons writable instances should never make it
to the main index. You should share only immutable instances. Otherwise
you will need to synchronize until death.
I recommend to put this in the type system. I.e. make DataItem immutable
and use LocalDataItem as local, writable copy without deduplication.
LocalDataItem should inherit from DataItem to make reading code to be
able to deal with a mix of a both, i.e. many immutable instances and a
few mutable ones.
At least you should ensure that the shared instances use a compact
memory representation, i.e. no half full vectors and so on. Even
std::string is not the best choice as it is optimized for mutability.

To give further hints, more knowledge about the structure of the
DataItems and your application is required.
What are their properties?
Why do many instances have the same data? (Otherwise your concept would
not work.)
What kind of data do they contain? Strings? Maybe it is easier to
deduplicate them.
What is the typical access pattern to look up the DataItems? Do they
have something like a primary key?
How do changes apply to the data structures? Transactions? Revisions?
Snapshot isolation?
Do you have a database backend?
What about concurrency? Is it likely that the same items are accessed
concurrently? For writing or only for reading?
What about the object lifetime? May you have a memory leak? Since you
deal with raw pointers (i strongly disadvise to do so) this is not that
unlikely.
And last but not least: is it really the space for the DataItems that
clobbers your memory? Or is it management overhead? Or maybe even
fragmentation of the virtual address space?

Marcel

Lynn McGuire <lmc@winsim.com>: Feb 05 04:26PM -0600

On 2/5/2016 1:34 AM, Marcel Mueller wrote:
> And last but not least: is it really the space for the DataItems that clobbers your memory? Or is it management overhead? Or maybe
> even fragmentation of the virtual address space?

> Marcel

Answers to your questions:
1. part of the DataItem declaration is below
2. many of the DataItem instances are exactly alike since they are snapshots of a user's workspace
3. all kinds of data: strings, integers, doubles, string arrays, double arrays, integer array, strings larger than 300 characters are
compressed using zlib
4. DataItems are stored in a hierarchical object system using a primary key in DataGroup objects
5. not sure what you are asking
6. no database backend
7. no concurrency (yet)
8. when the storage used is 1.5 GB, the memory leakage is 10 MB (observed)
8a. the lifetime of the objects is controlled by the user by opening a file or closing a file
9. I think that it is DataItems but will not know for sure until completion of the current deduplication project

Here is part of the declaration for the DataItem and DesValue classes. There are no member variables in the ObjPtr class.

class DataItem : public ObjPtr
{
private:

int datatype; // Either #Int, #Real, #String or #Enumerated
int vectorFlag; // Flag indicating value contains an Array.
int descriptorName; // name of Corresponding DataDescriptor
// DataGroup * owner; // The DataGroup instance to which this item belongs
std::vector <DataGroup *> owners; // The DataGroup instance(s) to which this item belongs
DesValue * inputValue; // DesValue containing permanent input value
DesValue * scratchValue; // DesValue containing scratch input value
int writeTag; // a Long representing the object for purposes of reading/writing
int unitsClass; // nil or the symbol of the class
std::string unitsArgs; // a coded string of disallowed units
std::map <int, std::vector <int> > dependentsListMap;
DataDescriptor * myDataDescriptor;
BOOL scratchChangedComVector; // if the scratch value was changed in the changeComVector() method

protected:
virtual void discardInput (DataGroup * ownerDG);

public:
// constructor
DataItem ();
DataItem (const DataItem & rhs);
DataItem & operator = (const DataItem & rhs);

// destructor
virtual ~DataItem ();

// comparison of equality
virtual bool operator == (DataItem const & right) const;
virtual bool operator != (DataItem const & right) const;

virtual int isDataItem () { return true; };

class DesValue : public ObjPtr
{
public:

int datatype; // Either #Int, #Real, or #String.
int vectorFlag; // Flag indicating value contains an Array.
int optionListName; // name of the option list item
int * intValue; // Either nil, an Int, a Real, a String, or an Array thereof.
double * doubleValue;
std::string * stringValue;
std::vector <int> * intArrayValue;
std::vector <double> * doubleArrayValue;
std::vector <std::string> * stringArrayValue;
unsigned char * compressedData;
unsigned long compressedDataLength;
std::vector <unsigned long> uncompressedStringLengths;
int isTouched; // Flag indicating if value, stringValue, or units have been modified since this DesValue was created. Set to
true by setValue, setString, setUnits, and convertUnits.
int isSetFlag; // Flag indicating whether the contents of the DesValue is defined or undefined. If isSet is false, getValue
returns nil despite the contents of value, while getString and getUnits return the empty string despite the contents of stringValue
and units.
int unitsValue; // current string value index in $UnitsList (single or top)
int unitsValue2; // current string value index in $UnitsList (bottom)
std::string errorMessage; // message about last conversion of string to value
std::string unitsArgs; // a coded string of disallowed units

protected:

virtual void deleteValues ();

public:

// constructor
DesValue ();
DesValue (const DesValue & rhs);
DesValue & operator = (const DesValue & rhs);

// destructor
virtual ~DesValue ();

// comparison of equality
virtual bool operator == (DesValue const & right) const;
virtual bool operator != (DesValue const & right) const;

Lynn

Ian Collins <ian-news@hotmail.com>: Feb 06 12:01PM +1300

Lynn McGuire wrote:
> 4. DataItems are stored in a hierarchical object system using a primary key in DataGroup objects
> 5. not sure what you are asking
> 6. no database backend

If you have a large number of objects that are suitable for
deduplication, you might be better off using a proper in memory
database. Then you wouldn't have to worry about such things your self.
Being a JSON/BSON fan, I tend towards MongoDB for this type of data.
If the data is relational, I would look to MySQL in memory tables.

> int unitsClass; // nil or the symbol of the class
> std::string unitsArgs; // a coded string of disallowed units
> std::map <int, std::vector <int> > dependentsListMap;

Could these (and the vector above) be fixed size? If not, you might
benefit in both space and performance if you use a custom allocator for
them.

Maps within vectors within maps will probably lead to a very fragmented
heap, wasting both memory and possible cache hits.

> DataDescriptor * myDataDescriptor;
> BOOL scratchChangedComVector; // if the scratch value was changed in the changeComVector() method

BOOL?

> virtual void discardInput (DataGroup * ownerDG);

> public:
> // constructor

Time for Flibble to start a "Gratuitous considered harmful" thread :)

<snip>

> std::vector <int> * intArrayValue;
> std::vector <double> * doubleArrayValue;
> std::vector <std::string> * stringArrayValue;

Are these local to the class? If so, the allocator comment above might
be relevant.

--
Ian Collins

Ian Collins <ian-news@hotmail.com>: Feb 06 12:21PM +1300

Ian Collins wrote:

>> public:
>> // constructor

> Time for Flibble to start a "Gratuitous considered harmful" thread :)

D'oh! "Gratuitous comments considered harmful"

--
Ian Collins

Commenting code considered harmful

JiiPee <no@notvalid.com>: Feb 05 10:57AM

On 04/02/2016 21:40, Juha Nieminen wrote:
> In that case it's good to explain the algorithm.

> Without comments it's hard to remember months later what does what,
> so they are really useful.

yes, I have had this problem before.... its difficult to remember the
formulas and why they were there

scott@slp53.sl.home (Scott Lurndal): Feb 05 02:28PM

>Andy
>--
>p.s. Didn't anyone like my German sausages?

One of the wurst jokes ever.

Prroffessorr Fir Kenobi <profesor.fir@gmail.com>: Feb 05 01:46PM -0800

W dniu środa, 3 lutego 2016 22:47:08 UTC+1 użytkownik Vir Campestris napisał:
> why it doesn't do something else that seems obvious.

> The code tells you what it does. Nothing more.

> Andy

thats tru, code tells you what, comments could tell you why (which is probably a harder part of experience)

i am personally not writing this why comments though (i base on my memory,
it seems im mostly not forgeting why
i coded something given way even
15 years later)

Prroffessorr Fir Kenobi <profesor.fir@gmail.com>: Feb 05 02:02PM -0800

W dniu piątek, 5 lutego 2016 22:46:50 UTC+1 użytkownik Prroffessorr Fir Kenobi napisał:
> it seems im mostly not forgeting why
> i coded something given way even
> 15 years later)

eventually comments could describe a bit of what to - by describing the overview of architecture of some parts or of a whole
program - that could be helpfull even for the oryginal coder that this is probably some fault of language as c has probably no good way to express it locally (or at least probably clearly and fast) in the code

Prroffessorr Fir Kenobi <profesor.fir@gmail.com>: Feb 05 02:18PM -0800

W dniu czwartek, 4 lutego 2016 22:40:34 UTC+1 użytkownik Juha Nieminen napisał:

> Except in cases of very complex algorithms, where it just can't be
> clear from the implementation itself how and why the algorithm works.
> In that case it's good to explain the algorithm.

when its algorithm you give its name

> Without comments it's hard to remember months later what does what,
> so they are really useful.

for me it is opposite when looking at my code much later i see clearer what i coded than in the moment i oryginally coded that
(recently i looked in my oldest codebase in c, and found that i had wrote bit dirty but very clear (this makes me a bit depressive
as today i know more (lost a very long time on that) but got not such clear material style

Vir Campestris <vir.campestris@invalid.invalid>: Feb 05 10:46PM

On 05/02/2016 07:21, Alf P. Steinbach wrote:
> Assuming it's C code there should be no cast, and of course malloc is OK
> in C.

As usual Alf gets it right ;)

It's in a Linux kernel driver.

Andy

Jerry Stuckle

Jerry Stuckle <jstucklex@attglobal.net>: Feb 05 11:35AM -0500

On 2/5/2016 10:32 AM, Mr Flibble wrote:
> Who is this Jerry Stuckle guy? He is giving me a headache.

> /Flibble

Someone who's been programming longer than you've been alive, and has
consulted on three continents and for many Fortune 500 companies.

--
==================
Remove the "x" from my email address
Jerry Stuckle
jstucklex@attglobal.net
==================

Mr Flibble <flibbleREMOVETHISBIT@i42.co.uk>: Feb 05 06:22PM

On 05/02/2016 16:35, Jerry Stuckle wrote:

>> /Flibble

> Someone who's been programming longer than you've been alive, and has
> consulted on three continents and for many Fortune 500 companies.

But you have no idea how old I am or how long I have been programming so
how can you make such an assertion?

/Flibble

Jerry Stuckle <jstucklex@attglobal.net>: Feb 05 02:06PM -0500

On 2/5/2016 1:22 PM, Mr Flibble wrote:

> But you have no idea how old I am or how long I have been programming so
> how can you make such an assertion?

> /Flibble

Let's just say an educated guess based on decades of experience.

--
==================
Remove the "x" from my email address
Jerry Stuckle
jstucklex@attglobal.net
==================

Mr Flibble <flibbleREMOVETHISBIT@i42.co.uk>: Feb 05 07:12PM

On 05/02/2016 19:06, Jerry Stuckle wrote:
>> how can you make such an assertion?

>> /Flibble

> Let's just say an educated guess based on decades of experience.

The problem is that the contents of most of your posts to this ng also
appear to be educated guesses.

/Flibble

Jerry Stuckle <jstucklex@attglobal.net>: Feb 05 02:34PM -0500

On 2/5/2016 2:12 PM, Mr Flibble wrote:

> The problem is that the contents of most of your posts to this ng also
> appear to be educated guesses.

> /Flibble

Wrong again - and even more proof that I was right in my educated guess.

--
==================
Remove the "x" from my email address
Jerry Stuckle
jstucklex@attglobal.net
==================

Mr Flibble <flibbleREMOVETHISBIT@i42.co.uk>: Feb 05 07:37PM

On 05/02/2016 19:34, Jerry Stuckle wrote:
>> appear to be educated guesses.

>> /Flibble

> Wrong again - and even more proof that I was right in my educated guess.

Get a proper hobby mate.

/Flibble

Ian Collins <ian-news@hotmail.com>: Feb 06 09:18AM +1300

Mr Flibble wrote:

> The problem is that the contents of most of your posts to this ng also
> appear to be educated guesses.

Educated?

--
Ian Collins

Gareth Owen <gwowen@gmail.com>: Feb 05 08:29PM

> Who is this Jerry Stuckle guy? He is giving me a headache.

https://en.wikipedia.org/wiki/Ugly_American_%28pejorative%29

Jerry Stuckle <jstucklex@attglobal.net>: Feb 05 03:52PM -0500

On 2/5/2016 2:37 PM, Mr Flibble wrote:

>> Wrong again - and even more proof that I was right in my educated guess.

> Get a proper hobby mate.

> /Flibble

You just admitted I'm right. Thanks for the confirmation.

--
==================
Remove the "x" from my email address
Jerry Stuckle
jstucklex@attglobal.net
==================

Derivation without size increase?

Paavo Helde <myfirstname@osa.pri.ee>: Feb 05 06:08PM +0200

On 5.02.2016 16:43, Marcel Mueller wrote:

>> The size of B is not 24 or 25, but 32, because alignment.

> You are on a 64 bit platform?
> I would have expected 28 otherwise.

Doesn't matter, the doubles in A require 8-byte alignment, which 28 is not.

> return 0;
> }

> => 24

#pragma pack(1) does indeed work (with MSVC), but then sizeof(A)==23.
This means that when placed in an array, larger members like doubles
inside A would be misaligned and cause performance hits.

If I put #pragma pack(1) only before B, then sizeof(A)==24 and
sizeof(B)==25, which is not better.

I could probably derive both A and B from a common sizeof 23 base and
add 'char unused;' or something like that to A. Then sizeof(A)==24, but
I'm afraid the compiler would still think the required alignment is 1
and would not align the objects in memory properly.

Interestingly enough, the other platform which we need to support (gcc
on x64 Linux) seems to do the right thing (sizeof(A)==24,
sizeof(B)==24), with no pragmas or additional flags needed.

Cheers,
Paavo

Paavo Helde <myfirstname@osa.pri.ee>: Feb 05 06:39PM +0200

On 5.02.2016 17:32, Jerry Stuckle wrote:
> Exactly how many are "zillions"? Even if you have 1,000,000 of them at
> the same time (very doubtful - trying to process that many items would
> take a lot of time), you're only talking about 8 MB of additional memory.

Yes, it might well be that I am wasting my time on a premature
optimization, but I was just curious if using C++ abstractions would
indeed somehow work against the zero overhead principle.

But as I have now seen that gcc does the right thing without any tricks,
I am starting to think this is just a quality of implementation issue.

Cheers
Paavo

Marcel Mueller <news.5.maazl@spamgourmet.org>: Feb 05 06:00PM +0100

On 05.02.16 17.08, Paavo Helde wrote:

>> You are on a 64 bit platform?
>> I would have expected 28 otherwise.

> Doesn't matter, the doubles in A require 8-byte alignment, which 28 is not.

Think, you are a bit too fast.
A 32 bit platform never needs 64 bit alignment.

> #pragma pack(1) does indeed work (with MSVC),

I tested with gcc.

> but then sizeof(A)==23.

That was the intention of the OP.

> This means that when placed in an array, larger members like doubles
> inside A would be misaligned and cause performance hits.

True. One have to take care about that.

> I could probably derive both A and B from a common sizeof 23 base and
> add 'char unused;' or something like that to A.

No #pragma pack should do the job as well. Hmm, seems only to work if
new members are added.

class A
{ union
{ A_base a;
int dummy[(sizeof(A)-1)/sizeof(int)+1];
};
};

... not that pretty.

> Interestingly enough, the other platform which we need to support (gcc
> on x64 Linux) seems to do the right thing (sizeof(A)==24,
> sizeof(B)==24), with no pragmas or additional flags needed.

So probably the optimizer has been enhanced.

This is always the problem with optimization. At some point they work
against you.

[...]
It does not work for me without pragma. B grows one machine size word.
Neither gcc 4.7.2 @ Debian amd64 nor gcc 4.8.4 @ Mint 17 x86 nor gcc
4.9.2 @ Raspbian ARMv6 nor gcc 3.3.5 @ eCS x86 nor IBM icc 3.08 @ eCS x86.

But the above hack works with all of them.

Marcel

Jerry Stuckle <jstucklex@attglobal.net>: Feb 05 12:46PM -0500

On 2/5/2016 11:39 AM, Paavo Helde wrote:
> I am starting to think this is just a quality of implementation issue.

> Cheers
> Paavo

Like any performance issue, I concern myself with it when it becomes a
problem, not before. For instance, in your case, I would be very
concerned if this were running on an embedded processor with 512K of
RAM. However, if it's running on a machine with 5GB (or even 50MB) of
RAM, I'd be less concerned until it became a problem.

It may very well be this can cause a problem. If it is, you'll need to
fix it. But remember also that the compilers align data for performance
reasons, also. It's faster to load/store an integer when it's on a 4
byte boundary in most 32 bit processors, for instance. Compressing your
structure may end up slowing the application down. Or, by not using as
much memory, you may cause less paging and application performance may
improve. You just don't know at this point.

Almost every performance issue is a tradeoff somewhere. It's a matter
of finding the right tradeoff.

--
==================
Remove the "x" from my email address
Jerry Stuckle
jstucklex@attglobal.net
==================

Paavo Helde <myfirstname@osa.pri.ee>: Feb 05 08:16PM +0200

On 5.02.2016 19:00, Marcel Mueller wrote:
>> not.

> Think, you are a bit too fast.
> A 32 bit platform never needs 64 bit alignment.

Yes, you are right. It seems I have started to forget about 32-bit world
only after a few years.

> I tested with gcc.

>> but then sizeof(A)==23.

> That was the intention of the OP.

No, not at all.

> It does not work for me without pragma. B grows one machine size word.
> Neither gcc 4.7.2 @ Debian amd64 nor gcc 4.8.4 @ Mint 17 x86 nor gcc
> 4.9.2 @ Raspbian ARMv6 nor gcc 3.3.5 @ eCS x86 nor IBM icc 3.08 @ eCS x86.

For any case, I tried with an example which resembles my real code more
exactly:

~>cat main.cpp
#include <iostream>
#include <stdint.h>
#include <string>

class X;

class A {
union xunion {
int64_t k;
double f;
const void *p;
struct Extra {
std::string* s;
X* x;
} x;
char buf[16];
} x;
struct {
char buf_continued[6];
} y;
char tag;
};

class B: A {
char anotherTag;
public:
// ...
};

int main() {
std::cout << "sizeof(A)==" << sizeof(A) << "\nsizeof(B)==" <<
sizeof(B) << "\n";
return 0;
}

~>g++ -Wall main.cpp

~>./a.out
sizeof(A)==24
sizeof(B)==24

~>g++ -v
...
gcc version 4.6.2 (SUSE Linux)

Maybe they have abandoned this optimization in later versions...

MSVC produces 24 and 32 for this code, or 23 and 24 with #pragma pack(1).

You received this digest because you're subscribed to updates for this group. You can change your settings on the group membership page.
To unsubscribe from this group and stop receiving emails from it send an email to comp.lang.c+++unsubscribe@googlegroups.com.

soft and program

Friday, February 5, 2016

Digest for comp.lang.c++@googlegroups.com - 24 updates in 4 topics

No comments:

Blog Archive

About Me