soft and program: Digest for comp.lang.c++@googlegroups.com

comp.lang.c++@googlegroups.com

Google Groups

DataItem cache - 6 Updates
Commenting code considered harmful - 19 Updates

Paavo Helde <myfirstname@osa.pri.ee>: Feb 04 11:29PM +0200

On 4.02.2016 23:19, Lynn McGuire wrote:
> for our software. In the latest release, there can be tens of millions
> of these objects which is using up all our memory in Windows (we run out
> at 1.9 GB of memory usage).

AFAIK there is a simple way to increase that to 3 GB.

> I am wondering if I can make this more efficient (less memory usage).
> Any thoughts here? One of my staff is totally for this and another is
> not.

Totally for what? What is the alternative?

Lynn McGuire <lmc@winsim.com>: Feb 04 03:19PM -0600

I have an object called DataItem that is the basic variant storage unit for our software. In the latest release, there can be tens
of millions of these objects which is using up all our memory in Windows (we run out at 1.9 GB of memory usage). We cannot change to
x64 at the moment so I have decided to build a DataItem cache and use one DataItem for many of the same objects wherever possible. I
will use copy on write mechanism to create new DataItems that are being modified.

I have structured the DataItem cache using a vector inside a vector inside a map:

static std::map <int, std::vector <std::vector <DataItem *>>> g_DataItem_Cache;

In other words, a sparse cube. The outside map is for the identity of the major object type, i.e. SYM_AirCoolerGroup. The middle
vector is for the index of the DataItems in that group, i.e. AIR_DUT. The inside vector is for the various different copies of
DataItems that are referenced by that Data Group type and index. Each Data Group will need to know which version of the Data Item
that it is using.

I am wondering if I can make this more efficient (less memory usage). Any thoughts here? One of my staff is totally for this and
another is not. I am about 25% complete on making the code changes for testing.

Thanks,
Lynn

Lynn McGuire <lmc@winsim.com>: Feb 04 04:03PM -0600

On 2/4/2016 3:29 PM, Paavo Helde wrote:
>> Any thoughts here? One of my staff is totally for this and another is
>> not.

> Totally for what? What is the alternative?

We will hit the 3 GB barrier also using the current storage methodology.

My other programmer wants to move to x64. At least 1/4 of our customers are still running x86 Windows, not gonna happen yet.

Thanks,
Lynn

Marcel Mueller <news.5.maazl@spamgourmet.org>: Feb 05 08:34AM +0100

On 04.02.16 22.19, Lynn McGuire wrote:
> have decided to build a DataItem cache and use one DataItem for many of
> the same objects wherever possible. I will use copy on write mechanism
> to create new DataItems that are being modified.

Whether COW is efficient or not depends on the recombination rate, i.e.
how often happens a modified item to be identical to another instance.
If this is likely COW is bad.

> I have structured the DataItem cache using a vector inside a vector
> inside a map:

If you use COW you do not need a cache at all.
You just need to deal with references that are aware of COW. As soon as
you intend to modify the item a copy is made.

> static std::map <int, std::vector <std::vector <DataItem *>>>
> g_DataItem_Cache;

If you need an index then you probably want to do deduplication rather
than COW. I.e. you seek for identical or matching instances before (or
after) you create new ones.

Because you easily will run into serious race-conditions here I strongly
recommend to use smart pointers here. Intrusive reference count is the
first choice.
Of course, if your application never uses a second thread you are safe.

> In other words, a sparse cube. The outside map is for the identity of
> the major object type, i.e. SYM_AirCoolerGroup. The middle vector is
> for the index of the DataItems in that group, i.e. AIR_DUT.

Note that vector is quite inefficient in dealing with sparse content.
Direct lookup is only efficient for types with a small domain like enum
types where most of the values are really used.

Furthermore std::map is not efficient with respect to memory and memory
cache efficiency too if the number of nodes becomes large. The typical
implementations uses a Red-black tree. You should prefer a B-tree if
memory counts. Almost every database do so. Unfortunately this is not
part of the standard. But there are good public implementations
available. E.g. Google published a quite good Java implementation that
can be ported to C++ with reasonable effort (take care of license issues).

> vector is for the various different copies of DataItems that are
> referenced by that Data Group type and index. Each Data Group will need
> to know which version of the Data Item that it is using.

Version?

> I am wondering if I can make this more efficient (less memory usage).
> Any thoughts here? One of my staff is totally for this and another is
> not. I am about 25% complete on making the code changes for testing.

Deduplication can significantly reduce memory usage. Typically for
business database are factors in the order of 10. This is basically one
of the concepts behind in memory databases. I also have achieved factors
up to 100 in some applications.

But there are challenges, too.
Fist of all you strictly need to distinguish between read and write
access. For efficiency reasons writable instances should never make it
to the main index. You should share only immutable instances. Otherwise
you will need to synchronize until death.
I recommend to put this in the type system. I.e. make DataItem immutable
and use LocalDataItem as local, writable copy without deduplication.
LocalDataItem should inherit from DataItem to make reading code to be
able to deal with a mix of a both, i.e. many immutable instances and a
few mutable ones.
At least you should ensure that the shared instances use a compact
memory representation, i.e. no half full vectors and so on. Even
std::string is not the best choice as it is optimized for mutability.

To give further hints, more knowledge about the structure of the
DataItems and your application is required.
What are their properties?
Why do many instances have the same data? (Otherwise your concept would
not work.)
What kind of data do they contain? Strings? Maybe it is easier to
deduplicate them.
What is the typical access pattern to look up the DataItems? Do they
have something like a primary key?
How do changes apply to the data structures? Transactions? Revisions?
Snapshot isolation?
Do you have a database backend?
What about concurrency? Is it likely that the same items are accessed
concurrently? For writing or only for reading?
What about the object lifetime? May you have a memory leak? Since you
deal with raw pointers (i strongly disadvise to do so) this is not that
unlikely.
And last but not least: is it really the space for the DataItems that
clobbers your memory? Or is it management overhead? Or maybe even
fragmentation of the virtual address space?

Marcel

Ian Collins <ian-news@hotmail.com>: Feb 06 12:01PM +1300

Lynn McGuire wrote:
> 4. DataItems are stored in a hierarchical object system using a primary key in DataGroup objects
> 5. not sure what you are asking
> 6. no database backend

If you have a large number of objects that are suitable for
deduplication, you might be better off using a proper in memory
database. Then you wouldn't have to worry about such things your self.
Being a JSON/BSON fan, I tend towards MongoDB for this type of data.
If the data is relational, I would look to MySQL in memory tables.

> int unitsClass; // nil or the symbol of the class
> std::string unitsArgs; // a coded string of disallowed units
> std::map <int, std::vector <int> > dependentsListMap;

Could these (and the vector above) be fixed size? If not, you might
benefit in both space and performance if you use a custom allocator for
them.

Maps within vectors within maps will probably lead to a very fragmented
heap, wasting both memory and possible cache hits.

> DataDescriptor * myDataDescriptor;
> BOOL scratchChangedComVector; // if the scratch value was changed in the changeComVector() method

BOOL?

> virtual void discardInput (DataGroup * ownerDG);

> public:
> // constructor

Time for Flibble to start a "Gratuitous considered harmful" thread :)

<snip>

> std::vector <int> * intArrayValue;
> std::vector <double> * doubleArrayValue;
> std::vector <std::string> * stringArrayValue;

Are these local to the class? If so, the allocator comment above might
be relevant.

--
Ian Collins

Lynn McGuire <lmc@winsim.com>: Feb 05 04:26PM -0600

On 2/5/2016 1:34 AM, Marcel Mueller wrote:
> And last but not least: is it really the space for the DataItems that clobbers your memory? Or is it management overhead? Or maybe
> even fragmentation of the virtual address space?

> Marcel

Answers to your questions:
1. part of the DataItem declaration is below
2. many of the DataItem instances are exactly alike since they are snapshots of a user's workspace
3. all kinds of data: strings, integers, doubles, string arrays, double arrays, integer array, strings larger than 300 characters are
compressed using zlib
4. DataItems are stored in a hierarchical object system using a primary key in DataGroup objects
5. not sure what you are asking
6. no database backend
7. no concurrency (yet)
8. when the storage used is 1.5 GB, the memory leakage is 10 MB (observed)
8a. the lifetime of the objects is controlled by the user by opening a file or closing a file
9. I think that it is DataItems but will not know for sure until completion of the current deduplication project

Here is part of the declaration for the DataItem and DesValue classes. There are no member variables in the ObjPtr class.

class DataItem : public ObjPtr
{
private:

int datatype; // Either #Int, #Real, #String or #Enumerated
int vectorFlag; // Flag indicating value contains an Array.
int descriptorName; // name of Corresponding DataDescriptor
// DataGroup * owner; // The DataGroup instance to which this item belongs
std::vector <DataGroup *> owners; // The DataGroup instance(s) to which this item belongs
DesValue * inputValue; // DesValue containing permanent input value
DesValue * scratchValue; // DesValue containing scratch input value
int writeTag; // a Long representing the object for purposes of reading/writing
int unitsClass; // nil or the symbol of the class
std::string unitsArgs; // a coded string of disallowed units
std::map <int, std::vector <int> > dependentsListMap;
DataDescriptor * myDataDescriptor;
BOOL scratchChangedComVector; // if the scratch value was changed in the changeComVector() method

protected:
virtual void discardInput (DataGroup * ownerDG);

public:
// constructor
DataItem ();
DataItem (const DataItem & rhs);
DataItem & operator = (const DataItem & rhs);

// destructor
virtual ~DataItem ();

// comparison of equality
virtual bool operator == (DataItem const & right) const;
virtual bool operator != (DataItem const & right) const;

virtual int isDataItem () { return true; };

class DesValue : public ObjPtr
{
public:

int datatype; // Either #Int, #Real, or #String.
int vectorFlag; // Flag indicating value contains an Array.
int optionListName; // name of the option list item
int * intValue; // Either nil, an Int, a Real, a String, or an Array thereof.
double * doubleValue;
std::string * stringValue;
std::vector <int> * intArrayValue;
std::vector <double> * doubleArrayValue;
std::vector <std::string> * stringArrayValue;
unsigned char * compressedData;
unsigned long compressedDataLength;
std::vector <unsigned long> uncompressedStringLengths;
int isTouched; // Flag indicating if value, stringValue, or units have been modified since this DesValue was created. Set to
true by setValue, setString, setUnits, and convertUnits.
int isSetFlag; // Flag indicating whether the contents of the DesValue is defined or undefined. If isSet is false, getValue
returns nil despite the contents of value, while getString and getUnits return the empty string despite the contents of stringValue
and units.
int unitsValue; // current string value index in $UnitsList (single or top)
int unitsValue2; // current string value index in $UnitsList (bottom)
std::string errorMessage; // message about last conversion of string to value
std::string unitsArgs; // a coded string of disallowed units

protected:

virtual void deleteValues ();

public:

// constructor
DesValue ();
DesValue (const DesValue & rhs);
DesValue & operator = (const DesValue & rhs);

// destructor
virtual ~DesValue ();

// comparison of equality
virtual bool operator == (DesValue const & right) const;
virtual bool operator != (DesValue const & right) const;

Lynn

Commenting code considered harmful

Mr Flibble <flibbleREMOVETHISBIT@i42.co.uk>: Feb 03 05:35PM

Verbose commenting of code (especially of implementation details) can be
dangerous as quite often the code evolves and old comments are not
updated in tandem with the code changes and become out of date no longer
reflecting what the code actually does. These erroneous comments can
have disastrous consequences if incorrect assumptions are made based on
them.

The best form of documentation is the code itself!

/Flibble

Christian Gollwitzer <auriocus@gmx.de>: Feb 03 07:08PM +0100

Am 03.02.16 um 18:44 schrieb Gareth Owen:
>> be dangerous ... The best form of documentation is the code itself!

> Bravo! You have excelled yourself.

> Expect a million bites, and not just on sausages.

Where it hurts most :P

Gareth Owen <gwowen@gmail.com>: Feb 03 05:44PM

> Verbose commenting of code (especially of implementation details) can
> be dangerous ... The best form of documentation is the code itself!

Bravo! You have excelled yourself.

Expect a million bites, and not just on sausages.

Ian Collins <ian-news@hotmail.com>: Feb 04 08:01AM +1300

Mr Flibble wrote:
> have disastrous consequences if incorrect assumptions are made based on
> them.

> The best form of documentation is the code itself!

Aren't you going to offer up your critique of Uncle Bob's TDD sausages?

--
Ian Collins

Vir Campestris <vir.campestris@invalid.invalid>: Feb 03 09:46PM

On 03/02/2016 17:35, Mr Flibble wrote:
> them.

> The best form of documentation is the code itself!

> /Flibble

I'll bite. It can't make things wurst.

A good comment tells you what the code is supposed to do, and tells you
why it doesn't do something else that seems obvious.

The code tells you what it does. Nothing more.

Andy

Mr Flibble <flibbleREMOVETHISBIT@i42.co.uk>: Feb 03 11:04PM

On 03/02/2016 21:46, Vir Campestris wrote:

> A good comment tells you what the code is supposed to do, and tells you
> why it doesn't do something else that seems obvious.

> The code tells you what it does. Nothing more.

Well written and designed code with sensible, descriptive variable,
function and class names is virtually self-documenting.

/Flibble

Ian Collins <ian-news@hotmail.com>: Feb 04 12:09PM +1300

Mr Flibble wrote:

> Well written and designed code with sensible, descriptive variable,
> function and class names is virtually self-documenting.

Especially if it was written with TDD where you have a lovely set of
tests that tell you exactly what the code does :)

Aren't you going to offer up your critique of Uncle Bob's TDD sausages?

--
Ian Collins

Mr Flibble <flibbleREMOVETHISBIT@i42.co.uk>: Feb 03 11:19PM

On 03/02/2016 23:09, Ian Collins wrote:

> Especially if it was written with TDD where you have a lovely set of
> tests that tell you exactly what the code does :)

> Aren't you going to offer up your critique of Uncle Bob's TDD sausages?

Perhaps your problem is that you are confusing TDD with unit testing?
Unit tests are great, TDD isn't.

/Flibble

Ian Collins <ian-news@hotmail.com>: Feb 04 12:24PM +1300

Mr Flibble wrote:

>> Aren't you going to offer up your critique of Uncle Bob's TDD sausages?

> Perhaps your problem is that you are confusing TDD with unit testing?
> Unit tests are great, TDD isn't.

Nope.

Aren't you going to offer up your critique of Uncle Bob's TDD sausages?

--
Ian Collins

Mr Flibble <flibbleREMOVETHISBIT@i42.co.uk>: Feb 04 10:35AM

On 04/02/2016 07:11, Öö Tiib wrote:
> break some unit test naively and waste her time. If there are also no
> unit tests that demonstrate the reason why then that typically results
> with regression.

Nonsense. Why is not important, what is. If you were implementing
std::copy would you comment why? Of course not, the what is what is
important and the code itself tells you what.

/Flibble

"Alf P. Steinbach" <alf.p.steinbach+usenet@gmail.com>: Feb 04 12:17PM +0100

On 2/3/2016 6:35 PM, Mr Flibble wrote:
> have disastrous consequences if incorrect assumptions are made based on
> them.

> The best form of documentation is the code itself!

I agree with all that.

Of course there are exceptions.

But in favor of your view, I once had to help a colleague with a little
Java class dealing with timestamps. I first sent her a simple
non-commented class she could use as starting point, and she was well
satisfied with that. However, our project coding guidelines required
comments on everything, to serve as automatically generated
documentation, and I had a little free time so I added what I thought
was reasonable commenting and sent that. This would be very helpful, I
thought, and the code was exactly the same. But now the clear
understanding evaporated, "I don't understand any of this!".

I guess what happened was not that the comments misled intellectually,
but that with comments added the code LOOKED MORE COMPLICATED.

In a similar vein, my late father once thought he couldn't use my
calculator, because it looked so complex, lots of "math" keys. It didn't
matter that the keys he'd use were the same as on other calculators he'd
used. There was the uncertainty about the thing.

Francis Glassborow once remarked that the nice thing about the
introduction of syntax colouring was that one could now configure the
editor to show comments as white on white. ;-)

Which, I think, goes to show that your sentiment is not new, and is
shared by many who have suffered other's "well-commented" code.

Looks, not content.

Cheers,

- Alf

JiiPee <no@notvalid.com>: Feb 04 12:05PM

On 04/02/2016 11:17, Alf P. Steinbach wrote:
>> them.

>> The best form of documentation is the code itself!

> I agree with all that.

Does this mean no comments at all, even not outside the code? Like I
make a code to handle 3 base numbers (as I need to have 3 values per
slot, not binary values like 101100, but could have 201200). Now
explaining the theory (and put couple of examples also) near that code
helps me when I come back year after. It speeds up things.
In a comment I tell what is the mathematical logic behind it and coupld
of short examples. Then its easy to understand the code after that.

---
This email has been checked for viruses by Avast antivirus software.
https://www.avast.com/antivirus

Mr Flibble <flibbleREMOVETHISBIT@i42.co.uk>: Feb 04 12:08PM

On 04/02/2016 12:05, JiiPee wrote:
> helps me when I come back year after. It speeds up things.
> In a comment I tell what is the mathematical logic behind it and coupld
> of short examples. Then its easy to understand the code after that.

I guess we can summarize both those points as never document HOW you are
doing something as the code itself does that.

/Flibble

David Brown <david.brown@hesbynett.no>: Feb 04 12:33PM +0100

On 04/02/16 00:09, Ian Collins wrote:

> Especially if it was written with TDD where you have a lovely set of
> tests that tell you exactly what the code does :)

> Aren't you going to offer up your critique of Uncle Bob's TDD sausages?

Not long ago, I had the pleasure of bug-fixing code from a different
company that combined incompressible code, badly named variables and
functions, minimal commenting (some of which was other languages), and
no possibility of any sort of testing. However, the authors clearly
understood the importance of testing, since the one appropriate comment
was "// Test this shit!".

JiiPee <no@notvalid.com>: Feb 04 12:32PM

On 04/02/2016 12:08, Mr Flibble wrote:
>> of short examples. Then its easy to understand the code after that.

> I guess we can summarize both those points as never document HOW you
> are doing something as the code itself does that.

if I explain in the code also why I use that 3-base system, then it
helps to understand the code around it. The first question when seeing
3-base calculations there is: "why are we doing it like this? why use
3-base numbers here?". I did have that question when I came back to code
months after... and comments above it helped to undertand the motive
behind it.

The code does not answer questions like "why are we doing like this?
what is the motive doing this? why not doing another way? why is this
the best way to do this?"

---
This email has been checked for viruses by Avast antivirus software.
https://www.avast.com/antivirus

"Alf P. Steinbach" <alf.p.steinbach+usenet@gmail.com>: Feb 04 01:43PM +0100

On 2/4/2016 1:32 PM, JiiPee wrote:
> helps to understand the code around it. The first question when seeing
> 3-base calculations there is: "why are we doing it like this? why use
> 3-base numbers here?".

That's because the NIM game with 3 heaps has a simple solution in base 3.

> The code does not answer questions like "why are we doing like this?
> what is the motive doing this? why not doing another way? why is this
> the best way to do this?"

Could be useful.

IMHO it all depends on whether the comments really add something that is
useful and can't be easily expressed in the code itself.

Cheers!,

- Alf

JiiPee <no@notvalid.com>: Feb 04 04:07PM

On 04/02/2016 12:43, Alf P. Steinbach wrote:
> On 2/4/2016 1:32 PM, JiiPee wrote:

> IMHO it all depends on whether the comments really add something that
> is useful and can't be easily expressed in the code itself.

you mean not like this:

// here we are looping though all the humans in the vector and printing
their information!
for(const auto& a : humans)
a.print();

hehe

---
This email has been checked for viruses by Avast antivirus software.
https://www.avast.com/antivirus

red floyd <no.spam@its.invalid>: Feb 04 11:11AM -0800

On 2/4/2016 10:38 AM, Andrea Venturoli wrote:

>> BOOST_CHECK_THROW(z(3,4))

> Think 20-40 lines of those comments, followed by the 20-40 lines of
> code, which soon will get out of sync.

I once had the dubious "pleasure" of examining code where the standards
said that each function would have a block comment describing
functionality.

So far, so good.

However, this code was written in Ada by an idiot who figured that since
well written Ada code was "self-documenting", the block comment was
literally a commented out copy of the function.

Of course, the code was neither well written nor self documenting.

FWIW, my block comments look like this:

//
// func_name() -- one line description
//
// INPUTS: [input 1] -- description or NONE
// ...
//
// OUTPUTS: [output 1] -- description
// ...
//
// RETURNS: description, or NONE
//
// Short description of what function is intended to do and why
//

red floyd <no.spam@its.invalid>: Feb 04 11:16AM -0800

On 2/4/2016 6:03 AM, Jerry Stuckle wrote:

> Well written comments indicate WHY the code does what it does. It also
> defines input and output conditions to a function, and other information
> not part of the code.

One time, I wrote a comment that was about five times the length of the
actual code. I was working on a Z80, and using 14-bit scaled fixed
point trig. To avoid losing precision when I multiplied the sines and
cosines, I had worked out a whole bunch of transformations that involved
adding angles instead of multiplying sin/cos.

The comment was the derivation of the transformations, since the code
was non-obvious. However, upon reading the comment, anyone familiar
with trig would understand what I had done, and why.

You received this digest because you're subscribed to updates for this group. You can change your settings on the group membership page.
To unsubscribe from this group and stop receiving emails from it send an email to comp.lang.c+++unsubscribe@googlegroups.com.

soft and program

Tuesday, February 9, 2016

Digest for comp.lang.c++@googlegroups.com - 25 updates in 2 topics

No comments:

Blog Archive

About Me