soft and program: Digest for comp.lang.c++@googlegroups.com

comp.lang.c++@googlegroups.com

Google Groups

two's complement idea - 4 Updates
std::atomic<float>- std::atomic<double>-specializations - 2 Updates
Parallel matrix-multiplciation - 8 Updates
Regal eagle / American cloud - 6 Updates

Bonita Montero <Bonita.Montero@gmail.com>: Nov 04 08:48PM +0100

What would realls speak against that numieric_limits<type> would include
a flag whether a type has a two's complement and maybe another flag that
says that a type has a two's complement wrap-around?

Bonita Montero <Bonita.Montero@gmail.com>: Nov 04 09:19PM +0100

> What would realls speak against that numieric_limits<type> would include
> a flag whether a type has a two's complement and maybe another flag that
> says that a type has a two's complement wrap-around?

And a further flag that says that signed shifts have the usual behaviour
as on most platforms?

"Öö Tiib" <ootiib@hot.ee>: Nov 04 02:39PM -0800

On Monday, 4 November 2019 22:19:56 UTC+2, Bonita Montero wrote:
> > What would realls speak against that numieric_limits<type> would include
> > a flag whether a type has a two's complement and maybe another flag that

You yourself started a thread discussing proposal paper p0907r0
at end of July "In the end, rason will come" that would make it redundant.

> > says that a type has a two's complement wrap-around?

The numeric_limits<type>::is_modulo is already there. Unfortunately
no compiler sets it to signed types (even with -fwrapv flag).

> And a further flag that says that signed shifts have the usual behaviour
> as on most platforms?

Language lawyers at isocpp.org seem quite uncertain what they want:
<https://groups.google.com/a/isocpp.org/forum/#!msg/std-proposals/MZzCyAL1qRo/p493_UdUAgAJ>

Bo Persson <bo@bo-persson.se>: Nov 04 11:55PM +0100

On 2019-11-04 at 20:48, Bonita Montero wrote:
> What would realls speak against that numieric_limits<type> would include
> a flag whether a type has a two's complement and maybe another flag that
> says that a type has a two's complement wrap-around?

Possibly that in C++20 all signed integers *are* two's complement. :-)

https://wg21.link/P0907

std::atomic<float>- std::atomic<double>-specializations

"Chris M. Thomasson" <chris.m.thomasson.1@gmail.com>: Nov 04 02:02PM -0800

On 11/3/2019 5:59 AM, Bonita Montero wrote:
> The only purpose I can imagine is when I have a thread that changes
> an atomic double or float asynchronouslsy at random points. But the
> need this should be extremely rare.

For fun, I have used a table of atomic<double> that fractal computation
threads would update. The size of the table was divisible by two (real,
imag) so one can take a complex number in the form of [0][1]. [2][3] is
another one... These numbers were used by another set of threads as
starting points for an iterated function system.

"Chris M. Thomasson" <chris.m.thomasson.1@gmail.com>: Nov 04 02:04PM -0800

On 11/4/2019 2:02 PM, Chris M. Thomasson wrote:
> imag) so one can take a complex number in the form of [0][1]. [2][3] is
> another one... These numbers were used by another set of threads as
> starting points for an iterated function system.

The table was a 1d array:

real, imag, real, imag, real, imag, ..., ...

Parallel matrix-multiplciation

Bonita Montero <Bonita.Montero@gmail.com>: Nov 04 10:49AM +0100

I wrote a little program that does a parallel matrix-multiplication.
It multiplies two matrices which have a configurable size in megabytes
given as the first commandline-parameter. The second commandline-para-
meter is the limit of the number of threads; if it is omitted, the num-
ber of threads is the number of hw-threads in the system.
I found that, on my Ryzen 7 1800X, the performance is about 2,5% higher
with 64MB-matrices when I limit the number of threads to 8 so that each
core is assigned only a single thread by the OS. So cache-thrashing
seems to be an issue here. This isn't a measurement-error for sure
because i've run the code with "start /b /wait /high ...".

So here's the code:

#include <iostream>
#include <vector>
#include <random>
#include <stdexcept>
#include <utility>
#include <utility>
#include <thread>
#include <cstdlib>
#include <cmath>

using namespace std;

unsigned threadLimit = 0;

struct Matrix
{
Matrix() = default;
Matrix( size_t n );
Matrix( Matrix const &other );
Matrix( Matrix &&other );
double *operator []( size_t i ) const;
Matrix &operator =( Matrix const &other );
Matrix &operator =( Matrix &&other );
size_t dim() const;

private:
void init( size_t n );
void copy( Matrix const &other );
vector<double> m_mtx;
vector<double *> m_ptrs;
};

inline
Matrix::Matrix( size_t n )
{
init( n );
}

inline
Matrix::Matrix( Matrix const &other )
{
copy( other );
}

inline
Matrix::Matrix( Matrix &&other )
{
m_mtx = move( other.m_mtx );
m_ptrs = move( other.m_ptrs );
}

inline
double *Matrix::operator []( size_t i ) const
{
return m_ptrs[i];
}

Matrix &Matrix::operator =( Matrix const &other )
{
copy( other );
return *this;
}

inline
Matrix &Matrix::operator =( Matrix &&other )
{
m_mtx = move( other.m_mtx );
m_ptrs = move( other.m_ptrs );
return *this;
}

inline
size_t Matrix::dim() const
{
return m_ptrs.size();
}

inline
void Matrix::init( size_t n )
{
m_mtx.resize( n * n, 0.0 );
m_ptrs.resize( n, nullptr );
for( size_t i = 0, j = 0; i != n; ++i, j += n )
m_ptrs[i] = &m_mtx[j];
}

inline
void Matrix::copy( Matrix const &other )
{
size_t n = other.m_ptrs.size();
init( n );
for( size_t i = 0; i != n; ++i )
for( size_t j = 0; j != n; ++j )
m_ptrs[i][j] = other.m_ptrs[i][j];
}

Matrix operator *( Matrix const &a, Matrix const &b )
{
size_t n;
if( (n = a.dim()) != b.dim() )
throw invalid_argument( "unequal matrix-dimensions" );
Matrix mtx( n );
auto matrixpart = [&a, &b, &mtx, n]( size_t begin, size_t end )
{
for( size_t i = begin; i != end; ++i )
for( size_t j = 0; j != n; ++j )
{
mtx[i][j] = 0.0;
for( size_t k = 0; k != n; ++k )
mtx[i][j] += a[i][k] * b[k][j];
}
};
vector<thread> threads;
double part = (double)n / ::threadLimit,
p = 0.0;
for( unsigned t = 0; ; ++t, p += part )
{
size_t begin = (size_t)p,
end = (size_t)(p + part);
if( begin == end )
continue;
if( t == ::threadLimit - 1 )
{
matrixpart( begin, n );
break;
}
threads.emplace_back( matrixpart, begin, end );
}
for( thread &thr : threads )
thr.join();
return move( mtx );
}

int main( int argc, char **argv )
{
if( argc < 2 )
return EXIT_FAILURE;
size_t mtxDim = (size_t)sqrt( (size_t)(unsigned)atoi( argv[1]
) * 1024 * 1024 / 8 );
Matrix mtxA( mtxDim ),
mtxB( mtxDim ),
mtxC;
cout << "matrix-dimension: " << mtxDim << endl;
if( (::threadLimit = thread::hardware_concurrency()) == 0 )
::threadLimit = 1;
if( argc >= 3 )
{
unsigned tlParam = (unsigned)atoi( argv[2] );
tlParam += tlParam == 0;
if( tlParam < (::threadLimit = thread::hardware_concurrency()) )
::threadLimit = tlParam;
}
cout << "thread-limit: " << ::threadLimit << endl;
auto fillmatrix = []( Matrix &mtx )
{
vector<thread> threads;
size_t n = mtx.dim();
auto mtxRndThread = [&mtx, n]( size_t begin, size_t end )
{
static thread_local random_device rd;
static thread_local uniform_real_distribution<double> urd(
0.0, 1.0 );
for( size_t i = begin; i != end; ++i )
for( size_t j = 0; j != n; ++j )
mtx[i][j] = urd( rd );
};
double part = (double)n / ::threadLimit,
p = 0.0;
for( unsigned t = 0; ; ++t, p += part )
{
size_t begin = (size_t)p,
end = (size_t)(p + part);
if( begin == end )
continue;
if( t == ::threadLimit - 1 )
{
mtxRndThread( begin, n );
break;
}
threads.emplace_back( mtxRndThread, begin, end );
}
for( thread &thr : threads )
thr.join();
};
fillmatrix( mtxA );
fillmatrix( mtxB );
mtxC = move( mtxA * mtxB );
return EXIT_SUCCESS;
}

For the Windows-users: here's a little program to measure the runtime of
other applications similar to Unix' time:

#include <Windows.h>
#include <string>
#include <cstring>
#include <cstdio>
#include <cstdlib>
#include <sstream>
#include <iomanip>

using namespace std;

string FormatClockCycles( ULONG64 ul64ClockCycles );

int wmain( int argc, wchar_t **argv )
{
if( argc < 2 )
return printf( "can't create process\n" ), EXIT_FAILURE;

wstring commandline;
for( size_t i = 1; i != (size_t)argc; commandline += (i !=
(size_t)argc - 1) ? L" " : L"", ++i )
if( wcschr( argv[i], L' ' ) == nullptr )
commandline += argv[i];
else
commandline += L"\"",
commandline += argv[i],
commandline += L"\"";

STARTUPINFOW si;
PROCESS_INFORMATION pi;
memset( &si, 0, sizeof si );
si.cb = sizeof si;
commandline += L'\0';
if( !CreateProcessW( nullptr, (LPWSTR)commandline.c_str(), nullptr,
nullptr, FALSE, 0, nullptr, nullptr, &si, &pi ) )
return printf( "can't create process\n" ), EXIT_FAILURE;
WaitForSingleObject( pi.hProcess, INFINITE );
FILETIME creationTime,
exitTime,
userFTime,
sysFTime;
GetProcessTimes( pi.hProcess, &creationTime, &exitTime, &sysFTime,
&userFTime );
ULONG64 processCycles;
QueryProcessCycleTime( pi.hProcess, &processCycles );

double realTime,
userTime,
sysTime;
realTime = ((LONGLONG &)exitTime - (LONGLONG &)creationTime) /
10'000.0;
userTime = (LONGLONG &)userFTime /
10'000.0;
sysTime = (LONGLONG &)sysFTime /
10'000.0;
string clockCycles = FormatClockCycles( processCycles );
printf( "real %.2lfms\n", realTime );
printf( "user %.2lfms\n", userTime );
printf( "sys %.2lfms\n", sysTime );
printf( "cycles %s\n", clockCycles.c_str() );

return EXIT_SUCCESS;
}

string FormatClockCycles( ULONG64 ul64ClockCycles )
{
string str;
ostringstream ss;
while( ul64ClockCycles )
ss.str( "" ),
ss << setw( 3 ) << setfill( '0' ) << (unsigned)(ul64ClockCycles
% 1'000u),
str = ss.str() + (str != "" ? "." : "") + str,
ul64ClockCycles /= 1'000u;
return str;
}

I compiled the lattter tool to timep.exe. So I testet my code with:
start /b /wait /high timep matrix 64 8
and
start /b /wait /high timep matrix 64 16
to test the code with 8 and 16 threads.
So it would be nice if someone here would test the matrix-code on his
smt-enabled cpu with the full and the half number of threads to see
if it would get the same thrashing-problems.

Paavo Helde <myfirstname@osa.pri.ee>: Nov 04 02:20PM +0200

On 4.11.2019 11:49, Bonita Montero wrote:

> ...
> for( size_t k = 0; k != n; ++k )
> mtx[i][j] += a[i][k] * b[k][j];

Before you start with parallelizing large matrix multiplication, maybe
you should first learn how to do this efficiently in a single thread.

Hint: consider what happens when each next k in the above innermost loop
misses the various CPU caches.

Hint 2: realize that if written in the above form, this cache trashing
will inevitable happen if the matrixes are large enough.

Hint 3: there is a simple little trick called "transpose" which can fix
this problem.

Bonita Montero <Bonita.Montero@gmail.com>: Nov 04 01:28PM +0100

> you should first learn how to do this efficiently in a single thread.

> Hint: consider what happens when each next k in the above innermost loop
> misses the various CPU caches.

k is incrementing vertically for b[k][j]; this might not be good, but
the b[k]'s will be in the cache for b[k + 1] ... if n isn't too large,
which should be common.

Bonita Montero <Bonita.Montero@gmail.com>: Nov 04 03:15PM +0100

>> you should first learn how to do this efficiently in a single thread.

>> Hint: consider what happens when each next k in the above innermost
>> loop misses the various CPU caches.

Ok, for large matrices where a vertical stride to b doesn't fit in
the L2-cache you're right. On a multiplication of a 64MB matrix with
a 64MB transposed matrix I get s auperlinear speedup of 8,4* with 8
threads over one thread and a speedup of 14,5* with 16 threads; so
that's a speedup throuh SMT of 73%!

#include <iostream>
#include <vector>
#include <random>
#include <stdexcept>
#include <utility>
#include <thread>
#include <cstdlib>
#include <cmath>

using namespace std;

unsigned threadLimit = 0;

struct TransposedMatrix;

struct Matrix
{
Matrix() = default;
Matrix( size_t n );
Matrix( Matrix const &other );
Matrix( Matrix &&other );
double *operator []( size_t i ) const;
Matrix &operator =( Matrix const &other );
Matrix &operator =( Matrix &&other );
TransposedMatrix &transpose();
Matrix &resize( size_t n );
size_t dim() const;

private:
void init( size_t n );
void copy( Matrix const &other );
vector<double> m_mtx;
vector<double *> m_ptrs;
};

struct TransposedMatrix : public Matrix
{
using Matrix::Matrix;
using Matrix::operator [];
using Matrix::dim;
TransposedMatrix &operator =( TransposedMatrix const &other );
TransposedMatrix &operator =( TransposedMatrix &&other );
TransposedMatrix &resize( size_t n );
};

inline
Matrix::Matrix( size_t n )
{
init( n );
}

inline
Matrix::Matrix( Matrix const &other )
{
copy( other );
}

inline
Matrix::Matrix( Matrix &&other )
{
m_mtx = move( other.m_mtx );
m_ptrs = move( other.m_ptrs );
}

inline
double *Matrix::operator []( size_t i ) const
{
return m_ptrs[i];
}

Matrix &Matrix::operator =( Matrix const &other )
{
copy( other );
return *this;
}

inline
Matrix &Matrix::operator =( Matrix &&other )
{
m_mtx = move( other.m_mtx );
m_ptrs = move( other.m_ptrs );
return *this;
}

TransposedMatrix &Matrix::transpose()
{
size_t n = dim();
for( size_t i = 0; i != n; ++i )
for( size_t j = 0; j != i ; ++j )
swap( m_ptrs[i][j], m_ptrs[j][i] );
return *(TransposedMatrix *)this;
}

inline
size_t Matrix::dim() const
{
return m_ptrs.size();
}

inline
void Matrix::init( size_t n )
{
m_mtx.resize( n * n, 0.0 );
m_ptrs.resize( n, nullptr );
for( size_t i = 0, j = 0; i != n; ++i, j += n )
m_ptrs[i] = &m_mtx[j];
}

inline
void Matrix::copy( Matrix const &other )
{
size_t n = other.m_ptrs.size();
init( n );
for( size_t i = 0; i != n; ++i )
for( size_t j = 0; j != n; ++j )
m_ptrs[i][j] = other.m_ptrs[i][j];
}

Matrix operator *( Matrix const &a, TransposedMatrix const &b )
{
size_t n;
if( (n = a.dim()) == 0 )
return move( Matrix() );
if( n != b.dim() )
throw invalid_argument( "unequal matrix-dimensions" );
Matrix mtx( n );
auto mtxMultThread = [&a, &b, &mtx, n]( size_t begin, size_t end )
{
for( size_t i = begin; i != end; ++i )
for( size_t j = 0; j != n; ++j )
{
mtx[i][j] = 0.0;
for( size_t k = 0; k != n; ++k )
mtx[i][j] += a[i][k] * b[j][k];
}
};
vector<thread> threads;
double part = (double)n / ::threadLimit,
p = 0.0;
for( unsigned t = 0; ; p += part )
{
size_t begin = (size_t)p,
end = (size_t)(p + part);
if( begin == end )
continue;
if( t++ == ::threadLimit - 1 )
{
mtxMultThread( begin, n );
break;
}
threads.emplace_back( mtxMultThread, begin, end );
}
for( thread &thr : threads )
thr.join();
return move( mtx );
}

inline
TransposedMatrix &TransposedMatrix::operator =( TransposedMatrix const
&other )
{
return (TransposedMatrix &)(*(Matrix *)this = (Matrix &)other);
}

inline
TransposedMatrix &TransposedMatrix::operator =( TransposedMatrix &&other )
{
return (TransposedMatrix &)(*(Matrix *)this = move( (Matrix &)other ));
}

inline
TransposedMatrix &TransposedMatrix::resize( size_t n )
{
Matrix::resize( n );
return *this;
}

int main( int argc, char **argv )
{
if( argc < 2 )
return EXIT_FAILURE;
size_t mtxDim = (size_t)sqrt( (size_t)(unsigned)atoi( argv[1]
) * 1024 * 1024 / 8 );
Matrix mtxA( mtxDim );
TransposedMatrix tmtxB( mtxDim );
Matrix mtxC;
cout << "matrix-dimension: " << mtxDim << endl;
if( (::threadLimit = thread::hardware_concurrency()) == 0 )
::threadLimit = 1;
if( argc >= 3 )
{
unsigned tlParam = (unsigned)atoi( argv[2] );
tlParam += tlParam == 0;
if( tlParam < (::threadLimit = thread::hardware_concurrency()) )
::threadLimit = tlParam;
}
cout << "thread-limit: " << ::threadLimit << endl;
auto fillmatrix = []( Matrix &mtx )
{
size_t n = mtx.dim();
if( n == 0 )
return;
vector<thread> threads;
auto mtxRndThread = [&mtx, n]( size_t begin, size_t end )
{
static thread_local random_device rd;
static thread_local uniform_real_distribution<double> urd(
0.0, 1.0 );
for( size_t i = begin; i != end; ++i )
for( size_t j = 0; j != n; ++j )
mtx[i][j] = urd( rd );
};
double part = (double)n / ::threadLimit,
p = 0.0;
for( unsigned t = 0; ; p += part )
{
size_t begin = (size_t)p,
end = (size_t)(p + part);
if( begin == end )
continue;
if( t++ == ::threadLimit - 1 )
{
mtxRndThread( begin, n );
break;
}
threads.emplace_back( mtxRndThread, begin, end );
}
for( thread &thr : threads )
thr.join();
};
fillmatrix( mtxA );
fillmatrix( tmtxB );
mtxC = move( mtxA * tmtxB );
return EXIT_SUCCESS;
}

Bonita Montero <Bonita.Montero@gmail.com>: Nov 04 03:23PM +0100

And this has an aliasing-problem:

>                     mtx[i][j] += a[i][k] * b[j][k];
>             }
>     };

auto mtxMultThread = [&a, &b, &mtx, n]( size_t begin, size_t end )
{
for( size_t i = begin; i != end; ++i )
for( size_t j = 0; j != n; ++j )
{
double s = 0.0;
for( size_t k = 0; k != n; ++k )
s += a[i][k] * b[j][k];
mtx[i][j] = s;
}
};

With the changed code the 64MB-benchmark runs over three times faster.

Christian Gollwitzer <auriocus@gmx.de>: Nov 04 04:04PM +0100

Am 04.11.19 um 13:20 schrieb Paavo Helde:
> will inevitable happen if the matrixes are large enough.

> Hint 3: there is a simple little trick called "transpose" which can fix
> this problem.

Hint 4: Others have already solved it. If you download a good BLAS with
optimizations for your system, e.g. OpenBLAS, or the one from Intel, you
may find that a simple call to "dgemm" may well be 1000x faster than the
code you have.

Christian

Bonita Montero <Bonita.Montero@gmail.com>: Nov 04 04:16PM +0100

> optimizations for your system, e.g. OpenBLAS, or the one from Intel, you
> may find that a simple call to "dgemm" may well be 1000x faster than the
> code you have.

I'm coding it just for fun and not to solve a real issue.

"Chris M. Thomasson" <chris.m.thomasson.1@gmail.com>: Nov 04 01:57PM -0800

On 11/4/2019 7:16 AM, Bonita Montero wrote:
>> Intel, you may find that a simple call to "dgemm" may well be 1000x
>> faster than the code you have.

> I'm coding it just for fun and not to solve a real issue.

No problem with that. However, try using the GPU for fun. Just to see
the difference.

Regal eagle / American cloud

woodbrian77@gmail.com: Nov 03 08:04PM -0800

On Sunday, November 3, 2019 at 4:35:56 AM UTC-6, David Brown wrote:

> For those that don't know who Shapiro is, but (for some reason) want to
> know, here is an introduction.

> <https://rationalwiki.org/wiki/Ben_Shapiro>

He is from Los Angeles, California. He went to Harvard
Law School and practiced law for a short time, but then
became a journalist. He's a very knowledgeable and down-
to-earth person. He's an Orthodox Jew. He's married
and has three children - one of them hasn't been born yet.
His podcasts are super.

I have this in a server:

#ifdef __linux__
#include<linux/sctp.h>
#else
#include<netinet/sctp.h>

soft and program

Monday, November 4, 2019

Digest for comp.lang.c++@googlegroups.com - 20 updates in 4 topics

No comments:

Blog Archive

About Me