soft and program: Digest for comp.lang.c++@googlegroups.com

comp.lang.c++@googlegroups.com

Google Groups

Split up vector for concurrent processing - 20 Updates
Copy construction of stateful allocators - 1 Update
Split up vector for concurrent processing - 3 Updates
A teaching that's worth hearing - 1 Update

Split up vector for concurrent processing

bitrex <bitrex@de.lete.earthlink.net>: Oct 09 09:28PM -0400

How would I go about accomplishing the following, which seems like
something one might want to do regularly in a multi-threaded
data-crunching application:

Take a vector of some type where the required algorithm can be applied
element-wise and doesn't depend on any of the other values, split into N
chunks (ideally N = number of cores * threads per core), send off copies
to the worker threads and then recombine the result in a new vector in
the same order after completion. Or use iterators to transform the
original vector in place from different threads, if that's possible?

red floyd <no.spam@its.invalid>: Oct 09 09:22PM -0700

On 10/09/2017 06:28 PM, bitrex wrote:
> to the worker threads and then recombine the result in a new vector in
> the same order after completion. Or use iterators to transform the
> original vector in place from different threads, if that's possible?

Run through the vector, and pass a reference (or pointer) to each
element to a new thread for processing?

PSEUDOCODE:

for (T& elem: v)
spawn_thread(some_function, T);

Christian Gollwitzer <auriocus@gmx.de>: Oct 10 07:28AM +0200

Am 10.10.17 um 03:28 schrieb bitrex:
> to the worker threads and then recombine the result in a new vector in
> the same order after completion. Or use iterators to transform the
> original vector in place from different threads, if that's possible?

Use OpenMP. It does most of that for you:

std::vector<double> new(old.size());

#pragma omp parallel for
// this pragma does almost exactly what you describe,
// except it doesn't copy the input vector

for (size_t i=0; i < old.size(); i++) {
new[i] = old[i]*2;
}

// caveat: some OpenMP implementations do not accept unsigned types
// then maybe replace size_t by intptr_t and ignore the
// comparison between signed and unsigned warning

Compile with openmp enabled (-fopenmp for gcc or /openmp for Visual C++)

Christian

red floyd <no.spam@its.invalid>: Oct 10 12:23AM -0700

On 10/09/2017 10:28 PM, Christian Gollwitzer wrote:
> // comparison between signed and unsigned warning

> Compile with openmp enabled (-fopenmp for gcc or /openmp for Visual C++)

> Christian

Much better than mine, assuming he has OpenNP.

Does OpenMP work on a single system? Or does it need to hand off to a
another node in a cluster? I haven't looked at it in years.

David Brown <david.brown@hesbynett.no>: Oct 10 10:32AM +0200

On 10/10/17 03:28, bitrex wrote:
> to the worker threads and then recombine the result in a new vector in
> the same order after completion. Or use iterators to transform the
> original vector in place from different threads, if that's possible?

If you have C++17, you can try the "execution policies":

<http://en.cppreference.com/w/cpp/algorithm>

Jorgen Grahn <grahn+nntp@snipabacken.se>: Oct 10 08:41AM

On Tue, 2017-10-10, bitrex wrote:
> to the worker threads and then recombine the result in a new vector in
> the same order after completion. Or use iterators to transform the
> original vector in place from different threads, if that's possible?

I think I'd try this design:

- Threads with an input queue and an output queue, like a Unix filter
except maybe without flow control.
- Pools of these.
- An abstraction which need not be thread-aware but can:
- chop up a container into N pieces
- accept "tagged" chunks of data, gather them into a destination
container, and flag "done" when it has all pieces matching the
source container. E.g. insert [10..12); insert [0..5); and lastly
insert [5..10) and then it's done because the original container
was [0..12) chopped up in three pieces.
- do this without too much copying

Although thinking a bit further, this is a bit like TCP: the sender
chops the stream up into segments, the receiver assembles them into a
stream, and preserves order.

A possibly infinite stream seems like a better abstraction for general
use for two reasons:
- You may want to process the first elements even if all of them aren't
ready yet.
- You'll have idle threads when you're near the end of the container;
utilization is lower than it perhaps could be.

Overkill for many uses, I'm sure.

Disclaimer: I don't do a lot of thread programming, and I didn't learn
the C++11 stuff.

/Jorgen

--
// Jorgen Grahn <grahn@ Oo o. . .
\X/ snipabacken.se> O o .

scott@slp53.sl.home (Scott Lurndal): Oct 10 12:39PM

>to the worker threads and then recombine the result in a new vector in
>the same order after completion. Or use iterators to transform the
>original vector in place from different threads, if that's possible?

Use an autovectorizing compiler and the host SIMD instruction set?

bitrex <bitrex@de.lete.earthlink.net>: Oct 10 09:43AM -0400

On 10/10/2017 09:14 AM, Stefan Ram wrote:
> ^
> . Maybe someone can explain how to remove the error?
> (Is it my compiler not supporting all of the library?)

Nice!!

bitrex <bitrex@de.lete.earthlink.net>: Oct 10 09:47AM -0400

On 10/10/2017 08:39 AM, Scott Lurndal wrote:
>> the same order after completion. Or use iterators to transform the
>> original vector in place from different threads, if that's possible?

> Use an autovectorizing compiler and the host SIMD instruction set?

I think recent versions of GCC should optimize for SIMD at -O3? In the
Code::Blocks build options I also see "CPU Architecture Tuning" flags
for AMD FX-64, Intel Core, etc...

scott@slp53.sl.home (Scott Lurndal): Oct 10 02:20PM

> ::std::thread thread0{ ::std::move( package0 ), p, p + len/2, 0 };
> ::std::thread thread1{ ::std::move( package0 ), p + len/2, p + len, 0 }; }
> return future0.get() + future1.get(); }

Completely unreadable.

>{ ::std::vector const vector< double >{};
> ::std::cout << sum_in_parallel( vector )<< '\n'; }

> But I cannot get it compiled:

Not surprising.

bitrex <bitrex@de.lete.earthlink.net>: Oct 10 10:22AM -0400

On 10/10/2017 09:52 AM, Stefan Ram wrote:

> ::std::thread thread0{ ::std::move( package0 ), p, p + len/2, 0 };
> ::std::thread thread1{ ::std::move( package0 ), p + len/2, p + len, 0 };

> , it is clear that on can move from »package0« only once.

Also the function that's packaged takes two doubles as arguments, but in
the "thread0" and "thread1" constructor the author is trying to pass
three not including the rvalue reference to the package. Also I don't
think using raw pointers as indexes into the vector data is such a good
idea.

This compiles with -std=c++11, recent versions of GCC but gives a
"terminate without active exception" on execution - looks like the
threads aren't being joined properly.

#include <future>
#include <initializer_list>
#include <iostream>
#include <thread>
#include <utility>
#include <vector>
#include <numeric>

double sum(double const* const beginning, double const* const end)
{
return ::std::accumulate(beginning, end, 0.0);
}

double sum_in_parallel(const ::std::vector<double>& vector)
{
using task_type = double(double const*, double const*);
::std::packaged_task<task_type> package0{sum};
::std::packaged_task<task_type> package1{sum};
::std::future<double> future0{package0.get_future()};
::std::future<double> future1{package1.get_future()};
double const* const p = &vector[0];
{
auto len{vector.size()};
::std::thread thread0{::std::move(package0), p, p + len / 2};
::std::thread thread1{::std::move(package1), p + len / 2, p + len};
}
return future0.get() + future1.get();
}

int main()
{
const ::std::vector<double> vector{1.0, 2.0, 3.0, 4.0, 5.0, 6.0, 7.0,
8.0};
::std::cout << sum_in_parallel(vector) << ::std::endl;
}

bitrex <bitrex@de.lete.earthlink.net>: Oct 10 10:22AM -0400

On 10/10/2017 10:22 AM, bitrex wrote:
>> ::std::thread thread1{ ::std::move( package0 ), p + len/2, p + len, 0 };

>> , it is clear that on can move from »package0« only once.

> Also the function that's packaged takes two doubles as arguments

pointers to doubles, rather

David Brown <david.brown@hesbynett.no>: Oct 10 04:44PM +0200

On 10/10/17 15:47, bitrex wrote:

> I think recent versions of GCC should optimize for SIMD at -O3? In the
> Code::Blocks build options I also see "CPU Architecture Tuning" flags
> for AMD FX-64, Intel Core, etc...

gcc (and other compilers) can do some auto-vectorising. But that is a
very different thing from multi-threading. Auto-vectorising means using
SIMD instructions to do multiple identical operations in parallel within
the one core. Multi-threading (as originally asked) means doing
possibly different operations in multiple threads, preferably on
multiple cores. Both techniques are useful.

To get the most out of auto-vectorising, you need to make sure the
compiler knows the cpu type it is targetting (such as with "-mnative" if
compiling for just your own cpu, but possibly other flags if you want it
to run on a variety of cpus in the same family). You need -O2 or -O3
(or the compiler's equivalent). You may need other flags as well. And
you need to give your compiler as much information as possible - try to
make your loops of constant size, make data aligned suitable for
vectorisation (such as with gcc's "aligned" attribute), etc.

Alain Ketterlin <alain@universite-de-strasbourg.fr.invalid>: Oct 10 04:45PM +0200

> something one might want to do regularly in a multi-threaded
> data-crunching application:

> Take a vector of some type

"some type" is the crucial factor here.

> depend on any of the other values, split into N chunks (ideally N =
> number of cores * threads per core), send off copies to the worker
> threads

If you copy pieces of vectors, don't expect significant gains (unless
you have a many cores): memory access is much more costly than mere
arithmetic. Also simultaneous multi-threading (e.g., hyperthreading)
might be detrimental to performance (it all depends on the kind/amount
of data you process: SMT adds pressure on caches). It is very easy to
make a parallel version run slower that the sequential version.

> and then recombine the result in a new vector in the same order after
> completion. Or use iterators to transform the original vector in place
> from different threads, if that's possible?

Your best bet is OpenMP. Use inplace as much as possible. For parallel
loops, adapt the scheduling strategy to the work (im)balance (static if
approximately balanced, dynamic otherwise), and array size (for static
schedule, longer chunks are better). If you use small chunks and you
have small array elements, arrange for the chunks to align on cache-line
sizes to avoid false-sharing.

If you plan to, e.g., sum/... short vectors of int/float/..., give up on
multi-threads and ensure your compiler vectorizes properly; if necessary
rewrite your code so that it does (use whatever options your compiler
provides to spot the problems). Also make sure the compiler targets the
correct architecture (e.g., -march=native with gcc).

If instead you plan to, e.g., apply various filters to large raster
images of various sizes, use OpenMP (and still make sure your compiler
optimizes the sequential part correctly). Then play with scheduling
strategies.

-- Alain.

Christian Gollwitzer <auriocus@gmx.de>: Oct 10 05:31PM +0200

Am 10.10.17 um 09:23 schrieb red floyd:

> Much better than mine, assuming he has OpenNP.

> Does OpenMP work on a single system? Or does it need to hand off to a
> another node in a cluster? I haven't looked at it in years.

OpenMP only works on shared memory systems, i.e. on a single node with
multiple CPUs. It is available in all major current C++ compilers (gcc,
clang, Intel, Visual). There used to be a discontinued product from
Intel (cluster OpenMP) which used page faults to synchronize the memory
over the cluster, but for today clustering needs different tools (MPI is
the most standard one)

Christian

scott@slp53.sl.home (Scott Lurndal): Oct 10 04:48PM

>Intel (cluster OpenMP) which used page faults to synchronize the memory
>over the cluster, but for today clustering needs different tools (MPI is
>the most standard one)

For loosely coupled systems, openMPI is the typical answer.

David Brown <david.brown@hesbynett.no>: Oct 10 07:02PM +0200

On 10/10/17 15:14, Stefan Ram wrote:
> ^
> . Maybe someone can explain how to remove the error?
> (Is it my compiler not supporting all of the library?)

I have tried to keep the structure and logic of your code, while
removing the worst jumbled mess of formatting and the extra includes.
And it is crazy to call your std::vector instance "vector". (I hope you
don't teach your students that weird bracketing style, unusual spacing,
and unnecessary ::std. They are just going to have to unlearn it all
before working with any real-world code.)

Key errors:

1. Messed up type for "p"
2. Using "vector.size" instead of "vector.size()"
3. Extra parameter to your thread initialisers
4. Forgetting to join your threads
5. Using an empty vector for testing!

#include <numeric>
#include <vector>
#include <iostream>
#include <future>

static double sum(const double * const beginning, const double * const end)
{
return std::accumulate(beginning, end, 0.0);
}

static double sum_in_parallel(const std::vector<double> &vect)
{
using task_type = double(const double *, const double *);
std::packaged_task<task_type> package0 { sum };
std::packaged_task<task_type> package1 { sum };
std::future<double> future0 { package0.get_future() };
std::future<double> future1 { package1.get_future() };
const double * p = &vect[0];
const auto len { vect.size() };
std::thread thread0 { std::move(package0), p, p + len / 2 };
std::thread thread1 { std::move(package1), p + len / 2, p + len};
thread0.join();
thread1.join();
return future0.get() + future1.get();
}

int main()
{
const std::vector<double> vect { 1.0, 2.0, 3.0, 4.0 };
std::cout << sum_in_parallel(vect) << '\n';
}

David Brown <david.brown@hesbynett.no>: Oct 10 07:05PM +0200

On 10/10/17 19:02, David Brown wrote:

> Key errors:

> 1. Messed up type for "p"
Skip that one - I had merely missed out a "const" while copying the code.

red floyd <dont.bother@its.invalid>: Oct 10 10:06AM -0700

On 10/10/2017 9:48 AM, Scott Lurndal wrote:

> For loosely coupled systems, openMPI is the typical answer.

*THAT'S* the one I was thinking of. Thanks.

bitrex <bitrex@de.lete.earthlink.net>: Oct 10 01:40PM -0400

On 10/10/2017 01:02 PM, David Brown wrote:
> const std::vector<double> vect { 1.0, 2.0, 3.0, 4.0 };
> std::cout << sum_in_parallel(vect) << '\n';
> }

Nice, thank you

Copy construction of stateful allocators

bitrex <bitrex@de.lete.earthlink.net>: Oct 10 01:38PM -0400

Prior to C+11 I guess it was assumed that allocators were stateless, so
you didn't have to do anything special with their copy constructors (it
seems most were just declared as "throw()."

Now it's possible for allocators to have internal state, but a problem
arises if I write something like the following:

template<typename T>
using RebindAlloc = typename std::allocator_traits
<AllocatorBase<T, StatefulAllocatorPolicy<T>>::template rebind_alloc<T>;

typedef std::basic_string<char, std::char_traits<char>,
RebindAlloc<char>> my_allocated_string_t;

my_allocated_string_t my_string{"abcde"};

If my allocator policy has internal state, say a raw memory block of
some size which is initialized on instantiation with "new" and freed
using "delete []" on a raw pointer stored in a class field in its
destructor, if I just let the policy use the default copy constructor
the allocator is copy constructed in the constructor of
std::basic_string, all the trivially constructable internal fields are
copied over verbatim, and then I get a segmentation fault as the code
tries to delete the same block of memory twice on destruction of both
the original allocator instance and the copy.

So if I want custom allocated container types that use a stateful
allocator to be copy constructible, copy assignable etc. I'd need to
have some kind of shared global state between instances using either a
singleton or reference counting smart pointer to a common structure
holding the mutable fields, I guess?

Split up vector for concurrent processing

ram@zedat.fu-berlin.de (Stefan Ram): Oct 10 01:14PM

>element-wise and doesn't depend on any of the other values, split into N
>chunks (ideally N = number of cores * threads per core), send off copies
>to the worker threads and then recombine the result in a new vector in

Here is something similar (based on code by Bjarne Stroustrup):
Calculate the sum of a vector in parallel.

#include <algorithm>
#include <future>
#include <initializer_list>
#include <iostream>
#include <ostream>
#include <thread>
#include <utility>
#include <vector>

double sum( double const * const beginning, double const * const end )
{ return ::std::accumulate( beginning, end, 0.0 ); }

double sum_in_parallel( ::std::vector< double > const & vector )
{ using task_type = double( double const *, double const * );
::std::packaged_task< task_type >package0{ sum };
::std::packaged_task< task_type >package1{ sum };
::std::future< double >future0{ package0.get_future() };
::std::future< double >future1{ package1.get_future() };
double const * const p = &vector[ 0 ];
{ auto len { vector.size };
::std::thread thread0{ ::std::move( package0 ), p, p + len/2, 0 };
::std::thread thread1{ ::std::move( package0 ), p + len/2, p + len, 0 }; }
return future0.get() + future1.get(); }

int main()
{ ::std::vector const vector< double >{};
::std::cout << sum_in_parallel( vector )<< '\n'; }

But I cannot get it compiled:

error: variable 'std::packaged_task<double(const double*, const double*)> package0'
has initializer but incomplete type
::std::packaged_task< task_type >package0{ sum };
^
. Maybe someone can explain how to remove the error?
(Is it my compiler not supporting all of the library?)

ram@zedat.fu-berlin.de (Stefan Ram): Oct 10 01:19PM

>{ ::std::vector const vector< double >{};

That should be

::std::vector< double >const vector {};

. (But the error reported still remains.)

ram@zedat.fu-berlin.de (Stefan Ram): Oct 10 01:52PM

> That should be
>::std::vector< double >const vector {};
> . (But the error reported still remains.)

Oh, and

::std::thread thread1{ ::std::move( package0 ), p + len/2, p + len, 0 };

should be

::std::thread thread1{ ::std::move( package1 ), p + len/2, p + len, 0 };

. I was not able to start this program, so I was not able to
debug it. But »move« helped me to spot the error, because
when reading,

::std::thread thread0{ ::std::move( package0 ), p, p + len/2, 0 };
::std::thread thread1{ ::std::move( package0 ), p + len/2, p + len, 0 };

, it is clear that on can move from »package0« only once.

A teaching that's worth hearing

"Rick C. Hodgin" <rick.c.hodgin@gmail.com>: Oct 10 06:21AM -0700

This is Pastor Mac, a guest pastor at a church in Hawaii I listen to
each week. He's appeared on the channel as a guest pastor a handful
of times that I've seen, but his sermons are always so powerful.

I urge you all to listen to this. He talks about the blood, and how
our lives are so dependent upon it here in this world, but also upon
the blood of Christ in eternity:

Sacrifice, Sanctity, Life to God
https://www.youtube.com/watch?v=XZfn1XLCOQo

Life is in the blood. You can live without an arm, or a leg, or so
many other parts. But if you lose the blood it's over.

-----
This relationship between the blood of man, and the blood of Christ,
is by design. Even a woman's monthly cycle is a reminder of man's
original sin in the Garden of Eden, a permanent, personal reminder
of our accountability unto God world-wide.

Thank you,
Rick C. Hodgin

You received this digest because you're subscribed to updates for this group. You can change your settings on the group membership page.
To unsubscribe from this group and stop receiving emails from it send an email to comp.lang.c+++unsubscribe@googlegroups.com.

soft and program

Tuesday, October 10, 2017

Digest for comp.lang.c++@googlegroups.com - 25 updates in 4 topics

No comments:

Blog Archive

About Me