- Split up vector for concurrent processing - 20 Updates
- Copy construction of stateful allocators - 1 Update
- Split up vector for concurrent processing - 3 Updates
- A teaching that's worth hearing - 1 Update
bitrex <bitrex@de.lete.earthlink.net>: Oct 09 09:28PM -0400 How would I go about accomplishing the following, which seems like something one might want to do regularly in a multi-threaded data-crunching application: Take a vector of some type where the required algorithm can be applied element-wise and doesn't depend on any of the other values, split into N chunks (ideally N = number of cores * threads per core), send off copies to the worker threads and then recombine the result in a new vector in the same order after completion. Or use iterators to transform the original vector in place from different threads, if that's possible? |
red floyd <no.spam@its.invalid>: Oct 09 09:22PM -0700 On 10/09/2017 06:28 PM, bitrex wrote: > to the worker threads and then recombine the result in a new vector in > the same order after completion. Or use iterators to transform the > original vector in place from different threads, if that's possible? Run through the vector, and pass a reference (or pointer) to each element to a new thread for processing? PSEUDOCODE: for (T& elem: v) spawn_thread(some_function, T); |
Christian Gollwitzer <auriocus@gmx.de>: Oct 10 07:28AM +0200 Am 10.10.17 um 03:28 schrieb bitrex: > to the worker threads and then recombine the result in a new vector in > the same order after completion. Or use iterators to transform the > original vector in place from different threads, if that's possible? Use OpenMP. It does most of that for you: std::vector<double> new(old.size()); #pragma omp parallel for // this pragma does almost exactly what you describe, // except it doesn't copy the input vector for (size_t i=0; i < old.size(); i++) { new[i] = old[i]*2; } // caveat: some OpenMP implementations do not accept unsigned types // then maybe replace size_t by intptr_t and ignore the // comparison between signed and unsigned warning Compile with openmp enabled (-fopenmp for gcc or /openmp for Visual C++) Christian |
red floyd <no.spam@its.invalid>: Oct 10 12:23AM -0700 On 10/09/2017 10:28 PM, Christian Gollwitzer wrote: > // comparison between signed and unsigned warning > Compile with openmp enabled (-fopenmp for gcc or /openmp for Visual C++) > Christian Much better than mine, assuming he has OpenNP. Does OpenMP work on a single system? Or does it need to hand off to a another node in a cluster? I haven't looked at it in years. |
David Brown <david.brown@hesbynett.no>: Oct 10 10:32AM +0200 On 10/10/17 03:28, bitrex wrote: > to the worker threads and then recombine the result in a new vector in > the same order after completion. Or use iterators to transform the > original vector in place from different threads, if that's possible? If you have C++17, you can try the "execution policies": <http://en.cppreference.com/w/cpp/algorithm> |
Jorgen Grahn <grahn+nntp@snipabacken.se>: Oct 10 08:41AM On Tue, 2017-10-10, bitrex wrote: > to the worker threads and then recombine the result in a new vector in > the same order after completion. Or use iterators to transform the > original vector in place from different threads, if that's possible? I think I'd try this design: - Threads with an input queue and an output queue, like a Unix filter except maybe without flow control. - Pools of these. - An abstraction which need not be thread-aware but can: - chop up a container into N pieces - accept "tagged" chunks of data, gather them into a destination container, and flag "done" when it has all pieces matching the source container. E.g. insert [10..12); insert [0..5); and lastly insert [5..10) and then it's done because the original container was [0..12) chopped up in three pieces. - do this without too much copying Although thinking a bit further, this is a bit like TCP: the sender chops the stream up into segments, the receiver assembles them into a stream, and preserves order. A possibly infinite stream seems like a better abstraction for general use for two reasons: - You may want to process the first elements even if all of them aren't ready yet. - You'll have idle threads when you're near the end of the container; utilization is lower than it perhaps could be. Overkill for many uses, I'm sure. Disclaimer: I don't do a lot of thread programming, and I didn't learn the C++11 stuff. /Jorgen -- // Jorgen Grahn <grahn@ Oo o. . . \X/ snipabacken.se> O o . |
scott@slp53.sl.home (Scott Lurndal): Oct 10 12:39PM >to the worker threads and then recombine the result in a new vector in >the same order after completion. Or use iterators to transform the >original vector in place from different threads, if that's possible? Use an autovectorizing compiler and the host SIMD instruction set? |
bitrex <bitrex@de.lete.earthlink.net>: Oct 10 09:43AM -0400 On 10/10/2017 09:14 AM, Stefan Ram wrote: > ^ > . Maybe someone can explain how to remove the error? > (Is it my compiler not supporting all of the library?) Nice!! |
bitrex <bitrex@de.lete.earthlink.net>: Oct 10 09:47AM -0400 On 10/10/2017 08:39 AM, Scott Lurndal wrote: >> the same order after completion. Or use iterators to transform the >> original vector in place from different threads, if that's possible? > Use an autovectorizing compiler and the host SIMD instruction set? I think recent versions of GCC should optimize for SIMD at -O3? In the Code::Blocks build options I also see "CPU Architecture Tuning" flags for AMD FX-64, Intel Core, etc... |
scott@slp53.sl.home (Scott Lurndal): Oct 10 02:20PM > ::std::thread thread0{ ::std::move( package0 ), p, p + len/2, 0 }; > ::std::thread thread1{ ::std::move( package0 ), p + len/2, p + len, 0 }; } > return future0.get() + future1.get(); } Completely unreadable. >{ ::std::vector const vector< double >{}; > ::std::cout << sum_in_parallel( vector )<< '\n'; } > But I cannot get it compiled: Not surprising. |
bitrex <bitrex@de.lete.earthlink.net>: Oct 10 10:22AM -0400 On 10/10/2017 09:52 AM, Stefan Ram wrote: > ::std::thread thread0{ ::std::move( package0 ), p, p + len/2, 0 }; > ::std::thread thread1{ ::std::move( package0 ), p + len/2, p + len, 0 }; > , it is clear that on can move from »package0« only once. Also the function that's packaged takes two doubles as arguments, but in the "thread0" and "thread1" constructor the author is trying to pass three not including the rvalue reference to the package. Also I don't think using raw pointers as indexes into the vector data is such a good idea. This compiles with -std=c++11, recent versions of GCC but gives a "terminate without active exception" on execution - looks like the threads aren't being joined properly. #include <future> #include <initializer_list> #include <iostream> #include <thread> #include <utility> #include <vector> #include <numeric> double sum(double const* const beginning, double const* const end) { return ::std::accumulate(beginning, end, 0.0); } double sum_in_parallel(const ::std::vector<double>& vector) { using task_type = double(double const*, double const*); ::std::packaged_task<task_type> package0{sum}; ::std::packaged_task<task_type> package1{sum}; ::std::future<double> future0{package0.get_future()}; ::std::future<double> future1{package1.get_future()}; double const* const p = &vector[0]; { auto len{vector.size()}; ::std::thread thread0{::std::move(package0), p, p + len / 2}; ::std::thread thread1{::std::move(package1), p + len / 2, p + len}; } return future0.get() + future1.get(); } int main() { const ::std::vector<double> vector{1.0, 2.0, 3.0, 4.0, 5.0, 6.0, 7.0, 8.0}; ::std::cout << sum_in_parallel(vector) << ::std::endl; } |
bitrex <bitrex@de.lete.earthlink.net>: Oct 10 10:22AM -0400 On 10/10/2017 10:22 AM, bitrex wrote: >> ::std::thread thread1{ ::std::move( package0 ), p + len/2, p + len, 0 }; >> , it is clear that on can move from »package0« only once. > Also the function that's packaged takes two doubles as arguments pointers to doubles, rather |
David Brown <david.brown@hesbynett.no>: Oct 10 04:44PM +0200 On 10/10/17 15:47, bitrex wrote: > I think recent versions of GCC should optimize for SIMD at -O3? In the > Code::Blocks build options I also see "CPU Architecture Tuning" flags > for AMD FX-64, Intel Core, etc... gcc (and other compilers) can do some auto-vectorising. But that is a very different thing from multi-threading. Auto-vectorising means using SIMD instructions to do multiple identical operations in parallel within the one core. Multi-threading (as originally asked) means doing possibly different operations in multiple threads, preferably on multiple cores. Both techniques are useful. To get the most out of auto-vectorising, you need to make sure the compiler knows the cpu type it is targetting (such as with "-mnative" if compiling for just your own cpu, but possibly other flags if you want it to run on a variety of cpus in the same family). You need -O2 or -O3 (or the compiler's equivalent). You may need other flags as well. And you need to give your compiler as much information as possible - try to make your loops of constant size, make data aligned suitable for vectorisation (such as with gcc's "aligned" attribute), etc. |
Alain Ketterlin <alain@universite-de-strasbourg.fr.invalid>: Oct 10 04:45PM +0200 > something one might want to do regularly in a multi-threaded > data-crunching application: > Take a vector of some type "some type" is the crucial factor here. > depend on any of the other values, split into N chunks (ideally N = > number of cores * threads per core), send off copies to the worker > threads If you copy pieces of vectors, don't expect significant gains (unless you have a many cores): memory access is much more costly than mere arithmetic. Also simultaneous multi-threading (e.g., hyperthreading) might be detrimental to performance (it all depends on the kind/amount of data you process: SMT adds pressure on caches). It is very easy to make a parallel version run slower that the sequential version. > and then recombine the result in a new vector in the same order after > completion. Or use iterators to transform the original vector in place > from different threads, if that's possible? Your best bet is OpenMP. Use inplace as much as possible. For parallel loops, adapt the scheduling strategy to the work (im)balance (static if approximately balanced, dynamic otherwise), and array size (for static schedule, longer chunks are better). If you use small chunks and you have small array elements, arrange for the chunks to align on cache-line sizes to avoid false-sharing. If you plan to, e.g., sum/... short vectors of int/float/..., give up on multi-threads and ensure your compiler vectorizes properly; if necessary rewrite your code so that it does (use whatever options your compiler provides to spot the problems). Also make sure the compiler targets the correct architecture (e.g., -march=native with gcc). If instead you plan to, e.g., apply various filters to large raster images of various sizes, use OpenMP (and still make sure your compiler optimizes the sequential part correctly). Then play with scheduling strategies. -- Alain. |
Christian Gollwitzer <auriocus@gmx.de>: Oct 10 05:31PM +0200 Am 10.10.17 um 09:23 schrieb red floyd: > Much better than mine, assuming he has OpenNP. > Does OpenMP work on a single system? Or does it need to hand off to a > another node in a cluster? I haven't looked at it in years. OpenMP only works on shared memory systems, i.e. on a single node with multiple CPUs. It is available in all major current C++ compilers (gcc, clang, Intel, Visual). There used to be a discontinued product from Intel (cluster OpenMP) which used page faults to synchronize the memory over the cluster, but for today clustering needs different tools (MPI is the most standard one) Christian |
scott@slp53.sl.home (Scott Lurndal): Oct 10 04:48PM >Intel (cluster OpenMP) which used page faults to synchronize the memory >over the cluster, but for today clustering needs different tools (MPI is >the most standard one) For loosely coupled systems, openMPI is the typical answer. |
David Brown <david.brown@hesbynett.no>: Oct 10 07:02PM +0200 On 10/10/17 15:14, Stefan Ram wrote: > ^ > . Maybe someone can explain how to remove the error? > (Is it my compiler not supporting all of the library?) I have tried to keep the structure and logic of your code, while removing the worst jumbled mess of formatting and the extra includes. And it is crazy to call your std::vector instance "vector". (I hope you don't teach your students that weird bracketing style, unusual spacing, and unnecessary ::std. They are just going to have to unlearn it all before working with any real-world code.) Key errors: 1. Messed up type for "p" 2. Using "vector.size" instead of "vector.size()" 3. Extra parameter to your thread initialisers 4. Forgetting to join your threads 5. Using an empty vector for testing! #include <numeric> #include <vector> #include <iostream> #include <future> static double sum(const double * const beginning, const double * const end) { return std::accumulate(beginning, end, 0.0); } static double sum_in_parallel(const std::vector<double> &vect) { using task_type = double(const double *, const double *); std::packaged_task<task_type> package0 { sum }; std::packaged_task<task_type> package1 { sum }; std::future<double> future0 { package0.get_future() }; std::future<double> future1 { package1.get_future() }; const double * p = &vect[0]; const auto len { vect.size() }; std::thread thread0 { std::move(package0), p, p + len / 2 }; std::thread thread1 { std::move(package1), p + len / 2, p + len}; thread0.join(); thread1.join(); return future0.get() + future1.get(); } int main() { const std::vector<double> vect { 1.0, 2.0, 3.0, 4.0 }; std::cout << sum_in_parallel(vect) << '\n'; } |
David Brown <david.brown@hesbynett.no>: Oct 10 07:05PM +0200 On 10/10/17 19:02, David Brown wrote: > Key errors: > 1. Messed up type for "p" Skip that one - I had merely missed out a "const" while copying the code. |
red floyd <dont.bother@its.invalid>: Oct 10 10:06AM -0700 On 10/10/2017 9:48 AM, Scott Lurndal wrote: > For loosely coupled systems, openMPI is the typical answer. *THAT'S* the one I was thinking of. Thanks. |
bitrex <bitrex@de.lete.earthlink.net>: Oct 10 01:40PM -0400 On 10/10/2017 01:02 PM, David Brown wrote: > const std::vector<double> vect { 1.0, 2.0, 3.0, 4.0 }; > std::cout << sum_in_parallel(vect) << '\n'; > } Nice, thank you |
bitrex <bitrex@de.lete.earthlink.net>: Oct 10 01:38PM -0400 Prior to C+11 I guess it was assumed that allocators were stateless, so you didn't have to do anything special with their copy constructors (it seems most were just declared as "throw()." Now it's possible for allocators to have internal state, but a problem arises if I write something like the following: template<typename T> using RebindAlloc = typename std::allocator_traits <AllocatorBase<T, StatefulAllocatorPolicy<T>>::template rebind_alloc<T>; typedef std::basic_string<char, std::char_traits<char>, RebindAlloc<char>> my_allocated_string_t; my_allocated_string_t my_string{"abcde"}; If my allocator policy has internal state, say a raw memory block of some size which is initialized on instantiation with "new" and freed using "delete []" on a raw pointer stored in a class field in its destructor, if I just let the policy use the default copy constructor the allocator is copy constructed in the constructor of std::basic_string, all the trivially constructable internal fields are copied over verbatim, and then I get a segmentation fault as the code tries to delete the same block of memory twice on destruction of both the original allocator instance and the copy. So if I want custom allocated container types that use a stateful allocator to be copy constructible, copy assignable etc. I'd need to have some kind of shared global state between instances using either a singleton or reference counting smart pointer to a common structure holding the mutable fields, I guess? |
ram@zedat.fu-berlin.de (Stefan Ram): Oct 10 01:14PM >element-wise and doesn't depend on any of the other values, split into N >chunks (ideally N = number of cores * threads per core), send off copies >to the worker threads and then recombine the result in a new vector in Here is something similar (based on code by Bjarne Stroustrup): Calculate the sum of a vector in parallel. #include <algorithm> #include <future> #include <initializer_list> #include <iostream> #include <ostream> #include <thread> #include <utility> #include <vector> double sum( double const * const beginning, double const * const end ) { return ::std::accumulate( beginning, end, 0.0 ); } double sum_in_parallel( ::std::vector< double > const & vector ) { using task_type = double( double const *, double const * ); ::std::packaged_task< task_type >package0{ sum }; ::std::packaged_task< task_type >package1{ sum }; ::std::future< double >future0{ package0.get_future() }; ::std::future< double >future1{ package1.get_future() }; double const * const p = &vector[ 0 ]; { auto len { vector.size }; ::std::thread thread0{ ::std::move( package0 ), p, p + len/2, 0 }; ::std::thread thread1{ ::std::move( package0 ), p + len/2, p + len, 0 }; } return future0.get() + future1.get(); } int main() { ::std::vector const vector< double >{}; ::std::cout << sum_in_parallel( vector )<< '\n'; } But I cannot get it compiled: error: variable 'std::packaged_task<double(const double*, const double*)> package0' has initializer but incomplete type ::std::packaged_task< task_type >package0{ sum }; ^ . Maybe someone can explain how to remove the error? (Is it my compiler not supporting all of the library?) |
ram@zedat.fu-berlin.de (Stefan Ram): Oct 10 01:19PM >{ ::std::vector const vector< double >{}; That should be ::std::vector< double >const vector {}; . (But the error reported still remains.) |
ram@zedat.fu-berlin.de (Stefan Ram): Oct 10 01:52PM > That should be >::std::vector< double >const vector {}; > . (But the error reported still remains.) Oh, and ::std::thread thread1{ ::std::move( package0 ), p + len/2, p + len, 0 }; should be ::std::thread thread1{ ::std::move( package1 ), p + len/2, p + len, 0 }; . I was not able to start this program, so I was not able to debug it. But »move« helped me to spot the error, because when reading, ::std::thread thread0{ ::std::move( package0 ), p, p + len/2, 0 }; ::std::thread thread1{ ::std::move( package0 ), p + len/2, p + len, 0 }; , it is clear that on can move from »package0« only once. |
"Rick C. Hodgin" <rick.c.hodgin@gmail.com>: Oct 10 06:21AM -0700 This is Pastor Mac, a guest pastor at a church in Hawaii I listen to each week. He's appeared on the channel as a guest pastor a handful of times that I've seen, but his sermons are always so powerful. I urge you all to listen to this. He talks about the blood, and how our lives are so dependent upon it here in this world, but also upon the blood of Christ in eternity: Sacrifice, Sanctity, Life to God https://www.youtube.com/watch?v=XZfn1XLCOQo Life is in the blood. You can live without an arm, or a leg, or so many other parts. But if you lose the blood it's over. ----- This relationship between the blood of man, and the blood of Christ, is by design. Even a woman's monthly cycle is a reminder of man's original sin in the Garden of Eden, a permanent, personal reminder of our accountability unto God world-wide. Thank you, Rick C. Hodgin |
You received this digest because you're subscribed to updates for this group. You can change your settings on the group membership page. To unsubscribe from this group and stop receiving emails from it send an email to comp.lang.c+++unsubscribe@googlegroups.com. |
No comments:
Post a Comment