- cmsg cancel <n3cf5v$hqv$2@dont-email.me> - 2 Updates
- You can run it on... - 1 Update
- Scalable Parallel implementation of Conjugate Gradient Linear System solver library version 1.2 - 1 Update
- A "better" C++ - 3 Updates
bleachbot <bleachbot@httrack.com>: Nov 28 03:56PM +0100 |
bleachbot <bleachbot@httrack.com>: Nov 28 07:13PM +0100 |
Ramine <ramine@1.1>: Nov 28 01:16PM -0800 Hello.... You can run my Scalable Parallel implementation of Conjugate Gradient Linear System solver library that is NUMA-aware and cache-aware version 1.2 on the following multicores NUMA architecture, for example on the following 16-socket system with many NUMA nodes, read the following from HP: http://www8.hp.com/h20195/v2/GetPDF.aspx%2F4AA5-1507ENW.pdf You can download my new Scalable Parallel implementation of Conjugate Gradient Linear System solver library version 1.2 from: https://sites.google.com/site/aminer68/scalable-parallel-implementation-of-conjugate-gradient-linear-system-solver-library-that-is-numa-aware-and-cache-aware Thank you, Amine Moulay Ramdane |
Ramine <ramine@1.1>: Nov 28 09:58AM -0800 Hello..... I have updated my Scalable Parallel implementation of Conjugate Gradient Linear System solver library that is NUMA-aware and cache-aware to version 1.2, in the previous version, the FIFO queues of the Threadpools that i was using was not Allocated in different NUMA nodes, i have enhanced it and now each FIFO queue of each Threadpool is allocated in different NUMA node, and the rest of my algorithm is NUMA-aware and cache aware, so now all my algorithm has become fully NUMA-aware and cache-aware, so it is now scalable on multicores and on NUMA architectures of the x86 architecture. You can download my new Scalable Parallel implementation of Conjugate Gradient Linear System solver library version 1.2 from: https://sites.google.com/site/aminer68/scalable-parallel-implementation-of-conjugate-gradient-linear-system-solver-library-that-is-numa-aware-and-cache-aware Feel free to port it to C++ and to other programming languages... Thank you, Amine Moulay Ramdane. |
"Öö Tiib" <ootiib@hot.ee>: Nov 27 06:14PM -0800 On Friday, 27 November 2015 18:41:29 UTC+2, J. Clarke wrote: > > need to resort to low-level trickery to achieve this speed. > Do qsort and std::sort() implement the same algorithm? If not then you > are not comparing languages, you are comparing libraries. The exact algorithm is not required for either but it is often single-threaded quick sort with a fallback to heap sort in recent years. The speed difference typically stays about same when amount of sorted data grows. It is not about algorithms used but micro-optimizations made by compiler. Since 'std::sort' is template it is simpler to optimize for compiler. When the performance matters seriously then both C and C++ developers can implement a sort that does best match with nature of sortable data and situation under hand. There is quite large pile of sorting algorithms invented and some are more parallelizable than others. When the performance does not matter much then it is irrelevant non-issue anyway. |
"Alf P. Steinbach" <alf.p.steinbach+usenet@gmail.com>: Nov 28 04:53AM +0100 On 11/28/2015 3:14 AM, Öö Tiib wrote: > algorithms invented and some are more parallelizable than others. > When the performance does not matter much then it is irrelevant > non-issue anyway. This is an interesting exercise. With 4 cores available, when is it faster to partition the sort into 4 threads, each on its own core, and merge the results? Doing a perhaps too simple math inequality said this would be faster for anything above a handful of items, but that didn't include the setup time for the threads. I wonder what actual measurements would say. I would guess a cross-over point (to more efficient) for a few hundred items. Cheers, - Alf |
"Öö Tiib" <ootiib@hot.ee>: Nov 27 11:37PM -0800 On Saturday, 28 November 2015 05:53:46 UTC+2, Alf P. Steinbach wrote: > include the setup time for the threads. I wonder what actual > measurements would say. I would guess a cross-over point (to more > efficient) for a few hundred items. I am afraid I haven't even tested that, can't find any notes. The logic is that it is typically about hundreds of thousands of items that we humans start to perceive like difference in performance. Still something I have. Intel's TBB has 'tbb::parallel_sort' that by my notes always beats 'std::sort' somewhat even with small sequences but with hundreds of thousands or more items it is about 3 times faster on quad core. Note that few hundred of items we have to sort thousands of times to alter performance. That means we likely have thousands of such unrelated little chunks to sort. So we sort these little buggers on different cores. ;) Other thing is that the GPU is worth using for so mundane things like sorting only when data (or index of it) is already on GPU. Transporting it there and back cripples the results and sum only barely beats TBB. |
You received this digest because you're subscribed to updates for this group. You can change your settings on the group membership page. To unsubscribe from this group and stop receiving emails from it send an email to comp.lang.c+++unsubscribe@googlegroups.com. |
No comments:
Post a Comment