soft and program: Digest for comp.lang.c++@googlegroups.com

Saturday, November 28, 2015

Digest for comp.lang.c++@googlegroups.com - 7 updates in 4 topics

comp.lang.c++@googlegroups.com

Google Groups

Topic digest
View all topics

cmsg cancel <n3cf5v$hqv$2@dont-email.me> - 2 Updates
You can run it on... - 1 Update
Scalable Parallel implementation of Conjugate Gradient Linear System solver library version 1.2 - 1 Update
A "better" C++ - 3 Updates

cmsg cancel <n3cf5v$hqv$2@dont-email.me>

bleachbot <bleachbot@httrack.com>: Nov 28 03:56PM +0100

bleachbot <bleachbot@httrack.com>: Nov 28 07:13PM +0100

You can run it on...

Ramine <ramine@1.1>: Nov 28 01:16PM -0800

Hello....

You can run my Scalable Parallel implementation of Conjugate Gradient
Linear System solver library that is NUMA-aware and cache-aware version
1.2 on the following multicores NUMA architecture, for example on the
following 16-socket system with many NUMA nodes, read the following from HP:

http://www8.hp.com/h20195/v2/GetPDF.aspx%2F4AA5-1507ENW.pdf

You can download my new Scalable Parallel implementation of Conjugate
Gradient Linear System solver library version 1.2 from:

https://sites.google.com/site/aminer68/scalable-parallel-implementation-of-conjugate-gradient-linear-system-solver-library-that-is-numa-aware-and-cache-aware

Thank you,
Amine Moulay Ramdane

Scalable Parallel implementation of Conjugate Gradient Linear System solver library version 1.2

Ramine <ramine@1.1>: Nov 28 09:58AM -0800

Hello.....

I have updated my Scalable Parallel implementation of Conjugate Gradient
Linear System solver library that is NUMA-aware and cache-aware to
version 1.2, in the previous version, the FIFO queues
of the Threadpools that i was using was not Allocated in different NUMA
nodes, i have enhanced it and now each FIFO queue of each Threadpool is
allocated in different NUMA node, and the rest of my algorithm is
NUMA-aware and cache aware, so now all my algorithm has become fully
NUMA-aware and cache-aware, so it is now scalable on multicores and
on NUMA architectures of the x86 architecture.

You can download my new Scalable Parallel implementation of Conjugate
Gradient Linear System solver library version 1.2 from:

https://sites.google.com/site/aminer68/scalable-parallel-implementation-of-conjugate-gradient-linear-system-solver-library-that-is-numa-aware-and-cache-aware

Feel free to port it to C++ and to other programming languages...

Thank you,
Amine Moulay Ramdane.

A "better" C++

"Öö Tiib" <ootiib@hot.ee>: Nov 27 06:14PM -0800

On Friday, 27 November 2015 18:41:29 UTC+2, J. Clarke wrote:
> > need to resort to low-level trickery to achieve this speed.

> Do qsort and std::sort() implement the same algorithm? If not then you
> are not comparing languages, you are comparing libraries.

The exact algorithm is not required for either but it is often
single-threaded quick sort with a fallback to heap sort in recent years.

The speed difference typically stays about same when amount of sorted
data grows. It is not about algorithms used but micro-optimizations
made by compiler. Since 'std::sort' is template it is simpler to
optimize for compiler.

When the performance matters seriously then both C and C++ developers
can implement a sort that does best match with nature of sortable
data and situation under hand. There is quite large pile of sorting
algorithms invented and some are more parallelizable than others.
When the performance does not matter much then it is irrelevant
non-issue anyway.

"Alf P. Steinbach" <alf.p.steinbach+usenet@gmail.com>: Nov 28 04:53AM +0100

On 11/28/2015 3:14 AM, Öö Tiib wrote:
> algorithms invented and some are more parallelizable than others.
> When the performance does not matter much then it is irrelevant
> non-issue anyway.

This is an interesting exercise. With 4 cores available, when is it
faster to partition the sort into 4 threads, each on its own core, and
merge the results? Doing a perhaps too simple math inequality said this
would be faster for anything above a handful of items, but that didn't
include the setup time for the threads. I wonder what actual
measurements would say. I would guess a cross-over point (to more
efficient) for a few hundred items.

Cheers,

- Alf

"Öö Tiib" <ootiib@hot.ee>: Nov 27 11:37PM -0800

On Saturday, 28 November 2015 05:53:46 UTC+2, Alf P. Steinbach wrote:
> include the setup time for the threads. I wonder what actual
> measurements would say. I would guess a cross-over point (to more
> efficient) for a few hundred items.

I am afraid I haven't even tested that, can't find any notes. The logic
is that it is typically about hundreds of thousands of items that we
humans start to perceive like difference in performance.

Still something I have. Intel's TBB has 'tbb::parallel_sort' that by
my notes always beats 'std::sort' somewhat even with small sequences
but with hundreds of thousands or more items it is about 3 times faster
on quad core.

Note that few hundred of items we have to sort thousands of times
to alter performance. That means we likely have thousands of such
unrelated little chunks to sort. So we sort these little buggers on
different cores. ;)

Other thing is that the GPU is worth using for so mundane things
like sorting only when data (or index of it) is already on GPU.
Transporting it there and back cripples the results and sum only
barely beats TBB.

You received this digest because you're subscribed to updates for this group. You can change your settings on the group membership page.
To unsubscribe from this group and stop receiving emails from it send an email to comp.lang.c+++unsubscribe@googlegroups.com.

soft and program

Saturday, November 28, 2015

Digest for comp.lang.c++@googlegroups.com - 7 updates in 4 topics

No comments:

Blog Archive

About Me