Wednesday, February 18, 2015

Digest for comp.programming.threads@googlegroups.com - 8 updates in 6 topics

Ramine <ramine@1.1>: Feb 17 07:16PM -0800

Hello again my dear programmers,
 
 
Let's talk about my parallel sort library, my parallel sort library
divides the array to be sorted into mutiple sub-arrays, and it sorts
those sub-arrays in parallel then it uses a parallel merge algorithm to
merge the final sub-arrays to be sorted:
 
Here is the kind of parallel merge that is used, read here:
 
http://dzmitryhuba.blogspot.ca/2010/10/parallel-merge-sort.html
 
 
Here is the steps of the parallel merge algorithm:
 
"Let's assume we want to merge sorted arrays X and Y. Select X[m] median
element in X. Elements in X[ .. m-1] are less than or equal to X[m].
Using binary search find index k of the first element in Y greater than
X[m]. Thus Y[ .. k-1] are less than or equal to X[m] as well. Elements
in X[m+1 .. ] are greater than or equal to X[m] and Y[k .. ] are
greater. So merge(X, Y) can be defined as concat(merge(X[ .. m–1], Y[ ..
k–1]), X[m], merge(X[m+1 .. ], Y[k .. ])),
now we can recursively in parallel do merge(X[ .. m-1], Y[ .. k–1]) and
merge(X[m+1 .. ], Y[k .. ]) and then concat results."
 
 
But there is still a problem that i have talked about before, this
parallel merge sort above or the parallel sort algorithms of my parallel
sort library are limited by the serial part of the Amdahl's law that
correspond to the memory transfers from the memory to the L2 caches,
this serial part will limit the scalability to not more than 8X or 4X ,
so what i have said before that on my next project i will render my
parallel sort library scalable on NUMA architecture so that those memory
transfers from the memory to the L2 caches will be parallelized and this
will make my parallel sort library a very powerful library. So stay tuned...
 
 
You can also use sameplesort, but my parallel sort library will be very
good also when it will become scalable on NUMA architecture.
 
 
 
Thank you,
Amine Moulay Ramdane.
Ramine <ramine@1.1>: Feb 17 06:17PM -0800

Hello,
 
 
I have done a benchmark for my StringTree, and i have created 10
directoriers with 100000 file each, on my CPU Quadcore Q6600 clocked at
2.4 Ghz and it has took only 2.8 seconds to create the one million
files, that's really very fast ! and i have done a FindFile() on a
file's name that is contained in each directory that contains 100000
files and it has took 7200 microseconds, so my StringTree library is
really very fast and it can be considered to have the same level of
quality as an industrial library.
 
 
You can download my StringTree from:
 
https://sites.google.com/site/aminer68/stringtree
 
 
Thank you,
Amine Moulay Ramdane.
Ramine <ramine@1.1>: Feb 17 06:22PM -0800

I correct some typos...
 
Hello,
 
 
I have done a benchmark for my StringTree, and i have created 10
directories with 100000 files each, on my CPU Quadcore Q6600 clocked at
2.4 Ghz and it has took only 2.8 seconds to create the one million
files, that's really very fast ! and i have done a FindFile() on a
file's name that is contained in each directory that contains 100000
files and it has took 7200 microseconds, so my StringTree library is
really very fast and it can be considered to have the same level of
quality as an industrial library.
 
 
You can download my StringTree from:
 
https://sites.google.com/site/aminer68/stringtree
 
 
Thank you,
Amine Moulay Ramdane.
Ramine <ramine@1.1>: Feb 17 05:41PM -0800

Greinder wrote:
>My processor has 6 cores. Your benchmark program reports 12 cores.
 
 
I am using the windows function called GetSystemInfo() that is
returning the number of logical processors..
 
So in my program it's returning the number of logical processors,
so in your computer there is 12 logical processors, i have called
it 12 cores, but it is in fact 12 logical processors.
 
 
 
Thank you,
Amine Moulay Ramdane.
Ramine <ramine@1.1>: Feb 17 05:29PM -0800

Hello,
 
 
Hope you have read my previous post titled "About my parallel algorithms
and NUMA", as you have noticed on my NUMA-aware i am using my classical
threadpool and i am not scheduling the jobs by specifying an explicit
NUMA node, but i am dividing the parallel memory processing
between the NUMA nodes, and by doing so you will get a scalable
algorithm with 40% less throughput than if you deisgn a more optimized
parallel algorithm and threadpool that schedules the jobs by specifying
explicitly a NUMA node so that to avoid at best remote memory accesses
on NUMA nodes, and this will get you 40% more throughput, bu my parallel
algorithm that are NUMA-aware and that uses a classical threadpool are
also good i think, and they are still good enough , but if you need me
to optimize more my threadpool so that to get 40% more throughput i will
do it as a next project.
 
 
 
 
Thank you,
Amine Moulay Ramdane.
Ramine <ramine@1.1>: Feb 17 04:46PM -0800

Hello,
 
We have to be smart, so follow with me please..
 
As you have noticed i have implemented and invented a parallel Conjugate
gradient linear system solver library...
 
Here it is:
 
https://sites.google.com/site/aminer68/scalable-parallel-implementation-of-conjugate-gradient-linear-system-solver-library-that-is-numa-aware-and-cache-aware
 
 
My parallel algorithm is scalable on NUMA architecture...
 
But You have to undertand my way of designing my NUMA-aware parallel
algorithms, the first way of implementing a NUMA-aware parallel
algorithm is by implementing a threadpool that schedules a job on a
given thread by specifying for example the NUMA-node explicitly
depending on the wich NUMA node's memory you will do your processing ...
this way will buy you 40% more throughput on NUMA architecture, but
there is another way of doing is to use the classical threadpool without
specifying the NUMA node explicitly , but you will divide for exemple
your parallel memory processing between the NUMA nodes, this is the way
i have implemented my parallel algorithms that are NUMA-aware, my way of
doing is scalable on NUMA architecture but you will get 40% less
throughput on NUMA architecture, but even if it's 40% throughput i think
that my parallel algorithms that are NUMA-aware are scalable on NUMA
architecture and they are still good enough, my next parallel sort
library will be also scalable on NUMA-architecture.
 
From were i have got this 40% ? please read here:
 
 
"Performance impact: the cost of NUMA remote memory access
 
For instance, this Dell whitepaper has some test results on the Xeon
5500 processors, showing that local memory access can have 40% higher
bandwidth than remote memory access, and the latency of local memory
access is around 70 nanoseconds whereas remote memory access has a
latency of about 100 nanoseconds."
 
Read more here:
 
http://sqlblog.com/blogs/linchi_shea/archive/2012/01/30/performance-impact-the-cost-of-numa-remote-memory-access.aspx
 
 
 
 
Amine Moulay Ramdane.
Ramine <ramine@1.1>: Feb 17 04:57PM -0800

On 2/17/2015 4:46 PM, Ramine wrote:
> i have implemented my parallel algorithms that are NUMA-aware, my way of
> doing is scalable on NUMA architecture but you will get 40% less
> throughput on NUMA architecture, but even if it's 40% throughput i think
 
I mean: even if it's 40% less throughput...
 
 
Ramine <ramine@1.1>: Feb 17 02:33PM -0800

Hello,
 
Xeon E7-v3 CPUs To Carry Up To 18 Cores
 
read more here:
 
http://www.tomshardware.com/news/intel-xeon-e7-v3-cpus,28550.html
 
 
 
Amine Moulay Ramdane.
You received this digest because you're subscribed to updates for this group. You can change your settings on the group membership page.
To unsubscribe from this group and stop receiving emails from it send an email to comp.programming.threads+unsubscribe@googlegroups.com.

No comments: