- About parallel sort algorithms - 1 Update
- About my StringTree and benchmark - 2 Updates
- About my parallel sort library's benchmark program - 1 Update
- More about my NUMA-aware algorithms - 1 Update
- About my parallel algorithms and NUMA - 2 Updates
- Xeon E7-v3 CPUs To Carry Up To 18 Cores - 1 Update
Ramine <ramine@1.1>: Feb 17 07:16PM -0800 Hello again my dear programmers, Let's talk about my parallel sort library, my parallel sort library divides the array to be sorted into mutiple sub-arrays, and it sorts those sub-arrays in parallel then it uses a parallel merge algorithm to merge the final sub-arrays to be sorted: Here is the kind of parallel merge that is used, read here: http://dzmitryhuba.blogspot.ca/2010/10/parallel-merge-sort.html Here is the steps of the parallel merge algorithm: "Let's assume we want to merge sorted arrays X and Y. Select X[m] median element in X. Elements in X[ .. m-1] are less than or equal to X[m]. Using binary search find index k of the first element in Y greater than X[m]. Thus Y[ .. k-1] are less than or equal to X[m] as well. Elements in X[m+1 .. ] are greater than or equal to X[m] and Y[k .. ] are greater. So merge(X, Y) can be defined as concat(merge(X[ .. m–1], Y[ .. k–1]), X[m], merge(X[m+1 .. ], Y[k .. ])), now we can recursively in parallel do merge(X[ .. m-1], Y[ .. k–1]) and merge(X[m+1 .. ], Y[k .. ]) and then concat results." But there is still a problem that i have talked about before, this parallel merge sort above or the parallel sort algorithms of my parallel sort library are limited by the serial part of the Amdahl's law that correspond to the memory transfers from the memory to the L2 caches, this serial part will limit the scalability to not more than 8X or 4X , so what i have said before that on my next project i will render my parallel sort library scalable on NUMA architecture so that those memory transfers from the memory to the L2 caches will be parallelized and this will make my parallel sort library a very powerful library. So stay tuned... You can also use sameplesort, but my parallel sort library will be very good also when it will become scalable on NUMA architecture. Thank you, Amine Moulay Ramdane. |
Ramine <ramine@1.1>: Feb 17 06:17PM -0800 Hello, I have done a benchmark for my StringTree, and i have created 10 directoriers with 100000 file each, on my CPU Quadcore Q6600 clocked at 2.4 Ghz and it has took only 2.8 seconds to create the one million files, that's really very fast ! and i have done a FindFile() on a file's name that is contained in each directory that contains 100000 files and it has took 7200 microseconds, so my StringTree library is really very fast and it can be considered to have the same level of quality as an industrial library. You can download my StringTree from: https://sites.google.com/site/aminer68/stringtree Thank you, Amine Moulay Ramdane. |
Ramine <ramine@1.1>: Feb 17 06:22PM -0800 I correct some typos... Hello, I have done a benchmark for my StringTree, and i have created 10 directories with 100000 files each, on my CPU Quadcore Q6600 clocked at 2.4 Ghz and it has took only 2.8 seconds to create the one million files, that's really very fast ! and i have done a FindFile() on a file's name that is contained in each directory that contains 100000 files and it has took 7200 microseconds, so my StringTree library is really very fast and it can be considered to have the same level of quality as an industrial library. You can download my StringTree from: https://sites.google.com/site/aminer68/stringtree Thank you, Amine Moulay Ramdane. |
Ramine <ramine@1.1>: Feb 17 05:41PM -0800 Greinder wrote: >My processor has 6 cores. Your benchmark program reports 12 cores. I am using the windows function called GetSystemInfo() that is returning the number of logical processors.. So in my program it's returning the number of logical processors, so in your computer there is 12 logical processors, i have called it 12 cores, but it is in fact 12 logical processors. Thank you, Amine Moulay Ramdane. |
Ramine <ramine@1.1>: Feb 17 05:29PM -0800 Hello, Hope you have read my previous post titled "About my parallel algorithms and NUMA", as you have noticed on my NUMA-aware i am using my classical threadpool and i am not scheduling the jobs by specifying an explicit NUMA node, but i am dividing the parallel memory processing between the NUMA nodes, and by doing so you will get a scalable algorithm with 40% less throughput than if you deisgn a more optimized parallel algorithm and threadpool that schedules the jobs by specifying explicitly a NUMA node so that to avoid at best remote memory accesses on NUMA nodes, and this will get you 40% more throughput, bu my parallel algorithm that are NUMA-aware and that uses a classical threadpool are also good i think, and they are still good enough , but if you need me to optimize more my threadpool so that to get 40% more throughput i will do it as a next project. Thank you, Amine Moulay Ramdane. |
Ramine <ramine@1.1>: Feb 17 04:46PM -0800 Hello, We have to be smart, so follow with me please.. As you have noticed i have implemented and invented a parallel Conjugate gradient linear system solver library... Here it is: https://sites.google.com/site/aminer68/scalable-parallel-implementation-of-conjugate-gradient-linear-system-solver-library-that-is-numa-aware-and-cache-aware My parallel algorithm is scalable on NUMA architecture... But You have to undertand my way of designing my NUMA-aware parallel algorithms, the first way of implementing a NUMA-aware parallel algorithm is by implementing a threadpool that schedules a job on a given thread by specifying for example the NUMA-node explicitly depending on the wich NUMA node's memory you will do your processing ... this way will buy you 40% more throughput on NUMA architecture, but there is another way of doing is to use the classical threadpool without specifying the NUMA node explicitly , but you will divide for exemple your parallel memory processing between the NUMA nodes, this is the way i have implemented my parallel algorithms that are NUMA-aware, my way of doing is scalable on NUMA architecture but you will get 40% less throughput on NUMA architecture, but even if it's 40% throughput i think that my parallel algorithms that are NUMA-aware are scalable on NUMA architecture and they are still good enough, my next parallel sort library will be also scalable on NUMA-architecture. From were i have got this 40% ? please read here: "Performance impact: the cost of NUMA remote memory access For instance, this Dell whitepaper has some test results on the Xeon 5500 processors, showing that local memory access can have 40% higher bandwidth than remote memory access, and the latency of local memory access is around 70 nanoseconds whereas remote memory access has a latency of about 100 nanoseconds." Read more here: http://sqlblog.com/blogs/linchi_shea/archive/2012/01/30/performance-impact-the-cost-of-numa-remote-memory-access.aspx Amine Moulay Ramdane. |
Ramine <ramine@1.1>: Feb 17 04:57PM -0800 On 2/17/2015 4:46 PM, Ramine wrote: > i have implemented my parallel algorithms that are NUMA-aware, my way of > doing is scalable on NUMA architecture but you will get 40% less > throughput on NUMA architecture, but even if it's 40% throughput i think I mean: even if it's 40% less throughput... |
Ramine <ramine@1.1>: Feb 17 02:33PM -0800 Hello, Xeon E7-v3 CPUs To Carry Up To 18 Cores read more here: http://www.tomshardware.com/news/intel-xeon-e7-v3-cpus,28550.html Amine Moulay Ramdane. |
You received this digest because you're subscribed to updates for this group. You can change your settings on the group membership page. To unsubscribe from this group and stop receiving emails from it send an email to comp.programming.threads+unsubscribe@googlegroups.com. |
No comments:
Post a Comment