- Parallel Sort library is here... - 1 Update
- Scalable Parallel Conjugate Gradient Linear System solver library - 1 Update
- Scalable MLock is here - 1 Update
- Winmenus was update to version 1.4 - 1 Update
Ramine <ramine@1.1>: Feb 07 04:23PM -0800 Hello... My Parallel Sort library version 3.3 is here: https://sites.google.com/site/aminer68/parallel-sort-library As i have told you , my new parallel Sort algorithm has become more cache-aware, and since it has become more cache-aware it have induced a super linear speedup and super linear scalability when using more cores and more L2 caches, i have done some benchmarks on my Quadcore that uses two L2 caches and it has given a super linear speedup of 5X scalability on my Quadcore when sorting strings even though i am using only 4 cores, that's easy to understand cause when you use only one thread it will use only one L2 cache, but when you use more threads on multiple cores and with multiple L2 caches it will use more L2 caches and it will parallelize the access to those multiple L2 caches , this is why my new parallel algorithm has given a super linear speedup when sorting strings. I have to be frank, we have to be smart when inventing or doing parallel sort algorithms, my new parallel sort algorithm is smart, cause it is more cache-aware, but you have to be carefull cause look at my other parallel quicksort algorithm here: https://sites.google.com/site/aminer68/parallel-quicksort You have to know that this parallel quicksort that uses the classical way of sorting is not so good, cause when its partition function is used recursively, this parallel quicksort algorithm will dispatch the arrays to be partitioned each time to different threads, and this will make it less cache-aware than my new parallel sort algorithm, and since it will make it less cache-aware , so you will not get much than 3X scalability by sorting strings and you will get less that 3X scalability by sorting integers or doubles for example, So i advice you to use my new parallel sort algorithm of my parallel sort library version 3.3 that is more cache-aware and that gives you super linear scalability when sorting strings and it gives you good scalability when sorting integers and doubles etc. Let's talk more computer science... Finally i have arrived to an important subject that is "optimization", and you will notice that computer science learns you one of the most important thing, is how to amortize and optimize by optimizing more your time complexity that we express mathematically with a O(), this is how i have done it with my new parallel sort algorithm that has a "very" good time complexity, and what learns you also computer science is how to optimize more in the source code's part or parts that takes the greater percentage of running time, and what learns you also computer science is to do the greater percentage of all the optimizations and to not do the smaller optimizations, this is also a good way to follow to optimize your code, so if you look at my new parallel conjugate gradient solver library , i am not using SIMD SSE2 instructions , but i am just parallelizing the memory transfers cause my parallel algorithm is NUMA-aware and i am also parallelizing also the multiplication and addition of the double's type etc. and also my parallel algorithm is cache-aware so i am also parallelizing more the L2 cache memory transfers , so this optimization will make it a very greater percentage of all the optimization that we can do on my parallel algorithm, so in large scale industrial problems i don't think that my parallel conjugate gradient algorithm needs to do SIMD SSE2 or AVX optimization , cause SIMD SSE2 or AVX will give just a small improvement to my parallel algorithm. I have finally arrived to an important subject...it's still on the subject of parallel optimization, but this time i will talk more about the Cilk-style scheduler.. as you have seen me in my previous post i have talked about the Dmitry Vyukov's Cilk-style scheduler that supports system-topology awareness, hyper-threading awareness, affinity-awareness, batch-spawn capability and manual task-depth control, here it is: http://www.1024cores.net/home/parallel-computing/radix-sort and as you have seen me talking about my parallel quicksort, i have said that it's not so good cause since it dispatches the indexes of the arrays to be partitionned each time to different threads of my classical threadpool, since the my threadpool don't look like Dmitry Vyukov's scheduler , so this will render the parallel quicksort algorithm less cache-aware, so this will not scale much than 3X for strings and it will scale less for integers and doubles... but if we are using the Dmitry Vyukov's scheduler, as we recursively dispatch the array's indexes to the Scheduler's threads that will call the partition function, the Dmitry Vyukov's Cilk-style scheduler will dispatch those array's indexes to threads that will reuse "efficiently" the data of the caches , so this will reuse efficiently the L2-cache's data and L3-cache's data efficiently , but even though it will reuse the data of caches efficiently, this optimizations of the Dmitry Vyukov's scheduler will not bring more than a 2X improvement , in fact you will get less and less improvement when using "bigger" and bigger data, so this is not so important improvement that will bring the Dmitry Vyukov's scheduler, so what you have to do is to render your parallel sort algorithm into a NUMA-aware algorithm this way you will render your parallel algorithm into a scalable algorithm, this is why i have told you in my previous post that the Dmitry's Sheduler will not bring an important improvement, and if you take a look at my new parallel sort algorithm of my parallel sort library version 3.3, its parallel sorting part doesn't contain parallel recursive calls and its parallel merging part doesn't contain parallel recursive calls, so the Dmitry Vyukov's scheduler will not bring improvement to my parallel algorithm, so for all those reasons this why i have decided to not implement a scheduler that look like the Dmitry Vyukov's scheduler and this why i have decided to use my classical threadpool engine instead. Finally Parallel Sort Library that supports Parallel Quicksort, Parallel HeapSort and Parallel MergeSort on Multicores systems is here. Parallel Sort Library uses my Thread Pool Engine and sort many array parts - of your array - in parallel using Quicksort or HeapSort or MergeSort and after that it finally merge them - with the merge() procedure - In the previous parallelsort version i have parallelized only the sort part, but in this new parallelsort version i have parallelized also the merge procedure part and it gives better performance. My new parallel sort algorithm has become more cache-aware, and i have done some benchmarks with my new parallel algorithm and it has given up to 5X scalability on a Quadcore when sorting strings, other than that i have cleaned more the code and i think my parallel Sort library has become a more professional and industrial parallel Sort library , you can be confident cause i have tested it thoroughly and no bugs have showed , so i hope you will be happy with my new Parallel Sort library. I have also included a "test.pas" example, just compile first the "gendata.pas" inside the zip file and run it first, after that compile the "test.pas" example and run it and do your benchmarks. I have implemented a Parallel hybrid divide-and-conquer merge algorithm that performs 0.9-5.8 times better than sequential merge, on a quad-core processor, with larger arrays outperforming by over 5 times. Parallel processing combined with a hybrid algorithm approach provides a powerful high performance result. The best case complexity of ParallelSort using mergesort is: ((n/p)* log(n/p)) + O(n/p) p: is the number of cores the ((n/p)* log(n/p)) is the complexity of the sorting part. O(n/p) is the best case complexity of the merging part. so the best case complexity is: ((n/p)* log(n/p)) The worst case complexity of parallel sort library using mergesort is: ((n/p)* log(n/p)) + O(n) the ((n/p)* log(n/p)) is the complexity of the sorting part. O(n) is the worst case complexity of the merging part. so the worst case complexity of parallelsort using mergesort is approximatly: O(n) I have done some tests with my ParallelSort library and i have noticed that it can give up to 5X scalability with strings, and it gives 3x scalability with integers on a quad cores. So, why it scales to 5X with strings and only 3x with integers on a quad cores ? I explain: In the SequentialMerge() method and QSort() method inside Parallel Sort library, i am calling the Scompare() method and also in both of them i am copying to the memory system. So when i am using strings the SCompare() method is more expensive, so the parallel part p in the Amdahl equation 1/ S + P/N (S: the serial part, P: parallel part and N: the number of cores) is bigger than with integers so the Amdahl equation will scale better, but when we are using integers the SCompare() method is less expensive than the SCompare() with strings, so the parallel part p in the Amdahl equation is less bigger than with strings. so this is why parallel sorting with strings scales better than with integers. I have implemented mergsort and quicksort, but as you know the complexity of mergesort in the worst case is better than quicksort , and the mergesort that i have implemented is faster than quicksort, but mergesort takes more space.. The reccurence equation of the complexity of mergesort is: T(n) = 2 * T(n/2) + n cause it take O(n) for the merging part. It gives: T(n) = 2 * T(n/2) + n = 2 (2T(n/4) +n/2) + n =4T(n/4)+n+n =4T(n/4)+2*n =4 (2T(n/8) +n/4) + 2*n =8T(n/8)+n+2n =8T(n/8)+3*n =2k T(n/2^k) + k*n We want: n/2^k = 1 n = 2^k log n = k so the reccurence equation gives: = n*T(1) +n*log(n) = n+ (n * log(n)) So the mergesort complexity in the best case and in the worst case is: n * log(n) But the complexity of the quicksort in the worst case is: T(n)= n + T(n-1) it gives: T(n) = n + (n-1) + T(n-2) T(n) = n + (n-1) + (n-2)+ T(n-3) T(n) = 1 + 2+ 3+.+N .T(n) = O(n^2) And Quicksort in best case performance is: T(n) = T(n/2) + T(n/2) + O(n) = 2T(n/2) + (n) this gives: T(n) = (n lg n) Parallelizing the Sorts One way to parallelize the sorts is: Divide the data among the processors Sort the data on the individual processors. Merge the various data Note that the merge operation is a reduction operation ! You can dowload my Parallel Sort library that is much more faster than my Parallel Quickort library from: https://sites.google.com/site/aminer68/parallel-sort-library Thank you for your time. Amine Moulay Ramdane. |
Ramine <ramine@1.1>: Feb 07 03:49PM -0800 Hello.. My Scalable Parallel implementation of Conjugate Gradient Linear System solver library that is NUMA-aware and cache-aware is here... I have spook before about my Parallel Conjugate gradient linear system solver library and i have said that it is using a probabilistic mechanism that is very efficient and that render my parallel algorithm into a scalable parallel algorithm on NUMA architecture, so now i will prove to you what i am saying: If you look at my parallel algorithm it is dividing the each array by 250 elements, and if you look carefully i am using two functions that consumes the greater part of all the CPU, it is the atsub() and asub(), and inside those functions i am using a probabilistic mechanism so that to render my algorithm scalable on NUMA architecture, what i am doing is scrambling the array parts using a probabilistic function and what i have noticed that this probabilitic mechanism is very efficient, to proove to you what i am saying , please look at the following simulation that i have done using a variable that contains the number of NUMA nodes, and what i have noticed that my simulation is giving almost a perfect scalability on NUMA architecture, for example let us give to the "NUMA_nodes" variable a value of 4, and to our array a value of 250, the simulation bellow will give a number of contention points of a quarter of the array, so if i am using 16 cores , in the the worst case it will scale 4X throughput on NUMA architecture, because since i am using an array of 250 and there is a quarter of the array of contention points , so from the Amdahl's law this will give a scalability of almost 4X throughput on four NUMA nodes, and this will give almost a perfect scalability on more and more NUMA nodes, so my parallel algorithm is scalable on NUMA architecture, Here is the simulation that i have done, please run it and you will notice yourself that my parallel algorithm is scalable on NUMA architecture. Here it is: --- program test; uses math; var tab,tab1,tab2,tab3:array of integer; a,n1,k,i,n2,tmp,j,numa_nodes:integer; begin a:=250; Numa_nodes:=4; setlength(tab2,a); for i:=0 to a-1 do begin tab2[i]:=i mod numa_nodes; end; setlength(tab,a); randomize; for k:=0 to a-1 do tab[k]:=k; n2:=a-1; for k:=0 to a-1 do begin n1:=random(n2); tmp:=tab[k]; tab[k]:=tab[n1]; tab[n1]:=tmp; end; setlength(tab1,a); randomize; for k:=0 to a-1 do tab1[k]:=k; n2:=a-1; for k:=0 to a-1 do begin n1:=random(n2); tmp:=tab1[k]; tab1[k]:=tab1[n1]; tab1[n1]:=tmp; end; for i:=0 to a-1 do if tab2[tab[i]]=tab2[tab1[i]] then begin inc(j); writeln('A contention at: ',i); end; writeln('Number of contention points: ',j); setlength(tab,0); setlength(tab1,0); setlength(tab2,0); end. --- You can download my Parallel Conjugate gradient solver library from: https://sites.google.com/site/aminer68/scalable-parallel-implementation-of-conjugate-gradient-linear-system-solver-library-that-is-numa-aware-and-cache-aware Thank you for your time. Thank you for your time. Amine Moulay Ramdane. |
Ramine <ramine@1.1>: Feb 07 03:07PM -0800 Hello, So in this post i will explain to you what is exactly my invention that is my scalable MLock algorithm.. If you look the Enter() method of my scalable MLock algorithm you will read this(that's easy to undertand the code in Object Pascal) , and read my explanation bellow: You can download the source code here: https://sites.google.com/site/aminer68/scalable-mlock == procedure TMLock.Enter; var prev,fcount1:PListEntry; k:integer; Buffer1:pointer; begin Buffer1 := AllocMem(SizeOf(TListEntry) + Alignment); FCount1 := PListEntry((int(Buffer1) + Alignment - 1) and not (Alignment - 1)); fcount1^.Data:=0; fcount1^.Next:=nil; fcount1^.mem:=buffer1; long(prev) := LockedExchange(long(m_head), long(fcount1)); prev.next := fcount1; if (flag=1) then begin if (m_tail=prev) then if CAS(flag,1,0) then begin fcount1^.Data:=-1; exit; end; end; k:=1; repeat if fcount1^.data=1 then begin freemem(fcount1^.mem); break; end else if fcount1^.data=2 then begin fcount1^.Data:=-1; break end; if (flag=1) then begin if (m_tail.next.next=fcount1) then if CAS(flag,1,0) then begin fcount1^.Data:=-1; break; end; end; inc(k); if (k mod 140)=0 then {$IFDEF FPC} ThreadSwitch; {$ENDIF} {$IFDEF Delphi} sleep(0); {$ENDIF} asm pause end; until false; end; === First i am allocating a Buffer1 struct and using fcount1 that is a pointer to a 64 bytes aligned record and i am assigning zero to fcount1^.Data cause the threads that enter will have to wait for there turn and for fcount1^.Data to become 1 to enter the scalable Lock except for the first thread that enters the lock, the first thread that enter the lock will find the "flag" variable equal to 1 and notice with me that the CAS(flag,1,0) will succeed for the first thread that enters the Enter() method... Look that i am using this inside the source code: long(prev) := LockedExchange(long(m_head), long(fcount1)); prev.next := fcount1; this is a waitfree queue and the push() and the pop() inside my scalable MLock algorithm are waitfree... So notice with me that the threads on the Enter() method will enter a repeat-until loop waiting for there turn to enter, and notice that the CAS will be touched rarely , i have benchmarked and collected some empiric statistics and i have noticed that the CAS inside the repeat-unil loop inside the Enter() method will be touched only 1% of the time, so the threads will wait 99% of the time for fcount1^.data to become 1 or fcount1^.data to become 2 , and notice that fcount1^.data inside the Enter() method is not generating cache-lines transfers for each thread waiting in the Enter() method, cause fcount1^.data is only touched by its correponding threads on its local cache, this is why my MLock algorithm is scalable, cause it minimizes efficiently the cache-coherence traffic. For the Leave() method it is waitfree and it is easy to prove that, just look at the source code and you will notice that even though there is a repeat-until loop inside the Leave() method , my algorithm will not loop more than 2 times inside the Leave() method, so it makes my algorithm waitfree and my algorithm can be used on parallel and realtime safety critical systems. And also you will notice easily that my scalable MLock is FIFO fairso it's starvation-free. This is a node based Lock that is scalable, FIFO fair and starvation-free. - Discovered by Amine Moulay Ramdane - This lock is scalable - It has the same space requirement as the scalable MCS lock - Doesn't require a local "queue node" to be passed in as a parameter as is doing the MCS and CLH locks. - Spins only on local locations on a cache-coherent machine - And it's fast. Please read also this: A bigger problem with the MCS lock is its API. It requires a second structure to be passed in addition to the address of the lock. The algorithm uses this second structure to store the information which describes the queue of threads waiting for the lock. Unfortunately, most code written using spinlocks doesn't have this extra information, so the fact that the MCS algorithm isn't a drop-in replacement to a standard spin lock is a problem. An IBM working group found a way to improve the MCS algorithm to remove the need to pass the extra structure as a parameter. Instead, on-stack information was used instead. The result is the K42 lock algorithm: Unfortunately, the K42 algorithm has another problem. It appears that it may be patented by IBM. Thus it cannot be used either. (Without perhaps paying royalties to IBM.) So you have to know that my scalable MLock doesn't require a local "queue node" to be passed in as a parameter as is doing the MCS and CLH locks, my scalable MLock doesn't require any parameter to be passed, just call the Enter() and Leave() method and that's all. You can download my scalable node based Lock that is FIFO fair and waitfree from: https://sites.google.com/site/aminer68/scalable-mlock Thank you, Amine Moulay Ramdane. |
Ramine <ramine@1.1>: Feb 07 02:16PM -0800 Hello, I have just updated my Winmenus to version 1.4, now it supports all the Delphi XE versions and it supports also freepascal and Delphi 7 to 2007. Winmenus is very fast and a powerfull text mode Drop-Down Menu widget using the Object Pascal CRT unit, you can read more about it and download it from here: https://sites.google.com/site/aminer68/winmenus Also i am thinking to design and implement a kind of graphical user interface to my parallel archiver using my Winmenus, i am currently enhancing an old compoenent called TStringTree and i am rendering it a very fast TStringTree, TStringTree it is a kind of a graphical TreeView component but for text mode only, i am currently benchmarking my TStringTree and it is giving a good results on the speed criteria. So all in all Delphi and FreePascal are wonderful languages and compilers that have allowed me and that will allow me to implement my new projects that are: an enhanced TStringTree and a kind of graphical user interface for my Parallel archiver, so stay tuned... Thank you, Amine Moulay Ramdane. |
You received this digest because you're subscribed to updates for this group. You can change your settings on the group membership page. To unsubscribe from this group and stop receiving emails from it send an email to comp.programming.threads+unsubscribe@googlegroups.com. |
No comments:
Post a Comment