- More about Parallel Disk IO - 1 Update
- Efficiency of my parallel algorithm - 1 Update
| Ramine <ramine@1.1>: Jan 13 03:14PM -0800 Hello, I have read the following webpage about Parallel Disk IO by Dmitry Vyukov, here it is: http://www.1024cores.net/home/scalable-architecture/parallel-disk-io Read it carefully and you will notice that it is not optimal, cause a better way to do is that the threads must read directly from the harddisk, but you have to use my following mechanism: an independent thread from the readers threads will calculate the throughput of the system , since you have to read equal chunks of data from the file, and as the reader threads will process the data they will update a "counter"each time they process the chunk of data , and the independant thread that calculates the throughput will read the "counter" two times in an interval of t and it will calculate the throughput that is: (counter_2 - counter_1)*size_of_the_chunk_of_data / t, and counter_2 - counter_1 is the time of the interval and the independant thread that calculates the throughput have to start more threads and it will recalculate the throughput and if the throughput goes up it will start more threads, but if the throughput goes down it will take the previous count of the threads and it will start this count number of threads and blocks the other threads , this way my mechanism will more optimal than the Dmitry Vyukov mechanism, and the independant thread will follow the same mechanism throughout the duration of the reading of the file, that means it will start more threads and recalculate the throughput and if the throughput goes up it will start more threads, but if the throughput goes down it will take the previous count of the threads and start this count number of threads and it will blocks the other threads , this is my own mechanism that i am presenting to you and i think it is more optimal than the Dmitry Vyukov way of doing. If the chunks of data are not equal in size, the processing threads can update the counter of the total processed chunks of data and also they can update counter of the total size of the data that has been processed and this way the independant thread that adjust the number of threads will do its calculation of the throughput with a mean size of the chunk of data. That's the way to do it i think. This algorithm is called the Hill Climbing (HC) algorithm. This technique is an auto-tuning approach based on an input-output feedback loop. Thank you for your time. Amine Moulay Ramdane. |
| Ramine <ramine@1.1>: Jan 13 02:32PM -0800 Hello, I have spook before about my Parallel Conjugate gradient linear system solver library and i have said that it is using a probabilistic mechanism that is very efficient and that render my parallel algorithm into a scalable parallel algorithm on NUMA architecture, so now i will prove to you what i am saying: If you look at my parallel algorithm it is dividing the array by 250 elements, and if you look carefully i am using two functions that consumes the greater part of all the CPU, it is the atsub() and asub(), and inside those functions i am using a probabilistic mechanism so that to render my algorithm scalable on NUMA architecture, what i am doing is scrambling the array parts using a probabilistic function and what i have noticed that this probabilitic mechanism is very efficient, to proove to you what i am saying , please look at the following simulation that i have done using a variable that contains the number of NUMA nodes, and what i have noticed that my simulation is giving almost a perfect scalability on NUMA architecture, for example let us give to the "NUMA_nodes" variable a value of 4, and to our array a value of 250, since i am dividing all the array by 250, the simulation will give a number of contention points of a quarter of the array, so if i am using 16 cores , in the the worst case it will scale 4X throughput on NUMA architecture, because since i am using an array of 250 and there is a quarter of the array of contention points , so from the Amdahl'law this will give a scalability of 4X throughput on four NUMA nodes, and this will give almost a perfect scalability on more and more NUMA nodes, so my parallel algorithm is scalable on NUMA architecture, Here the simulation that i have done, please run it and you will notice yourself that my parallel algorithm is scalable on NUMA architecture. Here it is: --- program test; uses math; var tab,tab1,tab2,tab3:array of integer; a,n1,k,i,n2,tmp,j,numa_nodes:integer; begin a:=250; Numa_nodes:=4; setlength(tab2,a); for i:=0 to a-1 do begin tab2[i]:=i mod numa_nodes; end; setlength(tab,a); randomize; for k:=0 to a-1 do tab[k]:=k; n2:=a-1; for k:=0 to a-1 do begin n1:=random(n2); tmp:=tab[k]; tab[k]:=tab[n1]; tab[n1]:=tmp; end; setlength(tab1,a); randomize; for k:=0 to a-1 do tab1[k]:=k; n2:=a-1; for k:=0 to a-1 do begin n1:=random(n2); tmp:=tab1[k]; tab1[k]:=tab1[n1]; tab1[n1]:=tmp; end; for i:=0 to a-1 do if tab2[tab[i]]=tab2[tab1[i]] then begin inc(j); writeln('A contention at: ',i); end; writeln('Number of contention points: ',j); setlength(tab,0); setlength(tab1,0); setlength(tab2,0); end. --- You can download my Parallel Conjugate gradient solver library from: https://sites.google.com/site/aminer68/scalable-parallel-implementation-of-conjugate-gradient-linear-system-solver-library-that-is-numa-aware-and-cache-aware Thank you for your time. Amine Moulay Ramdane. |
| You received this digest because you're subscribed to updates for this group. You can change your settings on the group membership page. To unsubscribe from this group and stop receiving emails from it send an email to comp.programming.threads+unsubscribe@googlegroups.com. |
No comments:
Post a Comment