Wednesday, January 14, 2015

Digest for comp.programming.threads@googlegroups.com - 2 updates in 2 topics

Ramine <ramine@1.1>: Jan 13 03:14PM -0800

Hello,
 
 
I have read the following webpage about Parallel Disk IO by Dmitry
Vyukov, here it is:
 
http://www.1024cores.net/home/scalable-architecture/parallel-disk-io
 
 
Read it carefully and you will notice that it is not optimal,
cause a better way to do is that the threads must read directly from the
harddisk, but you have to use my following mechanism: an independent
thread from the readers threads will calculate the throughput of the
system , since you have to read equal chunks of data from the file, and
as the reader threads will process the data they will update a
"counter"each time they process the chunk of data , and the independant
thread that calculates the throughput will read the "counter" two times
in an interval of t and it will
calculate the throughput that is: (counter_2 -
counter_1)*size_of_the_chunk_of_data / t, and counter_2 - counter_1 is
the time of the interval and the independant thread that calculates the
throughput have to start more threads and it will recalculate the
throughput and if the throughput goes up it will start more
threads, but if the throughput goes down it will take the previous count
of the threads and it will start this count number of threads and blocks
the other threads , this way my mechanism will more optimal than the
Dmitry Vyukov mechanism, and the independant thread will follow the same
mechanism throughout the duration of the reading of the file, that means
it will start more threads and recalculate the throughput and if the
throughput goes up it will start more threads, but if the throughput
goes down it will take the previous count of the threads and start this
count number of threads and it will blocks the other threads , this is
my own mechanism that i am presenting to you and i think it is more
optimal than the Dmitry Vyukov way of doing.
 
If the chunks of data are not equal in size, the processing threads can
update the counter of the total processed chunks of data and also they
can update counter of the total size of the data that has been processed
and this way the independant thread that adjust the number of threads
will do its calculation of the throughput with a mean size of the chunk
of data. That's the way to do it i think.
 
This algorithm is called the Hill Climbing (HC) algorithm. This
technique is an auto-tuning approach based on an input-output feedback
loop.
 
 
 
Thank you for your time.
 
 
 
Amine Moulay Ramdane.
Ramine <ramine@1.1>: Jan 13 02:32PM -0800

Hello,
 
 
I have spook before about my Parallel Conjugate gradient linear system
solver library and i have said that it is using a probabilistic
mechanism that is very efficient and that render my parallel algorithm
into a scalable parallel algorithm on NUMA architecture, so now i will
prove to you what i am saying:
 
If you look at my parallel algorithm it is dividing the array by 250
elements, and if you look carefully i am using two functions that
consumes the greater part of all the CPU, it is the atsub() and asub(),
and inside those functions i am using a probabilistic mechanism so that
to render my algorithm scalable on NUMA architecture, what i am
doing is scrambling the array parts using a probabilistic function
and what i have noticed that this probabilitic mechanism is very
efficient, to proove to you what i am saying , please look at the
following simulation that i have done using a variable that
contains the number of NUMA nodes, and what i have noticed that my
simulation is giving almost a perfect scalability on NUMA architecture,
for example let us give to the "NUMA_nodes" variable a value of 4,
and to our array a value of 250, since i am dividing all the array by
250, the simulation will give a number of contention points of a
quarter of the array, so if i am using 16 cores , in the the worst case
it will scale 4X throughput on NUMA architecture, because since
i am using an array of 250 and there is a quarter of the array of
contention points , so from the Amdahl'law this will give a scalability
of 4X throughput on four NUMA nodes, and this will give almost a perfect
scalability on more and more NUMA nodes, so my parallel algorithm is
scalable on NUMA architecture,
 
Here the simulation that i have done, please run it and you will notice
yourself that my parallel algorithm is scalable on NUMA architecture.
 
Here it is:
 
---
program test;
 
uses math;
 
var tab,tab1,tab2,tab3:array of integer;
a,n1,k,i,n2,tmp,j,numa_nodes:integer;
begin
 
a:=250;
 
 
Numa_nodes:=4;
 
setlength(tab2,a);
 
for i:=0 to a-1
do
begin
 
tab2[i]:=i mod numa_nodes;
 
end;
 
setlength(tab,a);
 
randomize;
 
for k:=0 to a-1
do tab[k]:=k;
 
n2:=a-1;
 
for k:=0 to a-1
do
begin
n1:=random(n2);
tmp:=tab[k];
tab[k]:=tab[n1];
tab[n1]:=tmp;
end;
 
setlength(tab1,a);
 
randomize;
 
for k:=0 to a-1
do tab1[k]:=k;
 
n2:=a-1;
 
for k:=0 to a-1
do
begin
n1:=random(n2);
tmp:=tab1[k];
tab1[k]:=tab1[n1];
tab1[n1]:=tmp;
end;
 
 
 
for i:=0 to a-1
do
if tab2[tab[i]]=tab2[tab1[i]] then
begin
inc(j);
writeln('A contention at: ',i);
 
end;
 
writeln('Number of contention points: ',j);
 
 
setlength(tab,0);
setlength(tab1,0);
setlength(tab2,0);
 
 
end.
 
 
---
 
 
You can download my Parallel Conjugate gradient solver library from:
 
https://sites.google.com/site/aminer68/scalable-parallel-implementation-of-conjugate-gradient-linear-system-solver-library-that-is-numa-aware-and-cache-aware
 
 
 
 
Thank you for your time.
 
 
Amine Moulay Ramdane.
You received this digest because you're subscribed to updates for this group. You can change your settings on the group membership page.
To unsubscribe from this group and stop receiving emails from it send an email to comp.programming.threads+unsubscribe@googlegroups.com.

No comments: