Sunday, February 8, 2015

Digest for comp.programming.threads@googlegroups.com - 4 updates in 4 topics

Ramine <ramine@1.1>: Feb 07 04:23PM -0800

Hello...
 
 
My Parallel Sort library version 3.3 is here:
 
https://sites.google.com/site/aminer68/parallel-sort-library
 
 
As i have told you , my new parallel Sort algorithm has become
more cache-aware, and since it has become more cache-aware it
have induced a super linear speedup and super linear scalability when
using more cores and more L2 caches, i have done some benchmarks on my
Quadcore that uses two L2 caches and it has given a super linear speedup
of 5X scalability on my Quadcore when sorting strings even though i am
using only 4 cores, that's easy to understand cause when you use only
one thread it will use only one L2 cache, but when you use more threads
on multiple cores and with multiple L2 caches it will use more L2 caches
and it will parallelize the access to those multiple L2 caches , this is
why my new parallel algorithm has given a super linear speedup when
sorting strings.
 
I have to be frank, we have to be smart when inventing or doing parallel
sort algorithms, my new parallel sort algorithm is smart, cause it is
more cache-aware, but you have to be carefull cause look
at my other parallel quicksort algorithm here:
 
https://sites.google.com/site/aminer68/parallel-quicksort
 
You have to know that this parallel quicksort that uses the
classical way of sorting is not so good, cause when its partition
function is used recursively, this parallel quicksort algorithm will
dispatch the arrays to be partitioned each time to different threads,
and this will make it less cache-aware than my new parallel sort
algorithm, and since it will make it less cache-aware , so you will
not get much than 3X scalability by sorting strings and you will get
less that 3X scalability by sorting integers or doubles for example, So
i advice you to use my new parallel sort algorithm of my parallel sort
library version 3.3 that is more cache-aware and that gives you super
linear scalability when sorting strings and it gives you good
scalability when sorting integers and doubles etc.
 
Let's talk more computer science...
 
Finally i have arrived to an important subject that is "optimization",
and you will notice that computer science learns you one
of the most important thing, is how to amortize and optimize by
optimizing more your time complexity that we express mathematically with
a O(), this is how i have done it with my new parallel sort algorithm
that has a "very" good time complexity, and what learns you also
computer science is how to optimize more in the source code's part or
parts that takes the greater percentage of running time, and what learns
you also computer science is to do the greater percentage of all the
optimizations and to not do the smaller optimizations, this is also a
good way to follow to optimize your code, so if you look at my
new parallel conjugate gradient solver library , i am not using SIMD
SSE2 instructions , but i am just parallelizing the memory transfers
cause my parallel algorithm is NUMA-aware and i am also parallelizing
also the multiplication and addition of the double's type etc. and also
my parallel algorithm is cache-aware so i am also parallelizing more the
L2 cache memory transfers , so this optimization will make it a very
greater percentage of all the optimization that we can do on my parallel
algorithm, so in large scale industrial problems i don't think that my
parallel conjugate gradient algorithm needs to do SIMD SSE2 or AVX
optimization , cause SIMD SSE2 or AVX will give just a small improvement
to my parallel algorithm.
 
I have finally arrived to an important subject...it's still
on the subject of parallel optimization, but this time i will
talk more about the Cilk-style scheduler.. as you have seen me in my
previous post i have talked about the Dmitry Vyukov's Cilk-style
scheduler that supports system-topology awareness, hyper-threading
awareness, affinity-awareness, batch-spawn capability and manual
task-depth control, here it is:
 
http://www.1024cores.net/home/parallel-computing/radix-sort
 
 
and as you have seen me talking about my parallel quicksort, i have said
that it's not so good cause since it dispatches the indexes of the
arrays to be partitionned each time to different threads of my
classical threadpool, since the my threadpool don't look like Dmitry
Vyukov's scheduler , so this will render the parallel quicksort
algorithm less cache-aware, so this will not scale much than 3X for
strings and it will scale less for integers and doubles... but if we are
using the Dmitry Vyukov's scheduler, as we recursively dispatch the
array's indexes to the Scheduler's threads that will call the partition
function, the Dmitry Vyukov's Cilk-style scheduler will dispatch those
array's indexes to threads that will reuse "efficiently" the data of the
caches , so this will reuse efficiently the L2-cache's data and
L3-cache's data efficiently , but even though it will reuse the data of
caches efficiently, this optimizations of the Dmitry Vyukov's scheduler
will not bring more than a 2X improvement , in fact you will get less
and less improvement when using "bigger" and bigger data, so this is not
so important improvement that will bring the Dmitry Vyukov's scheduler,
so what you have to do is to render your parallel sort algorithm into a
NUMA-aware algorithm this way you will render your parallel algorithm
into a scalable algorithm, this is why i have told you in my previous
post that the Dmitry's Sheduler will not bring an important improvement,
and if you take a look at my new parallel sort algorithm of my parallel
sort library version 3.3, its parallel sorting part doesn't contain
parallel recursive calls and its parallel merging part doesn't contain
parallel recursive calls, so the Dmitry Vyukov's scheduler will not
bring improvement to my parallel algorithm, so for all those reasons
this why i have decided to not implement a scheduler that look like the
Dmitry Vyukov's scheduler and this why i have decided to use my
classical threadpool engine instead.
 
Finally Parallel Sort Library that supports Parallel Quicksort, Parallel
HeapSort and Parallel MergeSort on Multicores systems is here.
 
Parallel Sort Library uses my Thread Pool Engine and sort many array
parts - of your array - in parallel using Quicksort or HeapSort or
MergeSort and after that it finally merge them - with the merge()
procedure -
 
In the previous parallelsort version i have parallelized only the sort
part, but in this new parallelsort version i have parallelized also the
merge procedure part and it gives better performance.
 
My new parallel sort algorithm has become more cache-aware, and i have
done some benchmarks with my new parallel algorithm and it has given up
to 5X scalability on a Quadcore when sorting strings, other than that i
have cleaned more the code and i think my parallel Sort library has
become a more professional and industrial parallel Sort library , you
can be confident cause i have tested it thoroughly and no bugs have
showed , so i hope you will be happy with my new Parallel Sort library.
 
 
I have also included a "test.pas" example, just compile first the
"gendata.pas" inside the zip file and run it first, after that compile
the "test.pas" example and run it and do your benchmarks.
 
I have implemented a Parallel hybrid divide-and-conquer merge algorithm
that performs 0.9-5.8 times better than sequential merge, on a quad-core
processor, with larger arrays outperforming by over 5 times. Parallel
processing combined with a hybrid algorithm approach provides a powerful
high performance result.
 
The best case complexity of ParallelSort using mergesort is:
 
((n/p)* log(n/p)) + O(n/p)
 
p: is the number of cores
 
the ((n/p)* log(n/p)) is the complexity of the sorting part.
 
O(n/p) is the best case complexity of the merging part.
 
so the best case complexity is: ((n/p)* log(n/p))
 
The worst case complexity of parallel sort library using mergesort is:
 
((n/p)* log(n/p)) + O(n)
 
the ((n/p)* log(n/p)) is the complexity of the sorting part.
 
O(n) is the worst case complexity of the merging part.
 
so the worst case complexity of parallelsort using mergesort is
approximatly: O(n)
 
I have done some tests with my ParallelSort library and i have noticed
that it can give up to 5X scalability with strings, and it gives 3x
scalability with integers on a quad cores.
 
So, why it scales to 5X with strings and only 3x with integers on a quad
cores ?
 
 
I explain:
 
In the SequentialMerge() method and QSort() method inside Parallel Sort
library, i am calling the Scompare() method and also in both of them i
am copying to the memory system.
 
So when i am using strings the SCompare() method is more expensive, so
the parallel part p in the Amdahl equation 1/ S + P/N (S: the serial
part, P: parallel part and N: the number of cores) is bigger than with
integers so the Amdahl equation will scale better, but when we are using
integers the SCompare() method is less expensive than the SCompare()
with strings, so the parallel part p in the Amdahl equation is less
bigger than with strings. so this is why parallel sorting with strings
scales better than with integers.
 
I have implemented mergsort and quicksort, but as you know the
complexity of mergesort in the worst case is better than quicksort , and
the mergesort that i have implemented is faster than quicksort, but
mergesort takes more space..
 
The reccurence equation of the complexity of mergesort is:
 
T(n) = 2 * T(n/2) + n
 
cause it take O(n) for the merging part.
 
It gives:
 
T(n) = 2 * T(n/2) + n
 
= 2 (2T(n/4) +n/2) + n
 
=4T(n/4)+n+n
 
=4T(n/4)+2*n
 
=4 (2T(n/8) +n/4) + 2*n
 
=8T(n/8)+n+2n
 
=8T(n/8)+3*n
 
=2k T(n/2^k) + k*n
 
We want:
 
n/2^k = 1
 
n = 2^k
 
log n = k
 
so the reccurence equation gives:
 
= n*T(1) +n*log(n)
 
= n+ (n * log(n))
 
So the mergesort complexity in the best case and in the worst case is:
 
n * log(n)
 
But the complexity of the quicksort in the worst case is:
 
T(n)= n + T(n-1)
 
it gives:
 
T(n) = n + (n-1) + T(n-2)
 
T(n) = n + (n-1) + (n-2)+ T(n-3)
 
T(n) = 1 + 2+ 3+.+N
 
.T(n) = O(n^2)
 
And Quicksort in best case performance is:
 
T(n) = T(n/2) + T(n/2) + O(n)
= 2T(n/2) + (n)
 
this gives: T(n) = (n lg n)
 
Parallelizing the Sorts
 
One way to parallelize the sorts is:
 
Divide the data among the processors
Sort the data on the individual processors.
Merge the various data
Note that the merge operation is a reduction operation !
 
 
 
You can dowload my Parallel Sort library that is much more faster than
my Parallel Quickort library from:
 
https://sites.google.com/site/aminer68/parallel-sort-library
 
 
 
 
Thank you for your time.
 
 
Amine Moulay Ramdane.
Ramine <ramine@1.1>: Feb 07 03:49PM -0800

Hello..
 
 
My Scalable Parallel implementation of Conjugate Gradient Linear System
solver library that is NUMA-aware and cache-aware is here...
I have spook before about my Parallel Conjugate gradient linear system
solver library and i have said that it is using a probabilistic
mechanism that is very efficient and that render my parallel algorithm
into a scalable parallel algorithm on NUMA architecture, so now i will
prove to you what i am saying:
 
If you look at my parallel algorithm it is dividing the each array by
250 elements, and if you look carefully i am using two functions that
consumes the greater part of all the CPU, it is the atsub() and asub(),
and inside those functions i am using a probabilistic mechanism so that
to render my algorithm scalable on NUMA architecture, what i am doing is
scrambling the array parts using a probabilistic function
and what i have noticed that this probabilitic mechanism is very
efficient, to proove to you what i am saying , please look at the
following simulation that i have done using a variable that
contains the number of NUMA nodes, and what i have noticed that my
simulation is giving almost a perfect scalability on NUMA architecture,
for example let us give to the "NUMA_nodes" variable a value of 4, and
to our array a value of 250, the simulation bellow will give a number of
contention points of a quarter of the array, so if i am using 16 cores ,
in the the worst case it will scale 4X throughput on NUMA architecture,
because since i am using an array of 250 and there is a quarter of the
array of contention points , so from the Amdahl's law this will give a
scalability of almost 4X throughput on four NUMA nodes, and this will
give almost a perfect scalability on more and more NUMA nodes, so my
parallel algorithm is scalable on NUMA architecture,
 
Here is the simulation that i have done, please run it and you will notice
yourself that my parallel algorithm is scalable on NUMA architecture.
 
Here it is:
 
---
program test;
 
uses math;
 
var tab,tab1,tab2,tab3:array of integer;
a,n1,k,i,n2,tmp,j,numa_nodes:integer;
begin
 
a:=250;
Numa_nodes:=4;
 
setlength(tab2,a);
 
for i:=0 to a-1
do
begin
 
tab2[i]:=i mod numa_nodes;
 
end;
 
setlength(tab,a);
 
randomize;
 
for k:=0 to a-1
do tab[k]:=k;
 
n2:=a-1;
 
for k:=0 to a-1
do
begin
n1:=random(n2);
tmp:=tab[k];
tab[k]:=tab[n1];
tab[n1]:=tmp;
end;
 
setlength(tab1,a);
 
randomize;
 
for k:=0 to a-1
do tab1[k]:=k;
 
n2:=a-1;
 
for k:=0 to a-1
do
begin
n1:=random(n2);
tmp:=tab1[k];
tab1[k]:=tab1[n1];
tab1[n1]:=tmp;
end;
 
for i:=0 to a-1
do
if tab2[tab[i]]=tab2[tab1[i]] then
begin
inc(j);
writeln('A contention at: ',i);
 
end;
 
writeln('Number of contention points: ',j);
setlength(tab,0);
setlength(tab1,0);
setlength(tab2,0);
end.
---
You can download my Parallel Conjugate gradient solver library from:
 
https://sites.google.com/site/aminer68/scalable-parallel-implementation-of-conjugate-gradient-linear-system-solver-library-that-is-numa-aware-and-cache-aware
 
Thank you for your time.
 
 
Thank you for your time.
 
 
Amine Moulay Ramdane.
Ramine <ramine@1.1>: Feb 07 03:07PM -0800

Hello,
 
 
So in this post i will explain to you what is exactly my invention that
is my scalable MLock algorithm..
 
If you look the Enter() method of my scalable MLock algorithm you will
read this(that's easy to undertand the code in Object Pascal) , and
read my explanation bellow:
 
You can download the source code here:
 
https://sites.google.com/site/aminer68/scalable-mlock
 
 
 
==
procedure TMLock.Enter;
var prev,fcount1:PListEntry;
k:integer;
Buffer1:pointer;
 
begin
Buffer1 := AllocMem(SizeOf(TListEntry) + Alignment);
FCount1 := PListEntry((int(Buffer1) + Alignment - 1)
and not (Alignment - 1));
 
fcount1^.Data:=0;
fcount1^.Next:=nil;
fcount1^.mem:=buffer1;
 
long(prev) := LockedExchange(long(m_head), long(fcount1));
 
prev.next := fcount1;
 
if (flag=1)
then
begin
if (m_tail=prev)
then
if CAS(flag,1,0)
then
begin
fcount1^.Data:=-1;
exit;
end;
end;
k:=1;
repeat
 
if fcount1^.data=1
then
begin
freemem(fcount1^.mem);
break;
end
else if fcount1^.data=2
then
begin
fcount1^.Data:=-1;
break
end;
if (flag=1)
then
begin
if (m_tail.next.next=fcount1)
then
if CAS(flag,1,0)
then
begin
fcount1^.Data:=-1;
break;
end;
end;
inc(k);
if (k mod 140)=0
then
{$IFDEF FPC}
ThreadSwitch;
{$ENDIF}
{$IFDEF Delphi}
sleep(0);
{$ENDIF}
asm pause end;
until false;
 
end;
===
 
 
 
First i am allocating a Buffer1 struct and using fcount1 that is a
pointer to a 64 bytes aligned record and i am assigning zero to
fcount1^.Data cause the threads that enter will have to wait for there
turn and for fcount1^.Data to become 1 to enter the scalable Lock except
for the first thread that enters the lock, the first thread that enter
the lock will find the "flag" variable equal to 1 and notice with
me that the CAS(flag,1,0) will succeed for the first thread that enters
the Enter() method...
 
Look that i am using this inside the source code:
 
long(prev) := LockedExchange(long(m_head), long(fcount1));
 
prev.next := fcount1;
 
this is a waitfree queue and the push() and the pop() inside my scalable
MLock algorithm are waitfree...
 
So notice with me that the threads on the Enter() method will enter a
repeat-until loop waiting for there turn to enter, and notice
that the CAS will be touched rarely , i have benchmarked and collected
some empiric statistics and i have noticed that the CAS inside the
repeat-unil loop inside the Enter() method will be touched only 1% of
the time, so the threads will wait 99% of the time for fcount1^.data
to become 1 or fcount1^.data to become 2 , and notice that fcount1^.data
inside the Enter() method is not generating cache-lines transfers for
each thread waiting in the Enter() method, cause fcount1^.data is only
touched by its correponding threads on its local cache, this is why my
MLock algorithm is scalable, cause it minimizes efficiently the
cache-coherence traffic.
 
For the Leave() method it is waitfree and it is easy to prove that, just
look at the source code and you will notice that even though there is a
repeat-until loop inside the Leave() method , my algorithm will not loop
more than 2 times inside the Leave() method, so it makes my algorithm
waitfree and my algorithm can be used on parallel and realtime safety
critical systems.
 
And also you will notice easily that my scalable MLock is FIFO fairso
it's starvation-free.
 
This is a node based Lock that is scalable, FIFO fair and starvation-free.
 
- Discovered by Amine Moulay Ramdane
 
- This lock is scalable
 
- It has the same space requirement as the scalable MCS lock
 
- Doesn't require a local "queue node" to be passed in as a parameter
as is doing the MCS and CLH locks.
 
- Spins only on local locations on a cache-coherent machine
 
- And it's fast.
 
 
 
Please read also this:
 
A bigger problem with the MCS lock is its API. It requires a second
structure to be passed in addition to the address of the lock. The
algorithm uses this second structure to store the information which
describes the queue of threads waiting for the lock. Unfortunately, most
code written using spinlocks doesn't have this extra information, so the
fact that the MCS algorithm isn't a drop-in replacement to a standard
spin lock is a problem.
 
An IBM working group found a way to improve the MCS algorithm to remove
the need to pass the extra structure as a parameter. Instead, on-stack
information was used instead. The result is the K42 lock algorithm:
 
Unfortunately, the K42 algorithm has another problem. It appears that it
may be patented by IBM. Thus it cannot be used either. (Without perhaps
paying royalties to IBM.)
 
So you have to know that my scalable MLock doesn't require a local
"queue node" to be passed in as a parameter as is doing the MCS and CLH
locks, my scalable MLock doesn't require any parameter to be passed,
just call the Enter() and Leave() method and that's all.
 
 
 
You can download my scalable node based Lock that is FIFO fair and
waitfree from:
 
 
https://sites.google.com/site/aminer68/scalable-mlock
 
 
Thank you,
Amine Moulay Ramdane.
Ramine <ramine@1.1>: Feb 07 02:16PM -0800

Hello,
 
 
I have just updated my Winmenus to version 1.4, now it supports
all the Delphi XE versions and it supports also freepascal and
Delphi 7 to 2007.
 
Winmenus is very fast and a powerfull text mode Drop-Down Menu widget
using the Object Pascal CRT unit, you can read more about it and
download it from here:
 
https://sites.google.com/site/aminer68/winmenus
 
 
Also i am thinking to design and implement a kind of graphical user
interface to my parallel archiver using my Winmenus, i am
currently enhancing an old compoenent called TStringTree and i am
rendering it a very fast TStringTree, TStringTree it is a kind of a
graphical TreeView component but for text mode only, i am currently
benchmarking my TStringTree and it is giving a good results on
the speed criteria.
 
So all in all Delphi and FreePascal are wonderful languages and
compilers that have allowed me and that will allow me to implement my
new projects that are: an enhanced TStringTree and a kind of graphical
user interface for my Parallel archiver, so stay tuned...
 
 
 
Thank you,
Amine Moulay Ramdane.
You received this digest because you're subscribed to updates for this group. You can change your settings on the group membership page.
To unsubscribe from this group and stop receiving emails from it send an email to comp.programming.threads+unsubscribe@googlegroups.com.

No comments: