Saturday, February 14, 2015

Digest for comp.programming.threads@googlegroups.com - 15 updates in 7 topics

Ramine <ramine@1.1>: Feb 13 07:43PM -0800

Hello,
 
We have to be smart , please read what's follow...
 
I have said before that my parallel heapsort is more cache efficient
it is why it scales almost perfectly on an 8 cores machine, but
i think i have made a mistake , cause i have just looked carefully at
my parallel heapsort and what i have noticed that it contains two
string's compares, but my parallel quicksort contains one string compare
on it's partition function, so from the Amdahl's equation since the
string's compare is more expensive , the parallel heapsort will scale
almost perfectly on 8 cores machines, but i don't think it will scale on
more than 8 cores machines... it's the Amdahl's equation that says so,
and i think all my parallel algorithms have the same cache efficiency..
so by nature parallel sort algorithms such us parallel mergesort and
parallel quicksort and parallel heapsort have a scalability limit at 8X
or so, and they don't scale at more than 8X with more and more cores
than 8 cores, so the solution is to implement a NUMA-aware parallel sort
algorithm to make it scale on more and more NUMA nodes...
 
 
 
Thank you,
Amine Moulay Ramdane.
Ramine <ramine@1.1>: Feb 13 07:12PM -0800

Hello,
 
 
Here is the results of my benchmark on an 8 cores machine that was
posted by a guy that is called Melzzzz on usenet , read it carefully,
and read my conclusion bellow:
 
===
 
Please press a key to exit...:
[bmaxa@maxa-pc aminer]$ taskset -c 0,1,2,3,4,5,6,7 wine test.exe
 
 
Number of cores is: 8
Scalability with parallel mergesort on 8 cores is: 3.23
Time of parallel mergesort on 8 cores is: 298091 microseconds
 
 
Number of cores is: 8
Scalability with parallel quicksort on 8 cores is: 3.52
Time of parallel quicksort on 8 cores is: 348340 microseconds
 
 
Number of cores is: 8
Scalability with parallel heapsort on 8 cores is: 7.36
Time of parallel heapsort on 8 cores is: 807979 microseconds
 
===
 
 
I think i have finally understood my parallel algorithms:
 
Look at my parallel heapsort results:
 
Number of cores is: 8
Scalability with parallel heapsort on 8 cores is: 7.36
Time of parallel heapsort on 8 cores is: 807979 microseconds
 
 
I think that my parallel heapsort algorithm is by nature more cache
efficient this is why it scales very well on more and more cores, so if
you have more cores than 8 cores, i think that my parallel heapsort of
my parallel sort library will be better to replace the other parallel
algorithms such as my parallel mergesort and my parallel quicksort
of my parallel sort library.
 
 
The benchmark's results also inform us on an important think: it is that
the parallel mergesort and parallel quicksort of my parallel sort
library are by nature much less cache efficient than my parallel
heapsort of my parallel sort library.
 
 
Thank you Melzzzz, you are such a great guy that you have helped me to
run the benchmark.
 
 
 
 
Thank you,
Amine Moulay Ramdane.
Ramine <ramine@1.1>: Feb 13 05:43PM -0800

Hello,
 
 
I have implemented a benchmark for my Parallel sort library,
and i want to test it on other computers, please help me to do it,
this benchmark runs on windows, please download it and run it and report
to me your kind of processor and if it's possible how many L2 caches you
have, and report to me the output of this benchmark, this benchmark is
testing my parallel mergesort of my parallel sort library, please
download the benchmark from here and run it:
 
https://sites.google.com/site/aminer68/benchmark-for-parallel-sort-library
 
 
To download it please click on the small "arrow" on the right of the
"test.zip" text on your screen...
 
 
Thank you,
Amine Moulay Ramdane.
Ramine <ramine@1.1>: Feb 13 06:14PM -0800

On 2/13/2015 5:43 PM, Ramine wrote:
> "test.zip" text on your screen...
 
> Thank you,
> Amine Moulay Ramdane.
 
If you have more than 4 cores on your computer, it will be really
interresting to see what's the result of my benchmark on your computer.
 
If you have an Intel i7 CPU , this will be also interresting to see the
result on it...
 
 
So please help me to see what's the results of my parallel sort library,
so run the benchmark.
 
 
Thank you,
Amine Moulay Ramdane.
Melzzzzz <mel@zzzzz.com>: Feb 14 12:34AM +0100

On Fri, 13 Feb 2015 17:43:16 -0800
> "test.zip" text on your screen...
 
> Thank you,
> Amine Moulay Ramdane.
 
[bmaxa@maxa-pc aminer]$ taskset -c 0,1,2,3 wine test.exe
 
 
Number of cores is: 4
Scalability on 4 cores is: 2.42
 
Please press a key to exit...:
[bmaxa@maxa-pc aminer]$ taskset -c 0,1,2,3,4,5,6,7 wine test.exe
 
 
Number of cores is: 8
Scalability on 8 cores is: 3.30
 
Please press a key to exit...:
[bmaxa@maxa-pc aminer]$
 
[bmaxa@maxa-pc aminer]$ inxi -C
CPU: Quad core Intel Core i7-4790 (-HT-MCP-) cache: 8192 KB
clock speeds: max: 4000 MHz 1: 3958 MHz 2: 3977 MHz 3: 3995 MHz 4: 3994 MHz 5: 4000 MHz 6: 3968 MHz
7: 3998 MHz 8: 3924 MHz
[bmaxa@maxa-pc aminer]$
Ramine <ramine@1.1>: Feb 13 06:53PM -0800

Hello,
 
 
Please Melzzzz i have just implemented and uploaded a full benchmark,
this one will test all the sorting algorithms, can you download the new
benchmark again and test it with your computers that have 4 cores and
the other that have 8 cores.
 
Here it is:
 
https://sites.google.com/site/aminer68/benchmark-for-parallel-sort-library
 
 
Thank you,
Amine Moulay Ramdane.
Melzzzzz <mel@zzzzz.com>: Feb 14 12:57AM +0100

On Fri, 13 Feb 2015 18:53:05 -0800
> this one will test all the sorting algorithms, can you download the
> new benchmark again and test it with your computers that have 4 cores
> and the other that have 8 cores.
 
It's i7 4790. 4 cores with HT
 
 
> https://sites.google.com/site/aminer68/benchmark-for-parallel-sort-library
 
> Thank you,
> Amine Moulay Ramdane.
 
[bmaxa@maxa-pc aminer]$ taskset -c 0,1,2,3 wine test.exe
 
 
Number of cores is: 4
Scalability with parallel mergesort on 4 cores is: 2.38
Time of parallel mergesort on 4 cores is: 304864 microseconds
 
 
Number of cores is: 4
Scalability with parallel quicksort on 4 cores is: 2.64
Time of parallel quicksort on 4 cores is: 334282 microseconds
 
 
Number of cores is: 4
Scalability with parallel heapsort on 4 cores is: 4.30
Time of parallel heapsort on 4 cores is: 753477 microseconds
 
Please press a key to exit...:
[bmaxa@maxa-pc aminer]$ taskset -c 0,1,2,3,4,5,6,7 wine test.exe
 
 
Number of cores is: 8
Scalability with parallel mergesort on 8 cores is: 3.23
Time of parallel mergesort on 8 cores is: 298091 microseconds
 
 
Number of cores is: 8
Scalability with parallel quicksort on 8 cores is: 3.52
Time of parallel quicksort on 8 cores is: 348340 microseconds
 
 
Number of cores is: 8
Scalability with parallel heapsort on 8 cores is: 7.36
Time of parallel heapsort on 8 cores is: 807979 microseconds
 
Please press a key to exit...:
Ramine <ramine@1.1>: Feb 13 07:08PM -0800

Hello,
 
 
I think i have finally understood my parallel algorithms:
 
Look at my parallel heapsort results:
 
Number of cores is: 8
Scalability with parallel heapsort on 8 cores is: 7.36
Time of parallel heapsort on 8 cores is: 807979 microseconds
 
 
I think that my parallel heapsort algorithm is by nature more cache
efficient this is why it scales very well on more and more cores, so if
you have more cores than 8 cores, i think that my parallel heapsort of
my parallel sort library will be better to replace the other parallel
algorithms such as my parallel mergesort and my parallel quicksort
of my parallel sort library.
 
 
The benchmark's results also inform us on an important think: it is that
the parallel mergesort and parallel quicksort of my parallel sort
library are by nature much less cache efficient than my parallel
heapsort of my parallel sort library.
 
 
Thank you Melzzzz, you are such a great guy that you have helped me to
run the benchmark.
 
 
 
Amine Moulay Ramdanae.
Ramine <ramine@1.1>: Feb 13 03:10PM -0800

Hello,
 
 
As you have noticed i am implementing my libraries using the
Delphi and FreePascal compilers, i must say that the Object Pascal
language that i am using is a fantastic language, cause it has
allowed me for example to code 2000 lines of "stable" code of StringTree
in one day , that's really amazing how is efficient Delphi and
FreePascal language, what is much more amazing is that Object Pascal is
so much easy that i have not even used a debugger in all my projects
that i think are stable now, i have used only some few writeln() and
that's all, so i think FreePascal and Delphi are powerful compilers,
also i have tested the new 64 bit FPC here:
ftp://ftp.freepascal.org/pub/fpc/snapshot/trunk/
 
and i have tried to do the scimark2 benchmarks and it is giving a really
amazing performance on 64 bit compiler that was optimized more:
 
For Visual C++ 32 bit it gives:
 
Using 2.00 seconds min time per kenel.
 
Composite Score: 701.59
FFT Mflops: 519.89 (N=1024)
SOR Mflops: 622.40 (100 x 100)
MonteCarlo: Mflops: 101.07
Sparse matmult Mflops: 893.77 (N=1000, nz=5000)
LU Mflops: 1370.82 (M=100, N=100)
 
 
For csharp it gives:
 
Composite Score: 531.97 MFlops
FFT : 501.18 - (1024)
SOR : 711.62 - (100x100)
Monte Carlo : 31.85
Sparse MatMult : 553.74 - (N=1000, nz=5000)
LU : 861.49 - (100x100)
 
 
And for FreePascal 64 bit (from the trunk that was optimized)
 
Composite Score MFlops: 581.20
FFT Mflops: 404.46 (N=1024)
SOR Mflops: 717.01 (100 x 100)
MonteCarlo: Mflops: 113.12
Sparse matmult Mflops: 810.49 (N=1000, nz=5000)
LU Mflops: 860.90 (M=100, N=100)
 
 
So all in all that's a good news for FreePascal and Delphi...
 
 
 
Thank you,
Amine Moulay Ramdane,
Bonita Montero <Bonita.Montero@gmail.com>: Feb 13 10:02PM +0100

Ramine wrote:
 
> For Visual C++ 32 bit it gives:
^^

> And for FreePascal 64 bit (from the trunk that was optimized)
^^
 
Comparing 32 and 64 bit code ist comparing apples and pears.
64 bit code has more registers and therefore is faster.
I'll bet 64 bit C++ code optimized the same way like pascal
-code will outperform the pascal-code because of the better
C++-compilers.
And just ignoring, that C++ is the smarter language.
Ramine <ramine@1.1>: Feb 13 04:23PM -0800

Hello,
 
 
I have compiled the scimark2 benchmark with a newer gcc mingw64 64 bit
and with -O2 optimization and it has given a composite score of:
 
Composite Score: 760.43
 
The composite score of FreePascal 64 bit (the optimized version from the
trunk) is:
 
Composite Score MFlops: 581.20
 
 
So FreePascal is slower than gcc mingw64 by only 24%, that's a great
news for FreePascal.
 
 
 
Thank you,
Amine Moulay Ramdane.
Melzzzzz <mel@zzzzz.com>: Feb 13 10:25PM +0100

On Fri, 13 Feb 2015 22:02:38 +0100
> -code will outperform the pascal-code because of the better
> C++-compilers.
> And just ignoring, that C++ is the smarter language.
 
For reference:
 
[bmaxa@maxa-pc sci]$ java jnt.scimark2.commandline
 
SciMark 2.0a
 
Composite Score: 2266.6183001230893
FFT (1024): 1389.279726779432
SOR (100x100): 1653.3291557147395
Monte Carlo : 877.5985388498531
Sparse matmult (N=1000, nz=5000): 1828.6989227937966
LU (100x100): 5584.185156477624
 
java.vendor: Oracle Corporation
java.version: 1.7.0_76
os.arch: amd64
os.name: Linux
os.version: 3.18.6-1-MANJARO
 
[bmaxa@maxa-pc sci]$ gcc -Wall -O3 -march=native *.c -o scimark2 -lm
[bmaxa@maxa-pc sci]$ ./scimark2
** **
** SciMark2 Numeric Benchmark, see http://math.nist.gov/scimark **
** for details. (Results can be submitted to pozo@nist.gov) **
** **
Using 2.00 seconds min time per kenel.
Composite Score: 2813.76
FFT Mflops: 2292.31 (N=1024)
SOR Mflops: 2446.00 (100 x 100)
MonteCarlo: Mflops: 658.08
Sparse matmult Mflops: 3028.21 (N=1000, nz=5000)
LU Mflops: 5644.21 (M=100, N=100)
[bmaxa@maxa-pc sci]$ clang -Wall -O3 -march=native *.c -o scimark2 -lm
[bmaxa@maxa-pc sci]$ ./scimark2
** **
** SciMark2 Numeric Benchmark, see http://math.nist.gov/scimark **
** for details. (Results can be submitted to pozo@nist.gov) **
** **
Using 2.00 seconds min time per kenel.
Composite Score: 2743.54
FFT Mflops: 1640.43 (N=1024)
SOR Mflops: 1875.82 (100 x 100)
MonteCarlo: Mflops: 629.14
Sparse matmult Mflops: 2904.31 (N=1000, nz=5000)
LU Mflops: 6668.00 (M=100, N=100)
[bmaxa@maxa-pc sci]$
Ramine <ramine@1.1>: Feb 13 02:48PM -0800

Hello,
 
 
I was implementing StringTree, and i have used first TStringList, but it
was too slow, and i have used after that THashedStringList of the
inifiles unit and i have found it also slow, this is why i have decided
to implement a much faster HashStringList that is even faster than
THashedStringList of inifiles unit, i have benchmarked my new
HashStringList and it is giving really amazing performance, so i will
advise you to use my new HashStringList that you will find inside the
zip file of StringTree...
 
 
You can download my new HashStringList by downloading the zip file of
StringTree from:
 
https://sites.google.com/site/aminer68/stringtree
 
 
I have implemented the necessary methods and it is working with all the
Delphi versions and also it is working with freepascal, and it compiles
for 32 bit and 64 bit binaries form.
 
 
 
Thank you,
Amine Moulay Ramdane.
Ramine <ramine@1.1>: Feb 13 02:18PM -0800

Hello,
 
 
I have updated my Parallel Sort library to version 3.32, i have just
corrected a minor bug and i have stress tested it and it i think it is
now really stable.
 
 
You can download my Parallel Sort library version 3.32 from:
 
https://sites.google.com/site/aminer68/parallel-sort-library
 
 
But don't forget to put the "cmem" unit as the first unit in your "uses"
statement when you want to use my Parallel Sort library with the Delphi
graphical user interface, that's mandatory, and that's mandatory also
with my other parallel libraries.
 
 
Thank you,
Amine Moulay Ramdane.
Ramine <ramine@1.1>: Feb 13 01:59PM -0800

Hello,
 
 
As i have promised, i have finally implemented a very fast StringTree
that will also be used to design a kind of graphical interface for my
parallel archiver, i have worked hard and i have wrote 2000 lines code
of StringTree in one day, and i have also took one day to stress test it
, and i think it is now stable, please read more the description bellow...
 
 
Authors: Amine Moulay Ramdane and Kjell Hasthi (It's based on Kjell
Hasthi unit)
 
 
Description:
 
TStringStree class implements a non-visual tree structure like that
found in TreeView. TStringTree is a class for handling a
tree-structured stringlist. TStringTree is very similar to directory
structures , it uses the familiar terms of "directories" and "files"
instead of nodes and child nodes. This unit is based on Kjell Hasthi
unit but i have redesigned it and enhanced it much more and it is now
much more faster than Kjeli Hasthi unit and it also uses my Parallel
Sort library and it uses my faster HashStringList.
 
And please look at test.pas demo inside the zipfile - compile and
execute it...
 
 
You can download StringTree from:
 
https://sites.google.com/site/aminer68/stringtree
 
 
You have to download the zipfile called stringtree_xe.zip for Delphi XE
versions, the other zip file
to download called stringtree.zip is for freepascal and delphi 7 to 2007.
 
 
Language: FPC Pascal v2.2.0+ / Delphi 7+: http://www.freepascal.org/
 
Operating Systems: Win , Linux and Mac (x86).
 
Required FPC switches: -O3 -Sd -dFPC -dWin32 -dFreePascal
 
-Sd for delphi mode....
 
Required Delphi 7 to 2007 switches: -$H+ -DDelphi
 
Required Delphi XE switches: -$H+ -DXE
 
The defines options inside defines.inc are:
 
{$DEFINE CPU3} and {$DEFINE Windows32} for 32 bit systems
{$DEFINE CPU64} and {$DEFINE Windows64} for 64 bit systems
 
 
 
Thank you,
Amine Moulay Ramdane.
You received this digest because you're subscribed to updates for this group. You can change your settings on the group membership page.
To unsubscribe from this group and stop receiving emails from it send an email to comp.programming.threads+unsubscribe@googlegroups.com.

No comments: