Saturday, November 1, 2014

Digest for comp.programming.threads@googlegroups.com - 6 updates in 5 topics

comp.programming.threads@googlegroups.com Google Groups
Unsure why you received this message? You previously subscribed to digests from this group, but we haven't been sending them for a while. We fixed that, but if you don't want to get these messages, send an email to comp.programming.threads+unsubscribe@googlegroups.com.
Ramine <ramine@1.1>: Oct 31 05:36PM -0700

Hello,
 
We have to speak frankly and clearly, as is doing PhDs of IEEE
for example...
 
 
What i want to say this time, is that you have to know that parallel
programming is hard, cause you have to think about sequntial consistency
and about racing conditions and about Deadlocks and about Lock convoy
etc. and you have to know also that proving lockfree algorithms is hard
to do... so how can we proceed and walk confidently in this complex area
that
is parallel programming ? my way of doing is to implement a reduced
numbers of libraries that are stable and use them as base fondations
for other projects, this is my way of reducing the risk or eliminating
the risk, so what i have done by inventing my libraries and parallel
algorithms and synchronization algorithms is reducing the complexity of
parallel
programming by using those same parallel projects as building blocks for
other projects, i have taking a lot of care for verifying that
my parallel algotithms are free of bug and free of problems, and
i think i have succeeded in that regards by inventing and implementing
and bringing some stable parallel libraries and stable synchronization
algorithms, and as you have noticed i have not used Throttling
Concurrency in my Threadpool like they are doing in the following link:
 
http://msdn.microsoft.com/en-us/magazine/ff960958.aspx
 
 
Simply because my parallel projects dont contains waiting instructions
like sleep() or IO waiting or this kind of waiting functions, so
this makes my parallel algorithms efficient by using my Threadpool i
think.
 
 
So hope this will clear my intentions that were to bring to you
stable and fast parallel algorithms, and i think i have succeeded
in this regard.
 
 
You can download all my parallel projects from:
 
https://sites.google.com/site/aminer68/
 
 
 
Thank you,
Amine Moulay Ramdane.
Ramine <ramine@1.1>: Oct 31 04:32PM -0700

Hello,
 
 
Frankly i think i have to write a PhD paper for my Parallel Sort library...
 
 
If you look at my Parallel Quickort you will notice that when the main
thread enters the partition() procedure for the first time before
starting the threads, it will have to serially partition "all" the
array, and this will make the serial part of the Amdahl equation bigger
, so it will make my Parallel Quicksort less scalable... hope this is
clear... but what about my Parallel Sort library ? we have to be smart,
when you take a look at the PhD papers, you will notice that they are
benchmarking there parallel algorithms empiricaly, but since i
don't have a computer with more than 4 cores, i can not benchmark
my Parallel Sort library with more than 4 cores, so i have to
do mathematical calculations to do a scalability prediction for
my Parallel Sort library, so if you take a look at the source
code of my Parallel Sort library, you will notice that
i am using 2 steps , one step for the sorting part and one
step for the merging part, and my parallel algorithm have completly
parallelized the sorting part and completly parallelized the merging
part and this has made my parallel algorithm more scalable cause its
serial part is much smaller than the serial part of my Parallel
Quicksort library, so how can we make a scalability prediction for my
Parallel Sort library ? you have to do it the same way as you do it in
assembler, you have to calculate the number of CPU clocks that the
parallel part of the Amdahl equation takes, and the number of CPU clocks
that the serial part of the Amdahl equation takes, i have done it for my
Parallel Quicksort inside my Parallel Sort library
, although this depends on the size of the "string" when you
are sorting strings, we will take as an example an average string
of 10 characters, so this will give us around 16 clocks for the parallel
part of the Amdahl equation for the compare() between strings
etc., cause i have done the calculation my self, and for the serial part
we have to know that the memory transfers are serialized in the x86
computers , and since on DDR3 the memory throughput can go up to 17 MB/s
and on my 2.4 Ghz CPU this will give 8 bytes memory transfers per clock
from the memoryto the CPU, so this will take around 2 clocks for
transfering two strings with an average size of 10 characters before
making the compare()... so finally this will give a serial part of 2
clocks and a parallel part of 16 clocks, so this will give a scalability
on multicores of 9x on x86 multicores with DDR3 memory, this was the
scalability prediction for "strings" with
an average size of 10 characters, for a strings with bigger size it
can scale more , but for integers sorting, two integers will be
transfers from the memory to the CPU in 1 CPU clock before doing the
compare(), and the compare() function between integers will take only
one clock, so for integers it will take 7 CPU clocks for the parallel
part and 1 CPU clock and one clock for the serial part , so after doing
the Amdahl equation calculation for integers it will give a scalability
on multicores to no more than around 8x for integers.
 
 
So my Parallel Sort library is much more scalable than my Parallel
Quicksort library...
 
This very good to know...
 
 
You can download my Parallel Sort library from:
 
https://sites.google.com/site/aminer68/parallel-sort-library
 
 
 
Thank you,
Amine Moulay Ramdane.
Ramine <ramine@1.1>: Oct 31 04:33PM -0700

I correct:
 
 
Hello...
 
 
Frankly i think i have to write a PhD paper for my Parallel Sort library...
 
 
If you look at my Parallel Quicksort you will notice that when the main
thread enters the partition() procedure for the first time before
starting the threads, it will have to serially partition "all" the
array, and this will make the serial part of the Amdahl equation bigger
, so it will make my Parallel Quicksort less scalable... hope this is
clear... but what about my Parallel Sort library ? we have to be smart,
when you take a look at the PhD papers, you will notice that they are
benchmarking there parallel algorithms empiricaly, but since i
don't have a computer with more than 4 cores, i can not benchmark
my Parallel Sort library with more than 4 cores, so i have to
do mathematical calculations to do a scalability prediction for
my Parallel Sort library, so if you take a look at the source
code of my Parallel Sort library, you will notice that
i am using 2 steps , one step for the sorting part and one
step for the merging part, and my parallel algorithm have completly
parallelized the sorting part and completly parallelized the merging
part and this has made my parallel algorithm more scalable cause its
serial part is much smaller than the serial part of my Parallel
Quicksort library, so how can we make a scalability prediction for my
Parallel Sort library ? you have to do it the same way as you do it in
assembler, you have to calculate the number of CPU clocks that the
parallel part of the Amdahl equation takes, and the number of CPU clocks
that the serial part of the Amdahl equation takes, i have done it for my
Parallel Quicksort inside my Parallel Sort library
, although this depends on the size of the "string" when you
are sorting strings, we will take as an example an average string
of 10 characters, so this will give us around 16 clocks for the parallel
part of the Amdahl equation for the compare() between strings
etc., cause i have done the calculation my self, and for the serial part
we have to know that the memory transfers are serialized in the x86
computers , and since on DDR3 the memory throughput can go up to 17 MB/s
and on my 2.4 Ghz CPU this will give 8 bytes memory transfers per clock
from the memoryto the CPU, so this will take around 2 clocks for
transfering two strings with an average size of 10 characters before
making the compare()... so finally this will give a serial part of 2
clocks and a parallel part of 16 clocks, so this will give a scalability
on multicores of 9x on x86 multicores with DDR3 memory, this was the
scalability prediction for "strings" with
an average size of 10 characters, for a strings with bigger size it
can scale more , but for integers sorting, two integers will be
transfers from the memory to the CPU in 1 CPU clock before doing the
compare(), and the compare() function between integers will take only
one clock, so for integers it will take 7 CPU clocks for the parallel
part and 1 CPU clock and one clock for the serial part , so after doing
the Amdahl equation calculation for integers it will give a scalability
on multicores to no more than around 8x for integers.
 
 
So my Parallel Sort library is much more scalable than my Parallel
Quicksort library...
 
This very good to know...
 
 
You can download my Parallel Sort library from:
 
https://sites.google.com/site/aminer68/parallel-sort-library
 
 
 
Thank you,
Amine Moulay Ramdane.
Ramine <ramine@1.1>: Oct 31 02:59PM -0700

Hello,
 
 
I have just read the following PhD paper about NUMA-Aware Reader-Writer
Locks...
 
Here it is:
 
http://mcg.cs.tau.ac.il/papers/ppopp2013-rwlocks.pdf
 
 
You will notice that it is scaling better than the DV Read-Writer lock
cause i think there is less NUMA nodes inthis Reader-Writer lock than
CPU ids in the DV side, so it makes the DV side more expensive in the
writer side. so it makes the serial part in the Amdahl equation bigger ,
so it makes the DV Reader-Writer lock less scalable... but the DV
Reader-Writer lock is good also.
 
 
But i have also invented and inplemented scalable RWLocks , here they are:
 
https://sites.google.com/site/aminer68/scalable-rwlock
 
 
The scalability of my scalable RWLocks is good at 1% to 3% of writes,
even more, i have invented scalable and starvation-free RWLocks, you
will find them inside the same zipfile and this makes my scalable
RWLocks attractive and powerful i think.
 
Please notice that the Reader-Writer lock that is included inside
the Omnithread library is using an expensive atomic operation,
on the reader side , so it makes it less scalable than my scalable
RWLocks, so i advice you to use my scalable RWLocks cause they are more
powerful and they are really good.
 
 
You can download my scalable RWLocks:
 
https://sites.google.com/site/aminer68/scalable-rwlock
 
 
 
 
Thank you,
Amine Moulay Ramdane.
Ramine <ramine@1.1>: Oct 31 02:07PM -0700

Hello,
 
 
My Parallel archiver is really more powerful than my Parallel
compression library, so i will advice you to use my Parallel archiver
instead of using my Parallel compression library.
 
So hope i have brought to you an interesting and powerful parallel
library to add to your libraries.
 
 
You can download Parallel archiver 2.23 from:
 
https://sites.google.com/site/aminer68/parallel-archiver
 
 
 
Thank you,
Amine Moulay Ramdane.
Ramine <ramine@1.1>: Oct 31 01:39PM -0700

Hello,
 
 
I have updated my Parallel archiver to version 2.23, i have forgot to
update the AddStream() method, and i have do it, and now it is rock
solid and stable.
 
 
Please read this about my Parallel archiver:
 
This software is provided on an "as-is" basis, with no warranties,
express or implied. The entire risk and liability of using it is yours.
Any damages resulting from the use or misuse of this software will
be the responsibility of the user.
 
 
You can download Parallel archiver 2.23 from:
 
https://sites.google.com/site/aminer68/parallel-archiver
 
 
 
 
Thank you,
Amine Moulay Ramdane.
You received this digest because you're subscribed to updates for this group. You can change your settings on the group membership page.
To unsubscribe from this group and stop receiving emails from it send an email to comp.programming.threads+unsubscribe@googlegroups.com.

No comments: