Wednesday, November 20, 2019

Digest for comp.programming.threads@googlegroups.com - 7 updates in 7 topics

aminer68@gmail.com: Nov 19 03:14PM -0800

Hello,
 
 
 
Canada is becoming very attractive with its remarkable Economy,
read more to notice it:
 
 
Why so many Silicon Valley companies are moving to Vancouver Canada
 
https://www.straight.com/tech/1261681/why-so-many-silicon-valley-companies-are-moving-vancouver
 
 
As you have noticed that i am a white arab, and i am living in Canada, and speaking about Canada, look at this interesting video about Canada:
 
The Remarkable Economy of Canada
 
https://www.youtube.com/watch?v=uRnIHyI02EU
 
 
 
Thank you,
Amine Moulay Ramdane.
aminer68@gmail.com: Nov 19 02:31PM -0800

Hello,
 
 
My inventions that are my SemaMonitor and my SemaCondvar were updated to version 2.3, they have become efficient and powerful, please read the readme file to know more about the changes, and i have implemented an efficient Monitor over my SemaCondvar. Here is the description of my efficient Monitor inside the Monitor.pas file that you will find inside the zip file:
 
 
Description:
 
This is my implementation of a Monitor over my SemaCondvar.
 
You will find the Monitor class inside the Monitor.pas file inside the zip file.
 
When you set the first parameter of the constructor to true, the signal will not be lost if the threads are not waiting with wait() method, but when you set the first parameter of the construtor to false, if the threads are not waiting with the wait() method, the signal will be lost..
 
Second parameter of the constructor is the kind of Lock, you can
set it to ctMLock to use my scalable node based lock called MLock, or you can set it to ctMutex to use a Mutex or you can set it to ctCriticalSection to use the TCriticalSection.
 
Here is the methods of my efficient Monitor that i have implemented:
 
TMonitor = class
private
cache0:typecache0;
lock1:TSyncLock;
obj:TSemaCondvar;
cache1:typecache0;
 
public
 
constructor Create(bool:boolean=true;lock:TMyLocks=ctMLock);
destructor Destroy; override;
procedure Enter();
procedure Leave();
function Signal():boolean;overload;
function Signal(nbr:long;var remains:long):boolean;overload;
procedure Signal_All();
function Wait(const AMilliseconds:longword=INFINITE): boolean;
function WaitersBlocked():long;
 
end;
 
 
The wait() method is for the threads to wait on the Monitor object for the signal to be signaled. If wait() fails, that can be that the number of waiters is greater than high(longword).
 
And the signal() method will signal one time a waiting thread on the Monitor object, but if signal() fails , the returned value is false.
 
the signal_all() method will signal all the waiting threads on the Monitor object.
 
The signal(nbr:long;var remains:long) method will signal nbr of waiting threads, but if signal() fails, the remaining number of signals that were not signaled will be returned in the remains variable.
 
and WaitersBlocked() will return the number of waiting threads on the Monitor object.
 
and Enter() and Leave() methods to enter and leave the monitor's Lock.
 
 
You can download the zip files from:
 
https://sites.google.com/site/scalable68/semacondvar-semamonitor
 
and the lightweight version is here:
 
https://sites.google.com/site/scalable68/light-weight-semacondvar-semamonitor
 
 
Thank you,
Amine Moulay Ramdane.
aminer68@gmail.com: Nov 19 02:25PM -0800

Hello,
 
 
Here is my new inventions that are my new variants of Scalable RWLocks that are powerful..
 
 
Author: Amine Moulay Ramdane
 
Description:
 
A fast, and scalable and starvation-free and fair and lightweight Multiple-Readers-Exclusive-Writer Lock called LW_RWLockX, the scalable LW_RWLockX does spin-wait, and also a fast and scalable and starvation-free and fair Multiple-Readers-Exclusive-Writer Lock called RWLockX, the scalable RWLockX doesn't spin-wait but uses my portable SemaMonitor and portable event objects , so it is energy efficient.
 
The parameter of the constructors is the size of the array of the readers , so if the size of the array is equal to the number of parallel readers, so it will be scalable, but if the number of readers are greater than the size of the array , you will start to have contention, please look at the source code of my scalable algorithms to understand.
 
 
I have used my following hash function to make my new variants of RWLocks scalable:
 
---
 
function DJB2aHash(key:int64):uint64;
var
i: integer;
key1:uint64;
 
begin
Result := 5381;
for i := 1 to 8 do
begin
key1:=(key shr ((i-1)*8)) and $00000000000000ff;
Result := ((Result shl 5) xor Result) xor key1;
end;
end;
 
---
 
You can download them from:
 
https://sites.google.com/site/scalable68/new-variants-of-scalable-rwlocks
 
 
Thank you,
Amine Moulay Ramdane.
aminer68@gmail.com: Nov 19 02:21PM -0800

Hello,
 
 
 
About the store buffer and memory visibility..
 
 
 
More about memory visibility..
 
I said before:
 
As you know that in parallel programming you have to take care
not only of memory ordering , but also take care about memory visibility, read this to notice it:
 
A store barrier, "sfence" instruction on x86, forces all store instructions prior to the barrier to happen before the barrier and have the store buffers flushed to cache for the CPU on which it is issued. This will make the program state "visible" to other CPUs so they can act on it if necessary.
 
 
Read more here to understand correctly:
 
"However under x86-TSO, the stores are cached in the store buffers,
a load consult only shared memory and the store buffer of the given thread, wich means it can load data from memory and ignore values from
the other thread."
 
Read more here:
 
https://books.google.ca/books?id=C2R2DwAAQBAJ&pg=PA127&lpg=PA127&dq=immediately+visible+and+m+fence+and+store+buffer+and+x86&source=bl&ots=yfGI17x1YZ&sig=ACfU3U2EYRawTkQmi3s5wY-sM7IgowDlWg&hl=en&sa=X&ved=2ahUKEwi_nq3duYPkAhVDx1kKHYoyA5UQ6AEwAnoECAgQAQ#v=onepage&q=immediately%20visible%20and%20m%20fence%20and%20store%20buffer%20and%20x86&f=false
 
 
 
Now can we ask the question of how much time takes the
store buffer to drain ?
 
 
So read here to notice:
 
https://nicknash.me/2018/04/07/speculating-about-store-buffer-capacity/
 
 
So as you are noticing he is giving around 500 no-ops to allow the store
buffer to drain, and i think that it can take less than that for the store buffer to drain.
 
 
 
Thank you,
Amine Moulay Ramdane.
aminer68@gmail.com: Nov 19 12:02PM -0800

Hello,
 
 
And now i will talk about data dependency and parallel loops..
 
For a loop to be parallelized, every iteration must be independent of the others, one way to be sure of it is to execute the loop
in the direction of the incremented index of the loop and in the direction of the decremented index of the loop and verify if the results are the same. A data dependency happens if memory is modified: a loop has a data dependency if an iteration writes a variable that is read or write in another iteration of the loop. There is no data dependency if only one iteration reads or writes a variable or if many iterations read
the same variable without modifying it. So this is the "general" "rules".
 
Now there remains to know that you have for example to know how to construct the parallel for loop if there is an induction variable or if there is a reduction operation, i will give an example of them:
 
If we have the following (the code looks like Algol or modern Object Pascal):
 
 
IND:=0
 
For I:=1 to N
Do
Begin
IND := IND + 1;
A[I]:=B[IND];
End;
 
 
So as you are noticing since IND is an induction variable , so
to parallelize the loop you have to do the following:
 
 
For I:=1 to N
Do
Begin
IND:=(I*(I+1))/2;
A[I]:=B[IND];
End;
 
 
Now for the reduction operation example, you will notice that my invention that is my Threadpool with priorities that scales very well (
read about it below) supports a Parallel For that scales very well that supports "grainsize", and you will notice that the grainsize can be used in the ParallelFor() with a reduction operation and you will notice that my following powerful scalable Adder is also used in this scenario, here it is:
 
https://sites.google.com/site/scalable68/scalable-adder-for-delphi-and-freepascal
 
 
So here is the example with a reduction operation in modern Object Pascal:
 
 
TOTAL:=0.0
For I := 1 to N
Do
Begin
TOTAL:=TOTAL+A[I]
End;
 
 
So with my powerful scalable Adder and with my powerful invention that is my ParallelFor() that scales very well, you will parallelize the above like this:
 
procedure test1(j:integer;ptr:pointer);
begin
 
t.add(A[J]); // "t" is my scalable Adder object
 
end;
 
// Let's suppose that N is 100000
// In the following, 10000 is the grainsize
 
obj.ParallelFor(1,N,test1,10000,pointer(0));
 
TOTAL:=T.get();
 
 
And read the following to understand how to use grainsize of my Parallel for that scales well:
 
 
About my ParallelFor() that scales very well that uses my efficient Threadpool that scales very well:
 
With ParallelFor() you have to:
 
1- Ensure Sufficient Work
 
Each iteration of a loop involves a certain amount of work,
so you have to ensure a sufficient amount of the work,
read below about "grainsize" that i have implemented.
 
2- In OpenMP we have that:
 
Static and Dynamic Scheduling
 
One basic characteristic of a loop schedule is whether it is static or dynamic:
 
• In a static schedule, the choice of which thread performs a particular
iteration is purely a function of the iteration number and number of
threads. Each thread performs only the iterations assigned to it at the
beginning of the loop.
 
• In a dynamic schedule, the assignment of iterations to threads can
vary at runtime from one execution to another. Not all iterations are
assigned to threads at the start of the loop. Instead, each thread
requests more iterations after it has completed the work already
assigned to it.
 
But with my ParallelFor() that scales very well, since it is using my efficient Threadpool that scales very well, so it is using Round-robin scheduling and it uses also work stealing, so i think that this is sufficient.
 
Read the rest:
 
My Threadpool engine with priorities that scales very well is really powerful because it scales very well on multicore and NUMA systems, also it comes with a ParallelFor() that scales very well on multicores and NUMA systems.
 
You can download it from:
 
https://sites.google.com/site/scalable68/an-efficient-threadpool-engine-with-priorities-that-scales-very-well
 
 
Here is the explanation of my ParallelFor() that scales very well:
 
I have also implemented a ParallelFor() that scales very well, here is the method:
 
procedure ParallelFor(nMin, nMax:integer;aProc: TParallelProc;GrainSize:integer=1;Ptr:pointer=nil;pmode:TParallelMode=pmBlocking;Priority:TPriorities=NORMAL_PRIORITY);
 
nMin and nMax parameters of the ParallelFor() are the minimum and maximum integer values of the variable of the ParallelFor() loop, aProc parameter of ParallelFor() is the procedure to call, and GrainSize integer parameter of ParallelFor() is the following:
 
The grainsize sets a minimum threshold for parallelization.
 
A rule of thumb is that grainsize iterations should take at least 100,000 clock cycles to execute.
 
For example, if a single iteration takes 100 clocks, then the grainsize needs to be at least 1000 iterations. When in doubt, do the following experiment:
 
1- Set the grainsize parameter higher than necessary. The grainsize is specified in units of loop iterations.
 
If you have no idea of how many clock cycles an iteration might take, start with grainsize=100,000.
 
The rationale is that each iteration normally requires at least one clock per iteration. In most cases, step 3 will guide you to a much smaller value.
 
2- Run your algorithm.
 
3- Iteratively halve the grainsize parameter and see how much the algorithm slows down or speeds up as the value decreases.
 
A drawback of setting a grainsize too high is that it can reduce parallelism. For example, if the grainsize is 1000 and the loop has 2000 iterations, the ParallelFor() method distributes the loop across only two processors, even if more are available.
 
And you can pass a parameter in Ptr as pointer to ParallelFor(), and you can set pmode parameter of to pmBlocking so that ParallelFor() is blocking or to pmNonBlocking so that ParallelFor() is non-blocking, and the Priority parameter is the priority of ParallelFor(). Look inside the test.pas example to see how to use it.
 
 
Thank you,
Amine Moulay Ramdane.
aminer68@gmail.com: Nov 19 11:32AM -0800

Hello,
 
 
And about Message Passing Process Communication Model and Shared Memory Process Communication Model:
 
An advantage of shared memory model is that memory communication is faster as compared to the message passing model on the same machine.
 
Read the following to notice it:
 
Why did Windows NT move away from the microkernel?
 
"The main reason that Windows NT became a hybrid kernel is speed. A microkernel-based system puts only the bare minimum system components in the kernel and runs the rest of them as user mode processes, known as servers. A form of inter-process communication (IPC), usually message passing, is used for communication between servers and the kernel.
 
Microkernel-based systems are more stable than others; if a server crashes, it can be restarted without affecting the entire system, which couldn't be done if every system component was part of the kernel. However, because of the overhead incurred by IPC and context-switching, microkernels are slower than traditional kernels. Due to the performance costs of a microkernel, Microsoft decided to keep the structure of a microkernel, but run the system components in kernel space. Starting in Windows Vista, some drivers are also run in user mode."
 
 
More about message passing..
 
"One problem that plagues microkernel implementations is relatively poor performance. The message-passing layer that connects
different operating system components introduces an extra layer of
machine instructions. The machine instruction overhead introduced
by the message-passing subsystem manifests itself as additional
execution time. In a monolithic system, if a kernel component needs
to talk to another component, it can make direct function calls
instead of going through a third party."
 
However, shared memory model may create problems such as synchronization and memory protection that need to be addressed.
 
Message passing's major flaw is the inversion of control–it is a moral equivalent of gotos in un-structured programming (it's about time somebody said that message passing is considered harmful).
 
Also some research shows that the total effort to write an MPI application is significantly higher than that required to write a shared-memory version of it.
 
 
 
Thank you,
Amine Moulay Ramdane.
aminer68@gmail.com: Nov 19 11:21AM -0800

Hello,
 
 
About composability of lock-based systems now:
 
Design your systems to be composable. Among the more galling claims of the detractors of lock-based systems is the notion that they are somehow uncomposable:
 
"Locks and condition variables do not support modular programming," reads one typically brazen claim, "building large programs by gluing together smaller programs[:] locks make this impossible."9 The claim, of course, is incorrect. For evidence one need only point at the composition of lock-based systems such as databases and operating systems into larger systems that remain entirely unaware of lower-level locking.
 
There are two ways to make lock-based systems completely composable, and each has its own place. First (and most obviously), one can make locking entirely internal to the subsystem. For example, in concurrent operating systems, control never returns to user level with in-kernel locks held; the locks used to implement the system itself are entirely behind the system call interface that constitutes the interface to the system. More generally, this model can work whenever a crisp interface exists between software components: as long as control flow is never returned to the caller with locks held, the subsystem will remain composable.
 
Second (and perhaps counterintuitively), one can achieve concurrency and
composability by having no locks whatsoever. In this case, there must be
no global subsystem state—subsystem state must be captured in per-instance state, and it must be up to consumers of the subsystem to assure that they do not access their instance in parallel. By leaving locking up to the client of the subsystem, the subsystem itself can be used concurrently by different subsystems and in different contexts. A concrete example of this is the AVL tree implementation used extensively in the Solaris kernel. As with any balanced binary tree, the implementation is sufficiently complex to merit componentization, but by not having any global state, the implementation may be used concurrently by disjoint subsystems—the only constraint is that manipulation of a single AVL tree instance must be serialized.
 
Read more here:
 
https://queue.acm.org/detail.cfm?id=1454462
 
 
 
Thank you,
Amine Moulat Ramdane.
You received this digest because you're subscribed to updates for this group. You can change your settings on the group membership page.
To unsubscribe from this group and stop receiving emails from it send an email to comp.programming.threads+unsubscribe@googlegroups.com.

No comments: