soft and program: Digest for comp.programming.threads@googlegroups.com

comp.programming.threads@googlegroups.com

Google Groups

In the following software project of mine i have used also a more sophisticated mathematics.. - 1 Update
My new scalable algorithm is here.. - 1 Update
Today i will talk about data dependency and parallel loops.. - 1 Update
Read again, i correct some typos about Lock Versus Lock-Free - 1 Update
Lock Versus Lock-Free - 1 Update
About memory safety and more.. - 1 Update
About the Linux sys_membarrier() expedited and the windows FlushProcessWriteBuffers().. - 1 Update

In the following software project of mine i have used also a more sophisticated mathematics..

aminer68@gmail.com: Dec 10 04:10PM -0800

Hello,

My Universal Scalability Law for Delphi and FreePascal version 3.22

In the following software project of mine i have used also a more sophisticated mathematics..

I have included a 32 bit and 64 bit windows executables called usl.exe and usl_graph.exe inside the zip, please read the readme file to know how to use it, it is a very powerful tool.

Here is an important and free book about Universal Scalability Law (USL), read it to understand more what USL is all about, you can download it from this link:

https://www.vividcortex.com/resources/universal-scalability-law/

And you have to optimize the criterion of the cost for a better QoS, and for this i have supplied you with a second option called -d that you have to run for that, so you have to type at the command prompt:

usl data.csv -d 0.3 0.1

the 0.3 is the slope of the secant with a step 0.1, so since the step is 0.1 so this will approximate a derivative of the USL equation that equal 0.3, so here is the output of my program when you run it with -d 0.3 0.1:

--

Peak number is: 449.188
Predicted scalability peak is: 18.434
Coefficient of determination R-squared is: 0.995
The derivative of the USL equation at delta(y)/delta(x)=0.300
with a step delta(x)=0.100, gives a number and derivative of
a secant or a derivative delta(y)/delta(x) of: 16.600 and 0.300

--

So as you have noticed that a good approximation for the derivative of the USL equation will arrive at the 16.600 cores and this gives also a derivative of the secant that approximate the derivative of the USL equation.

So to optimize more the criterion of the cost for a better QoS, you have to choose a good delta(y)/delta(x) to optimize the criterion of the cost of your system and you have to balance better between the performance and the cost.

The -nlr option means that the problem will be solved with the mathematical nonlinear regression using the simplex method as a minimization, if you don't specify -nlr, the problem will be solved by default by the mathematical polynomial regression.

You can read about it and download it from:

https://sites.google.com/site/scalable68/universal-scalability-law-for-delphi-and-freepascal

Thank you,
Amine Moulay Ramdane.

My new scalable algorithm is here..

aminer68@gmail.com: Dec 10 03:17PM -0800

Hello,

My new scalable algorithm is here..

As you have noticed i am an inventor of many scalable algorithms and
there implementations, but i have just thought more and i think
i have just again invented a new algorithm that is scalable, i will
explain it, read my following previous writing:

-----------------------------------------------------------
About parallel programming and concurrency..

Look at the following concurrency abstractions of microsoft:

https://docs.microsoft.com/en-us/dotnet/api/system.threading.tasks.task.waitany?view=netframework-4.8

https://docs.microsoft.com/en-us/dotnet/api/system.threading.tasks.task.waitall?view=netframework-4.8

I will soon implement waitany() and waitall() concurrency abstractions for Delphi and Freepascal, with the timeout in milliseconds of course, and they will work with my efficient implementation of a Future, so you can will be able to wait for many futures with waitany() and waitall().

And about task canceletion like in microsoft TPL, i think it is
not a good abstraction, because how do you know when you have to efficiently cancel a task or tasks ? so you are understanding that task cancelation is not a so efficient abstraction , so i will not implement it, because i think the waitany() and waitall() with Futures with the "timeout" in milliseconds are good concurrency abstractions.
--------------------------------------------------------------

But my new algorithm is a WaitAny() that is fully "scalable", and it can be
called from mutiple threads and it will be scalable, but my WaitAny() and WaitAll() will work with Futures and with Threads and with Event object
and such, and it will be portable and scalable.

Thank you,
Amine Moulay Ramdane.

Today i will talk about data dependency and parallel loops..

aminer68@gmail.com: Dec 10 03:00PM -0800

Hello,

Today i will talk about data dependency and parallel loops..

For a loop to be parallelized, every iteration must be independent of the others, one way to be sure of it is to execute the loop
in the direction of the incremented index of the loop and in the direction of the decremented index of the loop and verify if the results are the same. A data dependency happens if memory is modified: a loop has a data dependency if an iteration writes a variable that is read or write in another iteration of the loop. There is no data dependency if only one iteration reads or writes a variable or if many iterations read
the same variable without modifying it. So this is the "general" "rules".

Now there remains to know that you have for example to know how to construct the parallel for loop if there is an induction variable or if there is a reduction operation, i will give an example of them:

If we have the following (the code looks like Algol or modern Object Pascal):

IND:=0

For I:=1 to N
Do
Begin
IND := IND + 1;
A[I]:=B[IND];
End;

So as you are noticing since IND is an induction variable , so
to parallelize the loop you have to do the following:

For I:=1 to N
Do
Begin
IND:=(I*(I+1))/2;
A[I]:=B[IND];
End;

Now for the reduction operation example, you will notice that my invention that is my Threadpool with priorities that scales very well (
read about it below) supports a Parallel For that scales very well that supports "grainsize", and you will notice that the grainsize can be used in the ParallelFor() with a reduction operation and you will notice that my following powerful scalable Adder is also used in this scenario, here it is:

https://sites.google.com/site/scalable68/scalable-adder-for-delphi-and-freepascal

So here is the example with a reduction operation in modern Object Pascal:

TOTAL:=0.0
For I := 1 to N
Do
Begin
TOTAL:=TOTAL+A[I]
End;

So with my powerful scalable Adder and with my powerful invention that is my ParallelFor() that scales very well, you will parallelize the above like this:

procedure test1(j:integer;ptr:pointer);
begin

t.add(A[J]); // "t" is my scalable Adder object

end;

// Let's suppose that N is 100000
// In the following, 10000 is the grainsize

obj.ParallelFor(1,N,test1,10000,pointer(0));

TOTAL:=T.get();

And read the following to understand how to use grainsize of my Parallel for that scales well:

About my ParallelFor() that scales very well that uses my efficient Threadpool that scales very well:

With ParallelFor() you have to:

1- Ensure Sufficient Work

Each iteration of a loop involves a certain amount of work,
so you have to ensure a sufficient amount of the work,
read below about "grainsize" that i have implemented.

2- In OpenMP we have that:

Static and Dynamic Scheduling

One basic characteristic of a loop schedule is whether it is static or dynamic:

• In a static schedule, the choice of which thread performs a particular
iteration is purely a function of the iteration number and number of
threads. Each thread performs only the iterations assigned to it at the
beginning of the loop.

• In a dynamic schedule, the assignment of iterations to threads can
vary at runtime from one execution to another. Not all iterations are
assigned to threads at the start of the loop. Instead, each thread
requests more iterations after it has completed the work already
assigned to it.

But with my ParallelFor() that scales very well, since it is using my efficient Threadpool that scales very well, so it is using Round-robin scheduling and it uses also work stealing, so i think that this is sufficient.

Read the rest:

My Threadpool engine with priorities that scales very well is really powerful because it scales very well on multicore and NUMA systems, also it comes with a ParallelFor() that scales very well on multicores and NUMA systems.

You can download it from:

https://sites.google.com/site/scalable68/an-efficient-threadpool-engine-with-priorities-that-scales-very-well

Here is the explanation of my ParallelFor() that scales very well:

I have also implemented a ParallelFor() that scales very well, here is the method:

procedure ParallelFor(nMin, nMax:integer;aProc: TParallelProc;GrainSize:integer=1;Ptr:pointer=nil;pmode:TParallelMode=pmBlocking;Priority:TPriorities=NORMAL_PRIORITY);

nMin and nMax parameters of the ParallelFor() are the minimum and maximum integer values of the variable of the ParallelFor() loop, aProc parameter of ParallelFor() is the procedure to call, and GrainSize integer parameter of ParallelFor() is the following:

The grainsize sets a minimum threshold for parallelization.

A rule of thumb is that grainsize iterations should take at least 100,000 clock cycles to execute.

For example, if a single iteration takes 100 clocks, then the grainsize needs to be at least 1000 iterations. When in doubt, do the following experiment:

1- Set the grainsize parameter higher than necessary. The grainsize is specified in units of loop iterations.

If you have no idea of how many clock cycles an iteration might take, start with grainsize=100,000.

The rationale is that each iteration normally requires at least one clock per iteration. In most cases, step 3 will guide you to a much smaller value.

2- Run your algorithm.

3- Iteratively halve the grainsize parameter and see how much the algorithm slows down or speeds up as the value decreases.

A drawback of setting a grainsize too high is that it can reduce parallelism. For example, if the grainsize is 1000 and the loop has 2000 iterations, the ParallelFor() method distributes the loop across only two processors, even if more are available.

And you can pass a parameter in Ptr as pointer to ParallelFor(), and you can set pmode parameter of to pmBlocking so that ParallelFor() is blocking or to pmNonBlocking so that ParallelFor() is non-blocking, and the Priority parameter is the priority of ParallelFor(). Look inside the test.pas example to see how to use it.

Thank you,
Amine Moulay Ramdane.

Read again, i correct some typos about Lock Versus Lock-Free

aminer68@gmail.com: Dec 10 02:31PM -0800

Hello,

Read again, i correct some typos about Lock Versus Lock-Free

Lock Versus Lock-Free

The class of problems that can be solved by lock-free approaches is limited.

Furthermore, lock-free approaches can require restructuring a problem.
As soon as multiple shared data-structures are modified simultaneously,the only practical approach is to use a lock.

All lock-free dynamic-size data-structures using such CAA (Compare-and-assign) require some form of garbage collector to lazily delete storage when it is no longer referenced. In languages with garbage collection, this capability comes for free (at the cost of garbage collection). For languages without garbage collection, the code is complex and error prone in comparison with locks, requiring epoch-based reclamation, read-copy-update (RCU), or hazard pointers.

While better performance is claimed for lock-free data-structures, there is no long-term evidence to support this claim. Many high-performance locking situations, e.g., operating system kernels and databases, continue to use locking in various forms, even though there are a broad class of lock-free data-structure readily available.

While lock-free data-structures cannot have deadlock, there is seldom deadlock using locks for the simple class of problems solvable using lock-free approaches. For example, protecting basic data-structure operations with locks is usually very straightforward. Normally deadlock occurs when accessing multiple resources simultaneously, which is not a class of problems
dealt with by lock-free approaches. Furthermore, disciplined lock usage, such as ranking locks to avoid deadlock, works well in practice and
is not onerous for the programmer.Finally, some static analysis tools are helpful for detecting deadlock scenarios.

Lock-free approaches have thread-kill tolerance, meaning no thread owns a lock, so any thread can terminate at an arbitrary point without leaving a lock in the closed state. However, within an application, thread kill is an unusual operation and thread failure means an unrecoverable error or major reset.

A lock-free approach always allows progress of other threads, whereas locks can cause delays if the lock owner is preempted. However,this issue is a foundational aspect of preemptive concurrency. And there are ways to mitigate this issue for locks using scheduler-activation techniques. However, lock-free is not immune to delays. If a page is evicted containing part of the lock-based or lockfree data, there is a delay. Hence, lock free is no better than lock based if the page
fault occurs on frequently accessed shared data. Given the increasing number of processors and large amount of memory on modern computers, neither of these delays should occur often.

Lock-free approaches are reentrant, and hence, can be used in signal handlers, which are implicitly concurrent. Locking approaches cannot deal with this issue. Lock-free approaches are claimed not to have priority inversion. However, inversion can occur because of the spinning required with atomic instructions, like CAA, as the hardware does not provide a bound for spinning threads. Hence, a low-priority thread can barge head of a high-priority thread because the low-priority thread just happens to win the race at the CAA instruction. Essentially,
priority inversion is a foundational aspect of preemptive concurrency and can only be mitigated.

The conclusion is that for unmanaged programming language (i.e., no garbage collection), using classical locks is simple, efficient, general, and causes issues only when the problem scales to multiple locks. For managed programming-languages, lock-free data-structures are easier to implement, but only handle a specific set of problems, and the programmer must accept other idiosyncrasies, like pauses in
execution for garbage collection.

Thank you,
Amine Moulay Ramdane.

Lock Versus Lock-Free

aminer68@gmail.com: Dec 10 02:27PM -0800

Hello,

Lock Versus Lock-Free

The class of problems that can be solved by lock-free approaches is limited.

Furthermore, lock-free approaches can require restructuring a problem.
As soon as multiple shareddata-structuresare modifiedsimultaneously,the only practical approach is to use a lock.

All lock-free dynamic-size data-structures using such CAA (Compare-and-assign) require some form of garbage collector to lazily delete storage when it is no longer referenced. In languages with garbage collection, this capability comes for free (at the cost of garbage collection). For languages without garbage collection, the code is complex and error prone in comparison with locks, requiring epoch-based reclamation, read-copy-update (RCU), or hazard pointers.

While better performance is claimed for lock-free data-structures, there is no long-term evidence to support this claim. Many high-performance locking situations, e.g., operating system kernels and databases, continue to use locking in various forms, even though there are a broad class of lock-free data-structure readily available.

While lock-free data-structures cannot have deadlock, there is seldom deadlock using locks for the simple class of problems solvable using lock-free approaches. For example, protecting basic data-structure operations with locks is usually very straightforward. Normally deadlock occurs when accessing multiple resources simultaneously, which is not a class of problems
dealt with by lock-free approaches. Furthermore, disciplined lock usage, such as ranking locks to avoid deadlock, works well in practice and
is not onerous for the programmer.Finally, some static analysis tools are helpful for detecting deadlock scenarios.

Lock-free approaches have thread-kill tolerance, meaning no thread owns a lock, so any thread can terminate at an arbitrary point without leaving a lock in the closed state. However, within an application, thread kill is an unusual operation and thread failure means an unrecoverable error or major reset.

A lock-free approach always allows progress of other threads, whereas locks can cause delays if the lock owner is preempted. However,this issue is a foundational aspect of preemptive concurrency. And there are ways to mitigate this issue for locks using scheduler-activation techniques. However, lock-free is not immune to delays. If a page is evicted containing part of the lock-based or lockfree data, there is a delay. Hence, lock free is no better than lock based if the page
fault occurs on frequently accessed shared data. Given the increasing number of processors and large amount of memory on modern computers, neither of these delays should occur often.

Lock-free approaches are reentrant, and hence, can be used in signal handlers, which are implicitly concurrent. Locking approaches cannot deal with this issue. Lock-free approaches are claimed not to have priority inversion. However, inversion can occur because of the spinning required with atomic instructions, like CAA, as the hardware does not provide a bound for spinning threads. Hence, a low-priority thread can barge head of a high-priority thread because the low-priority thread just happens to win the race at the CAA instruction. Essentially,
priority inversion is a foundational aspect of preemptive concurrency and can only be mitigated.

The conclusion is that for unmanaged programming language (i.e., no garbage collection), using classical locks is simple, efficient, general, and causes issues only when the problem scales to multiple locks. For managed programming-languages, lock-free data-structures are easier to implement, but only handle a specific set of problems, and the programmer must accept other idiosyncrasies, like pauses in
execution for garbage collection.

Thank you,
Amine Moulay Ramdane.

About memory safety and more..

aminer68@gmail.com: Dec 10 01:34PM -0800

Hello,

About memory safety and more..

I have just read the following webpage about memory safety:

Microsoft: 70 percent of all security bugs are memory safety issues

https://www.zdnet.com/article/microsoft-70-percent-of-all-security-bugs-are-memory-safety-issues/

And it says:

"Users who often read vulnerability reports come across terms over and over again. Terms like buffer overflow, race condition, page fault, null pointer, stack exhaustion, heap exhaustion/corruption, use after free, or double free --all describe memory safety vulnerabilities."

So as you will notice below, that the following memory safety problems have been solved in Delphi:

And I have just read the following webpage about "Fearless Security: Memory safety":

https://hacks.mozilla.org/2019/01/fearless-security-memory-safety/

Here is the memory safety problems:

1- Misusing Free (use-after-free, double free)

I have solved this in Delphi and Freepascal by inventing a "Scalable" reference counting with efficient support for weak references. Read below about it.

2- Uninitialized variables

This can be detected by the compilers of Delphi and Freepascal.

3- Dereferencing Null pointers

I have solved this in Delphi and Freepascal by inventing a "Scalable" reference counting with efficient support for weak references. Read below about it.

4- Buffer overflow and underflow

This has been solved in Delphi by using madExcept, read here about it:

http://help.madshi.net/DebugMm.htm

You can buy it from here:

http://www.madshi.net/

There remains also the stack exhaustion memory safety problem,
and here is how to detect it in Delphi:

Call the function "DoStackOverflow" below once from your code and you'll get the EStackOverflow error raised by Delphi with the message "stack overflow", and you can print the line of the source code where EStackOverflow is raised with JCLDebug and such:

----

function DoStackOverflow : integer;

begin

result := 1 + DoStackOverflow;

end;

---

About my scalable algorithms inventions..

I am a white arab, and i am a gentleman type of person,
and i think that you know me too by my poetry that i wrote
in front of you and that i posted here, but i am
also a more serious computer developer, and i am also
an inventor who has invented many scalable algorithms, read about
them on my writing below:

Here is my last scalable algorithm invention, read
what i have just responded in comp.programming.threads:

About my LRU scalable algorithm..

On 10/16/2019 7:48 AM, Bonita Montero on comp.programming.threads wrote:
> in locked mode in very rare cases. And as I said inserting and
> flushing is conventional locked access.
> So the quest is for you: Can you guess what I did?

And here is what i have just responded:

I think i am also smart, so i have just quickly found a solution that is scalable and that is not your solution, so it needs my hashtable that is scalable and it needs my fully scalable FIFO queue that i have invented. And i think i will not patent it. But my solution is not Lockfree, it uses locks like in a Lock striping manner and it is scalable.

And read about my other scalable algorithms inventions on my writing below:

About the buffer overflow problem..

I wrote yesterday about buffer overflow in Delphi and Freepascal..

I think there is a "higher" abstraction in Delphi and Freepascal
that does the job very well of avoiding buffer overflow, and it is
the TMemoryStream class, since it behaves also like a pointer
and it supports reallocmem() and freemem() on the pointer but
with a higher level abstraction, look for example at my
following example in Delphi and Freepascal, you will notice
that contrary to pointers , that the memory stream is adapting with writebuffer() without the need of reserving the memory, and this is why it avoids the buffer overflow problem, read the following example to notice how i am using it with a PAnsichar type:

========================================

Program test;

uses system.classes,system.sysutils;

var P: PAnsiChar;

Begin

P:='Amine';

mem:=TMemorystream.create;

mem.position:=0;

mem.writebuffer(pointer(p)^,6);

mem.position:=0;

writeln(PAnsichar(mem.memory));

end.

===================================

So since Delphi and Freepascal also detect the buffer overflow on dynamic arrays , so i think that Delphi and Freepascal are powerful
tools.

Read my previous thoughts below to understand more:

And I have just read the following webpage about "Fearless Security: Memory safety":

https://hacks.mozilla.org/2019/01/fearless-security-memory-safety/

Here is the memory safety problems:

1- Misusing Free (use-after-free, double free)

I have solved this in Delphi and Freepascal by inventing a "Scalable" reference counting with efficient support for weak references. Read below about it.

2- Uninitialized variables

This can be detected by the compilers of Delphi and Freepascal.

3- Dereferencing Null pointers

I have solved this in Delphi and Freepascal by inventing a "Scalable" reference counting with efficient support for weak references. Read below about it.

4- Buffer overflow and underflow

This has been solved in Delphi by using madExcept, read here about it:

http://help.madshi.net/DebugMm.htm

You can buy it from here:

http://www.madshi.net/

And about race conditions and deadlocks problems and more, read my following thoughts to understand:

I will reformulate more smartly what about race conditions detection in Rust, so read it carefully:

You can think of the borrow checker of Rust as a validator for a locking system: immutable references are shared read locks and mutable references are exclusive write locks. Under this mental model, accessing data via two independent write locks is not a safe thing to do, and modifying data via a write lock while there are readers alive is not safe either.

So as you are noticing that the "mutable" references in Rust follow the Read-Write Lock pattern, so this is not good, because it is not like more fine-grained parallelism that permits us to run the writes in "parallel" and gain more performance from parallelizing the writes.

Read more about Rust and Delphi and my inventions..

I think the spirit of Rust is like the spirit of ADA, they are especially designed for the very high standards of safety, like those of ADA, "but" i don't think we have to fear race conditions that Rust solve, because i think that race conditions are not so difficult to avoid when you are a decent knowledgeable programmer in parallel programming, so you have to understand what i mean, now we have to talk about the rest of the safety guaranties of Rust, there remain the problem of Deadlocks, and i think that Rust is not solving this problem, but i have provided you with an enhanced DelphiConcurrent library for Delphi and Freepascal that detects deadlocks, and there is also the Memory Safety guaranties of Rust, here they are:

1- No Null Pointer Dereferences
2- No Dangling Pointers
3- No Buffer Overruns

But notice that I have solved the number 1 and number 2 by inventing my
scalable reference counting with efficient support for weak references
for Delphi and Freepascal, read below to notice it, and for number 3 read my following thoughts to understand:

More about research and software development..

I have just looked at the following new video:

Why is coding so hard...

https://www.youtube.com/watch?v=TAAXwrgd1U8

I am understanding this video, but i have to explain my work:

I am not like this techlead in the video above, because i am also an "inventor" that has invented many scalable algorithms and there implementions, i am also inventing effective abstractions, i give you an example:

Read the following of the senior research scientist that is called Dave Dice:

Preemption tolerant MCS locks

https://blogs.oracle.com/dave/preemption-tolerant-mcs-locks

As you are noticing he is trying to invent a new lock that is preemption tolerant, but his lock lacks some important characteristics, this is why i have just invented a new Fast Mutex that is adaptative and that is much much better and i think mine is the "best", and i think you will not find it anywhere, my new Fast Mutex has the following characteristics:

1- Starvation-free
2- Good fairness
3- It keeps efficiently and very low the cache coherence traffic
4- Very good fast path performance (it has the same performance as the
scalable MCS lock when there is contention.)
5- And it has a decent preemption tolerance.

this is how i am an "inventor", and i have also invented other scalable algorithms such as a scalable reference counting with efficient support for weak references, and i have invented a fully scalable Threadpool, and i have also invented a Fully scalable FIFO queue, and i have also invented other scalable algorithms and there inmplementations, and i think i will sell some of them to Microsoft or to
Google or Embarcadero or such software companies.

Read my following writing to know me more:

More about computing and parallel computing..

The important guaranties of Memory Safety in Rust are:

1- No Null Pointer Dereferences
2- No Dangling Pointers
3- No Buffer Overruns

I think i have solved Null Pointer Dereferences and also solved Dangling Pointers and also solved memory leaks for Delphi and Freepascal by inventing my "scalable" reference counting with efficient support for weak references and i have implemented it in Delphi and Freepascal (Read about it below), and reference counting in Rust and C++ is "not" scalable.

About the (3) above that is Buffer Overruns, read here about Delphi and Freepascal:

What's a buffer overflow and how to avoid it in Delphi?

read my above thoughts about it.

About Deadlock and Race conditions in Delphi and Freepascal:

I have ported DelphiConcurrent to Freepascal, and i have
also extended them with the support of my scalable RWLocks for Windows and Linux and with the support of my scalable lock called MLock for Windows and Linux and i have also added the support for a Mutex for Windows and Linux, please look inside the DelphiConcurrent.pas and FreepascalConcurrent.pas files inside the zip file to understand more.

You can download DelphiConcurrent and FreepascalConcurrent for Delphi and Freepascal from:

https://sites.google.com/site/scalable68/delphiconcurrent-and-freepascalconcurrent

DelphiConcurrent and FreepascalConcurrent by Moualek Adlene is a new way to build Delphi applications which involve parallel executed code based on threads like application servers. DelphiConcurrent provides to the programmers the internal mechanisms to write safer multi-thread code while taking a special care of performance and genericity.

In concurrent applications a DEADLOCK may occurs when two threads or more try to lock two consecutive shared resources or more but in a different order. With DelphiConcurrent and FreepascalConcurrent, a DEADLOCK is detected and automatically skipped - before he occurs - and the programmer has an explicit exception describing the multi-thread problem instead of a blocking DEADLOCK which freeze the application with no output log (and perhaps also the linked clients sessions if we talk about an application server).

Amine Moulay Ramdane has extended them with the support of his scalable RWLocks for Windows and Linux and with the support of his scalable lock called MLock for Windows and Linux and he has also added the support for a Mutex for Windows and Linux, please look inside the DelphiConcurrent.pas and FreepascalConcurrent.pas files to
understand more.

And please read the html file inside to learn more how to use it.

About race conditions now:

My scalable Adder is here..

As you have noticed i have just posted previously my modified versions of DelphiConcurrent and FreepascalConcurrent to deal with deadlocks in parallel programs.

But i have just read the following about how to avoid race conditions in Parallel programming in most cases..

Here it is:

https://vitaliburkov.wordpress.com/2011/10/28/parallel-programming-with-delphi-part-ii-resolving-race-conditions/

This is why i have invented my following powerful scalable Adder to help you do the same as the above, please take a look at its source code to understand more, here it is:

https://sites.google.com/site/scalable68/scalable-adder-for-delphi-and-freepascal

Other than that, about composability of lock-based systems now:

Design your systems to be composable. Among the more galling claims of the detractors of lock-based systems is the notion that they are somehow uncomposable:

"Locks and condition variables do not support modular programming," reads one typically brazen claim, "building large programs by gluing together smaller programs[:] locks make this impossible."9 The claim, of course, is incorrect. For evidence one need only point at the composition of lock-based systems such as databases and operating systems into larger systems that remain entirely unaware of lower-level locking.

There are two ways to make lock-based systems completely composable, and each has its own place. First (and most obviously), one can make locking entirely internal to the subsystem. For example, in concurrent operating systems, control never returns to user level with in-kernel locks held; the locks used to implement the system itself are entirely behind the system call interface that constitutes the interface to the system. More generally, this model can work whenever a crisp interface exists between software components: as long as control flow is never returned to the caller with locks held, the subsystem will remain composable.

Second (and perhaps counterintuitively), one can achieve concurrency and
composability by having no locks whatsoever. In this case, there must be
no global subsystem state—subsystem state must be captured in per-instance state, and it must be up to consumers of the subsystem to assure that they do not access their instance in parallel. By leaving locking up to the client of the subsystem, the subsystem itself can be used concurrently by different subsystems and in different contexts. A concrete example of this is the AVL tree implementation used extensively in the Solaris kernel. As with any balanced binary tree, the implementation is sufficiently complex to merit componentization, but by not having any global state, the implementation may be used concurrently by disjoint subsystems—the only constraint is that manipulation of a single AVL tree instance must be serialized.

Read more here:

https://queue.acm.org/detail.cfm?id=1454462

And about Message Passing Process Communication Model and Shared Memory Process Communication Model:

An advantage of shared memory model is that memory communication is faster as compared to the message passing model on the same machine.

Read the following to notice it:

Why did Windows NT move away from the microkernel?

"The main reason that Windows NT became a hybrid kernel is speed. A microkernel-based system puts only the bare minimum system components in the kernel and runs the rest of them as user mode processes, known as servers. A form of inter-process communication (IPC), usually message passing, is used for communication between servers and the kernel.

Microkernel-based systems are more stable than others; if a server crashes, it can be restarted without affecting the entire system, which couldn't be done if every system component was part of the kernel. However, because of the overhead incurred by IPC and context-switching, microkernels are slower than traditional kernels. Due to the performance costs of a microkernel,

About the Linux sys_membarrier() expedited and the windows FlushProcessWriteBuffers()..

aminer68@gmail.com: Dec 10 01:30PM -0800

Hello,

About the Linux sys_membarrier() expedited and the windows FlushProcessWriteBuffers()..

I have just read the following webpage:

https://lwn.net/Articles/636878/

And it is interesting and it says:

---

Results in liburcu:

Operations in 10s, 6 readers, 2 writers:

memory barriers in reader: 1701557485 reads, 3129842 writes
signal-based scheme: 9825306874 reads, 5386 writes
sys_membarrier expedited: 6637539697 reads, 852129 writes
sys_membarrier non-expedited: 7992076602 reads, 220 writes

---

Look at how "sys_membarrier expedited" is powerful.

So as you have noticed i have already implemented my following scalable scalable Asymmetric RWLocks that use the windows FlushProcessWriteBuffers(), they are called Fast_RWLockX and LW_Fast_RWLockX and they are limited to 400 threads but you can manually extended the maximum number of threads by setting the NbrThreads parameter of the constructor, and you have to start once and for all your threads and work with all your threads, don't start every time a thread and exit from the thread. Fast_RWLockX and LW_Fast_RWLockX don't use any atomic operations and/or StoreLoad style memory barriers on the reader side, so they are scalable and very fast, and i will soon port them to Linux and they will support both sys_membarrier expedited and sys_membarrier non-expedited.

You can download my inventions of scalable Asymmetric RWLocks that use
IPIs and that are costless on the reader side from here:

https://sites.google.com/site/scalable68/scalable-rwlock

Cache-coherency protocols do not use IPIs, and as a user-space level developer you do not care about IPIs at all. One is most interested in the cost of cache-coherency itself. However, Win32 API provides a function that issues IPIs to all processors (in the affinity mask of the current process) FlushProcessWriteBuffers(). You can use it to investigate the cost of IPIs.

When i do simple synthetic test on a dual core machine I've obtained following numbers.

420 cycles is the minimum cost of the FlushProcessWriteBuffers() function on issuing core.

1600 cycles is mean cost of the FlushProcessWriteBuffers() function on issuing core.

1300 cycles is mean cost of the FlushProcessWriteBuffers() function on remote core.

Note that, as far as I understand, the function issues IPI to remote core, then remote core acks it with another IPI, issuing core waits for ack IPI and then returns.

And the IPIs have indirect cost of flushing the processor pipeline.

Thank you,
Amine Moulay Ramdane.

You received this digest because you're subscribed to updates for this group. You can change your settings on the group membership page.
To unsubscribe from this group and stop receiving emails from it send an email to comp.programming.threads+unsubscribe@googlegroups.com.

soft and program

Wednesday, December 11, 2019

Digest for comp.programming.threads@googlegroups.com - 7 updates in 7 topics

No comments:

Blog Archive

About Me