Sunday, April 25, 2021

Digest for comp.programming.threads@googlegroups.com - 10 updates in 5 topics

Amine Moulay Ramdane <aminer68@gmail.com>: Apr 24 06:12PM -0700

Hello...
 
 
I invite you to look at the following interesting video:
 
Why has TAIWAN become the most IMPORTANT country in the world? - VisualPolitik EN
 
https://www.youtube.com/watch?v=U1jN0Dizn9M
 
 
Thank you,
Amine Moulay Ramdane.
Amine Moulay Ramdane <aminer68@gmail.com>: Apr 24 05:23PM -0700

Hello,
 
 
More philosophy of do we have to be positive..
 
I think we have to be positive because look carefully at the following interesting video that explains it correctly:
 
https://www.youtube.com/watch?v=ZeeBUOGKz3I
 
And look carefully at the following interesting video to understand correctly:
 
Exponential Progress: Can We Expect Mind-Blowing Changes In The Near Future
 
https://www.youtube.com/watch?v=HfM5HXpfnJQ&t=144s
 
 
Thank you,
Amine Moulay Ramdane.
Amine Moulay Ramdane <aminer68@gmail.com>: Apr 24 05:00PM -0700

Hello,
 
I am a white arab, and i think i am smart since i have also
invented many scalable algorithms and algorithms..
 
I am rapidly inventing and thinking and writing my following proverbs:
 
Here is my just new proverb:
 
"Even silence makes us advance, since a human life full of silence is not the right diversity as a balance that makes the good reliance"
 
Here is my other just new proverb in english and french:
 
"Learn to lift your head with dignity because even the sea has threatening waves."
 
"Apprends à élever la tête avec dignité car même la mer a des vagues qui menacent."
 
And here is my other new proverbs read them carefully:
 
https://groups.google.com/g/comp.programming.threads/c/w7wgcbkEEIQ
 
 
Thank you,
Amine Moulay Ramdane.
Amine Moulay Ramdane <aminer68@gmail.com>: Apr 24 04:42PM -0700

Hello,
 
Here is also my philosophy, and it is not the way of C++ programming language that is too complex:
 
In honor of Albert Einstein's birthday – Everything should be made as simple as possible, but no simpler. "Everything should be made as simple as possible, but no simpler." This is one of the great quotes in science.
 
 
Thank you,
Amine Moulay Ramdane.
Amine Moulay Ramdane <aminer68@gmail.com>: Apr 24 01:45PM -0700

Hello,
 
 
More about WaitAny() and WaitAll() and more..
 
I am a white arab, and i think i am smart since i have also
invented many scalable algorithms and algorithms..
 
Look at the following concurrency abstractions of Microsoft:
 
https://docs.microsoft.com/en-us/dotnet/api/system.threading.tasks.task.waitany?view=netframework-4.8
 
https://docs.microsoft.com/en-us/dotnet/api/system.threading.tasks.task.waitall?view=netframework-4.8
 
They look like the following WaitForAny() and WaitForAll() of Delphi, here they are:
 
http://docwiki.embarcadero.com/Libraries/Sydney/en/System.Threading.TTask.WaitForAny
 
http://docwiki.embarcadero.com/Libraries/Sydney/en/System.Threading.TTask.WaitForAll
 
So the WaitForAll() is easy and i have implemented it in my Threadpool engine that scales very well and that i have invented, you can read my html tutorial inside The zip file of it to know how to do it, you can download it from my website here:
 
https://sites.google.com/site/scalable68/an-efficient-threadpool-engine-with-priorities-that-scales-very-well
 
And about the WaitForAny(), you can also do it using my SemaMonitor,
and i will soon give you an example of how to do it, and you can download my SemaMonitor invention from my website here:
 
https://sites.google.com/site/scalable68/semacondvar-semamonitor

Here is my other just new software inventions..
 
I have just looked at the source code of the following multiplatform pevents
 
https://github.com/neosmart/pevents
 
And notice that the WaitForMultipleEvents() is implemented with pthread
but it is not scalable on multicores. So i have just invented a WaitForMultipleObjects() that looks like the Windows WaitForMultipleObjects() and that is fully "scalable" on multicores and that works on Windows and Linux and MacOSX and that is blocking when waiting for the objects as WaitForMultipleObjects(), so it doesn't consume CPU cycles when waiting and it works with events and futures and tasks.
 
Here is my other just new software inventions..
 
I have just invented a fully "scalable" on multicores latch and a
fully scalable on multicores thread barrier, they are really powerful.
 
Read about the latches and thread barriers that are not scalable on
multicores of C++ here:
 
https://www.modernescpp.com/index.php/latches-and-barriers
 
 
Here is my other software inventions:
 
 
More about my scalable math Linear System Solver Library...
 
As you have just noticed i have just spoken about my Linear System Solver Library(read below), right now it scales very well, but i will
soon make it "fully" scalable on multicores using one of my scalable algorithm that i have invented and i will extend it much more to also support efficient scalable on multicores matrix operations and more, and since it will come with one of my scalable algorithms that i have invented, i think i will sell it too.
 
More about mathematics and about scalable Linear System Solver Libraries and more..
 
I have just noticed that a software architect from Austria
called Michael Rabatscher has designed and implemented MrMath Library that is also a parallelized Library:
 
Here he is:
 
https://at.linkedin.com/in/michael-rabatscher-6821702b
 
And here is his MrMath Library for Delphi and Freepascal:
 
https://github.com/mikerabat/mrmath
 
But i think that he is not so smart, and i think i am smart like
a genius and i say that his MrMath Library is not scalable on multicores, and notice that the Linear System Solver of his MrMath Library is not scalable on multicores too, and notice that the threaded matrix operations of his Library are not scalable on multicores too, this is why i have invented a scalable on multicores Conjugate Gradient Linear System Solver Library for C++ and Delphi and Freepascal, and here it is, read about it in my following thoughts(also i will soon extend more my Library to support scalable matrix operations):
 
About SOR and Conjugate gradient mathematical methods..
 
I have just looked at SOR(Successive Overrelaxation Method),
and i think it is much less powerful than Conjugate gradient method,
read the following to notice it:
 
COMPARATIVE PERFORMANCE OF THE CONJUGATE GRADIENT AND SOR METHODS
FOR COMPUTATIONAL THERMAL HYDRAULICS
 
https://inis.iaea.org/collection/NCLCollectionStore/_Public/19/055/19055644.pdf?r=1&r=1
 
 
This is why i have implemented in both C++ and Delphi my Parallel Conjugate Gradient Linear System Solver Library that scales very well, read my following thoughts about it to understand more:
 
 
About the convergence properties of the conjugate gradient method
 
The conjugate gradient method can theoretically be viewed as a direct method, as it produces the exact solution after a finite number of iterations, which is not larger than the size of the matrix, in the absence of round-off error. However, the conjugate gradient method is unstable with respect to even small perturbations, e.g., most directions are not in practice conjugate, and the exact solution is never obtained. Fortunately, the conjugate gradient method can be used as an iterative method as it provides monotonically improving approximations to the exact solution, which may reach the required tolerance after a relatively small (compared to the problem size) number of iterations. The improvement is typically linear and its speed is determined by the condition number κ(A) of the system matrix A: the larger is κ(A), the slower the improvement.
 
Read more here:
 
http://pages.stat.wisc.edu/~wahba/stat860public/pdf1/cj.pdf
 
 
So i think my Conjugate Gradient Linear System Solver Library
that scales very well is still very useful, read about it
in my writing below:
 
Read the following interesting news:
 
The finite element method finds its place in games
 
Read more here:
 
https://translate.google.com/translate?hl=en&sl=auto&tl=en&u=https%3A%2F%2Fhpc.developpez.com%2Factu%2F288260%2FLa-methode-des-elements-finis-trouve-sa-place-dans-les-jeux-AMD-propose-la-bibliotheque-FEMFX-pour-une-simulation-en-temps-reel-des-deformations%2F
 
But you have to be aware that finite element method uses Conjugate Gradient Method for Solution of Finite Element Problems, read here to notice it:
 
Conjugate Gradient Method for Solution of Large Finite Element Problems on CPU and GPU
 
https://pdfs.semanticscholar.org/1f4c/f080ee622aa02623b35eda947fbc169b199d.pdf
 
 
This is why i have also designed and implemented my Parallel Conjugate Gradient Linear System Solver library that scales very well,
here it is:
 
My Parallel C++ Conjugate Gradient Linear System Solver Library
that scales very well version 1.76 is here..
 
Author: Amine Moulay Ramdane
 
Description:
 
This library contains a Parallel implementation of Conjugate Gradient Dense Linear System Solver library that is NUMA-aware and cache-aware that scales very well, and it contains also a Parallel implementation of Conjugate Gradient Sparse Linear System Solver library that is cache-aware that scales very well.
 
Sparse linear system solvers are ubiquitous in high performance computing (HPC) and often are the most computational intensive parts in scientific computing codes. A few of the many applications relying on sparse linear solvers include fusion energy simulation, space weather simulation, climate modeling, and environmental modeling, and finite element method, and large-scale reservoir simulations to enhance oil recovery by the oil and gas industry.
 
Conjugate Gradient is known to converge to the exact solution in n steps for a matrix of size n, and was historically first seen as a direct method because of this. However, after a while people figured out that it works really well if you just stop the iteration much earlier - often you will get a very good approximation after much fewer than n steps. In fact, we can analyze how fast Conjugate gradient converges. The end result is that Conjugate gradient is used as an iterative method for large linear systems today.
 
Please download the zip file and read the readme file inside the zip to know how to use it.
 
You can download it from:
 
https://sites.google.com/site/scalable68/scalable-parallel-c-conjugate-gradient-linear-system-solver-library
 
Language: GNU C++ and Visual C++ and C++Builder
 
Operating Systems: Windows, Linux, Unix and Mac OS X on (x86)
 
 
--
 
Thread Barrier for Delphi and Freepascal version 1.0 is here..
 
I have added my condition variable implementation and my scalable Lock called scalable MLock that both work with both Windows and Linux and i have made the Thread Barrier work with both Windows and Linux, and now you can pass a parameter to the constructor of the Thread Barrier as ctMutex to use a Mutex or ctMLock to use a scalable Lock called MLock or ctCriticalSection to use a Crital Section.
 
You can download it from my website here:
 
https://sites.google.com/site/scalable68/thread-barrier-for-delphi-and-freepascal
 
Yet more precision about my inventions that are my SemaMonitor and SemaCondvar and my Monitor..
 
My inventions that are my SemaMonitor and SemaCondvar are fast pathed when the count of my SemaMonitor or my SemaCondvar is greater than 0, so in this case the wait() method stays on the user mode and it doesn't switch from user mode to kernel mode that costs around 1500 CPU cycles and that is expensive, the signal() method is also fast pathed when there is no item in the queue and count is less than MaximumCount, read here about what is the cost (in CPU cycles) to switch between windows user mode and kernel mode:
 
https://stackoverflow.com/questions/1368061/whats-the-cost-in-cycles-to-switch-between-windows-kernel-and-user-mode#:~:text=1%20Answer&text=Switching%20from%20%E2%80%9Cuser%20mode%E2%80%9D%20to,rest%20is%20%22kernel%20overhead%22.
 
You can read about and download my inventions of SemaMonitor and SemaCondvar from here:
 
https://sites.google.com/site/scalable68/semacondvar-semamonitor
 
And the light weight version is here:
 
https://sites.google.com/site/scalable68/light-weight-semacondvar-semamonitor
 
And i have implemented an efficient Monitor over my SemaCondvar.
 
Here is the description of my efficient Monitor inside the Monitor.pas file that you will find inside the zip file:
 
Description:
 
This is my implementation of a Monitor over my SemaCondvar.
 
You will find the Monitor class inside the Monitor.pas file inside the zip file.
 
When you set the first parameter of the constructor to true, the signal will not be lost if the threads are not waiting with wait() method, but when you set the first parameter of the construtor to false, if the threads are not waiting with the wait() method, the signal will be lost..
 
Second parameter of the constructor is the kind of Lock, you can set it to ctMLock to use my scalable node based lock called MLock, or you can set it to ctMutex to use a Mutex or you can set it to ctCriticalSection to use the TCriticalSection.
 
Here is the methods of my efficient Monitor that i have implemented:
 
TMonitor = class
private
cache0:typecache0;
lock1:TSyncLock;
obj:TSemaCondvar;
cache1:typecache0;
 
public
 
constructor Create(bool:boolean=true;lock:TMyLocks=ctMLock);
destructor Destroy; override;
procedure Enter();
procedure Leave();
function Signal():boolean;overload;
function Signal(nbr:long;var remains:long):boolean;overload;
procedure Signal_All();
function Wait(const AMilliseconds:longword=INFINITE): boolean;
function WaitersBlocked():long;
 
end;
 
 
The wait() method is for the threads to wait on the Monitor object for
the signal to be signaled. If wait() fails, that can be that the number
of waiters is greater than high(longword).
 
And the signal() method will signal one time a waiting thread on the
Monitor object, but if signal() fails , the returned value is false.
 
the signal_all() method will signal all the waiting threads on
the Monitor object.
 
The signal(nbr:long;var remains:long) method will signal nbr of
waiting threads, but if signal() fails, the remaining number of signals
that were not signaled will be returned in the remains variable.
 
and WaitersBlocked() will return the number of waiting threads on
the Monitor object.
 
and Enter() and Leave() methods to enter and leave the monitor's Lock.
 
 
You can download the zip files from:
 
https://sites.google.com/site/scalable68/semacondvar-semamonitor
 
and the lightweight version is here:
 
https://sites.google.com/site/scalable68/light-weight-semacondvar-semamonitor
 
 
More about my powerful inventions of scalable reference counting algorithm and of my scalable algorithms..
 
I invite you to read the following web page:
 
Why is memory reclamation so important?
 
https://concurrencyfreaks.blogspot.com/search?q=resilience+and+urcu
 
Notice that it is saying the following about RCU:
 
"Reason number 4, resilience
 
Another reason to go with lock-free/wait-free data structures is because they are resilient to failures. On a shared memory system with multiples processes accessing the same data structure, even if one of the processes dies, the others will be able to progress in their work. This is the true gem of lock-free data structures: progress in the presence of failure. Blocking data structures (typically) do not have this property (there are exceptions though). If we add a blocking memory reclamation (like URCU) to a lock-free/wait-free data structure, we are loosing this resilience because one dead process will prevent further memory reclamation and eventually bring down the whole system.
There goes the resilience advantage out the window."
 
So i think that RCU can not be used as reference counting,
since it is blocking on the writer side, so it is not resilient to failures since it is not lock-free on the writer side.
 
So this is why i have invented my powerful Scalable reference counting with efficient support for weak references that is lock-free for its scalable reference counting, and here it is:
 
https://sites.google.com/site/scalable68/scalable-reference-counting-with-efficient-support-for-weak-references
 
And my scalable reference counting algorithm is of the SCU(0,1) Class of Algorithms, so under scheduling conditions which approximate those found in commercial hardware architectures, it becomes wait-free with a system latency of time O(sqrt(k)) and with an individual latency of O(k*sqrt(k)), and k number of threads.
 
The proof is here on the following PhD paper:
 
https://arxiv.org/pdf/1311.3200.pdf
 
This paper suggests a simple solution to this problem. We show that, for a large class of lock- free algorithms, under scheduling conditions which approximate those found in commercial hardware architectures, lock-free algorithms behave as if they are wait-free. In other words, programmers can keep on designing simple lock-free algorithms instead of complex wait-free ones, and in practice, they will get wait-free progress. It says on the Analysis of the Class SCU(q, s):
 
"Given an algorithm in SCU(q, s) on k correct processes under a uniform stochastic scheduler, the system latency is O(q + s*sqrt(k), and the individual latency is O(k(q + s*sqrt(k))."
 
More precision about my new inventions of scalable algorithms..
 
And look at my below powerful inventions of LW_Fast_RWLockX and Fast_RWLockX that are two powerful scalable RWLocks that are FIFO fair
and Starvation-free and costless on the reader side
(that means with no atomics and with no fences on the reader side), they use sys_membarrier expedited on Linux and FlushProcessWriteBuffers() on windows, and if you look at the source code of my LW_Fast_RWLockX.pas
and Fast_RWLockX.pas inside the zip file, you will notice that in Linux they call two functions that are membarrier1() and membarrier2(), the membarrier1() registers the process's intent to use
Bonita Montero <Bonita.Montero@gmail.com>: Apr 25 12:10AM +0200

> https://sites.google.com/site/scalable68/an-efficient-threadpool-engine-with-priorities-that-scales-very-well
 
Having different priorities for the queue-items is absolutely
_wrong_ ! Therefore you should use multiple thread-pools whose
threads get different priorities and the fairness much more
accurately realised by the scheduler.
Amine Moulay Ramdane <aminer68@gmail.com>: Apr 24 03:50PM -0700

On Saturday, April 24, 2021 at 6:10:06 PM UTC-4, Bonita Montero wrote:
> _wrong_ ! Therefore you should use multiple thread-pools whose
> threads get different priorities and the fairness much more
> accurately realised by the scheduler.
 
 
I think you are wrong, since you have not to manage priorities of processes and threads
by your way, since this can introduce more complexity and this can lead to priority inversion, so i think
that my methodology is good. So i understand your objections, but i think with the way
of your thinking you can get into complexity and problems.
 
Thank you,
Amine Moulay Ramdane.
Amine Moulay Ramdane <aminer68@gmail.com>: Apr 24 04:00PM -0700

On Saturday, April 24, 2021 at 6:10:06 PM UTC-4, Bonita Montero wrote:
> _wrong_ ! Therefore you should use multiple thread-pools whose
> threads get different priorities and the fairness much more
> accurately realised by the scheduler.
 
I think i have answered your question before, but i will add something really important:
 
Your way of doing is the wrong way, because you are coming from the way of C++ thinking and
this way of thinking has a "tendency" to stupidly get into more complexity
that is more difficult to manage, but my way of doing is that i am looking
to efficiently minimize complexity, so our philosophies are not the same.
And i will give you more logical proof by giving you some real world examples
of your kind of stupidity from PhD research so that to understand.
 
Thank you,
Amine Moulay Ramdane.
Amine Moulay Ramdane <aminer68@gmail.com>: Apr 24 04:18PM -0700

On Saturday, April 24, 2021 at 6:10:06 PM UTC-4, Bonita Montero wrote:
> _wrong_ ! Therefore you should use multiple thread-pools whose
> threads get different priorities and the fairness much more
> accurately realised by the scheduler.
 
 
There is another problem with you getting into more complexity, it is
that you are seeking to use multiple threadpools by giving them each
a level of priority, but what we are trying to do is to maximize parallelism,
but with your methology you are not maximizing parallelism, since you can
get into more thread switches if every threadpool has the same thread number as the number of cores,
or if you don't want to get into this problem of too much context switches, you will
get also into another problem by not correctly maximizing parallelism.
 
 
Thank you,
Amine Moulay Ramdane.
Bonita Montero <Bonita.Montero@gmail.com>: Apr 25 01:41AM +0200

No, I'm not wrong. Having different scheduling-properties is
a much more fine graineed kind of having different priorites.
You received this digest because you're subscribed to updates for this group. You can change your settings on the group membership page.
To unsubscribe from this group and stop receiving emails from it send an email to comp.programming.threads+unsubscribe@googlegroups.com.

No comments: