- And for a Happy new year here is my next poem - 1 Update
- About ParallelFor() and the grainsize.. - 1 Update
- Donald Knuth: Algorithms, Complexity, Life, and The Art of Computer Programming | AI Podcast - 1 Update
- More about sorting algorithms.. - 1 Update
- About Hardware Transactional Memory and my invention that is my powerful Fast Mutex - 1 Update
- Here is my new inventions that are my new variants of Scalable RWLocks that are powerful.. - 1 Update
- My inventions that are my SemaMonitor and my SemaCondvar were updated to version 2.3 - 1 Update
- Delphi Library that implements higher-order functions like Map, Reduce and Filter - 2 Updates
aminer68@gmail.com: Dec 31 03:29PM -0800 Hello, And for a Happy new year here is my next poem: As the fruits of the day and night The "insight" is thus coming to us with a delight Because we are knowing more about the evil fight Because we are knowing more how to make a beautiful bright Because we are knowing more how to make a beautiful light Because we are knowing more how to make a beautiful right Because we are knowing more how to beautifully invite Because we are knowing more how to be beautifully polite And that's alright my baby, that's alright ! Since that's also the way to unite ! Since it is like a beautiful flight ! Since it is like the words of the almight ! Thank you, Amine Moulay Ramdane. |
aminer68@gmail.com: Dec 31 02:30PM -0800 Hello, About ParallelFor() and the grainsize.. I have just read the following web page: Parallelization: Harder than it looks https://www.jayconrod.com/posts/29/parallelization--harder-than-it-looks Notice that MIT's Cilk is using a divide and conquer approach to calculate the grainsize for the Parallel For, here it is: ------- void run_loop(first, last) { if (last - first < grainsize) { for (int i = first; i < last; ++i) LOOP_BODY; } else { int mid = (last-first)/2; cilk_spawn run_loop(first, mid); run_loop(mid, last); } } ------- But as you are noticing if i do a simulation of it by running my following Delphi program: ---------------------- program test; var c,d:uint64; begin c:=high(uint64); d:=0; repeat c:=c div 2; d:=d+1; until c<=1000; writeln(c); writeln(d); end. ------- So as you are noticing for a grainsize of 1000 the above Delphi program gives 511, that means that the Cilk's divide and conquer approach to calculate the grainsize for the Parallel For is "not" good. This is why you have to take a look at my Threadpool engine with priorities that scales very well that is really powerful because it scales very well on multicore and NUMA systems, also it comes with a ParallelFor() that scales very well on multicores and NUMA systems, and take a look at its source code to notice i am calculating much more precisely and correctly the grainsize for ParallelFor() than the Cilk's divide and conquer to calculate the grainsize that is "not" good. Read the rest: Today i will talk about data dependency and parallel loops.. For a loop to be parallelized, every iteration must be independent of the others, one way to be sure of it is to execute the loop in the direction of the incremented index of the loop and in the direction of the decremented index of the loop and verify if the results are the same. A data dependency happens if memory is modified: a loop has a data dependency if an iteration writes a variable that is read or write in another iteration of the loop. There is no data dependency if only one iteration reads or writes a variable or if many iterations read the same variable without modifying it. So this is the "general" "rules". Now there remains to know that you have for example to know how to construct the parallel for loop if there is an induction variable or if there is a reduction operation, i will give an example of them: If we have the following (the code looks like Algol or modern Object Pascal): IND:=0 For I:=1 to N Do Begin IND := IND + 1; A[I]:=B[IND]; End; So as you are noticing since IND is an induction variable , so to parallelize the loop you have to do the following: For I:=1 to N Do Begin IND:=(I*(I+1))/2; A[I]:=B[IND]; End; Now for the reduction operation example, you will notice that my invention that is my Threadpool with priorities that scales very well ( read about it below) supports a Parallel For that scales very well that supports "grainsize", and you will notice that the grainsize can be used in the ParallelFor() with a reduction operation and you will notice that my following powerful scalable Adder is also used in this scenario, here it is: https://sites.google.com/site/scalable68/scalable-adder-for-delphi-and-freepascal So here is the example with a reduction operation in modern Object Pascal: TOTAL:=0.0 For I := 1 to N Do Begin TOTAL:=TOTAL+A[I] End; So with my powerful scalable Adder and with my powerful invention that is my ParallelFor() that scales very well, you will parallelize the above like this: procedure test1(j:integer;ptr:pointer); begin t.add(A[J]); // "t" is my scalable Adder object end; // Let's suppose that N is 100000 // In the following, 10000 is the grainsize obj.ParallelFor(1,N,test1,10000,pointer(0)); TOTAL:=T.get(); And read the following to understand how to use grainsize of my Parallel for that scales well: About my ParallelFor() that scales very well that uses my efficient Threadpool that scales very well: With ParallelFor() you have to: 1- Ensure Sufficient Work Each iteration of a loop involves a certain amount of work, so you have to ensure a sufficient amount of the work, read below about "grainsize" that i have implemented. 2- In OpenMP we have that: Static and Dynamic Scheduling One basic characteristic of a loop schedule is whether it is static or dynamic: • In a static schedule, the choice of which thread performs a particular iteration is purely a function of the iteration number and number of threads. Each thread performs only the iterations assigned to it at the beginning of the loop. • In a dynamic schedule, the assignment of iterations to threads can vary at runtime from one execution to another. Not all iterations are assigned to threads at the start of the loop. Instead, each thread requests more iterations after it has completed the work already assigned to it. But with my ParallelFor() that scales very well, since it is using my efficient Threadpool that scales very well, so it is using Round-robin scheduling and it uses also work stealing, so i think that this is sufficient. Read the rest: My Threadpool engine with priorities that scales very well is really powerful because it scales very well on multicore and NUMA systems, also it comes with a ParallelFor() that scales very well on multicores and NUMA systems. You can download it from: https://sites.google.com/site/scalable68/an-efficient-threadpool-engine-with-priorities-that-scales-very-well Here is the explanation of my ParallelFor() that scales very well: I have also implemented a ParallelFor() that scales very well, here is the method: procedure ParallelFor(nMin, nMax:integer;aProc: TParallelProc;GrainSize:integer=1;Ptr:pointer=nil;pmode:TParallelMode=pmBlocking;Priority:TPriorities=NORMAL_PRIORITY); nMin and nMax parameters of the ParallelFor() are the minimum and maximum integer values of the variable of the ParallelFor() loop, aProc parameter of ParallelFor() is the procedure to call, and GrainSize integer parameter of ParallelFor() is the following: The grainsize sets a minimum threshold for parallelization. A rule of thumb is that grainsize iterations should take at least 100,000 clock cycles to execute. For example, if a single iteration takes 100 clocks, then the grainsize needs to be at least 1000 iterations. When in doubt, do the following experiment: 1- Set the grainsize parameter higher than necessary. The grainsize is specified in units of loop iterations. If you have no idea of how many clock cycles an iteration might take, start with grainsize=100,000. The rationale is that each iteration normally requires at least one clock per iteration. In most cases, step 3 will guide you to a much smaller value. 2- Run your algorithm. 3- Iteratively halve the grainsize parameter and see how much the algorithm slows down or speeds up as the value decreases. A drawback of setting a grainsize too high is that it can reduce parallelism. For example, if the grainsize is 1000 and the loop has 2000 iterations, the ParallelFor() method distributes the loop across only two processors, even if more are available. And you can pass a parameter in Ptr as pointer to ParallelFor(), and you can set pmode parameter of to pmBlocking so that ParallelFor() is blocking or to pmNonBlocking so that ParallelFor() is non-blocking, and the Priority parameter is the priority of ParallelFor(). Look inside the test.pas example to see how to use it. Thank you, Amine Moulay Ramdane. |
aminer68@gmail.com: Dec 31 09:26AM -0800 Hello, I invite you to look at this new interesting video: Donald Knuth: Algorithms, Complexity, Life, and The Art of Computer Programming | AI Podcast https://www.youtube.com/watch?v=2BdBfsXbST8 Thank you, Amine Moulay Ramdane. |
aminer68@gmail.com: Dec 31 08:11AM -0800 Hello, More about sorting algorithms.. I have just read the following webpage: In-depth: Smoothsort vs. Timsort https://www.gamasutra.com/view/news/172542/Indepth_Smoothsort_vs_Timsort.php I think that since it is very "rare" that the Data that is given is sorted or reversed, so then you can notice in the above webpage that Timsort does in reality has a small improvement on the partially sorted data, so then i will choose to use my powerful Parallel Sort Library that is efficient, read about it below: My Parallel Sort Library that is more efficient version 4.03 is here.. Notice also in the source code that my Mergesort uses also insertion sort like in a Timsort manner, so it is very efficient. You can download it from: https://sites.google.com/site/scalable68/parallel-sort-library-that-is-more-efficient I have come with a "powerful" Parallel Sort library that is very efficient, and it comes with the source code, please read about it below: Author: Amine Moulay Ramdane Description: Parallel Sort Library that supports Parallel Quicksort, Parallel HeapSort and Parallel MergeSort on Multicores systems. Parallel Sort Library uses my Thread Pool Engine and sort many array parts - of your array - in parallel using Quicksort or HeapSort or MergeSort and after that it finally merge them - with the merge() procedure - In the previous parallelsort version i have parallelized only the sort part, but in this new parallelsort version i have parallelized also the merge procedure part and it gives better performance. My new parallel sort algorithm has become more cache-aware, and i have done some benchmarks with my new parallel algorithm and it has given up to 5X scalability on a Quadcore when sorting strings, other than that i have cleaned more the code and i think my parallel Sort library has become a more professional and industrial parallel Sort library , you can be confident cause i have tested it thoroughly and no bugs have showed , so i hope you will be happy with my new Parallel Sort library. I have also included a "test.pas" example, just compile first the "gendata.pas" inside the zip file and run it first, after that compile the "test.pas" example and run it and do your benchmarks. I have implemented a Parallel hybrid divide-and-conquer merge algorithm that performs 0.9-5.8 times better than sequential merge, on a quad-core processor, with larger arrays outperforming by over 5 times. Parallel processing combined with a hybrid algorithm approach provides a powerful high performance result. My algorithm of finding the median of Parallel merge of my Parallel Sort Library that you will find here in my website: https://sites.google.com/site/scalable68/parallel-sort-library Is O(log(min(|A|,|B|))), where |A| is the size of A, since the binary search is performed within the smaller array and is O(lgN). But this new algorithm of finding the median of parallel merge of my Parallel Sort Library is O(log(|A|+|B|)), which is slightly worse. With further optimizations the order was reduced to O(log(2*min(|A|,|B|))), which is better, but is 2X more work, since both arrays may have to be searched. All algorithms are logarithmic. Two binary searches were necessary to find an even split that produced two equal or nearly equal halves. Luckily, this part of the merge algorithm is not performance critical. So, more effort can be spent looking for a better split. This new algorithm in the parallel merge balances the recursive binary tree of the divide-and-conquer and improve the worst-case performance of parallel merge sort. Why are we finding the median in the parallel algorithm ? Here is my previous idea of finding the median that is O(log(min(|A|,|B|))) to understand better: Let's assume we want to merge sorted arrays X and Y. Select X[m] median element in X. Elements in X[ .. m-1] are less than or equal to X[m]. Using binary search find index k of the first element in Y greater than X[m]. Thus Y[ .. k-1] are less than or equal to X[m] as well. Elements in X[m+1..] are greater than or equal to X[m] and Y[k .. ] are greater. So merge(X, Y) can be defined as concat(merge(X[ .. m-1], Y[ .. k-1]), X[m], merge(X[m+1.. ], Y[k .. ])) now we can recursively in parallel do merge(X[ .. m-1], Y[ .. k-1]) and merge(X[m+1 .. ], Y[k .. ]) and then concat results. Thank you, Amine Moulay Ramdane. |
aminer68@gmail.com: Dec 31 07:44AM -0800 Hello, About Hardware Transactional Memory and my invention that is my powerful Fast Mutex: "As someone who has used TSX to optimize synchronization primitives, you can expect to see a ~15-20% performance increase, if (big if) your program is heavy on disjoint data access, i.e. a lock is needed for correctness, but conflicts are rare in practice. If you have a lot of threads frequently writing the same cache lines, you are probably going to see worse performance with TSX as opposed to traditional locking. It helps to think about TSX as transparently performing optimistic concurrency control, which is actually pretty much how it is implemented under the hood." Read more here: https://news.ycombinator.com/item?id=8169697 So as you are noticing, HTM (hardware transactional memory) and TM can not replace locks when doing IO and for highly contended critical sections, this is why i have invented my following powerful Fast Mutex: More about research and software development.. I have just looked at the following new video: Why is coding so hard... https://www.youtube.com/watch?v=TAAXwrgd1U8 I am understanding this video, but i have to explain my work: I am not like this techlead in the video above, because i am also an "inventor" that has invented many scalable algorithms and there implementions, i am also inventing effective abstractions, i give you an example: Read the following of the senior research scientist that is called Dave Dice: Preemption tolerant MCS locks https://blogs.oracle.com/dave/preemption-tolerant-mcs-locks As you are noticing he is trying to invent a new lock that is preemption tolerant, but his lock lacks some important characteristics, this is why i have just invented a new Fast Mutex that is adaptative and that is much much better and i think mine is the "best", and i think you will not find it anywhere, my new Fast Mutex has the following characteristics: 1- Starvation-free 2- Good fairness 3- It keeps efficiently and very low the cache coherence traffic 4- Very good fast path performance (it has the same performance as the scalable MCS lock when there is contention.) 5- And it has a decent preemption tolerance. this is how i am an "inventor", and i have also invented other scalable algorithms such as a scalable reference counting with efficient support for weak references, and i have invented a fully scalable Threadpool, and i have also invented a Fully scalable FIFO queue, and i have also invented other scalable algorithms and there inmplementations, and i think i will sell some of them to Microsoft or to Google or Embarcadero or such software companies. And about composability of lock-based systems now: Design your systems to be composable. Among the more galling claims of the detractors of lock-based systems is the notion that they are somehow uncomposable: "Locks and condition variables do not support modular programming," reads one typically brazen claim, "building large programs by gluing together smaller programs[:] locks make this impossible."9 The claim, of course, is incorrect. For evidence one need only point at the composition of lock-based systems such as databases and operating systems into larger systems that remain entirely unaware of lower-level locking. There are two ways to make lock-based systems completely composable, and each has its own place. First (and most obviously), one can make locking entirely internal to the subsystem. For example, in concurrent operating systems, control never returns to user level with in-kernel locks held; the locks used to implement the system itself are entirely behind the system call interface that constitutes the interface to the system. More generally, this model can work whenever a crisp interface exists between software components: as long as control flow is never returned to the caller with locks held, the subsystem will remain composable. Second (and perhaps counterintuitively), one can achieve concurrency and composability by having no locks whatsoever. In this case, there must be no global subsystem state—subsystem state must be captured in per-instance state, and it must be up to consumers of the subsystem to assure that they do not access their instance in parallel. By leaving locking up to the client of the subsystem, the subsystem itself can be used concurrently by different subsystems and in different contexts. A concrete example of this is the AVL tree implementation used extensively in the Solaris kernel. As with any balanced binary tree, the implementation is sufficiently complex to merit componentization, but by not having any global state, the implementation may be used concurrently by disjoint subsystems—the only constraint is that manipulation of a single AVL tree instance must be serialized. Read more here: https://queue.acm.org/detail.cfm?id=1454462 Thank you, Amine Moulat Ramdane. |
aminer68@gmail.com: Dec 31 07:38AM -0800 Hello, Here is my new inventions that are my new variants of Scalable RWLocks that are powerful.. Author: Amine Moulay Ramdane Description: A fast, and scalable and starvation-free and fair and lightweight Multiple-Readers-Exclusive-Writer Lock called LW_RWLockX, the scalable LW_RWLockX does spin-wait, and also a fast and scalable and starvation-free and fair Multiple-Readers-Exclusive-Writer Lock called RWLockX, the scalable RWLockX doesn't spin-wait but uses my portable SemaMonitor and portable event objects , so it is energy efficient. The parameter of the constructors is the size of the array of the readers , so if the size of the array is equal to the number of parallel readers, so it will be scalable, but if the number of readers are greater than the size of the array , you will start to have contention, please look at the source code of my scalable algorithms to understand. I have used my following hash function to make my new variants of RWLocks scalable: --- function DJB2aHash(key:int64):uint64; var i: integer; key1:uint64; begin Result := 5381; for i := 1 to 8 do begin key1:=(key shr ((i-1)*8)) and $00000000000000ff; Result := ((Result shl 5) xor Result) xor key1; end; end; --- You can download them from: https://sites.google.com/site/scalable68/new-variants-of-scalable-rwlocks Thank you, Amine Moulay Ramdane. |
aminer68@gmail.com: Dec 31 07:37AM -0800 Hello, My inventions that are my SemaMonitor and my SemaCondvar were updated to version 2.3, they have become efficient and powerful, please read the readme file to know more about the changes, and i have implemented an efficient Monitor over my SemaCondvar. Here is the description of my efficient Monitor inside the Monitor.pas file that you will find inside the zip file: Description: This is my implementation of a Monitor over my SemaCondvar. You will find the Monitor class inside the Monitor.pas file inside the zip file. When you set the first parameter of the constructor to true, the signal will not be lost if the threads are not waiting with wait() method, but when you set the first parameter of the construtor to false, if the threads are not waiting with the wait() method, the signal will be lost.. Second parameter of the constructor is the kind of Lock, you can set it to ctMLock to use my scalable node based lock called MLock, or you can set it to ctMutex to use a Mutex or you can set it to ctCriticalSection to use the TCriticalSection. Here is the methods of my efficient Monitor that i have implemented: TMonitor = class private cache0:typecache0; lock1:TSyncLock; obj:TSemaCondvar; cache1:typecache0; public constructor Create(bool:boolean=true;lock:TMyLocks=ctMLock); destructor Destroy; override; procedure Enter(); procedure Leave(); function Signal():boolean;overload; function Signal(nbr:long;var remains:long):boolean;overload; procedure Signal_All(); function Wait(const AMilliseconds:longword=INFINITE): boolean; function WaitersBlocked():long; end; The wait() method is for the threads to wait on the Monitor object for the signal to be signaled. If wait() fails, that can be that the number of waiters is greater than high(longword). And the signal() method will signal one time a waiting thread on the Monitor object, but if signal() fails , the returned value is false. the signal_all() method will signal all the waiting threads on the Monitor object. The signal(nbr:long;var remains:long) method will signal nbr of waiting threads, but if signal() fails, the remaining number of signals that were not signaled will be returned in the remains variable. and WaitersBlocked() will return the number of waiting threads on the Monitor object. and Enter() and Leave() methods to enter and leave the monitor's Lock. You can download the zip files from: https://sites.google.com/site/scalable68/semacondvar-semamonitor and the lightweight version is here: https://sites.google.com/site/scalable68/light-weight-semacondvar-semamonitor Thank you, Amine Moulay Ramdane. |
aminer68@gmail.com: Dec 31 06:48AM -0800 Hello, Delphi Library that implements higher-order functions like Map, Reduce and Filter Author: Amine Moulay Ramdane Description: Delphi Library that implements higher-order functions like: Map, Reduce and Filter. Map: This is available in many functional programming languages. This takes as arguments a func function and a list of elements list, and returns a new list with func applied to each element of the list. Reduce: This is also known as Fold. This requires a combining function, a starting point for a data structure, and possibly some default values to be used under certain conditions. The Reduce function proceeds to combine elements of the data structure using the injected function. This is used to perform operations on a set of values to get only one result (or a smaller set of values) that represents the reduction of that initial data. For example, the values 1, 2, and 3 can be reduced to the single value 6 using the SUM . Filter: This requires a data structure and a filter condition. This returns all the elements in the structure that match the filter condition. Please look at the Delphi demo inside the zip file called test_MapReduce.pas . Language: Delphi 2009+ Operating Systems: Windows, Mac OSX , Linux You can download it from: https://sites.google.com/site/scalable68/delphi-library-that-implements-higher-order-functions-like-map-reduce-and-filter Thank you, Amine Moulay Ramdane. |
aminer68@gmail.com: Dec 31 07:16AM -0800 > https://sites.google.com/site/scalable68/delphi-library-that-implements-higher-order-functions-like-map-reduce-and-filter > Thank you, > Amine Moulay Ramdane. Sorry, the Operating Systems supported are: Windows, Mac OSX , Linux, iOS, Android Thank you, Amine Moulay Ramdane. |
You received this digest because you're subscribed to updates for this group. You can change your settings on the group membership page. To unsubscribe from this group and stop receiving emails from it send an email to comp.programming.threads+unsubscribe@googlegroups.com. |
No comments:
Post a Comment