soft and program: Digest for comp.programming.threads@googlegroups.com

comp.programming.threads@googlegroups.com

Google Groups

My Parallel Sort Library that is more efficient version 4.03 is here.. - 1 Update
Why Silicon Valley Wouldn't Work Without Immigrants - 1 Update
More about Transactional Memory.. - 1 Update
More about Hardware transactional memory - 1 Update
Why HTM/STM in general will not allow concurrent execution in practice - 1 Update

My Parallel Sort Library that is more efficient version 4.03 is here..

aminer68@gmail.com: Sep 12 02:24PM -0700

Hello,

My Parallel Sort Library that is more efficient version 4.03 is here..

You can download it from:

https://sites.google.com/site/scalable68/parallel-sort-library-that-is-more-efficient

I have come with a "powerful" Parallel Sort library that is very efficient, and it comes with the source code, please read about it below:

Author: Amine Moulay Ramdane

Description:

Parallel Sort Library that supports Parallel Quicksort, Parallel HeapSort and Parallel MergeSort on Multicores systems.

Parallel Sort Library uses my Thread Pool Engine and sort many array parts - of your array - in parallel using Quicksort or HeapSort or MergeSort and after that it finally merge them - with the merge() procedure -

In the previous parallelsort version i have parallelized only the sort part, but in this new parallelsort version i have parallelized also the merge procedure part and it gives better performance.

My new parallel sort algorithm has become more cache-aware, and i have done some benchmarks with my new parallel algorithm and it has given up to 5X scalability on a Quadcore when sorting strings, other than that i have cleaned more the code and i think my parallel Sort library has become a more professional and industrial parallel Sort library , you can be confident cause i have tested it thoroughly and no bugs have showed , so i hope you will be happy with my new Parallel Sort library.

I have also included a "test.pas" example, just compile first the "gendata.pas" inside the zip file and run it first, after that compile the "test.pas" example and run it and do your benchmarks.

I have implemented a Parallel hybrid divide-and-conquer merge algorithm that performs 0.9-5.8 times better than sequential merge, on a quad-core processor, with larger arrays outperforming by over 5 times. Parallel processing combined with a hybrid algorithm approach provides a powerful high performance result.

My algorithm of finding the median of Parallel merge of my Parallel Sort Library that you will find here in my website:

https://sites.google.com/site/scalable68/parallel-sort-library

Is O(log(min(|A|,|B|))), where |A| is the size of A, since the binary search is performed within the smaller array and is O(lgN). But this new algorithm of finding the median of parallel merge of my Parallel Sort Library is O(log(|A|+|B|)), which is slightly worse. With further optimizations the order was reduced to O(log(2*min(|A|,|B|))), which is better, but is 2X more work, since both arrays may have to be searched. All algorithms are logarithmic. Two binary searches were necessary to find an even split that produced two equal or nearly equal halves. Luckily, this part of the merge algorithm is not performance critical. So, more effort can be spent looking for a better split. This new algorithm in the parallel merge balances the recursive binary tree of the divide-and-conquer and improve the worst-case performance of parallel merge sort.

Why are we finding the median in the parallel algorithm ?

Here is my previous idea of finding the median that is O(log(min(|A|,|B|))) to understand better:

Let's assume we want to merge sorted arrays X and Y. Select X[m] median element in X. Elements in X[ .. m-1] are less than or equal to X[m]. Using binary search find index k of the first element in Y greater than X[m]. Thus Y[ .. k-1] are less than or equal to X[m] as well. Elements in X[m+1..] are greater than or equal to X[m] and Y[k .. ] are greater. So merge(X, Y) can be defined as concat(merge(X[ .. m-1], Y[ .. k-1]), X[m], merge(X[m+1.. ], Y[k .. ])) now we can recursively in parallel do merge(X[ .. m-1], Y[ .. k-1]) and merge(X[m+1 .. ], Y[k .. ]) and then concat results.

Thank you,
Amine Moulay Ramdane.

Why Silicon Valley Wouldn't Work Without Immigrants

aminer68@gmail.com: Sep 12 02:02PM -0700

Hello,

Why Silicon Valley Wouldn't Work Without Immigrants

There are many theories for why immigrants find so much success in tech. Many American-born tech workers point out that there is no shortage of American-born employees to fill the roles at many tech companies. Researchers have found that more than enough students graduate from American colleges to fill available tech jobs. Critics of the industry's friendliness toward immigrants say it comes down to money — that technology companies take advantage of visa programs, like the H-1B system, to get foreign workers at lower prices than they would pay American-born ones.

But if that criticism rings true in some parts of the tech industry, it misses the picture among Silicon Valley's top companies. One common misperception of Silicon Valley is that it operates like a factory; in that view, tech companies can hire just about anyone from anywhere in the world to fill a particular role.

But today's most ambitious tech companies are not like factories. They're more like athletic teams. They're looking for the LeBrons and Bradys — the best people in the world to come up with some brand-new, never-before-seen widget, to completely reimagine what widgets should do in the first place.

"It's not about adding tens or hundreds of thousands of people into manufacturing plants," said Aaron Levie, the co-founder and chief executive of the cloud-storage company Box. "It's about the couple ideas that are going to be invented that are going to change everything."

Why do tech honchos believe that immigrants are better at coming up with those inventions? It's partly a numbers thing. As the tech venture capitalist Paul Graham has pointed out, the United States has only 5 percent of the world's population; it stands to reason that most of the world's best new ideas will be thought up by people who weren't born here.

If you look at some of the most consequential ideas in tech, you find an unusual number that were developed by immigrants. For instance, Google's entire advertising business — that is, the basis for the vast majority of its revenues and profits, the engine that allows it to hire thousands of people in the United States — was created by three immigrants: Salar Kamangar and Omid Kordestani, who came to the United States from Iran, and Eric Veach, from Canada.

But it's not just a numbers thing. Another reason immigrants do so well in tech is that people from outside bring new perspectives that lead to new ideas.

Read more here:

https://www.nytimes.com/2017/02/08/technology/personaltech/why-silicon-valley-wouldnt-work-without-immigrants.html

Thank you,
Amine Moulay Ramdane.

More about Transactional Memory..

aminer68@gmail.com: Sep 12 12:20PM -0700

Hello,

More about Transactional Memory..

We observed that the TM programming model itself, whether implemented in hardware or software, introduces complexities that limit the expected productivity gains, thus reducing the current incentive for migration to transactional programming, and the justification at present for anything more than a small amount of hardware support.

Read more here:

https://queue.acm.org/detail.cfm?id=1454466

Thank you,
Amine Moulay Ramdane.

More about Hardware transactional memory

aminer68@gmail.com: Sep 12 11:42AM -0700

Hello,

More about Hardware transactional memory, and now about the disadvantages of Intel TSX:

Here is also something interesting to read about hardware transactional memory that is Intel TSX:

TSX does not gaurantee forward progress, so there must always be a fallback non-TSX pathway. (complex transactions might always abort even without any contention because they overflow the speculation buffer. Even transactions that could run in theory might livelock forever if you don't have the right pauses to allow forward progress, so the fallback path is needed then too).

TSX works by keeping a speculative set of registers and processor state. It tracks all reads done in the speculation block, and enqueues all writes to be delayed until the transaction ends. The memory tracking of the transaction is currently done using the L1 cache and the standard cache line protocols. This means contention is only detected at cache line granularity, so you have the standard "false sharing" issue.

If your transaction reads a cache line, then any write to that cache line by another core causes the transaction to abort. (reads by other cores do not cause an abort).

If your transaction writes a cache line, then any read or write by another core causes the transaction to abort.

If your transaction aborts, then any cache lines written are evicted from L1. If any of the cache lines involved in the transaction are evicted during the transaction (eg. if you touch too much memory, or another core locks that line), the transaction is aborted.

TSX seems to allow quite a large working set (up to size of L1 ?). Obviously the more memory you touch the more likely to abort due to contention.

Obviously you will get aborts from anything "funny" that's not just plain code and memory access. Context switches, IO, kernel calls, etc. will abort transactions.

At the moment, TSX is quite slow, even if there's no contention and you don't do anything in the block. There's a lot of overhead. Using TSX naively may slow down even threaded code. Getting significant performance gains from it is non-trivial.

Read more here:

http://cbloomrants.blogspot.ca/2014/11/11-12-14-intel-tsx-notes.html

Thank you,
Amine Moulay Ramdane.

Why HTM/STM in general will not allow concurrent execution in practice

aminer68@gmail.com: Sep 12 11:34AM -0700

Hello,

Why HTM/STM in general will not allow concurrent execution in practice

Back in 2006-2010 Azul made custom hardware for running Big Business Java — basically a super-computer for running e.g. web portals written in Java. The hardware had many innovations, and most were geared towards Java's special needs (read barriers for low-latency GC, support range-checked array operations, virtual calls). Probably the most innovative hardware was the Hardware Transactional Memory — the entire L1 of every one of the 864 cores could do a transactional update! This was specifically designed to allow transactional update of Java locks — to allow concurrent execution of Java synchronized blocks as long as there was no actual memory update conflict.

The HTM was really strong — transactions could be as large as the entire L1 cache — yet in practice it was not good enough to run Java synchronized blocks concurrently. Cliff will tell you why it didn't work and why HTM/STM in general will not allow concurrent execution in practice.

Read more here:

https://hydraconf.com/2019/talks/2jix5mst7iduyp9linqhfj/

Thank you,
Amine Moulay Ramdane.

You received this digest because you're subscribed to updates for this group. You can change your settings on the group membership page.
To unsubscribe from this group and stop receiving emails from it send an email to comp.programming.threads+unsubscribe@googlegroups.com.

soft and program

Friday, September 13, 2019

Digest for comp.programming.threads@googlegroups.com - 5 updates in 5 topics

No comments:

Blog Archive

About Me