Saturday, December 28, 2019

Digest for comp.programming.threads@googlegroups.com - 6 updates in 5 topics

aminer68@gmail.com: Dec 27 12:33PM -0800

Hello..
 
 
Here is another problem with ARM processors..
 
 
About SC and TSO and RMO hardware memory models..
 
I have just read the following webpage about the performance difference
between: SC and TSO and RMO hardware memory models
 
I think TSO is better, it is just around 3% ~ 6% less performance
than RMO and it is a simpler programming model than RMO. So i think ARM
must support TSO to be compatible with x86 that is TSO.
 
Read more here to notice it:
 
https://infoscience.epfl.ch/record/201695/files/CS471_proj_slides_Tao_Marc_2011_1222_1.pdf
 
About memory models and sequential consistency:
 
As you have noticed i am working with x86 architecture..
 
Even though x86 gives up on sequential consistency, it's among the most
well-behaved architectures in terms of the crazy behaviors it allows.
Most other architectures implement even weaker memory models.
 
ARM memory model is notoriously underspecified, but is essentially a
form of weak ordering, which provides very few guarantees. Weak ordering
allows almost any operation to be reordered, which enables a variety of
hardware optimizations but is also a nightmare to program at the lowest
levels.
 
Read more here:
 
https://homes.cs.washington.edu/~bornholt/post/memory-models.html
 
 
Memory Models: x86 is TSO, TSO is Good
 
Essentially, the conclusion is that x86 in practice implements the old
SPARC TSO memory model.
 
The big take-away from the talk for me is that it confirms the
observation made may times before that SPARC TSO seems to be the optimal
memory model. It is sufficiently understandable that programmers can
write correct code without having barriers everywhere. It is
sufficiently weak that you can build fast hardware implementation that
can scale to big machines.
 
Read more here:
 
https://jakob.engbloms.se/archives/1435
 
 
Thank you,
Amine Moulay Ramdane.
Bonita Montero <Bonita.Montero@gmail.com>: Dec 28 12:49PM +0100

> I think TSO is better, it is just around 3% ~ 6% less performance
 
It's impossible to estimate that since you can't build a ARM-CPU
with TSO for comparison.
aminer68@gmail.com: Dec 27 02:32PM -0800

Hello,
 
 
No, your smartphone isn't going to boil your skin off or give you cancer in the 5G age
 
Read more here:
 
https://business.financialpost.com/opinion/no-your-smartphone-isnt-going-to-boil-your-skin-off-or-give-you-cancer-in-the-5g-age
 
 
 
Thank you,
Amine Moulay Ramdane.
aminer68@gmail.com: Dec 27 12:34PM -0800

Hello,
 
 
About the store buffer and memory visibility..
 
 
More about memory visibility..
 
I said before:
 
As you know that in parallel programming you have to take care
not only of memory ordering , but also take care about memory visibility, read this to notice it:
 
A store barrier, "sfence" instruction on x86, forces all store instructions prior to the barrier to happen before the barrier and have the store buffers flushed to cache for the CPU on which it is issued. This will make the program state "visible" to other CPUs so they can act on it if necessary.
 
 
Read more here to understand correctly:
 
"However under x86-TSO, the stores are cached in the store buffers,
a load consult only shared memory and the store buffer of the given thread, wich means it can load data from memory and ignore values from
the other thread."
 
Read more here:
 
https://books.google.ca/books?id=C2R2DwAAQBAJ&pg=PA127&lpg=PA127&dq=immediately+visible+and+m+fence+and+store+buffer+and+x86&source=bl&ots=yfGI17x1YZ&sig=ACfU3U2EYRawTkQmi3s5wY-sM7IgowDlWg&hl=en&sa=X&ved=2ahUKEwi_nq3duYPkAhVDx1kKHYoyA5UQ6AEwAnoECAgQAQ#v=onepage&q=immediately%20visible%20and%20m%20fence%20and%20store%20buffer%20and%20x86&f=false
 
 
 
Now can we ask the question of how much time takes the
store buffer to drain ?
 
 
So read here to notice:
 
https://nicknash.me/2018/04/07/speculating-about-store-buffer-capacity/
 
 
So as you are noticing he is giving around 500 no-ops to allow the store
buffer to drain, and i think that it can take less than that for the store buffer to drain.
 
 
 
Thank you,
Amine Moulay Ramdane.
aminer68@gmail.com: Dec 27 09:02AM -0800

Hello,
 
 
Read the following the following webpage:
 
Concurrency and Parallelism: Understanding I/O
 
https://blog.risingstack.com/concurrency-and-parallelism-understanding-i-o/
 
 
So you have to know that my Parallel Compression Library and my
Parallel Archiver are very efficient in I/O, and here is
what i wrote about my powerful Parallel Compression Library:
 
Description:
 
Parallel Compression Library implements Parallel LZ4 , Parallel LZMA , and Parallel Zstd algorithms using my Thread Pool Engine.
 
- It supports memory streams, file streams and files
 
- 64 bit supports - lets you create archive files over 4 GB , supports archives up to 2^63 bytes, compresses and decompresses files up to 2^63 bytes.
 
- Parallel compression and parallel decompression are extremely fast
 
- Now it supports processor groups on windows, so that it can use more than 64 logical processors and it scales well.
 
- It's NUMA-aware and NUMA efficient on windows (it parallelizes the reads and writes on NUMA nodes)
 
- It minimizes efficiently the contention so that it scales well.
 
- It supports both compression and decompression rate indicator
 
- You can test the integrity of your compressed file or stream
 
- It is thread-safe, that means that the methods can be called from multiple threads
 
- Easy programming interface
 
- Full source codes available.
 
Now my Parallel compression library is optimized for NUMA (it parallelizes the reads and writes on NUMA nodes) and it supports processor groups on windows and it uses only two threads that do the IO (and they are not contending) so that it reduces at best the contention, so that it scales well. Also now the process of calculating the CRC is much more optimized and is fast, and the process of testing the integrity is fast.
 
I have done a quick calculation of the scalability prediction for my Parallel Compression Library, and i think it's good: it can scale beyond 100X on NUMA systems.
 
The Dynamic Link Libraries for Windows and Dynamic shared libraries for Linux of the compression and decompression algorithms of my Parallel Compression Library and for my Parallel archiver were compiled from C with the optimization level 2 enabled, so they are very fast.
 
Here are the parameters of the constructor:
 
First parameter is: The number of cores you have specify to run the compression algorithm in parallel.
 
Second parameter is: A boolean parameter that is processorgroups to support processor groups on windows , if it is set to true it will enable you to scale beyond 64 logical processors and it will be NUMA efficient.
 
Just look at the Easy compression library for example, if you have noticed it's not a parallel compression library:
 
http://www.componentace.com/ecl_features.htm
 
And look at its pricing:
 
http://www.componentace.com/order/order_product.php?id=4
 
My parallel compression library costs you 0$ and it's a parallel compression library..
 
 
You can read more about my Parallel Compression Library and download it from my website here:
 
https://sites.google.com/site/scalable68/parallel-compression-library
 
 
 
 
Thank you,
Amine Moulay Ramdane.
aminer68@gmail.com: Dec 27 08:47AM -0800

Hello,
 
 
My Scalable VarFiler was updated to version 1.92
 
Now ParallelVarFiler is Fault tolerant to power failures etc. i have done a simulation of power failures and data file damages and ParallelVarFiler is recovering from power failures and damages of the data file.
 
If AnalyzeVarfiler() returns ctCorrupt if there is a power failure or something like that that gives a corruption, you have to fix the format of the archive with FixVarfiler() that will fix the format of the archive and recover from a power failure corruption etc.
 
 
You can download it from:
 
https://sites.google.com/site/scalable68/scalable-parallel-varfiler
 
 
And the Scalable Varfiler benchmarks are here:
 
https://sites.google.com/site/scalable68/parallel-varfiler-benchmarks
 
 
 
Thank you,
Amine Moulay Ramdane.
You received this digest because you're subscribed to updates for this group. You can change your settings on the group membership page.
To unsubscribe from this group and stop receiving emails from it send an email to comp.programming.threads+unsubscribe@googlegroups.com.

No comments: