- Here is another problem with ARM processors.. - 2 Updates
- No, your smartphone isn't going to boil your skin off or give you cancer in the 5G age - 1 Update
- About the store buffer and memory visibility.. - 1 Update
- Concurrency and Parallelism: Understanding I/O - 1 Update
- My Scalable VarFiler was updated to version 1.92 - 1 Update
aminer68@gmail.com: Dec 27 12:33PM -0800 Hello.. Here is another problem with ARM processors.. About SC and TSO and RMO hardware memory models.. I have just read the following webpage about the performance difference between: SC and TSO and RMO hardware memory models I think TSO is better, it is just around 3% ~ 6% less performance than RMO and it is a simpler programming model than RMO. So i think ARM must support TSO to be compatible with x86 that is TSO. Read more here to notice it: https://infoscience.epfl.ch/record/201695/files/CS471_proj_slides_Tao_Marc_2011_1222_1.pdf About memory models and sequential consistency: As you have noticed i am working with x86 architecture.. Even though x86 gives up on sequential consistency, it's among the most well-behaved architectures in terms of the crazy behaviors it allows. Most other architectures implement even weaker memory models. ARM memory model is notoriously underspecified, but is essentially a form of weak ordering, which provides very few guarantees. Weak ordering allows almost any operation to be reordered, which enables a variety of hardware optimizations but is also a nightmare to program at the lowest levels. Read more here: https://homes.cs.washington.edu/~bornholt/post/memory-models.html Memory Models: x86 is TSO, TSO is Good Essentially, the conclusion is that x86 in practice implements the old SPARC TSO memory model. The big take-away from the talk for me is that it confirms the observation made may times before that SPARC TSO seems to be the optimal memory model. It is sufficiently understandable that programmers can write correct code without having barriers everywhere. It is sufficiently weak that you can build fast hardware implementation that can scale to big machines. Read more here: https://jakob.engbloms.se/archives/1435 Thank you, Amine Moulay Ramdane. |
Bonita Montero <Bonita.Montero@gmail.com>: Dec 28 12:49PM +0100 > I think TSO is better, it is just around 3% ~ 6% less performance It's impossible to estimate that since you can't build a ARM-CPU with TSO for comparison. |
aminer68@gmail.com: Dec 27 02:32PM -0800 Hello, No, your smartphone isn't going to boil your skin off or give you cancer in the 5G age Read more here: https://business.financialpost.com/opinion/no-your-smartphone-isnt-going-to-boil-your-skin-off-or-give-you-cancer-in-the-5g-age Thank you, Amine Moulay Ramdane. |
aminer68@gmail.com: Dec 27 12:34PM -0800 Hello, About the store buffer and memory visibility.. More about memory visibility.. I said before: As you know that in parallel programming you have to take care not only of memory ordering , but also take care about memory visibility, read this to notice it: A store barrier, "sfence" instruction on x86, forces all store instructions prior to the barrier to happen before the barrier and have the store buffers flushed to cache for the CPU on which it is issued. This will make the program state "visible" to other CPUs so they can act on it if necessary. Read more here to understand correctly: "However under x86-TSO, the stores are cached in the store buffers, a load consult only shared memory and the store buffer of the given thread, wich means it can load data from memory and ignore values from the other thread." Read more here: https://books.google.ca/books?id=C2R2DwAAQBAJ&pg=PA127&lpg=PA127&dq=immediately+visible+and+m+fence+and+store+buffer+and+x86&source=bl&ots=yfGI17x1YZ&sig=ACfU3U2EYRawTkQmi3s5wY-sM7IgowDlWg&hl=en&sa=X&ved=2ahUKEwi_nq3duYPkAhVDx1kKHYoyA5UQ6AEwAnoECAgQAQ#v=onepage&q=immediately%20visible%20and%20m%20fence%20and%20store%20buffer%20and%20x86&f=false Now can we ask the question of how much time takes the store buffer to drain ? So read here to notice: https://nicknash.me/2018/04/07/speculating-about-store-buffer-capacity/ So as you are noticing he is giving around 500 no-ops to allow the store buffer to drain, and i think that it can take less than that for the store buffer to drain. Thank you, Amine Moulay Ramdane. |
aminer68@gmail.com: Dec 27 09:02AM -0800 Hello, Read the following the following webpage: Concurrency and Parallelism: Understanding I/O https://blog.risingstack.com/concurrency-and-parallelism-understanding-i-o/ So you have to know that my Parallel Compression Library and my Parallel Archiver are very efficient in I/O, and here is what i wrote about my powerful Parallel Compression Library: Description: Parallel Compression Library implements Parallel LZ4 , Parallel LZMA , and Parallel Zstd algorithms using my Thread Pool Engine. - It supports memory streams, file streams and files - 64 bit supports - lets you create archive files over 4 GB , supports archives up to 2^63 bytes, compresses and decompresses files up to 2^63 bytes. - Parallel compression and parallel decompression are extremely fast - Now it supports processor groups on windows, so that it can use more than 64 logical processors and it scales well. - It's NUMA-aware and NUMA efficient on windows (it parallelizes the reads and writes on NUMA nodes) - It minimizes efficiently the contention so that it scales well. - It supports both compression and decompression rate indicator - You can test the integrity of your compressed file or stream - It is thread-safe, that means that the methods can be called from multiple threads - Easy programming interface - Full source codes available. Now my Parallel compression library is optimized for NUMA (it parallelizes the reads and writes on NUMA nodes) and it supports processor groups on windows and it uses only two threads that do the IO (and they are not contending) so that it reduces at best the contention, so that it scales well. Also now the process of calculating the CRC is much more optimized and is fast, and the process of testing the integrity is fast. I have done a quick calculation of the scalability prediction for my Parallel Compression Library, and i think it's good: it can scale beyond 100X on NUMA systems. The Dynamic Link Libraries for Windows and Dynamic shared libraries for Linux of the compression and decompression algorithms of my Parallel Compression Library and for my Parallel archiver were compiled from C with the optimization level 2 enabled, so they are very fast. Here are the parameters of the constructor: First parameter is: The number of cores you have specify to run the compression algorithm in parallel. Second parameter is: A boolean parameter that is processorgroups to support processor groups on windows , if it is set to true it will enable you to scale beyond 64 logical processors and it will be NUMA efficient. Just look at the Easy compression library for example, if you have noticed it's not a parallel compression library: http://www.componentace.com/ecl_features.htm And look at its pricing: http://www.componentace.com/order/order_product.php?id=4 My parallel compression library costs you 0$ and it's a parallel compression library.. You can read more about my Parallel Compression Library and download it from my website here: https://sites.google.com/site/scalable68/parallel-compression-library Thank you, Amine Moulay Ramdane. |
aminer68@gmail.com: Dec 27 08:47AM -0800 Hello, My Scalable VarFiler was updated to version 1.92 Now ParallelVarFiler is Fault tolerant to power failures etc. i have done a simulation of power failures and data file damages and ParallelVarFiler is recovering from power failures and damages of the data file. If AnalyzeVarfiler() returns ctCorrupt if there is a power failure or something like that that gives a corruption, you have to fix the format of the archive with FixVarfiler() that will fix the format of the archive and recover from a power failure corruption etc. You can download it from: https://sites.google.com/site/scalable68/scalable-parallel-varfiler And the Scalable Varfiler benchmarks are here: https://sites.google.com/site/scalable68/parallel-varfiler-benchmarks Thank you, Amine Moulay Ramdane. |
You received this digest because you're subscribed to updates for this group. You can change your settings on the group membership page. To unsubscribe from this group and stop receiving emails from it send an email to comp.programming.threads+unsubscribe@googlegroups.com. |
No comments:
Post a Comment