Thursday, December 5, 2019

Digest for comp.programming.threads@googlegroups.com - 8 updates in 8 topics

aminer68@gmail.com: Dec 04 01:31PM -0800

Hello,
 
 
More about compile time and build time..
 
Look here about Java it says:
 
 
"Java Build Time Benchmarks
 
I'm trying to get some benchmarks for builds and I'm coming up short via Google. Of course, build times will be super dependent on a million different things, but I'm having trouble finding anything comparable.
 
Right now: We've got ~2 million lines of code and it takes about 2 hours for this portion to build (this excludes unit tests).
 
What do your build times look like for similar sized projects and what did you do to make it that fast?"
 
 
Read here to notice it:
 
https://www.reddit.com/r/java/comments/4jxs17/java_build_time_benchmarks/
 
 
So 2 million lines of code of Java takes about 2 hours to build.
 
 
And what do you think that 2 millions lines of code takes
to Delphi ?
 
Answer: Just about 20 seconds.
 
 
Here is the proof from Embarcadero, read and look at the video to be convinced about Delphi:
 
https://community.idera.com/developer-tools/b/blog/posts/compiling-a-million-lines-of-code-with-delphi
 
C++ also takes "much" more time to compile than Delphi.
 
 
This is why i said previously the following:
 
 
I think Delphi is a single pass compiler, it is very fast at compile time, and i think C++ and Java and C# are multi pass compilers that are much slower than Delphi in compile time, but i think that the generated executable code of Delphi is still fast and is faster than C#.
 
And what about the Advantages and disadvantages of single and multi pass compiler?
 
And From Automata Theory we get that any Turing Machine that does 2 (or more ) pass over the tape, can be replaced with an equivalent one that makes only 1 pass, with a more complicated state machine. At the theoretical level, they the same. At a practical level, all modern compilers make only one pass over the source code. It typically translated into an internal representation that the different phases analyze and update. During flow analysis basic blocks are identified. Common sub expression are found and precomputed and results reused. During loop analysis, invariant code will be moved out the loop. During code emission registers are assigned and peephole analysis and code reduction is applied.
 
 
 
Thank you,
Amine Moulay Ramdane.
aminer68@gmail.com: Dec 04 01:08PM -0800

Hello...
 
 
About the Linux sys_membarrier() expedited and the windows FlushProcessWriteBuffers()..
 
I have just read the following webpage:
 
https://lwn.net/Articles/636878/
 
 
And it is interesting and it says:
 
---
 
Results in liburcu:
 
Operations in 10s, 6 readers, 2 writers:
 
memory barriers in reader: 1701557485 reads, 3129842 writes
signal-based scheme: 9825306874 reads, 5386 writes
sys_membarrier expedited: 6637539697 reads, 852129 writes
sys_membarrier non-expedited: 7992076602 reads, 220 writes
 
---
 
 
Look at how "sys_membarrier expedited" is powerful.
 
So as you have noticed i have already implemented my following scalable scalable Asymmetric RWLocks that use the windows FlushProcessWriteBuffers(), they are called Fast_RWLockX and LW_Fast_RWLockX and they are limited to 400 threads but you can manually extended the maximum number of threads by setting the NbrThreads parameter of the constructor, and you have to start once and for all your threads and work with all your threads, don't start every time a thread and exit from the thread. Fast_RWLockX and LW_Fast_RWLockX don't use any atomic operations and/or StoreLoad style memory barriers on the reader side, so they are scalable and very fast, and i will soon port them to Linux and they will support both sys_membarrier expedited and sys_membarrier non-expedited.
 
You can download my inventions of scalable Asymmetric RWLocks that use
IPIs and that are costless on the reader side from here:
 
https://sites.google.com/site/scalable68/scalable-rwlock
 
Cache-coherency protocols do not use IPIs, and as a user-space level developer you do not care about IPIs at all. One is most interested in the cost of cache-coherency itself. However, Win32 API provides a function that issues IPIs to all processors (in the affinity mask of the current process) FlushProcessWriteBuffers(). You can use it to investigate the cost of IPIs.
 
When i do simple synthetic test on a dual core machine I've obtained following numbers.
 
420 cycles is the minimum cost of the FlushProcessWriteBuffers() function on issuing core.
 
1600 cycles is mean cost of the FlushProcessWriteBuffers() function on issuing core.
 
1300 cycles is mean cost of the FlushProcessWriteBuffers() function on remote core.
 
Note that, as far as I understand, the function issues IPI to remote core, then remote core acks it with another IPI, issuing core waits for ack IPI and then returns.
 
And the IPIs have indirect cost of flushing the processor pipeline.
 
 
 
Thank you,
Amine Moulay Ramdane.
aminer68@gmail.com: Dec 04 12:37PM -0800

Hello,
 
 
 
About the store buffer and memory visibility..
 
 
 
More about memory visibility..
 
I said before:
 
As you know that in parallel programming you have to take care
not only of memory ordering , but also take care about memory visibility, read this to notice it:
 
A store barrier, "sfence" instruction on x86, forces all store instructions prior to the barrier to happen before the barrier and have the store buffers flushed to cache for the CPU on which it is issued. This will make the program state "visible" to other CPUs so they can act on it if necessary.
 
 
Read more here to understand correctly:
 
"However under x86-TSO, the stores are cached in the store buffers,
a load consult only shared memory and the store buffer of the given thread, wich means it can load data from memory and ignore values from
the other thread."
 
Read more here:
 
https://books.google.ca/books?id=C2R2DwAAQBAJ&pg=PA127&lpg=PA127&dq=immediately+visible+and+m+fence+and+store+buffer+and+x86&source=bl&ots=yfGI17x1YZ&sig=ACfU3U2EYRawTkQmi3s5wY-sM7IgowDlWg&hl=en&sa=X&ved=2ahUKEwi_nq3duYPkAhVDx1kKHYoyA5UQ6AEwAnoECAgQAQ#v=onepage&q=immediately%20visible%20and%20m%20fence%20and%20store%20buffer%20and%20x86&f=false
 
 
 
Now can we ask the question of how much time takes the
store buffer to drain ?
 
 
So read here to notice:
 
https://nicknash.me/2018/04/07/speculating-about-store-buffer-capacity/
 
 
So as you are noticing he is giving around 500 no-ops to allow the store
buffer to drain, and i think that it can take less than that for the store buffer to drain.
 
 
 
Thank you,
Amine Moulay Ramdane.
aminer68@gmail.com: Dec 04 12:35PM -0800

Hello,
 
 
What about garbage collection?
 
Read what said this serious specialist called Chris Lattner:
 
"One thing that I don't think is debatable is that the heap compaction
behavior of a GC (which is what provides the heap fragmentation win) is
incredibly hostile for cache (because it cycles the entire memory space
of the process) and performance predictability."
 
"Not relying on GC enables Swift to be used in domains that don't want
it - think boot loaders, kernels, real time systems like audio
processing, etc."
 
"GC also has several *huge* disadvantages that are usually glossed over:
while it is true that modern GC's can provide high performance, they can
only do that when they are granted *much* more memory than the process
is actually using. Generally, unless you give the GC 3-4x more memory
than is needed, you'll get thrashing and incredibly poor performance.
Additionally, since the sweep pass touches almost all RAM in the
process, they tend to be very power inefficient (leading to reduced
battery life)."
 
Read more here:
 
https://lists.swift.org/pipermail/swift-evolution/Week-of-Mon-20160208/009422.html
 
Here is Chris Lattner's Homepage:
 
http://nondot.org/sabre/
 
And here is Chris Lattner's resume:
 
http://nondot.org/sabre/Resume.html#Tesla
 
 
This why i have invented the following scalable algorithm and its
implementation that makes Delphi and FreePascal more powerful:
 
My invention that is my scalable reference counting with efficient support for weak references version 1.37 is here..
 
Here i am again, i have just updated my scalable reference counting with efficient support for weak references to version 1.37, I have just added a TAMInterfacedPersistent that is a scalable reference counted version, and now i think i have just made it complete and powerful.
 
Because I have just read the following web page:
 
https://www.codeproject.com/Articles/1252175/Fixing-Delphis-Interface-Limitations
 
But i don't agree with the writting of the guy of the above web page, because i think you have to understand the "spirit" of Delphi, here is why:
 
A component is supposed to be owned and destroyed by something else, "typically" a form (and "typically" means in english: in "most" cases, and this is the most important thing to understand). In that scenario, reference count is not used.
 
If you pass a component as an interface reference, it would be very unfortunate if it was destroyed when the method returns.
 
Therefore, reference counting in TComponent has been removed.
 
Also because i have just added TAMInterfacedPersistent to my invention.
 
To use scalable reference counting with Delphi and FreePascal, just replace TInterfacedObject with my TAMInterfacedObject that is the scalable reference counted version, and just replace TInterfacedPersistent with my TAMInterfacedPersistent that is the scalable reference counted version, and you will find both my TAMInterfacedObject and my TAMInterfacedPersistent inside the AMInterfacedObject.pas file, and to know how to use weak references please take a look at the demo that i have included called example.dpr and look inside my zip file at the tutorial about weak references, and to know how to use delegation take a look at the demo that i have included called test_delegation.pas, and take a look inside my zip file at the tutorial about delegation that learns you how to use delegation.
 
I think my Scalable reference counting with efficient support for weak references is stable and fast, and it works on both Windows and Linux, and my scalable reference counting scales on multicore and NUMA systems,
and you will not find it in C++ or Rust, and i don't think you will find it anywhere, and you have to know that this invention of mine solves
the problem of dangling pointers and it solves the problem of memory leaks and my scalable reference counting is "scalable".
 
And please read the readme file inside the zip file that i have just extended to make you understand more.
 
You can download my new scalable reference counting with efficient support for weak references version 1.37 from:
 
https://sites.google.com/site/scalable68/scalable-reference-counting-with-efficient-support-for-weak-references
 
 
Thank you,
Amine Moulay Ramdane.
aminer68@gmail.com: Dec 04 12:33PM -0800

Hello,
 
 
More about Energy efficiency..
 
You have to be aware that parallelization of the software
can lower power consumption, and here is the formula
that permits you to calculate the power consumption of
"parallel" software programs:
 
Power consumption of the total cores = (The number of cores) * ( 1/(Parallel speedup))^3) * (Power consumption of the single core).
 
 
Also read the following about energy efficiency:
 
Energy efficiency isn't just a hardware problem. Your programming
language choices can have serious effects on the efficiency of your
energy consumption. We dive deep into what makes a programming language
energy efficient.
 
As the researchers discovered, the CPU-based energy consumption always
represents the majority of the energy consumed.
 
What Pereira et. al. found wasn't entirely surprising: speed does not
always equate energy efficiency. Compiled languages like C, C++, Rust,
and Ada ranked as some of the most energy efficient languages out there,
and Java and FreePascal are also good at Energy efficiency.
 
Read more here:
 
https://jaxenter.com/energy-efficient-programming-languages-137264.html
 
RAM is still expensive and slow, relative to CPUs
 
And "memory" usage efficiency is important for mobile devices.
 
So Delphi and FreePascal compilers are also still "useful" for mobile
devices, because Delphi and FreePascal are good if you are considering
time and memory or energy and memory, and the following pascal benchmark
was done with FreePascal, and the benchmark shows that C, Go and Pascal
do rather better if you're considering languages based on time and
memory or energy and memory.
 
Read again here to notice it:
 
https://jaxenter.com/energy-efficient-programming-languages-137264.html
 
 
 
Thank you,
Amine Moulay Ramdane.
aminer68@gmail.com: Dec 04 12:32PM -0800

Hello,
 
 
Here is another problem with ARM processors..
 
 
About SC and TSO and RMO hardware memory models..
 
I have just read the following webpage about the performance difference
between: SC and TSO and RMO hardware memory models
 
I think TSO is better, it is just around 3% ~ 6% less performance
than RMO and it is a simpler programming model than RMO. So i think ARM
must support TSO to be compatible with x86 that is TSO.
 
Read more here to notice it:
 
https://infoscience.epfl.ch/record/201695/files/CS471_proj_slides_Tao_Marc_2011_1222_1.pdf
 
About memory models and sequential consistency:
 
As you have noticed i am working with x86 architecture..
 
Even though x86 gives up on sequential consistency, it's among the most
well-behaved architectures in terms of the crazy behaviors it allows.
Most other architectures implement even weaker memory models.
 
ARM memory model is notoriously underspecified, but is essentially a
form of weak ordering, which provides very few guarantees. Weak ordering
allows almost any operation to be reordered, which enables a variety of
hardware optimizations but is also a nightmare to program at the lowest
levels.
 
Read more here:
 
https://homes.cs.washington.edu/~bornholt/post/memory-models.html
 
 
Memory Models: x86 is TSO, TSO is Good
 
Essentially, the conclusion is that x86 in practice implements the old
SPARC TSO memory model.
 
The big take-away from the talk for me is that it confirms the
observation made may times before that SPARC TSO seems to be the optimal
memory model. It is sufficiently understandable that programmers can
write correct code without having barriers everywhere. It is
sufficiently weak that you can build fast hardware implementation that
can scale to big machines.
 
Read more here:
 
https://jakob.engbloms.se/archives/1435
 
 
Thank you,
Amine Moulay Ramdane.
aminer68@gmail.com: Dec 04 12:17PM -0800

Hello,
 
 
I have just posted few posts about my political philosophy,
it was my last post about my political philosophy.
 
 
 
Thank you,
Amine Moulay Ramdane.
aminer68@gmail.com: Dec 04 08:14AM -0800

Hello,
 
 
Seagate's Mach.2 Technology Doubles HDD Performance, Microsoft Jumps Aboard
 
Read more here:
 
https://www.tomshardware.com/news/seagates-mach2-technology-doubles-hdd-performance-microsoft-jumps-aboard
 
 
Thank you,
Amine Moulay Ramdane.
You received this digest because you're subscribed to updates for this group. You can change your settings on the group membership page.
To unsubscribe from this group and stop receiving emails from it send an email to comp.programming.threads+unsubscribe@googlegroups.com.

No comments: