soft and program: Digest for comp.programming.threads@googlegroups.com

comp.programming.threads@googlegroups.com

Google Groups

More of my philosophy about the Supervolcano threat and about the Potentially Hazardous Asteroids.. - 1 Update
More of my philosophy about matrix-matrix multiplication and about scalability and more of my thoughts.. - 1 Update

More of my philosophy about the Supervolcano threat and about the Potentially Hazardous Asteroids..

Amine Moulay Ramdane <aminer68@gmail.com>: Oct 31 03:07PM -0700

Hello,

More of my philosophy about the Supervolcano threat and about the Potentially Hazardous Asteroids..

I am a white arab from Morocco, and i think i am smart since i have also
invented many scalable algorithms and algorithms..

And more of my philosophy about the Potentially Hazardous Asteroids..

About the Potentially Hazardous Asteroids: The largest, most devastating impacts (like that which helped to kill the dinosaurs 65 million years ago) are the rarest. But smaller, more frequent collisions also pose a marked risk. This is why a New space telescope could spot potentially hazardous asteroids heading for Earth, since in 2013, an asteroid entered Earth's atmosphere over Chelyabinsk, Russia. It exploded in the air, releasing 20 to 30 times more energy than that of the first atomic bombs, generating brightness greater than the sun, exuding heat, damaging more than 7,000 buildings and injuring more than 1,000 people. The shock wave broke windows 58 miles away. It went undetected because the asteroid came from the same direction and path as the sun. These meteorites landed on Earth after a 22-million-year voyage. The NEO Surveyor will use infrared sensors that can help astronomers find these objects -- even ones that may approach Earth during the day from the direction of the sun. This isn't something that's possible using ground-based observatories.

Read more here:

https://www.cnn.com/2021/06/30/world/international-asteroid-day-2021-nasa-telescope-scn/index.html

But i think we are becoming successful at changing the trajectory of
asteroids, read in the following article so that to notice it:

NASA's DART Spacecraft Successfully Moved an Asteroid

Read more here:

https://gizmodo.com/nasa-dart-didymos-dimorphos-asteroid-test-space-defense-1849644501

Nasa's ambitious plan to save Earth from a supervolcano

"There are around 20 known supervolcanoes on Earth, with major eruptions occurring on average once every 100,000 years. One of the greatest threats an eruption may pose is thought to be starvation, with a prolonged volcanic winter potentially prohibiting civilisation from having enough food for the current population. And Supervolcano threat is substantially greater than the asteroid or comet threat."

Read more here:

Nasa's ambitious plan to save Earth from a supervolcano

https://www.bbc.com/future/article/20170817-nasas-ambitious-plan-to-save-earth-from-a-supervolcano

And i invite you to read the following interesting article:

A "catastrophic" supervolcano eruption could be much more likely than currently believed, according to a new study.

Existing knowledge about the likelihood of eruptions is based on the presence of liquid magma under a volcano, but new research warns "eruptions can occur even if no liquid magma is found".

Read more here:

https://news.sky.com/story/catastrophic-supervolcano-eruption-could-be-much-more-likely-than-previously-thought-scientists-warn-12398129

Thank you,
Amine Moulay Ramdane.

More of my philosophy about matrix-matrix multiplication and about scalability and more of my thoughts..

Amine Moulay Ramdane <aminer68@gmail.com>: Oct 31 02:28PM -0700

Hello,

More of my philosophy about matrix-matrix multiplication and about scalability and more of my thoughts..

I am a white arab, and i think i am smart since i have also
invented many scalable algorithms and algorithms..

I think that the time complexity of the Strassen algorithm for matrix-matrix multiplication is around O(N^2.8074), and the time complexity of the naive algorithm is O(N^3) , so it is not a significant difference, so i think i will soon implement the parallel Blocked matrix-matrix multiplication and i will implement it with a new algorithm that also uses intel AVX512 and that uses fused multiply-add and of course it will use the assembler instructions below of prefetching into caches so that to gain a 22% speed, so i think that overall it will have around the same speed as parallel BLAS, and i say that Pipelining greatly increases throughput in modern CPUs such as x86 CPUs, and another common pipelining scenario is the FMA or fused multiply-add, which is a fundamental part of the instruction set for some processors. The basic load-operate-store sequence simply lengthens by one step to become load-multiply-add-store. The FMA is possible only if the hardware supports it, as it does in the case of the Intel Xeon Phi, for example, as well as in Skylake etc.

More of my philosophy about matrix-vector multiplication of large matrices and about scalability and more of my thoughts..

The matrix-vector multiplication of large matrices is completly limited by the memory bandwidth as i have just said it, read it below, so vector extensions like using SSE or AVX are usually not necessary for matrix-vector multiplication of large matrices. It is interesting that
matrix-matrix-multiplications don't have these kind of problems with memory bandwidth. Companies like Intel or AMD typically usually show benchmarks of matrix-matrix multiplications and they show how nice they scale on many more cores, but they never show matrix-vector multiplications, and notice that my Powerful Open source software project of Parallel C++ Conjugate Gradient Linear System Solver Library that scales very well is also memory-bound and the matrices for it are usually big, but my new algorithm of it is efficiently cache-aware and efficiently NUMA-aware, and i have implemented it for the dense and sparse matrices.

More of my philosophy about the efficient Matrix-Vector multiplication algorithm in MPI and about scalability and more of my thoughts..

Matrix-vector multiplication is an absolutely fundamental operation, with countless applications in computer science and scientific computing. Efficient algorithms for matrix-vector multiplication are of paramount importance, and notice that for matrix-vector multiplication, n^2 time is certainly required for an n × n dense matrix, but you have to be smart, since in MPI computing for also the supercomputer exascale systems, doesn't only take into account this n^2 time, since it has to also be efficiently be cache-aware, and it has to also have a good complexity for the how much memory is used by the parallel processes in MPI, since notice carefully with me that you have also to not send both a row of the matrix and the vector the the parallel processes of MPI, but you have to know how to reduce efficiently this complexity by for example dividing each row of the matrix and by dividing the vector and sending a part of the row of the matrix and a part of the vector to the parallel processes of MPI, and i think that in an efficient algorithm for Matrix-Vector multiplication, time for addition is dominated by the communication time, and of course that my implementation of my Powerful Open source software of Parallel C++ Conjugate Gradient Linear System Solver Library that scales very well is also smart, since it is efficiently cache-aware and efficiently NUMA-aware, and it implements both the dense and the sparse, and of course as i am showing below, it is scaling well on the memory channels, so it is scaling well in my 16 cores dual Xeon with 8 memory channels as i am showing below, and it will scale well on 16 sockets HPE NONSTOP X SYSTEMS or the 16 sockets HPE Integrity Superdome X with above 512 cores and with 64 memory channels, so i invite you to read carefully and to download my Open source software project of Parallel C++ Conjugate Gradient Linear System Solver Library that scales very well from my website here:

https://sites.google.com/site/scalable68/scalable-parallel-c-conjugate-gradient-linear-system-solver-library

MPI will continue to be a viable programming model on exascale supercomputer systems, so i will soon implement many algorithms in MPI for Delphi and Freepascal and i will provide you with them, i am currently
implementing an efficient Matrix-Vector multiplication algorithm in MPI
and you have to know that an efficient Matrix-Vector multiplication algorithm is really important for scientific applications, and of course i will also soon implement many other interesting algorithms in MPI for Delphi and Freepascal and i will provide you with them, so stay tuned !
More of my philosophy about the memory bottleneck and about scalability
and more of my thoughts..

I think i am highly smart since I have passed two certified IQ tests and i have scored "above" 115 IQ, and I am also specialized in parallel computing, and i know that the large cache can reduce Amdahl's Law bottleneck – main memory, but you have to understand what i am saying, since my Open source project below of my Powerful Open source software project of Parallel C++ Conjugate Gradient Linear System Solver Library that scales very well is also memory-bound and the matrices for it are usually big, and since also the sparse linear system solvers are ubiquitous in high performance computing (HPC) and often are the most computational intensive parts in scientific computing codes. A few of the many applications relying on sparse linear solvers include fusion energy simulation, space weather simulation, climate modeling, and environmental modeling, and finite element method, and large-scale reservoir simulations to enhance oil recovery by the oil and gas industry. So it is why i am speaking about the how many memory channels comes in the 16 sockets HPE NONSTOP X SYSTEMS or the 16 sockets HPE Integrity Superdome X, so as you notice that they can come with more than 512 cores and with 64 memory channels. Also i have just benchmarked my Scalable Varfiler and it is scaling above 7x on my 16 cores Dual Xeon processor, and it is scaling well since i have 8 memory channels, and i invite you to look at my powerful Scalable Varfiler carefully in the following web link:

https://sites.google.com/site/scalable68/scalable-parallel-varfiler

More of my philosophy about the how many memory channels in the 16 sockets HPE NONSTOP X SYSTEMS and more of my thoughts..

I think i was right by saying that the 16 sockets HPE NONSTOP X SYSTEMS or the 16 sockets HPE Integrity Superdome X have around 2 to 4 memory channels per socket on x86 with Intel Xeons, and it means that they have 32 or 64 memory channels.

You can read here the FAQ from Hewlett Packard Enterprise from USA so that to notice it:

https://bugzilla.redhat.com/show_bug.cgi?id=1346327

And it says the following:

"How many memory channels per socket for specific CPU?

Each of the 8 blades has 2 CPU sockets.
Each CPU socket has 2 memory channels each connecting to 2 memory controllers that contain 6 Dimms each."

So i think that it can also support 4 memory channels per CPU socket with Intel Xeons.

More of my philosophy about the highest availability with HPE NONSTOP X SYSTEMS from Hewlett Packard Enterprise from USA and more of my thoughts..

I have just talked, read it below, about the 16 sockets HPE Integrity Superdome X from Hewlett Packard Enterprise from USA, but so that
to be the highest "availability" on x86 architecture, i advice you to buy the
16 sockets HPE NONSTOP X SYSTEMS from Hewlett Packard Enterprise from USA, and read about it here:

https://www.hpe.com/hpe-external-resources/4aa4-2000-2999/enw/4aa4-2988?resourceTitle=Engineered+for+the+highest+availability+with+HPE+Integrity+NonStop+family+of+systems+brochure&download=true

And here is more of my thoughts about the history of HP NonStop on x86:

More of my philosophy about HP and about the Tandem team and more of my thoughts..

I invite you to read the following interesting article so that
to notice how HP was smart by also acquiring Tandem Computers, Inc.
with there "NonStop" systems and by learning from the Tandem team
that has also Extended HP NonStop to x86 Server Platform, you can read about it in my below writing and you can read about Tandem Computers here: https://en.wikipedia.org/wiki/Tandem_Computers , so notice that Tandem Computers, Inc. was the dominant manufacturer of fault-tolerant computer systems for ATM networks, banks, stock exchanges, telephone switching centers, and other similar commercial transaction processing applications requiring maximum uptime and zero data loss:

https://www.zdnet.com/article/tandem-returns-to-its-hp-roots/

More of my philosophy about HP "NonStop" to x86 Server Platform fault-tolerant computer systems and more..

Now HP to Extend HP NonStop to x86 Server Platform

HP announced in 2013 plans to extend its mission-critical HP NonStop technology to x86 server architecture, providing the 24/7 availability required in an always-on, globally connected world, and increasing customer choice.

Read the following to notice it:

https://www8.hp.com/us/en/hp-news/press-release.html?id=1519347#.YHSXT-hKiM8

And today HP provides HP NonStop to x86 Server Platform, and here is
an example, read here:

https://www.hpe.com/ca/en/pdfViewer.html?docId=4aa5-7443&parentPage=/ca/en/products/servers/mission-critical-servers/integrity-nonstop-systems&resourceTitle=HPE+NonStop+X+NS7+%E2%80%93+Redefining+continuous+availability+and+scalability+for+x86+data+sheet

So i think programming the HP NonStop for x86 is now compatible with x86 programming.

More of my philosophy about the 16 sockets HPE Integrity Superdome X from Hewlett Packard Enterprise from USA and more of my thoughts..

I think i am highly smart since I have passed two certified IQ tests and i have scored "above" 115 IQ, so i think that parallel programming with memory on Intel's CXL will be different than parallel programming the many memory channels and on many sockets, so i think so that to scale much more the memory channels on many sockets and be compatible, i advice you to for example buy the 16 sockets HPE Integrity Superdome X from Hewlett Packard Enterprise from USA here:

https://cdn.cnetcontent.com/3b/dc/3bdcd896-f2b4-48e4-bbf6-a75234db25da.pdf

And i am sure that my below Powerful Open source software project of Parallel C++ Conjugate Gradient Linear System Solver Library that scales very well will work correctly on the 16 sockets HPE Superdome X.

More of my philosophy about the future of system memory and more of thoughts..

Here is the future of system memory of how to scale like with many more memory channels:

THE FUTURE OF SYSTEM MEMORY IS MOSTLY CXL

Read more here:

https://www.nextplatform.com/2022/07/05/the-future-of-system-memory-is-mostly-cxl/

So i think the way to parallel programming in the standard Intel's CXL will look like parallel programming with many memory channels as i am
doing it below with my Powerful Open source software project of Parallel C++ Conjugate Gradient Linear System Solver Library that scales very well.

More of my philosophy about x86 CPUs and about cache prefetching and more of my thoughts..

I think i am highly smart since I have passed two certified IQ tests and i have scored "above" 115 IQ, and today i will talk about the how to prefetch data into the caches on x86 microprocessors:

So here my following delphi and freepascal x86 inline assembler procedures that prefetch data into the caches:

So for 32 bit Delphi and Freepascal compilers, here is how to prefetch data into the level 1 cache and notice that, in delphi and freepascal compilers, when we pass the first parameter of the procedure with a register convention, it will be passed on CPU register eax of the x86 microprocessor:

procedure Prefetch(p : pointer); register;
asm
prefetchT1 byte ptr [eax]
end;

For 64 bit Delphi and Freepascal compilers, here is how to prefetch data into the level 1 cache and notice that, in delphi and freepascal compilers, when we pass the first parameter of the procedure with a register convention, it will be passed on CPU register rcx of the x86 microprocessor:

procedure Prefetch(p : pointer); register;
asm
prefetchT1 byte ptr [rcx]
end;

And you can request a loading of 256 bytes in advance into the caches, and it can be efficient, by doing this:

So for 32 bit Delphi and Freepascal compilers you do this:

procedure Prefetch(p : pointer); register;
asm
prefetchT1 byte ptr [eax]+256
end;

So for 64 bit Delphi and Freepascal compilers you do this:

procedure Prefetch(p : pointer); register;
asm
prefetchT1 byte ptr [rcx]+256
end;

So you can also prefetch into level 0 and level 2 caches with the x86 assembler instruction prefetchT0 and prefetchT2, so just replace, in the above inline assembler procedures, prefetchT1 with prefetchT0 or prefetchT2, but i think i am highly smart and i say that notice that those prefetch x86 assembler instructions are used since also the microprocessor can be faster than memory, so then you have to understand that today, the story is much nicer, since the powerful x86 processor cores can all sustain many memory requests, and we call this process: "memory-level parallelism", and today x86 AMD or Intel processor cores could support more than 10 independent memory requests at a time, so for example Graviton 3 ARM CPU appears to sustain about 19 simultaneous memory loads per core against about 25 for the Intel processor, so then i think i can also say that this memory-level parallelism looks like using latency hiding so that to speed the things more so that the CPU doesn't wait too much for memory.

And now i invite you to read more of my thoughts about stack memory allocations and about preemptive and non-preemptive timesharing in the following web link:

https://groups.google.com/g/alt.culture.morocco/c/JuC4jar661w

And more of my philosophy about Stacktrace and more of my thoughts..

I think i am highly smart, and i say that there is advantages and disadvantages to portability in software programming , for example you can make your application run just in Windows operating system and it can be much more business friendly than making it run in multiple operating systems, since in business you have for example to develop and sell your application faster or much faster than the competition, so then we can not say that the tendency of C++ to requiring portability is a good thing.

Other than that i have just looked at Delphi and Freepascal and
i have just noticed that the Stacktrace in Freepascal is much more enhanced than Delphi, since look for example at the following application of Freepascal that has made Stacktrace portable to different operating systems and CPU architectures , and it is a much more enhanced stacktrace that is better than the Delphi ones that run just in Windows:

https://github.com/r3code/lazarus-exception-logger

But notice carefully that the Delphi ones run just in Windows:

https://docwiki.embarcadero.com/Libraries/Sydney/en/System.SysUtils.Exception.StackTrace

So i think that since a much more enhanced Stacktrace is important,
so i think that Delphi needs to provide us with a portable one to different operating systems and CPU architectures.

Also the Free Pascal Developer

You received this digest because you're subscribed to updates for this group. You can change your settings on the group membership page.
To unsubscribe from this group and stop receiving emails from it send an email to comp.programming.threads+unsubscribe@googlegroups.com.

soft and program

Tuesday, November 1, 2022

Digest for comp.programming.threads@googlegroups.com - 2 updates in 2 topics

No comments:

Blog Archive

About Me