- More of my philosophy about the highest availability with HPE NONSTOP X SYSTEMS from Hewlett Packard Enterprise from USA and more of my thoughts.. - 1 Update
- More of my philosophy about the 16 sockets HPE Integrity Superdome X from Hewlett Packard Enterprise from USA and more of my thoughts.. - 1 Update
- More of my philosophy about the future of system memory and more of thoughts.. - 1 Update
| Amine Moulay Ramdane <aminer68@gmail.com>: Oct 27 03:52PM -0700 Hello, More of my philosophy about the highest availability with HPE NONSTOP X SYSTEMS from Hewlett Packard Enterprise from USA and more of my thoughts.. I am a white arab, and i think i am smart since i have also invented many scalable algorithms and algorithms.. I have just talked, read it below, about the 16 sockets HPE Integrity Superdome X from Hewlett Packard Enterprise from USA, but so that to be the highest "availability" on x86 architecture, i advice you to buy the 16 sockets HPE NONSTOP X SYSTEMS from Hewlett Packard Enterprise from USA, and read about it here: https://www.hpe.com/hpe-external-resources/4aa4-2000-2999/enw/4aa4-2988?resourceTitle=Engineered+for+the+highest+availability+with+HPE+Integrity+NonStop+family+of+systems+brochure&download=true And here is more of my thoughts about the history of HP NonStop on x86: More of my philosophy about HP and about the Tandem team and more of my thoughts.. I invite you to read the following interesting article so that to notice how HP was smart by also acquiring Tandem Computers, Inc. with there "NonStop" systems and by learning from the Tandem team that has also Extended HP NonStop to x86 Server Platform, you can read about it in my below writing and you can read about Tandem Computers here: https://en.wikipedia.org/wiki/Tandem_Computers , so notice that Tandem Computers, Inc. was the dominant manufacturer of fault-tolerant computer systems for ATM networks, banks, stock exchanges, telephone switching centers, and other similar commercial transaction processing applications requiring maximum uptime and zero data loss: https://www.zdnet.com/article/tandem-returns-to-its-hp-roots/ More of my philosophy about HP "NonStop" to x86 Server Platform fault-tolerant computer systems and more.. Now HP to Extend HP NonStop to x86 Server Platform HP announced in 2013 plans to extend its mission-critical HP NonStop technology to x86 server architecture, providing the 24/7 availability required in an always-on, globally connected world, and increasing customer choice. Read the following to notice it: https://www8.hp.com/us/en/hp-news/press-release.html?id=1519347#.YHSXT-hKiM8 And today HP provides HP NonStop to x86 Server Platform, and here is an example, read here: https://www.hpe.com/ca/en/pdfViewer.html?docId=4aa5-7443&parentPage=/ca/en/products/servers/mission-critical-servers/integrity-nonstop-systems&resourceTitle=HPE+NonStop+X+NS7+%E2%80%93+Redefining+continuous+availability+and+scalability+for+x86+data+sheet So i think programming the HP NonStop for x86 is now compatible with x86 programming. More of my philosophy about the 16 sockets HPE Integrity Superdome X from Hewlett Packard Enterprise from USA and more of my thoughts.. I think i am highly smart since I have passed two certified IQ tests and i have scored "above" 115 IQ, so i think that parallel programming with memory on Intel's CXL will be different than parallel programming the many memory channels and on many sockets, so i think so that to scale much more the memory channels on many sockets and be compatible, i advice you to for example buy the 16 sockets HPE Integrity Superdome X from Hewlett Packard Enterprise from USA here: https://cdn.cnetcontent.com/3b/dc/3bdcd896-f2b4-48e4-bbf6-a75234db25da.pdf And i am sure that my below Powerful Open source software project of Parallel C++ Conjugate Gradient Linear System Solver Library that scales very well will work correctly on the 16 sockets HPE Superdome X. More of my philosophy about the future of system memory and more of thoughts.. Here is the future of system memory of how to scale like with many more memory channels: THE FUTURE OF SYSTEM MEMORY IS MOSTLY CXL Read more here: https://www.nextplatform.com/2022/07/05/the-future-of-system-memory-is-mostly-cxl/ So i think the way to parallel programming in the standard Intel's CXL will look like parallel programming with many memory channels as i am doing it below with my Powerful Open source software project of Parallel C++ Conjugate Gradient Linear System Solver Library that scales very well. More of my philosophy about x86 CPUs and about cache prefetching and more of my thoughts.. I think i am highly smart since I have passed two certified IQ tests and i have scored "above" 115 IQ, and today i will talk about the how to prefetch data into the caches on x86 microprocessors: So here my following delphi and freepascal x86 inline assembler procedures that prefetch data into the caches: So for 32 bit Delphi and Freepascal compilers, here is how to prefetch data into the level 1 cache and notice that, in delphi and freepascal compilers, when we pass the first parameter of the procedure with a register convention, it will be passed on CPU register eax of the x86 microprocessor: procedure Prefetch(p : pointer); register; asm prefetchT1 byte ptr [eax] end; For 64 bit Delphi and Freepascal compilers, here is how to prefetch data into the level 1 cache and notice that, in delphi and freepascal compilers, when we pass the first parameter of the procedure with a register convention, it will be passed on CPU register rcx of the x86 microprocessor: procedure Prefetch(p : pointer); register; asm prefetchT1 byte ptr [rcx] end; And you can request a loading of 256 bytes in advance into the caches, and it can be efficient, by doing this: So for 32 bit Delphi and Freepascal compilers you do this: procedure Prefetch(p : pointer); register; asm prefetchT1 byte ptr [eax]+256 end; So for 64 bit Delphi and Freepascal compilers you do this: procedure Prefetch(p : pointer); register; asm prefetchT1 byte ptr [rcx]+256 end; So you can also prefetch into level 0 and level 2 caches with the x86 assembler instruction prefetchT0 and prefetchT2, so just replace, in the above inline assembler procedures, prefetchT1 with prefetchT0 or prefetchT2, but i think i am highly smart and i say that notice that those prefetch x86 assembler instructions are used since also the microprocessor can be faster than memory, so then you have to understand that today, the story is much nicer, since the powerful x86 processor cores can all sustain many memory requests, and we call this process: "memory-level parallelism", and today x86 AMD or Intel processor cores could support more than 10 independent memory requests at a time, so for example Graviton 3 ARM CPU appears to sustain about 19 simultaneous memory loads per core against about 25 for the Intel processor, so then i think i can also say that this memory-level parallelism looks like using latency hiding so that to speed the things more so that the CPU doesn't wait too much for memory. And now i invite you to read more of my thoughts about stack memory allocations and about preemptive and non-preemptive timesharing in the following web link: https://groups.google.com/g/alt.culture.morocco/c/JuC4jar661w And more of my philosophy about Stacktrace and more of my thoughts.. I think i am highly smart, and i say that there is advantages and disadvantages to portability in software programming , for example you can make your application run just in Windows operating system and it can be much more business friendly than making it run in multiple operating systems, since in business you have for example to develop and sell your application faster or much faster than the competition, so then we can not say that the tendency of C++ to requiring portability is a good thing. Other than that i have just looked at Delphi and Freepascal and i have just noticed that the Stacktrace in Freepascal is much more enhanced than Delphi, since look for example at the following application of Freepascal that has made Stacktrace portable to different operating systems and CPU architectures , and it is a much more enhanced stacktrace that is better than the Delphi ones that run just in Windows: https://github.com/r3code/lazarus-exception-logger But notice carefully that the Delphi ones run just in Windows: https://docwiki.embarcadero.com/Libraries/Sydney/en/System.SysUtils.Exception.StackTrace So i think that since a much more enhanced Stacktrace is important, so i think that Delphi needs to provide us with a portable one to different operating systems and CPU architectures. Also the Free Pascal Developer team is pleased to finally announce the addition of a long awaited feature, though to be precise it's two different, but very much related features: Function References and Anonymous Functions. These two features can be used independantly of each other, but their greatest power they unfold when used together. Read about it here: https://forum.lazarus.freepascal.org/index.php/topic,59468.msg443370.html#msg443370 More of my philosophy about the AMD Epyc CPU and more of my thoughts.. I think i am highly smart since I have passed two certified IQ tests and i have scored "above" 115 IQ, if you want to be serious about buying a CPU and motherboard, i advice you to buy the following AMD Epyc 7313p Milan 16 cores CPU that costs much less(around 1000 US dollars) and that is reliable and fast, since it is a 16 cores CPU and it supports standard ECC memory and it supports 8 memory channels, here it is: https://en.wikichip.org/wiki/amd/epyc/7313p And the good Supermicro motherboard for it that supports the Epyc Milan 7003 is the following: https://www.newegg.com/supermicro-mbd-h12ssl-nt-o-supports-single-amd-epyc-7003-7002-series-processor/p/1B4-005W-00911?Description=amd%20epyc%20motherboard&cm_re=amd_epyc%20motherboard-_-1B4-005W-00911-_-Product And the above AMD Epyc 7313p Milan 16 cores CPU can be configured as NUMA using the good Supermicro motherboard above as following: This setting enables a trade-off between minimizing local memory latency for NUMAaware or highly parallelizable workloads vs. maximizing per-core memory bandwidth for non-NUMA-friendly workloads. The default configuration (one NUMA domain per socket) is recommended for most workloads. NPS4 is recommended for HPC and other highly parallel workloads.Here is the detail introduction for such options: • NPS0: Interleave memory accesses across all channels in both sockets (not recommended) • NPS1: Interleave memory accesses across all eight channels in each socket, report one NUMA node per socket (unless L3 Cache as NUMA is enabled) • NPS2: Interleave memory accesses across groups of four channels (ABCD and EFGH) in each socket, report two NUMA nodes per socket (unless L3 Cache as NUMA is enabled) • NPS4: Interleave memory accesses across pairs of two channels (AB, CD, EF and GH) in each socket, report four NUMA nodes per socket (unless L3 Cache as NUMA is enabled) And of course you have to read my following writing about DDR5 memory that is not a fully ECC memory: "On-die ECC: The presence of on-die ECC on DDR5 memory has been the subject of many discussions and a lot of confusion among consumers and the press alike. Unlike standard ECC, on-die ECC primarily aims to improve yields at advanced process nodes, thereby allowing for cheaper DRAM chips. On-die ECC only detects errors if they take place within a cell or row during refreshes. When the data is moved from the cell to the cache or the CPU, if there's a bit-flip or data corruption, it won't be corrected by on-die ECC. Standard ECC corrects data corruption within the cell and as it is moved to another device or an ECC-supported SoC." Read more here to notice it: https://www.hardwaretimes.com/ddr5-vs-ddr4-ram-quad-channel-and-on-die-ecc-explained/ So if you want to get serious and professional you can buy the above AMD Epyc 7313p Milan 16 cores CPU with the Supermicro motherboard that supports it and that i am advicing and that supports the fully ECC memory and that supports 8 memory channels. And of course you can read my thoughts about technology in the following web link: https://groups.google.com/g/soc.culture.usa/c/N_UxX3OECX4 And of course you have to read my following thoughts that also show how powerful is to use 8 memory channels: I have just said the following: -- More of my philosophy about the new Zen 4 AMD Ryzen™ 9 7950X and more of my thoughts.. So i have just looked at the new Zen 4 AMD Ryzen™ 9 7950X CPU, and i invite you to look at it here: https://www.amd.com/en/products/cpu/amd-ryzen-9-7950x But notice carefully that the problem is with the number of supported memory channels, since it just support two memory channels, so it is not good, since for example my following Open source software project of Parallel C++ Conjugate Gradient Linear System Solver Library that scales very well is scaling around 8X on my 16 cores Intel Xeon with 2 NUMA nodes and with 8 memory channels, but it will not scale correctly on the new Zen 4 AMD Ryzen™ 9 7950X CPU with just 2 memory channels since it is also memory-bound, and here is my Powerful Open source software project of Parallel C++ Conjugate Gradient Linear System Solver Library that scales very well and i invite you to take carefully a look at it: https://sites.google.com/site/scalable68/scalable-parallel-c-conjugate-gradient-linear-system-solver-library So i advice you to buy an AMD Epyc CPU or an Intel Xeon CPU that supports 8 memory channels. --- And of course you can use the next Twelve DDR5 Memory Channels for Zen 4 AMD EPYC CPUs so that to scale more my above algorithm, and read about it here: https://www.tomshardware.com/news/amd-confirms-12-ddr5-memory-channels-on-genoa And here is the simulation program that uses the probabilistic mechanism that i have talked about and that prove to you that my algorithm of my Parallel C++ Conjugate Gradient Linear System Solver Library is scalable: If you look at my scalable parallel algorithm, it is dividing the each array of the matrix by 250 elements, and if you look carefully i am using two functions that consumes the greater part of all the CPU, it is the atsub() and asub(), and inside those functions i am using a probabilistic mechanism so that to render my algorithm scalable on NUMA architecture , and it also make it scale on the memory channels, what i am doing is scrambling the array parts using a probabilistic function and what i have noticed that this probabilistic mechanism is very efficient, to prove to you what i am saying , please look at the following simulation that i have done using a variable that contains the number of NUMA nodes, and what i have noticed that my simulation is giving almost a perfect scalability on NUMA architecture, for example let us give to the "NUMA_nodes" variable a value of 4, and to our array a value of 250, the simulation bellow will give a number of contention points of a quarter of the array, so if i am using 16 cores , in the worst case it will scale 4X throughput on NUMA architecture, because since i am using an array of 250 and there is a quarter of the array of contention points , so from the Amdahl's law this will give a scalability of almost 4X throughput on four NUMA nodes, and this will give almost a perfect scalability on more and more NUMA nodes, so my parallel algorithm is scalable on NUMA architecture and it also scale well on the memory channels, Here is the simulation that i have done, please run it and you will notice yourself that my parallel algorithm is scalable on NUMA architecture. Here it is: --- program test; uses math; var tab,tab1,tab2,tab3:array of integer; a,n1,k,i,n2,tmp,j,numa_nodes:integer; begin a:=250; Numa_nodes:=4; setlength(tab2,a); for i:=0 to a-1 do begin tab2:=i mod numa_nodes; end; setlength(tab,a); randomize; for k:=0 to a-1 do tab:=k; n2:=a-1; for k:=0 to a-1 do begin n1:=random(n2); |
| Amine Moulay Ramdane <aminer68@gmail.com>: Oct 27 12:54PM -0700 Hello, More of my philosophy about the 16 sockets HPE Integrity Superdome X from Hewlett Packard Enterprise from USA and more of my thoughts.. I am a white arab, and i think i am smart since i have also invented many scalable algorithms and algorithms.. I think i am highly smart since I have passed two certified IQ tests and i have scored "above" 115 IQ, so i think that parallel programming with memory on Intel's CXL will be different than parallel programming the many memory channels and on many sockets, so i think so that to scale much more the memory channels on many sockets and be compatible, i advice you to for example buy the 16 sockets HPE Integrity Superdome X from Hewlett Packard Enterprise from USA here: https://cdn.cnetcontent.com/3b/dc/3bdcd896-f2b4-48e4-bbf6-a75234db25da.pdf And i am sure that my below Powerful Open source software project of Parallel C++ Conjugate Gradient Linear System Solver Library that scales very well will work correctly on the 16 sockets HPE Integrity Superdome X from Hewlett Packard Enterprise from USA. More of my philosophy about the future of system memory and more of thoughts.. Here is the future of system memory of how to scale like with many more memory channels: THE FUTURE OF SYSTEM MEMORY IS MOSTLY CXL Read more here: https://www.nextplatform.com/2022/07/05/the-future-of-system-memory-is-mostly-cxl/ So i think the way to parallel programming in the standard Intel's CXL will be the same as programming with many memory channels as i am doing it below with my Powerful Open source software project of Parallel C++ Conjugate Gradient Linear System Solver Library that scales very well. More of my philosophy about x86 CPUs and about cache prefetching and more of my thoughts.. I think i am highly smart since I have passed two certified IQ tests and i have scored "above" 115 IQ, and today i will talk about the how to prefetch data into the caches on x86 microprocessors: So here my following delphi and freepascal x86 inline assembler procedures that prefetch data into the caches: So for 32 bit Delphi and Freepascal compilers, here is how to prefetch data into the level 1 cache and notice that, in delphi and freepascal compilers, when we pass the first parameter of the procedure with a register convention, it will be passed on CPU register eax of the x86 microprocessor: procedure Prefetch(p : pointer); register; asm prefetchT1 byte ptr [eax] end; For 64 bit Delphi and Freepascal compilers, here is how to prefetch data into the level 1 cache and notice that, in delphi and freepascal compilers, when we pass the first parameter of the procedure with a register convention, it will be passed on CPU register rcx of the x86 microprocessor: procedure Prefetch(p : pointer); register; asm prefetchT1 byte ptr [rcx] end; And you can request a loading of 256 bytes in advance into the caches, and it can be efficient, by doing this: So for 32 bit Delphi and Freepascal compilers you do this: procedure Prefetch(p : pointer); register; asm prefetchT1 byte ptr [eax]+256 end; So for 64 bit Delphi and Freepascal compilers you do this: procedure Prefetch(p : pointer); register; asm prefetchT1 byte ptr [rcx]+256 end; So you can also prefetch into level 0 and level 2 caches with the x86 assembler instruction prefetchT0 and prefetchT2, so just replace, in the above inline assembler procedures, prefetchT1 with prefetchT0 or prefetchT2, but i think i am highly smart and i say that notice that those prefetch x86 assembler instructions are used since also the microprocessor can be faster than memory, so then you have to understand that today, the story is much nicer, since the powerful x86 processor cores can all sustain many memory requests, and we call this process: "memory-level parallelism", and today x86 AMD or Intel processor cores could support more than 10 independent memory requests at a time, so for example Graviton 3 ARM CPU appears to sustain about 19 simultaneous memory loads per core against about 25 for the Intel processor, so then i think i can also say that this memory-level parallelism looks like using latency hiding so that to speed the things more so that the CPU doesn't wait too much for memory. And now i invite you to read more of my thoughts about stack memory allocations and about preemptive and non-preemptive timesharing in the following web link: https://groups.google.com/g/alt.culture.morocco/c/JuC4jar661w And more of my philosophy about Stacktrace and more of my thoughts.. I think i am highly smart, and i say that there is advantages and disadvantages to portability in software programming , for example you can make your application run just in Windows operating system and it can be much more business friendly than making it run in multiple operating systems, since in business you have for example to develop and sell your application faster or much faster than the competition, so then we can not say that the tendency of C++ to requiring portability is a good thing. Other than that i have just looked at Delphi and Freepascal and i have just noticed that the Stacktrace in Freepascal is much more enhanced than Delphi, since look for example at the following application of Freepascal that has made Stacktrace portable to different operating systems and CPU architectures , and it is a much more enhanced stacktrace that is better than the Delphi ones that run just in Windows: https://github.com/r3code/lazarus-exception-logger But notice carefully that the Delphi ones run just in Windows: https://docwiki.embarcadero.com/Libraries/Sydney/en/System.SysUtils.Exception.StackTrace So i think that since a much more enhanced Stacktrace is important, so i think that Delphi needs to provide us with a portable one to different operating systems and CPU architectures. Also the Free Pascal Developer team is pleased to finally announce the addition of a long awaited feature, though to be precise it's two different, but very much related features: Function References and Anonymous Functions. These two features can be used independantly of each other, but their greatest power they unfold when used together. Read about it here: https://forum.lazarus.freepascal.org/index.php/topic,59468.msg443370.html#msg443370 More of my philosophy about the AMD Epyc CPU and more of my thoughts.. I think i am highly smart since I have passed two certified IQ tests and i have scored "above" 115 IQ, if you want to be serious about buying a CPU and motherboard, i advice you to buy the following AMD Epyc 7313p Milan 16 cores CPU that costs much less(around 1000 US dollars) and that is reliable and fast, since it is a 16 cores CPU and it supports standard ECC memory and it supports 8 memory channels, here it is: https://en.wikichip.org/wiki/amd/epyc/7313p And the good Supermicro motherboard for it that supports the Epyc Milan 7003 is the following: https://www.newegg.com/supermicro-mbd-h12ssl-nt-o-supports-single-amd-epyc-7003-7002-series-processor/p/1B4-005W-00911?Description=amd%20epyc%20motherboard&cm_re=amd_epyc%20motherboard-_-1B4-005W-00911-_-Product And the above AMD Epyc 7313p Milan 16 cores CPU can be configured as NUMA using the good Supermicro motherboard above as following: This setting enables a trade-off between minimizing local memory latency for NUMAaware or highly parallelizable workloads vs. maximizing per-core memory bandwidth for non-NUMA-friendly workloads. The default configuration (one NUMA domain per socket) is recommended for most workloads. NPS4 is recommended for HPC and other highly parallel workloads.Here is the detail introduction for such options: • NPS0: Interleave memory accesses across all channels in both sockets (not recommended) • NPS1: Interleave memory accesses across all eight channels in each socket, report one NUMA node per socket (unless L3 Cache as NUMA is enabled) • NPS2: Interleave memory accesses across groups of four channels (ABCD and EFGH) in each socket, report two NUMA nodes per socket (unless L3 Cache as NUMA is enabled) • NPS4: Interleave memory accesses across pairs of two channels (AB, CD, EF and GH) in each socket, report four NUMA nodes per socket (unless L3 Cache as NUMA is enabled) And of course you have to read my following writing about DDR5 memory that is not a fully ECC memory: "On-die ECC: The presence of on-die ECC on DDR5 memory has been the subject of many discussions and a lot of confusion among consumers and the press alike. Unlike standard ECC, on-die ECC primarily aims to improve yields at advanced process nodes, thereby allowing for cheaper DRAM chips. On-die ECC only detects errors if they take place within a cell or row during refreshes. When the data is moved from the cell to the cache or the CPU, if there's a bit-flip or data corruption, it won't be corrected by on-die ECC. Standard ECC corrects data corruption within the cell and as it is moved to another device or an ECC-supported SoC." Read more here to notice it: https://www.hardwaretimes.com/ddr5-vs-ddr4-ram-quad-channel-and-on-die-ecc-explained/ So if you want to get serious and professional you can buy the above AMD Epyc 7313p Milan 16 cores CPU with the Supermicro motherboard that supports it and that i am advicing and that supports the fully ECC memory and that supports 8 memory channels. And of course you can read my thoughts about technology in the following web link: https://groups.google.com/g/soc.culture.usa/c/N_UxX3OECX4 And of course you have to read my following thoughts that also show how powerful is to use 8 memory channels: I have just said the following: -- More of my philosophy about the new Zen 4 AMD Ryzen™ 9 7950X and more of my thoughts.. So i have just looked at the new Zen 4 AMD Ryzen™ 9 7950X CPU, and i invite you to look at it here: https://www.amd.com/en/products/cpu/amd-ryzen-9-7950x But notice carefully that the problem is with the number of supported memory channels, since it just support two memory channels, so it is not good, since for example my following Open source software project of Parallel C++ Conjugate Gradient Linear System Solver Library that scales very well is scaling around 8X on my 16 cores Intel Xeon with 2 NUMA nodes and with 8 memory channels, but it will not scale correctly on the new Zen 4 AMD Ryzen™ 9 7950X CPU with just 2 memory channels since it is also memory-bound, and here is my Powerful Open source software project of Parallel C++ Conjugate Gradient Linear System Solver Library that scales very well and i invite you to take carefully a look at it: https://sites.google.com/site/scalable68/scalable-parallel-c-conjugate-gradient-linear-system-solver-library So i advice you to buy an AMD Epyc CPU or an Intel Xeon CPU that supports 8 memory channels. --- And of course you can use the next Twelve DDR5 Memory Channels for Zen 4 AMD EPYC CPUs so that to scale more my above algorithm, and read about it here: https://www.tomshardware.com/news/amd-confirms-12-ddr5-memory-channels-on-genoa And here is the simulation program that uses the probabilistic mechanism that i have talked about and that prove to you that my algorithm of my Parallel C++ Conjugate Gradient Linear System Solver Library is scalable: If you look at my scalable parallel algorithm, it is dividing the each array of the matrix by 250 elements, and if you look carefully i am using two functions that consumes the greater part of all the CPU, it is the atsub() and asub(), and inside those functions i am using a probabilistic mechanism so that to render my algorithm scalable on NUMA architecture , and it also make it scale on the memory channels, what i am doing is scrambling the array parts using a probabilistic function and what i have noticed that this probabilistic mechanism is very efficient, to prove to you what i am saying , please look at the following simulation that i have done using a variable that contains the number of NUMA nodes, and what i have noticed that my simulation is giving almost a perfect scalability on NUMA architecture, for example let us give to the "NUMA_nodes" variable a value of 4, and to our array a value of 250, the simulation bellow will give a number of contention points of a quarter of the array, so if i am using 16 cores , in the worst case it will scale 4X throughput on NUMA architecture, because since i am using an array of 250 and there is a quarter of the array of contention points , so from the Amdahl's law this will give a scalability of almost 4X throughput on four NUMA nodes, and this will give almost a perfect scalability on more and more NUMA nodes, so my parallel algorithm is scalable on NUMA architecture and it also scale well on the memory channels, Here is the simulation that i have done, please run it and you will notice yourself that my parallel algorithm is scalable on NUMA architecture. Here it is: --- program test; uses math; var tab,tab1,tab2,tab3:array of integer; a,n1,k,i,n2,tmp,j,numa_nodes:integer; begin a:=250; Numa_nodes:=4; setlength(tab2,a); for i:=0 to a-1 do begin tab2:=i mod numa_nodes; end; setlength(tab,a); randomize; for k:=0 to a-1 do tab:=k; n2:=a-1; for k:=0 to a-1 do begin n1:=random(n2); tmp:=tab; tab:=tab[n1]; tab[n1]:=tmp; end; setlength(tab1,a); randomize; for k:=0 to a-1 do tab1:=k; n2:=a-1; for k:=0 to a-1 do begin n1:=random(n2); tmp:=tab1; tab1:=tab1[n1]; tab1[n1]:=tmp; end; for i:=0 to a-1 do if tab2[tab]=tab2[tab1] then begin inc(j); writeln('A contention at: ',i); end; writeln('Number of contention points: ',j); setlength(tab,0); setlength(tab1,0); setlength(tab2,0); end. --- And i invite you to read my thoughts about technology here: https://groups.google.com/g/soc.culture.usa/c/N_UxX3OECX4 More of my philosophy about the problem with capacity planning of a website and more of my thoughts.. I think i am highly smart since I have passed two certified IQ tests and i have scored above 115 IQ, and i have just invented a new methodology that simplifies a lot capacity planning of a website that can be of a three-tier architecture with the web servers and with the applications servers and with the database servers, but i have to explain more so that you understand the big problem with capacity planning of a website, so when you want to for example to use web testing, the problem is how to choose for example the correct distribution of the read and write and delete transactions on the database of a website ? so if it is not realistic you can go beyond the knee of the curve and get a not acceptable waiting time, and the Mean value analysis (MVA) algorithm has the same problem, so how to solve the problem ? so as you are noticing it is why i have come with my new methodology that uses mathematics that solves the problem. And read my previous thoughts: More of my philosophy about website capacity planning and about Quality of service and more of my thoughts.. I think i am highly smart since I have passed two certified IQ tests and i have scored above 115 IQ, so i think that you have to lower to a certain level the QoS (quality of service) of a website, since you have to fix the limit of the number of connections that we allow to the website so that to not go beyond the knee of the curve, and of course i will soon show you my mathematical calculations of my new methodology of how to do capacity planning of a website, and of course you have to know that that we have to do capacity planning using mathematics so that to know the average waiting time etc. and this permits us to calculate the number of connections that we allow to the |
| Amine Moulay Ramdane <aminer68@gmail.com>: Oct 27 11:51AM -0700 Hello, More of my philosophy about the future of system memory and more of thoughts.. I am a white arab, and i think i am smart since i have also invented many scalable algorithms and algorithms.. Here is the future of system memory of how to scale like with many more memory channels: THE FUTURE OF SYSTEM MEMORY IS MOSTLY CXL Read more here: https://www.nextplatform.com/2022/07/05/the-future-of-system-memory-is-mostly-cxl/ So i think the way to parallel programming in the standard Intel's CXL will be the same as programming with many memory channels as i am doing it below with my Powerful Open source software project of Parallel C++ Conjugate Gradient Linear System Solver Library that scales very well. More of my philosophy about x86 CPUs and about cache prefetching and more of my thoughts.. I think i am highly smart since I have passed two certified IQ tests and i have scored "above" 115 IQ, and today i will talk about the how to prefetch data into the caches on x86 microprocessors: So here my following delphi and freepascal x86 inline assembler procedures that prefetch data into the caches: So for 32 bit Delphi and Freepascal compilers, here is how to prefetch data into the level 1 cache and notice that, in delphi and freepascal compilers, when we pass the first parameter of the procedure with a register convention, it will be passed on CPU register eax of the x86 microprocessor: procedure Prefetch(p : pointer); register; asm prefetchT1 byte ptr [eax] end; For 64 bit Delphi and Freepascal compilers, here is how to prefetch data into the level 1 cache and notice that, in delphi and freepascal compilers, when we pass the first parameter of the procedure with a register convention, it will be passed on CPU register rcx of the x86 microprocessor: procedure Prefetch(p : pointer); register; asm prefetchT1 byte ptr [rcx] end; And you can request a loading of 256 bytes in advance into the caches, and it can be efficient, by doing this: So for 32 bit Delphi and Freepascal compilers you do this: procedure Prefetch(p : pointer); register; asm prefetchT1 byte ptr [eax]+256 end; So for 64 bit Delphi and Freepascal compilers you do this: procedure Prefetch(p : pointer); register; asm prefetchT1 byte ptr [rcx]+256 end; So you can also prefetch into level 0 and level 2 caches with the x86 assembler instruction prefetchT0 and prefetchT2, so just replace, in the above inline assembler procedures, prefetchT1 with prefetchT0 or prefetchT2, but i think i am highly smart and i say that notice that those prefetch x86 assembler instructions are used since also the microprocessor can be faster than memory, so then you have to understand that today, the story is much nicer, since the powerful x86 processor cores can all sustain many memory requests, and we call this process: "memory-level parallelism", and today x86 AMD or Intel processor cores could support more than 10 independent memory requests at a time, so for example Graviton 3 ARM CPU appears to sustain about 19 simultaneous memory loads per core against about 25 for the Intel processor, so then i think i can also say that this memory-level parallelism looks like using latency hiding so that to speed the things more so that the CPU doesn't wait too much for memory. And now i invite you to read more of my thoughts about stack memory allocations and about preemptive and non-preemptive timesharing in the following web link: https://groups.google.com/g/alt.culture.morocco/c/JuC4jar661w And more of my philosophy about Stacktrace and more of my thoughts.. I think i am highly smart, and i say that there is advantages and disadvantages to portability in software programming , for example you can make your application run just in Windows operating system and it can be much more business friendly than making it run in multiple operating systems, since in business you have for example to develop and sell your application faster or much faster than the competition, so then we can not say that the tendency of C++ to requiring portability is a good thing. Other than that i have just looked at Delphi and Freepascal and i have just noticed that the Stacktrace in Freepascal is much more enhanced than Delphi, since look for example at the following application of Freepascal that has made Stacktrace portable to different operating systems and CPU architectures , and it is a much more enhanced stacktrace that is better than the Delphi ones that run just in Windows: https://github.com/r3code/lazarus-exception-logger But notice carefully that the Delphi ones run just in Windows: https://docwiki.embarcadero.com/Libraries/Sydney/en/System.SysUtils.Exception.StackTrace So i think that since a much more enhanced Stacktrace is important, so i think that Delphi needs to provide us with a portable one to different operating systems and CPU architectures. Also the Free Pascal Developer team is pleased to finally announce the addition of a long awaited feature, though to be precise it's two different, but very much related features: Function References and Anonymous Functions. These two features can be used independantly of each other, but their greatest power they unfold when used together. Read about it here: https://forum.lazarus.freepascal.org/index.php/topic,59468.msg443370.html#msg443370 More of my philosophy about the AMD Epyc CPU and more of my thoughts.. I think i am highly smart since I have passed two certified IQ tests and i have scored "above" 115 IQ, if you want to be serious about buying a CPU and motherboard, i advice you to buy the following AMD Epyc 7313p Milan 16 cores CPU that costs much less(around 1000 US dollars) and that is reliable and fast, since it is a 16 cores CPU and it supports standard ECC memory and it supports 8 memory channels, here it is: https://en.wikichip.org/wiki/amd/epyc/7313p And the good Supermicro motherboard for it that supports the Epyc Milan 7003 is the following: https://www.newegg.com/supermicro-mbd-h12ssl-nt-o-supports-single-amd-epyc-7003-7002-series-processor/p/1B4-005W-00911?Description=amd%20epyc%20motherboard&cm_re=amd_epyc%20motherboard-_-1B4-005W-00911-_-Product And the above AMD Epyc 7313p Milan 16 cores CPU can be configured as NUMA using the good Supermicro motherboard above as following: This setting enables a trade-off between minimizing local memory latency for NUMAaware or highly parallelizable workloads vs. maximizing per-core memory bandwidth for non-NUMA-friendly workloads. The default configuration (one NUMA domain per socket) is recommended for most workloads. NPS4 is recommended for HPC and other highly parallel workloads.Here is the detail introduction for such options: • NPS0: Interleave memory accesses across all channels in both sockets (not recommended) • NPS1: Interleave memory accesses across all eight channels in each socket, report one NUMA node per socket (unless L3 Cache as NUMA is enabled) • NPS2: Interleave memory accesses across groups of four channels (ABCD and EFGH) in each socket, report two NUMA nodes per socket (unless L3 Cache as NUMA is enabled) • NPS4: Interleave memory accesses across pairs of two channels (AB, CD, EF and GH) in each socket, report four NUMA nodes per socket (unless L3 Cache as NUMA is enabled) And of course you have to read my following writing about DDR5 memory that is not a fully ECC memory: "On-die ECC: The presence of on-die ECC on DDR5 memory has been the subject of many discussions and a lot of confusion among consumers and the press alike. Unlike standard ECC, on-die ECC primarily aims to improve yields at advanced process nodes, thereby allowing for cheaper DRAM chips. On-die ECC only detects errors if they take place within a cell or row during refreshes. When the data is moved from the cell to the cache or the CPU, if there's a bit-flip or data corruption, it won't be corrected by on-die ECC. Standard ECC corrects data corruption within the cell and as it is moved to another device or an ECC-supported SoC." Read more here to notice it: https://www.hardwaretimes.com/ddr5-vs-ddr4-ram-quad-channel-and-on-die-ecc-explained/ So if you want to get serious and professional you can buy the above AMD Epyc 7313p Milan 16 cores CPU with the Supermicro motherboard that supports it and that i am advicing and that supports the fully ECC memory and that supports 8 memory channels. And of course you can read my thoughts about technology in the following web link: https://groups.google.com/g/soc.culture.usa/c/N_UxX3OECX4 And of course you have to read my following thoughts that also show how powerful is to use 8 memory channels: I have just said the following: -- More of my philosophy about the new Zen 4 AMD Ryzen™ 9 7950X and more of my thoughts.. So i have just looked at the new Zen 4 AMD Ryzen™ 9 7950X CPU, and i invite you to look at it here: https://www.amd.com/en/products/cpu/amd-ryzen-9-7950x But notice carefully that the problem is with the number of supported memory channels, since it just support two memory channels, so it is not good, since for example my following Open source software project of Parallel C++ Conjugate Gradient Linear System Solver Library that scales very well is scaling around 8X on my 16 cores Intel Xeon with 2 NUMA nodes and with 8 memory channels, but it will not scale correctly on the new Zen 4 AMD Ryzen™ 9 7950X CPU with just 2 memory channels since it is also memory-bound, and here is my Powerful Open source software project of Parallel C++ Conjugate Gradient Linear System Solver Library that scales very well and i invite you to take carefully a look at it: https://sites.google.com/site/scalable68/scalable-parallel-c-conjugate-gradient-linear-system-solver-library So i advice you to buy an AMD Epyc CPU or an Intel Xeon CPU that supports 8 memory channels. --- And of course you can use the next Twelve DDR5 Memory Channels for Zen 4 AMD EPYC CPUs so that to scale more my above algorithm, and read about it here: https://www.tomshardware.com/news/amd-confirms-12-ddr5-memory-channels-on-genoa And here is the simulation program that uses the probabilistic mechanism that i have talked about and that prove to you that my algorithm of my Parallel C++ Conjugate Gradient Linear System Solver Library is scalable: If you look at my scalable parallel algorithm, it is dividing the each array of the matrix by 250 elements, and if you look carefully i am using two functions that consumes the greater part of all the CPU, it is the atsub() and asub(), and inside those functions i am using a probabilistic mechanism so that to render my algorithm scalable on NUMA architecture , and it also make it scale on the memory channels, what i am doing is scrambling the array parts using a probabilistic function and what i have noticed that this probabilistic mechanism is very efficient, to prove to you what i am saying , please look at the following simulation that i have done using a variable that contains the number of NUMA nodes, and what i have noticed that my simulation is giving almost a perfect scalability on NUMA architecture, for example let us give to the "NUMA_nodes" variable a value of 4, and to our array a value of 250, the simulation bellow will give a number of contention points of a quarter of the array, so if i am using 16 cores , in the worst case it will scale 4X throughput on NUMA architecture, because since i am using an array of 250 and there is a quarter of the array of contention points , so from the Amdahl's law this will give a scalability of almost 4X throughput on four NUMA nodes, and this will give almost a perfect scalability on more and more NUMA nodes, so my parallel algorithm is scalable on NUMA architecture and it also scale well on the memory channels, Here is the simulation that i have done, please run it and you will notice yourself that my parallel algorithm is scalable on NUMA architecture. Here it is: --- program test; uses math; var tab,tab1,tab2,tab3:array of integer; a,n1,k,i,n2,tmp,j,numa_nodes:integer; begin a:=250; Numa_nodes:=4; setlength(tab2,a); for i:=0 to a-1 do begin tab2:=i mod numa_nodes; end; setlength(tab,a); randomize; for k:=0 to a-1 do tab:=k; n2:=a-1; for k:=0 to a-1 do begin n1:=random(n2); tmp:=tab; tab:=tab[n1]; tab[n1]:=tmp; end; setlength(tab1,a); randomize; for k:=0 to a-1 do tab1:=k; n2:=a-1; for k:=0 to a-1 do begin n1:=random(n2); tmp:=tab1; tab1:=tab1[n1]; tab1[n1]:=tmp; end; for i:=0 to a-1 do if tab2[tab]=tab2[tab1] then begin inc(j); writeln('A contention at: ',i); end; writeln('Number of contention points: ',j); setlength(tab,0); setlength(tab1,0); setlength(tab2,0); end. --- And i invite you to read my thoughts about technology here: https://groups.google.com/g/soc.culture.usa/c/N_UxX3OECX4 More of my philosophy about the problem with capacity planning of a website and more of my thoughts.. I think i am highly smart since I have passed two certified IQ tests and i have scored above 115 IQ, and i have just invented a new methodology that simplifies a lot capacity planning of a website that can be of a three-tier architecture with the web servers and with the applications servers and with the database servers, but i have to explain more so that you understand the big problem with capacity planning of a website, so when you want to for example to use web testing, the problem is how to choose for example the correct distribution of the read and write and delete transactions on the database of a website ? so if it is not realistic you can go beyond the knee of the curve and get a not acceptable waiting time, and the Mean value analysis (MVA) algorithm has the same problem, so how to solve the problem ? so as you are noticing it is why i have come with my new methodology that uses mathematics that solves the problem. And read my previous thoughts: More of my philosophy about website capacity planning and about Quality of service and more of my thoughts.. I think i am highly smart since I have passed two certified IQ tests and i have scored above 115 IQ, so i think that you have to lower to a certain level the QoS (quality of service) of a website, since you have to fix the limit of the number of connections that we allow to the website so that to not go beyond the knee of the curve, and of course i will soon show you my mathematical calculations of my new methodology of how to do capacity planning of a website, and of course you have to know that that we have to do capacity planning using mathematics so that to know the average waiting time etc. and this permits us to calculate the number of connections that we allow to the website. More of my philosophy about the Mean value analysis (MVA) algorithm and more of my thoughts.. I think i am highly smart since I have passed two certified IQ tests and i have scored above 115 IQ, and i have just read the following paper about the Mean value analysis (MVA) algorithm, and i invite you to read it carefully: https://www.cs.ucr.edu/~mart/204/MVA.pdf But i say that i am understanding easily the above paper of Mean value analysis (MVA) algorithm, but i say that the above paper doesn't say that since you have to empirically collect the visit ratio and and the average demand of each class, so it is not so practical, since i say that you can and you have for example to calculate the "tendency" by also for example rendering the not memoryless service of for example the database to a memoryless service, but don't worry since i will soon make you understand my powerful methodology with all the mathematical |
| You received this digest because you're subscribed to updates for this group. You can change your settings on the group membership page. To unsubscribe from this group and stop receiving emails from it send an email to comp.programming.threads+unsubscribe@googlegroups.com. |
No comments:
Post a Comment