A Not So Easy Matter of Software program

For a couple of moments, the ambiance was extra Rock Live performance than Supercomputing Convention…

A Not So Easy Matter of Software program

For a couple of moments, the ambiance was extra Rock Live performance than Supercomputing Convention with many members of a packed viewers standing, cheering, and waving indicators as Jack Dongarra took the stage to ship the annual ACM Turing Award lecture at SC22. Few individuals are as deeply related to evolution of HPC software program or with the Top500 listing that spotlights the quickest supercomputers on the planet than Dongarra, who with Hans Meuer, and Erich Stromaier, created the Top500 in 1993. (The newest Top500 was unveiled on Monday at SC22.)

“I wasn’t anticipating this. Wow,” mentioned Dongarra, visibly moved. “I’ve to say it’s an amazing honor to be the latest recipient of the ACM A.M. Turing Award. An award like this couldn’t have come about with out the assistance and assist of many individuals over time.”

A Not So Easy Matter of Software program

Thought-about the Nobel Prize of pc science, the ACM A.M. Turing Award, named for Alan Turing, additionally carries $1 million prize. Right here’s transient excerpt of the ACM tribute to Dongarra:

“Dongarra has led the world of high-performance computing by means of his contributions to environment friendly numerical algorithms for linear algebra operations, parallel computing programming mechanisms, and efficiency analysis instruments. For almost forty years, Moore’s legislation produced exponential progress in {hardware} efficiency. Throughout that very same time, whereas most software program didn’t hold tempo with these {hardware} advances, excessive efficiency numerical software program did – largely as a result of Dongarra’s algorithms, optimization strategies, and production-quality software program implementations.

“These contributions laid a framework from which scientists and engineers made necessary discoveries and game-changing improvements in areas together with massive information analytics, healthcare, renewable power, climate prediction, genomics, and economics, to call a couple of. Dongarra’s work additionally helped facilitate leapfrog advances in pc structure and supported revolutions in pc graphics and deep studying.”

The title of Dongarra’s speak – A Not So Easy Matter of Software program – properly captures Dongarra’s decades-long work in HPC software program improvement. With out underlying software program, the beautiful advances we’ve seen in HPC {hardware} would by no means ship their promise. Whereas co-design methodologies are more and more used and search to raised mix {hardware} and software program improvement – the Exascale Computing Challenge is an effective instance – the ground-level fact is that software program is at all times a step behind.

“We’re in form of a catch-up mode on a regular basis, I really feel,” mentioned Dongarra. “The structure adjustments and the algorithms and software program attempt to meet up with that structure. I’ve this picture of the {hardware} folks throwing one thing over the fence, and the algorithms folks and software program guys scrambling to determine easy methods to match their issues on that machine to successfully cope with it. It takes about 10 years to try this. Then [a] new machine is thrown over the fence and we begin that cycle over once more.”

Dongarra coated quite a bit floor in his speak, beginning with the early vector-based machines (suppose Cray1) of the Nineteen Seventies and pushing ahead by means of multicore-based CPUs and clustering to right this moment’s heterogeneous architectures (suppose Frontier) that mix CPUs and a wide range of accelerators. Software program improvement was the connecting thread. He’s had his arms within the improvement of math libraries (varied BLAS, LAPACK), message passing (MPI), the LINPACK benchmark, directed acyclic graph (DAG) scheduling, and extra.

He supplied a quick glimpse into his roots.

“My grandfather was 42 when he took the household to Naples, boarded a ship, and sailed for Ellis Island. And that was in 1929. He had $25 in his pocket, and full of hopes and desires of a life there. My father was 10 years previous,” mentioned Dongarra. “I did fairly good in math and science, however actually struggled with studying and spelling and later as an grownup I discovered I used to be dyslexic.” He went to Chicago State College, graduating in 1972.

“My dream was to be a highschool trainer. My final semester in faculty, I used to be inspired to use for a place at Argonne Nationwide Laboratory. This was a place the place you’ll spend a semester with a scientist. I believe I joined about 30 different undergraduates at Argonne. I used to be within the math and pc science division at Argonne, working with Brian Smith on software program for mathematical software program and it was a transformational semester. I spotted I had a ardour for these sorts of issues, creating software program, mathematical software program, and linear algebra. On account of that encounter, I stayed at Argonne from that time till 1989.”

Not dangerous for somebody of modest roots and modest early ambition. From Argonne Nationwide Laboratory he joined Oak Ridge Nationwide Laboratory and the College of Tennessee and has been there since. He retired from instructing in July however maintains his analysis schedule and, after all, his function within the Top500.

Capturing all of Dongarra’s feedback is past the scope of a brief article. Offered listed below are a couple of of his feedback and slides round key factors in his lengthy profession. That mentioned, no account could be full with out going again to the creation of the Top500 which Dongarra calls “The Unintended Benchmark”. The story begins within the early 70s when vector machines dominated the roost and the creation of LINPACK.

“We have been evolving our concepts about software program, making an attempt to match the {hardware} traits of the time. The {hardware} traits have been vectors. We thought we must always put in place concepts that highlighted these vector operations,” recalled Dongarra. “There was a de facto community-based commonplace that was proposed by 4 folks – Chuck Lawson, Fred Crowe, David Kincaid, and Dick Hanson – for doing these vector operations [and] I went off and instantly carried out them in Fortran and used the strategy of unrolling the loops. In order that’s a normal time approach that we assume a compiler can do. However again in 1973, that was a moderately novel factor to do. And that resulted in my first publication, unrolling loops in Fortran and it led to an enchancment in efficiency of about 10-to-20% on many, many programs.

Software program portability, not surprisingly, was each a rising concern and purpose.

“We needed to take all of the ideas and the alternatives for doing transportable programming that have been gained within the EISPAC challenge and transfer them into one other factor, fixing programs of linear equations. The Eigenvalue drawback was one of many first issues that was tackled. The follow-on challenge was going to do programs of linear equations, least-squares issues, and singular values. This challenge would respect the Fortran group, use the de facto commonplace that had simply put in place to perform,” recalled Dongarra.

“That’s actually the origins of the LINPACK challenge. So LINPACK – many individuals consider it as a benchmark, however it’s really a group of software program for fixing programs of linear equations. It was funded by NSF [and] includes 4 teams, one at Argonne that was contributed on my own, [another at the] College of New Mexico [with] Cleve Moler, the College of Maryland [with] Pete Stewart, and College of California with Jim Bunch. In order that’s an image of us. Jim Bunch on the far proper right here, Pete Stewart, Cleve Moler. And that’s the 1979 model of me with a bit bit extra hair. And that’s my automotive.”

So the place’s the listing?

“Within the appendix of this person’s information, I put collectively a bit desk that was a results of fixing a system of equations for a matrix of dimension 100 and it reported on 24 machines starting from a Cray1 to a DEC (Digital Gear Corp) PDP 10 pc. This desk (slide beneath) is a file of that benchmark – if you’ll – fixing a system of linear equations. I put down the time it took to resolve it. The hand scribble is the is the speed of execution for every of these machines,” he mentioned.

“So the Cray and NCAR [system] turned out to be the quickest pc at 14 megaflops for fixing that system of linear equations. The man on the backside of the listing is a PDP system that was at Yale. That’s actually the origins of the LINPACK benchmark. That is the primary rating of it (LINPACK). The Top500 hasn’t even been considered at this time limit. However I maintained this listing and [it] grew from 24 machines to 100 machines, to 1,000 machines, to about 5,000 programs at one level. So there have been many, many machines and we had a very good foundation for taking a look at efficiency.”

Wanting by means of the programs and distributors on the listing beneath is a neat stroll by means of pc historical past.

The Top500 was ultimately created in 1993. “Since 1978, I had this listing of machines for fixing programs of equations. Hans and Erich had a listing of the quickest computer systems, ranked these machines by their theoretical peak efficiency. Hans and Eric approached me and mentioned, we must always actually merge our two lists and name it the Top500,” mentioned Dongarra, and they also did. The Top500 listing is up to date twice a yr, as soon as at SC in November and once more at ISC in June.

“The best way to consider this [benchmark] is we’re going to resolve a system of equations. The bottom guidelines say you will need to use Gaussian elimination with partial pivoting, you must do 64-bit computations, and we’re going to take a look at the efficiency. Usually, as you improve the dimensions of the issue, the efficiency goes up till it reaches some asymptotic level, and what we’d love to do is seize the asymptotic efficiency for fixing a system of equations utilizing Gaussian elimination and 64-bit floating level arithmetic. That’s the idea for all of the numbers we have now since this listing was created.”

A champion of collaborative software program improvement typically, Dongarra reviewed considerably related experiences across the improvement LAPACK and MPI. All through his speak, he emphasised that it’s adjustments within the {hardware} that drive adjustments in software program. Take into account the arrival highly effective microprocessors and cache reminiscence.

“As a result of the machines had cache, we realized that we wanted to lift the extent of granularity of the operations. Vector operations have been too easy. We needed to use the cache as a lot as doable. So we acquired collectively a group exercise to outline what we name the extent two and stage three BLAS (Fundamental Linear Algebra Subprogram). Stage two BlAS carry out matrix vector operations and stage three do matrix-matrix operations,” he mentioned.

“The thought being that we may cache retailer part of that information and get very fast entry to the weather of the matrices and the efficiency could be enhanced on account of exploiting of that traits. We determined to kind an effort to develop software program this this challenge. It was funded primarily by the Nationwide Science Basis and the Division of Vitality. [The goal] was to take the concepts and algorithms and LINPACK along with the algorithms and concepts in EISPACK and put them collectively in a single package deal,” mentioned Dongarra.

The outcome was LAPACK  designed to successfully exploit the caches on fashionable cache-based architectures and the instruction-level parallelism of recent processors.

The rise of distributed reminiscence machines was one other driver and helped give rise to MPI.

“Message passing was within the air. We didn’t have a normal. Every producer had its personal means of doing message passing, every group had its personal means of doing it. The group at Argonne had P4, the fellows at U Tennessee had PVM, there was a bunch at Caltech that had its means of doing it, one other group out in Germany had their means, and there have been guys at Yale doing one thing else. There was actually a have to have a normal so [we could] develop software program that may be efficient and transportable throughout the machines with out having to do main rewrites of the software program. That was the catalyst for MPI,” mentioned Dongarra, who emphasised this too was a community-driven challenge.

“It was began by maybe 35-40 folks. We adopted the roadmap that Ken Kennedy had laid out utilizing the identical template that he had for the HPF Discussion board (Excessive Efficiency Fortran Discussion board). The thought was to convey collectively that group of individuals each six weeks, and do this for 3 days concentrating on creating the usual. We determined that round a yr and a half could be the fitting time. That was a goal,” he recalled.

“We had nice contributions from many individuals. The blokes at Argonne, Invoice Gropp and Rusty (Ewing) Lusk, determined to do an implementation of the usual because it was being developed. So, we had a approach to check out concepts instantly. That [provided] terrific suggestions that allowed us to make adjustments and in the end ended up with having the usual carried out and simply being adopted by many teams.”

Dongarra had heaps to say concerning the Top500, the fading worth of LINPACK as a metric and his sturdy perception that HPCG [High Performance Conjugate Gradients] is a greater measure. He additionally talked at size about memory-bound obstacles, the rise of CPU-supervised programs wherein GPUs do the huge bulk of the work. HPCwire could have protection of these points in its reporting on the latest Top500 outcomes. At his speak, he urged attendees to go to the Top500 BOF, which he mentioned would deal with most of the thorny points going through the Top500.

Q&A turned up a few attention-grabbing dialogue. One query, not surprisingly, was round future architectures.

“Right this moment, we have now machines which might be constructed on manycore plus GPUs. I’d suppose that sooner or later, we’d see that develop, [and] produce other accelerators added to that assortment. So take into consideration including an accelerator that does one thing particular for AI. Or take into consideration including an accelerator which does one thing like neuromorphic computing. We will add accelerators to the gathering to assist in fixing our issues. Perhaps quantum could be one other accelerator – I don’t see quantum being its personal compute,” mentioned Dongarra.

The profit, mentioned Dongarra, is “that particular functions may draw on these elements to get excessive efficiency or a person may dial up maybe what combination of accelerators they select to have on their particular system, in accordance with the functions. It’s about ensuring that we have now the {hardware} matching the functions which might be meant to run on this machine and having the fitting combination.”

He’d famous earlier in his speak how fashionable programs have an effect on math library improvement and use.

“Right this moment’s atmosphere for creating numerical libraries is extremely parallel, it makes use of distributed reminiscence. There’s an MPI and OpenMP programming mannequin. It’s heterogeneous utilizing commodity processors and accelerators. It exploits issues that keep away from easy loop stage parallelism and tries to deal with taking a look at a directed acyclic graph for the computation. The factor to level out is that communication is tremendously costly on these machines; these machines are over-provisioned for floating level. And the communication is actually the place we’re spending more often than not. And that must be taken under consideration in designing algorithms,” mentioned Dongarra.

“Standard knowledge would say that if we’re going to determine between two algorithms to make use of on a machine, one algorithm does extra floating level arithmetic than the opposite algorithm, that typical knowledge would say we’d select the algorithm that does much less floating level arithmetic, however as a result of these machines are over-provisioned, and actually, it’s communication that we’re paying for. We actually ought to look deeper and never simply deal with the floating level operations, however take a look at what sort of communication is happening. The opposite factor we have now to understand is that 64-bit computations is what we generally consider, however machines right this moment are able to 32-bit, 16-bit, and even eight-bit floating level operations. We needs to be taking a look at methods to leverage that elevated efficiency by utilizing this combination [and] there’s been some fairly good success tales within the linear algebra area,” mentioned Dongarra.

One other query famous the Turing lecture supplies a possibility to debate what the awardee want to see occur and what areas would possibly profit from added funding.

“I’ve harped on the imbalance of the mach.ines right this moment. We construct our machines based mostly on commodity off-the-shelf processors, from AMD or Intel, commodity off-the-shelf accelerators, commodity off-the-shelf interconnects. [That’s] commodity stuff. We’re not designing our {hardware} to the specifics of the functions which might be going for use to drive them. Maybe we must always step again and take a more in-depth take a look at the how the structure ought to work together with the functions, with the software program Co-design [is] one thing we discuss however the actuality may be very little co-design takes place right this moment with our {hardware},” he mentioned.

Citing Fukagu’s spectacular effectivity numbers, “Maybe a greater indicator is what’s occurring in Japan, the place they’ve a lot nearer interactions with the architects, with the {hardware} folks, to design machines which have a greater steadiness. If I used to be going to take a look at forward-looking analysis initiatives, I’d say perhaps we must always spin up initiatives that take a look at structure and have the structure higher mirrored within the functions.”

Circling again to Dongarra’s feedback on successful the ACM A.M. Turing Award, he was desperate to unfold credit score.

“I need to simply give a shout out to my mentors, colleagues, generations of postdocs, college students, buddies, my workers on the College of Tennessee, who push issues in the fitting path to obtain this distinction. I’m extremely happy with the numerical software program libraries that have been created, the requirements that have been put in place, and the efficiency improvement instruments that we deploy. I really feel that this award is a recognition by the pc science group of the significance of HPC in computing, and our collective contributions to pc science. So, congratulations to us.”