

For the last several years I've been involved in helping HPC applications at LLNL prepare for advanced architectures: First, Sierra, and now El Capitan.

I've been asked to share with you an overview of some of the CS areas that will be coming out in the RFI. Additional details on some specific technical areas will be presented in follow-on presentations throughout the day.

## Extreme-scale HPC architectures introduce programming challenges

| System Change                    | Programming Challenge                                  |
|----------------------------------|--------------------------------------------------------|
| Increased node-level parallelism | Expressing/managing node-level & hybrid parallelism    |
| Diverse target architectures     | Performance portability across systems                 |
| Decreased system reliability     | Resilience/Fault mitigation                            |
| Increased system noise           | Increased need for effective load-balancing strategies |
| Deeper memory hierarchies        | Management of memory hierarchies/locality              |
| Increased system scale           | Increased workflow complexity                          |
|                                  |                                                        |
| Contract of American Street      |                                                        |

Today's extreme scale architectures come with many programming challenges as shown here.

New node designs include manycore and multicore designs with numerous hyperthreaded processes. GPU nodes have thousands of effective cores of parallel processing. Both GPUs and CPUs continue to require more and more parallelism to use them efficiently. The enormous amount of compute capability per node and decreasing amounts memory per unit of compute are also challenging traditional MPI-only decompositions and load balancing techniques.

As a greater variety of CPUs, GPUs, and potentially novel, specialized, or disaggregated hardware become available, maintaining portability between systems becomes more and more difficult.

With increased component count comes the potential for reduced reliability. Vendors have done an amazing job of keeping reliability high, but applications still must be prepared with fault mitigation strategies such as fast checkpoint/recovery schemes.

There is a potential for increased system noise. Most centers have been able to minimize these effects by dedicating cores or threads to system tasks, but effective load balancing is still a significant challenge for many applications.

Multi-level memory management is becoming the norm with high bandwidth memories and non-volatile memories becoming more popular.

And increased system scale will result in increase workflow and more complex multitasking schemes.

## <section-header> PSAAP IV will support the following Math & CS topics and more\* Data Analytics for science and engineering applications Exploration of advanced HPC architectures Programming environments and runtime systems Workflow automation Productivity and performance portability New approaches to engineering Algorithms/models Nicroelectronics

This is a list of many of the CS topics of interest for the PSAAP IV program. This list is not exclusive. We will entertain other topics that are in the spirit of advancing extreme-scale HPC. These topics are offered as examples of topics of interest to the Labs.

I will talk about each of these in more detail in the slides to follow. I will be giving examples of prototype work at the National Labs in these areas. This is not to say that these are all solved problems ... because they are not. The examples are intended to give you a feel for the type of work that is needed to solve difficult national security problems.



Data analytics for science and engineering applications is an obvious topic of interest.

The pace of change in AL/ML is so rapid that it is hard to predict what kind of impact we will see on science and engineering applications. For PDE-based simulation codes that are very common in the NNSA we have identified at least three levels at which AI/ML techniques can be integrated into our modeling and simulation codes.

First, ML inference might be called "in the loop" one or more times every time step. Such inference might take the place of a more expensive physically-based model.

Second, ML training or inference might be called "on the loop" every 10<sup>^</sup>3 time steps or so. Such training might attempt to respond to the trajectory of the simulation. Although the cost of such training is amortized over many time steps, it would still have to e fairly low-cost to avoid substantially impacting the overall simulation performance.

Finally, ML training or inference might be called "around the loop" every simulation. This is the mode that is likely least sensitive to training performance.

For all of these modalities, it will be important to be able to quantify uncertainties.

Some work is currently going on in NNSA to explore offloading so me of this training and/or inference to dedicated special purpose accelerators. The latency imposed by the data motion for the offload will be a key metric in determining whether such accelerators can improve simulation performance.

Understanding how embedded ML impacts simulation performance is another area of active research.



One of the areas of interest is combining or statistically fusing simulation and experimental data. The national laboratories have a role of stockpile stewardship where the goal is to certify nuclear weapons design without underground testing. Many of the components and subassemblies are individually tested, but some subassemblies and the entire system can not be tested and therefore, we must simulate and certify these systems using calibrated models.

Unlike conventional machine learning where you have large amounts of data, and it is okay to classify or identify objects with an accuracy in the 80-90% level, the DOE has high consequence applications for machine learning that will require 5 nines of reliability – only making a mistake 1 out of 10000 times or more. We need machine learning approaches that work on far less data, maybe taking advantage of Generative Adversarial Networks to generate synthetic data along with the real data. We need to quantify uncertainty in ML, and know when a machine learning algorithm is interpolating vs. extrapolating. We need a rigorous math model that goes with the ML to be able to explain the results.

Data management and data curation are also important topics. The diagram here shows a timeline from right to left and indicates data transactions to and from a data store. Data access must be low cost, but we also want maintain provenance of data and trained models, eg., for reproducibility.



Co-design of advanced HPC architectures is a topic of interest. As the figure shows, co-design is where evolutionary and revolutionary architectures and applications come together to create new HPC designs. This includes the design of memory, CPU/GPU configurations, and message passing and network protocols. Future HPC systems might also include various specialized hardware such as FPGAs, DSPs, network accelerators, AI accelerators, graph accelerators, etc. The ability to simulate HPC architectures using tools such as the Structural Simulation Toolkit (SST) is important when evaluating architectures. At the same time verifying and validating models on advanced architecture testbeds with the same or similar system software in necessary. The labs have a number of proxy applications that can be tested on these new architectures. The goal is to predict performance and perform design trade studies without building a full scale system.

Understanding the performance impact of memory hierarchies, and possibly disaggregated systems is also of interest.



The rise of chiplet technology may be an enabling technology for greater availability of specialized hardware.



Composition of programming environments and runtime systems is an important topic of interest. The diagram below shows the library dependency graph of one of the Multiphysics codes at LLNL. This gives some idea of the complexity of these applications, and the number of libraries, both lab-developed, and external open source, that they rely on. These codes rely on multiple languages and programming models, and any programming environment and runtime systems must interoperate in these kinds of complex builds.

The labs also have 10s of millions of lines of legacy, validated codes, and technologies that can integrate with those legacy code bases are especially welcome.



Portability abstractions that insulate developers from hardware and allow them to write code that performs well on multiple hardware platforms are important to NNSA.

Raja has proven to be a very effective portability abstraction. This graph shows the performance of test kernels in the RAJA performance suite as implemented in RAJA and compiled with the HIP back end for AMD GPUs and implemented in native HIP. The performance is nearly identical for most kernels producing a ratio of 1.0. Note that similar comparisons for RAJA/CUDA on Nvidia or for Kokkos on AMD or Nvidia would produce very similar results.



Workflow automation is another topics of interest. Simulation setup is still difficult where ensemble runs are typically the norm, with a post processing step that looks at the analysis.

Management of bulk data is another area of interest. NNSA has been developing various data warehouse strategies that allow in-situ passing of data between applications.

The role of containers in HPC is an area of interest. Can containers be used to store not only the simulation executable but also the data, so that you have a history and a provenance associated with they runs.

Interoperability and portability between cloud resources and NNSA HPC centers is of interest, including how compute cloud services contribute to workflows.

Dynamic management of resources is of interest, especially as we contemplate systems with specialized or disaggregated hardware.

Rob Neely will have more to say about workflows later today.



As system scale increases, workflows are becoming more and more complicated. It is now common to see workflows with multiple simulation and modeling tools, frequently operating at multiple fidelities in 1D, 2D, and 3D. Workflows also include ML/AI optimization loops with components that are trained from simulation results and help steer simulation ensembles as they are trained.



El Capitan will feature a near-node local storage architecture called rabbits. Each rabbit blade will make direct PCI connections to 8 compute blades and provide 2TB of SSD non-volatile storage per compute node. The Rabbit SSDs can be used as burst buffers or as low-latency local file systems. Each Rabbit also has its own CPU processor that can run arbitrary containerized applications such as in-transit analysis. Rabbits are also connected to the high speed network fabric.



Exascale and post-exascale systems can produce data at such high rates that saving all data for later analysis is difficult or impossible. In-situ or in-transit analysis and visualization, perhaps using dedicated hardware such as rabbits, is of great interest.



There is a strong overlap between productivity and performance portability and programing models and runtime systems so we've already talked about some of the main interests in this area.

All of our critical mission codes need productivity and performance portability on two axes. Our codes need to run on multiple current systems from laptops to supercomputers, and on chips from multiple vendors. Codes also need to be "future-proof" to run well on future architectures with minimal changes.

As we prepare for El Capitan, we're seeing the need for memory abstractions that can handle both traditional separate GPU/CPU memory spaces as well as the single memory space of the MI300 APU.

We have already discussed abstractions such as Kokkos and RAJA, which hide the complexity of heterogeneous computing systems.

And of course anything that can help design, build, test, and deliver applications into production is of interest.



New approaches to engineering is another topic of interest and includes using machine learning for topological design optimization. Some examples are shown here.

We are currently working on building modular applications for design optimization. Taking advantage of existing numerical methods and physics models accelerates development. Compared to monolithic approaches, it is also much easier to swap out or exchange physics, constraints, quantities of interest.

## Algorithms/models

- Novel approaches to coupling multiphysics/multiscale
- Algorithms for increasing performance of HPC systems, e.g., latency hiding, reduction of synchronization, utilization of simultaneous execution
- Support for resilience
- Exposing more parallelism at the cost of algorithm efficiency
- Reduced order models and their use in ensemble analysis
- Stochastic algorithms and adaptive algorithms
- Applied math and numerical methods

Lawrence Livermore National Laboratory

NIS 1

Finally, this slide is a catch all for algorithms and models that are of interest.

Novel approaches to coupling of multi-physics at multiple scales is desired.

Algorithms for increasing performance of HPC systems is desired.

In the cases where Exascale HPC becomes less reliable, resilience and fail-over become important.

As processors and accelerators continue to advance there is a need to expose even more parallelism, and efficiency is more appropriately measured in wall time as compared to minimizing the number of operations

Reduced order models and their use in ensemble analysis is needed.

Stochastic algorithms such as stochastic optimization is possible because of Exascale.

And finally, we should not forget about applied mathematics research and numerical methods that are specifically adapted to GPUs or other exascale enabling architectures.



The MFEM library is just one example of NNSA efforts to develop numerical methods that are specifically tailored to GPUs. High-order finite elements and matrix-free methods help optimize time to solution on GPU architectures.



We anticipate that the CHIPS act will support university research in microelectronics. Be aware of potential synergies or opportunities to leverage such research in PSAAP activities.



Please do not overlook opportunities to use and extend the many open-source software resources developed at the NNSA labs. There are many opportunities to collaborate with developers at the labs.

## Thanks to those who helped provide material for these slides

Anna Pietarila Graham

- David Beckingsale
- Jamie Bramwell
- Bronis de Supinski
- Erik Draeger
- Charles Doutriaux
- John Feddema
- Cyrus Harrison
- Rob Hoekstra
- Tom Stitt

Judy Hill

Dan Laney

Katie Lewis

Tzanio Kolev

Tom Scogland

Galen Shipman

Lawrence Livermore National Laboratory

