SOLVING THE SCALING PROBLEM

EXASCALE COMPUTING

The Department of Energy's Los Alamos National Laboratory is operating one of the largest supercomputers on the planet.

Named Trinity it boasts some impressive specifications to enable it to fulfil its NNSA mission mandate to ensure the United States' nuclear stockpile is safe, reliable, and secure.

It does this through massively parallel nuclear simulations of ever greater geometric and physical fidelity which in turn requires solutions to a range of problems that only arise when developing computing systems capable of doing this at the required magnitude and scale.

Suffice to say, whether simulating nuclear stockpiles or evaluating climate models, doing this sort of thing is challenging and very expensive.

OPERATING COSTS

Trinity runs almost 20,000 nodes with 2PB of DRAM, 4PB of flash and 100PB of disk across the cluster. Newer systems like Crossroads currently in development are even larger.

Systems like these require 10 to 40MW power and 50kW to 250kW of cooling per rack and the machines themselves can cost up to $250M per cluster to build.

Shown here is installation work for the water cooling system for Trinity. Despite the efficient use of water from LANL's Sanitary Effluent Reclamation Facility it still requires megawatts of power to keep it cool when in operation.

Exascale clusters planned for the future are even larger and potentially more power hungry. Notwithstanding efforts to improve operational efficiency, cutting edge solutions like this will always require substantial resources. The power and cooling requirement across a large facility like LANL is already enormous. It requires a considerable amount of dedicated infrastructure to support it all. Consequently, solutions that optimise the use of these large computing resources is critical.

CLUSTER RESEARCH

Building ever larger highly parallel supercomputers involves working though a vast array of new design trade-offs.

Every design detail from network interconnection to storage architectures and processing pipelines needs to be considered.

A fundamental problem is that what we know works at one scale may not work well at larger scales and building a $250M machine to find out is not really an option.

Cluster simulations can help to some extent but in many cases real-world issues can intervene to mitigate their effectiveness.

What's really needed is a low cost development platform on which to research the design options and prototypes new ideas without the expense of building a running a full scale HPC cluster to do this research.

CODE READY DESIGN

Gary Grider, High Performance Computing Division Leader, Los Alamos National Lab, also recognized a key challenge facing designers working towards exascale is being code-ready.

Without operational systems software, nothing will work and applications can't get the science done but applications rely on stable systems software which needs to be developed for the cluster in the first place. A classic Catch-22.

Given the high cost of running the production HPC clusters, it's not practical to use them to develop systems software and these machines are rarely available anyway because they're usually running existing application software 24x7 already.

Earlier generation clusters may be available at a given facility for systems software development but they are also expensive and their architecture may not represent current and future challenges to the software stack. Gary identified that the key to solving these problems is to develop a way for R&D to design system software that scales well when working on solutions intended to run 100k+ nodes. "You simply don't get the opportunity to use a large supercomputer for weeks to months at a time to try out things like scalable boot, launch, monitoring, io forwarding etc".

THE MODEL SOLUTION

The solution proposed is to model the problem at production scale on much lower cost hardware while remaining fully software compatible from a systems perspective.

Gary said "We thought Raspberry Pi was potentially the answer but there were no good cluster packaging technologies."

"SICORP helped us find a potential solution and we jointly worked with BitScope to develop the first unit and proceeded to get 5 units to try at towards 1000 node scale."

"Subsequently other potential uses like simulation of thousands or tens of thousands of IOT devices and other similar applications have expressed interest."

To learn more see the press conference presented by Gary Grider and Bruce Tulloch (CEO, BitScope) at SC17 where they announced details of the 3000 core pilot system for the "model cluster" project at Los Alamos National Laboratory and how it came about.