As home to three top-ranked supercomputers of the last decade, the US Department of Energy’s (DOE’s) Oak Ridge National Laboratory (ORNL) has become synonymous with scientific computing at the largest scales.
Getting the most out of these science machines, however, requires a willingness to experiment with problems and systems of every size and scale. This is especially important as technology vendors introduce new system architectures and as scientists’ problem-solving toolkit expands to include artificial intelligence (AI) and advanced data analysis.
In that spirit, ORNL recently installed two NVIDIA DGX-2 systems, powerful GPU-accelerated appliances that will provide ORNL researchers with enhanced opportunities to conduct science—machine learning and data-intensive workloads in particular. The appliances will also provide an onramp to ORNL’s Summit—the world’s most powerful supercomputer—by enabling smaller and more experimental projects to be developed and tested before running on the 200-petaflop machine. The DGX-2 appliances reside in the laboratory’s Compute and Data Environment for Science (CADES), which offers compute and data services for ORNL researchers.
“As Summit enters production, these DGX-2 systems supply ORNL with exploratory multipurpose computing resources,” said CADES director Arjun Shankar. “Early results suggest the DGX-2s will provide novel opportunities in data analysis, machine learning, and modeling and simulation that support the AI-driven transformation that is changing how science is conducted.”
The DGX-2 represents the latest step-change in AI appliances, housing 16 fully interconnected NVIDIA Tesla V100 GPUs with increased GPU memory, a powerful combination that expands the types of problems scientists can tackle in a unified environment. In addition to a standard DGX-2, ORNL received the newly available DGX-2H, which contains upgraded CPUs and faster-clocked GPUs that offer higher performance.
Since NVIDIA debuted the DGX line in 2016, ORNL has deployed the appliances throughout the laboratory to connect researchers with a platform that excels at executing machine learning techniques with the potential to automate some of the time-intensive analysis inherent in research. This is especially relevant to ORNL’s world-class experimental facilities, such as the Spallation Neutron Source, which produce large, unique datasets in need of analysis and automated data workflows.
Appliance for Science
In late 2018, Arvind Ramanathan, a staff scientist in ORNL’s computer science and engineering division, and his team became one of the first groups to get extended time on the DGX-2s. The team used the opportunity to train and optimize algorithms that belong to a class of machine learning called reinforcement learning, in which an “agent” attempts to master its environment by performing actions and evaluating the results without any preexisting knowledge.
Reinforcement-learning algorithms, famously showcased by Google’s AlphaGo program, have proven capable of achieving prescribed goals, such as winning games, but optimizing the preset parameters that control their decision-making can be difficult. Running multiple algorithms simultaneously on the DGX-2 systems allowed Ramanathan’s team to identify superior optimization strategies via an ORNL-developed software called HyperSpace in a fraction of the time it would have taken on another system.
“We couldn’t have done this without a DGX-2 because the problem space that we were exploring was so large and sample inefficient,” Ramanathan said. “Because these GPUs can essentially be used in a unified way, we can do things that are much more difficult to do on other systems, especially in terms of moving data and doing analysis.”
Though ORNL is known for conducting leadership-scale science on its massively parallel supercomputers, there are instances when an innovative smaller machine can be useful. Refining algorithms on the DGX-2 can improve researchers’ confidence that their AI software is ready to be deployed at scale later on. Additionally, workloads that may be poorly suited to run on a supercomputer—jobs that don’t scale or jobs that need to run for extended periods of time, for example—could be carried out on a DGX-2 appliance.
The DGX-2s also have something to offer traditional modeling and simulation. Researchers can run simulations side-by-side with AI to extend simulations further than they would otherwise go, using AI-recognized patterns in the data to “steer” the simulation correctly. A project supported by ORNL’s Laboratory Directed Research and Development program is dedicated to a molecular dynamics framework called Molecules that can execute AI-informed simulation.
“Traditionally, running AI side-by-side with simulation would be too expensive,” Ramanathan said, “but state-of-the-art systems like Summit and the DGX-2 enable this in such a way that we can think of this arrangement as a fused workflow in some sense.”
Currently, CADES staff are working to integrate the appliances into the datacenter’s shared environment so researchers can submit jobs as easily as any other CADES resource. The two DGX-2 systems have been connected by a dedicated EDR InfiniBand network to combine the systems’ capabilities.
“The idea is that researchers will be able to schedule up to 32 GPUs at one time to run in parallel,” said CADES team lead Brian Zachary.
HyperSpace software development is part of the CANcer Distributed Learning Environment project, a cancer research effort supported by the Exascale Computing Project.