Cori, the Cray XC40 system that is the latest addition to the National Energy Research Scientific Computing Center’s (NERSC) supercomputing repertoire, is now fully installed and ready to support scientific simulations and data-intensive workflows.
Over the summer Cori’s two phases, which together comprise more than 10,000 compute nodes featuring Intel Haswell and Xeon Phi Knights Landing (KNL) processors, were fully integrated into Berkeley Lab’s new Shyh Wang Hall, which opened just over a year ago. Construction of Shyh Wang Hall, one of the nation’s most energy-efficient supercomputer facilities, was financed by the University of California, with the utility infrastructure and computer systems provided by the U.S. Department of Energy.
NERSC users are now running a variety of science codes on the new supercomputer, which has a peak performance of 30 petaflop/s.
“Cori will provide a tremendous increase in supercomputing capability for our 6,000 users, and we worked closely with Cray and Intel to ensure that scientists could both run large-scale simulations and analyze very large datasets,“ said NERSC Director Sudip Dosanjh. “We are very focused on science, and we’re excited about the breakthroughs that Cori will enable.”
Cori was delivered in two phases. Cori Phase 1—also known as the Data Partition—was installed in late 2015 and comprises 12 cabinets and more than 1,600 Haswell compute nodes. It was customized to support data-intensive science and the analysis of large datasets through a combination of hardware and software configurations and queue policies. A NERSC innovation called Shifter, an open-source software tool based on Docker containers, is one such piece of software that enables users to more easily analyze datasets from experimental facilities at NERSC.
Cori Phase 2, installed in mid 2016, added another 52 cabinets and more than 9,300 KNL compute nodes, making Cori the largest supercomputing system for open science based on KNL processors. The two phases of Cori are integrated via the Cray Aries interconnect, which has a dragonfly network topology that provides scalable bandwidths without expensive external switches.
“Cori is an enormous system,” said Brian Austin, a member of NERSC’s Advanced Technologies Group who has been instrumental in the deployment of Cori. “What is really differentiating are the KNL processors, which have three distinctive features: very long vectors, high-bandwidth memory and the energy-efficient manycore architecture.”
More Storage, Better I/O
Beyond its size and speed, Cori incorporates a number of features intended to support the increasingly data-intensive workflows of NERSC’s users. For example, supercomputer users are always looking for more storage and better I/O performance. Toward this end, Cori features a Burst Buffer based on the Cray DataWarp technology. The Burst Buffer, a 1.5 PB layer of NVRAM storage that provides approximately 1.5 TB/sec of I/O bandwidth, sits between compute node memory and the Lustre parallel file system. It is designed to improve application I/O by handling spikes in I/O bandwidth requirements so that the parallel file system can be configured for capacity.
In addition, Cori’s Lustre scratch file system provides more than 700 GB/s of peak bandwidth and 30 PB of disk capacity, compared to about 150 GB/s and 7 PB on Edison, a Cray XC30 based on Intel Xeon Ivy Bridge processors. NERSC has also added software defined networking features to Cori to more efficiently move data in and out of the system, giving users end-to-end connectivity and bandwidth for real-time data analysis.
“Our goal is to allow the network to become a schedulable resource, which would enable jobs and devices to schedule time on the computer and bandwidth on the network at the same time, and then run in the allocated time-slot,” said Jason Lee, a network engineer at NERSC. “This would free up engineers from having to manually set up the network for experiments.”
Another unique Cori feature is the real-time queue for time-sensitive analyses of data. Users can request a small number of on-demand nodes if their jobs have special needs that cannot be accommodated through the regular batch system.
“There are a bunch of different ways of running jobs that we haven’t done before on our supercomputers, and the real-time queue is one of them,” said Tina Declerck, a computer system engineer at NERSC who is the lead on Cori. “A real-time queue is actually a challenge at NERSC because we don’t have idle cycles, ever. But we can arrange to have some nodes always available.”
On the Road to Exascale
For the last several months, Cori users have been road-testing these features and optimizing their codes in preparation for the addition of the KNL nodes via the NERSC Exascale Science Applications Program (NESAP). Through NESAP, NERSC partners with code teams and library and tool developers to prepare their codes for the system’s manycore architecture. Results from these initial application case studies highlight the success of NESAP in helping users optimize their codes for Cori, with significant speedups being reported.
“NERSC staff have been working with more than 20 teams for the past two years to prepare codes for Cori via the NESAP program, and NERSC has built up a team of expert performance engineers to work with and lead these collaborations,” said Jack Deslippe, acting group lead for NERSC’s Application Performance Group. “The effort is paying off with a number of early success stories porting applications to the KNL architecture.”
Among the many “lessons learned” through NESAP is the importance of taking advantage of the various Knight’s Landing hardware features: manycores, high-bandwidth memory (MCDRAM) and wide vector processing units. “For example, with KNL processors on Cori, you are looking at cores that can compute 32 FLOPs each cycle, on multiple vector processing units,” Deslippe said. “This is motivating developers to look at their code at a much deeper level to make sure they are exploiting the many levels of parallelism available in the system.”
Additionally, having MCDRAM right on the chip provides opportunities to accelerate the significant fraction of the NERSC workload that is sensitive to memory bandwidths. Codes in this category that are able to use the MCDRAM effectively are seeing performance boosts of up to a factor of 3, Deslippe noted.
The combined Cori system is the first to be specifically designed to handle the full spectrum of computational needs of DOE researchers, as well as emerging needs in which data- and compute-intensive work are part of a single workflow—a model that will be important in the coming exascale era, Dosanjh emphasized.
“We expect to see many aspects of Cori in an exascale computer, including dramatically more concurrency and on-package memory,” he said. “The response from our users has been overwhelming—they recognize that Cori is allowing them to do science that can’t be done on existing supercomputers.”
“We started the NERSC-8 Cori project way back in early 2012,” said Katie Antypas, head of NERSC’s Scientific Computing and Data Services Department. “It’s been a long road, and we are incredibly energized to have both phases of Cori on the floor at NERSC, integrated and ready to run real science applications.”