Tackling a Trillion

This simulation of the universe was done with NyX, a large-scale code for massively parallel machines. Image courtesy of Prabhat and Burlen Loring, LBNL.
To study the development of the universe or plasma physics, scientists run into what could be the biggest big-data problem: simulations based on trillions of particles. With every step forward in time, the computer model pumps out tens of terabytes of data.

To run and analyze these huge calculations, a team of scientists has developed BD-CATS, a program to run trillion-particle problems from start to finish on the most powerful supercomputers in the United States.

Each particle in a cosmology simulation, for example, represents a quantity of dark matter. As the model progresses through time, dense dark matter regions exert a gravitational pull on surrounding matter, pulling more of it into the dense region and forming clumps.

As Prabhat – group leader for Data and Analytics Services at the Department of Energy’s National Energy Research Scientific Computing Center (NERSC) – says, “If we got the physics right, we start seeing structures in the universe, like clusters of galaxies which form in the densest regions of dark matter. The more particles that you add, the more accurate the results are.”

Prabhat is part of a team presenting trillion-particle simulation results Tuesday, Nov. 17, at the SC15 supercomputing conference in Austin, Texas.

Plasma physicists, meanwhile, can simulate such natural phenomena as the interaction of charged particles from the sun with Earth’s magnetic field, leading to such familiar sights as the northern lights. As in the cosmology simulations, the more particles modeled, the more accurate the results should be.

Science is rife with simulations that generate large data sets, including climate, groundwater movement and molecular dynamics, says Suren Byna, a computer scientist in the Scientific Data Management group at Lawrence Berkeley National Laboratory (LBNL). These problems might require tracking only billions of particles today, but data analytics that work at the leading edge – as in cosmology and plasma physics – will improve future simulations across many domains.

The system that made this possible, BD-CATS, emerged from a collaboration between three other projects.

First, MANTISSA – funded by the Department of Energy’s Advanced Scientific Computing Research (ASCR) program – “formulates and implements highly scalable algorithms for big-data analytics,” Prabhat says. Scalable algorithms generally run faster as the number of processors they’re on increases. Clustering, which sorts data into related groups, is a major challenge for the kind of large data sets MANTISSA faces.

Second, for fast data input and output (I/O), BD-CATS uses Byna’s work on ExaHDF5, a version of hierarchical data format version 5 (HDF5), a file format to organize and store large amounts of information. ExaHDF5 efficiently scales the original program, making it run well even for files that are tens of terabytes in size.

Third is Intel Labs’ project to parallelize and optimize big-data analytics, making them run efficiently on a large number of computer processors. Md. Mostofa Ali Patwary, a research scientist in Intel’s Parallel Computing Lab, addressed clustering with a scalable implementation of DBSCAN, a longstanding algorithm designed for that purpose. Patwary and his colleagues balanced the analysis across supercomputer nodes to reduce time spent on communication.

How a system creates and handles clusters has a real impact on scientific results.

As Patwary explains, the projects combine to create an end-to-end workflow for trillion-particle simulations. BD-CATS can load the data from parallel file systems (which spread data across multiple storage sites), create data structures on individual computer nodes, perform clustering and store results.

I/O is a particular challenge for BD-CATS. The ExaHDF5 team has developed techniques to make writing and reading massive HDF5 files on parallel file systems at large scales more efficient. In cosmology and plasma-physics simulations, a single time step can create about 30 terabytes of data in a single file. HDF5 can move such files at nearly the peak capability of parallel file systems on Edison, NERSC’s Cray XC30 supercomputer.

With all of the data loaded, the team needed to partition it in a way that makes searching through it easier. That requires putting related data together. Imagine for instance, Prabhat says, “that you have all of the U.S. citizens in a spreadsheet, and you want to put all of the people who live in California next to each other, and all of the people in Washington next to each other and so on.”

In a plasma-physics simulation, this geometric positioning might apply to charged particles instead of people. The program uses a sampling-based median computation technique to geometrically partition the particles among supercomputer nodes.

After that, BD-CATS computes at each node a structure called a k-d tree, which arranges the particles in k-dimensional space. (The letter k stands in for the number of dimensions in the data space, which differs for each problem.) Using the people analogy, this tree creates a map of who lives together in California, and then who lives in the same county, followed by neighborhood and so on down to the desired level of granularity.

“Now we do the clustering using nearest-neighbor computation,” Patwary says. “We look at the k-d tree and get neighbors and grow regions until we get the desired cluster.”

How a system creates and handles clusters has a real impact on scientific results. For example, picture data in the shape of two clusters of points connected by a thin line of points. If the algorithm sees it as a single big cluster and computes the data’s center point, it will probably lie in neither of the real clusters. BD-CATS ignores the bridge and distinguishes between the two separate data clusters.

In a cosmology simulation, the dark matter clusters provide information on how the universe formed. To test the accuracy of the theory behind a simulation, scientists compare the number and size of clusters it shows with real astronomical data.

“If the physics of the simulation is accurate and your clustering algorithm did a good job, you will see a certain number of big clusters of dark matter, which correspond to the dense clusters of galaxies we can see in astronomical data,” Prabhat says. On the other hand, if the algorithm systematically produces mistakes – like combining clusters – that artificially inflates the clusters’ size and scientists risk misunderstanding the accuracy of the theory behind their simulation.

With all of BD-CATS’s parts operating, the team ran (among other examples) a plasma-physics simulation of 1.4 trillion particles. The entire process took about 30 minutes on roughly 100,000 Edison processor cores.

This, Patwary says, is “the first end-to-end clustering system that deals with a trillion particles and tens of terabytes of data.” In fact, he adds, it can handle simulations that are two orders of magnitude larger than anything attempted before.

Even beyond the technical challenges BD-CATS surmounts, Prabhat sees other huge victories. “This demonstrates a successful collaboration between several groups: algorithm developers and code optimization experts at Intel, applied math researchers at MANTISSA, and computer science researchers working on ExaHDF5. And this has successfully solved one of DOE’s leading big-data analytics problems on NERSC platforms.”

Besides Patwary, Byna and Prabhat, other collaborators on the SC15 paper include Nadathur Rajagopalan Satish, Narayanan Sundaram, Michael J. Anderson and Pradeep Dubey of Intel Labs; LBNL’s Zarija Lukic and Yushu Yao; and Vadim Roytershteyn of the Space Science Institute.