Menu

Supercomputing Pipeline Aids DESI’s Quest to Create 3D Map of the Universe

ESnet and NERSC Resources Enable Near-Real-Time Data Processing

July 20, 2020
By Kathy Kincade

Contact: cscomms@lbl.gov

kitt peak mayall scope

A view of the Mayall Telescope (tallest structure) and the Kitt Peak National Observatory site near Tucson, Arizona. The Dark Energy Spectroscopic Instrument is housed within the Mayall dome. (Image: Marilyn Sargent/Berkeley Lab)

As neuroscientists work to better understand the complex inner workings of the brain, a focus of their efforts lies in reimagining and reinventing one of their most basic research tools: the microscope. Likewise, as astrophysicists and cosmologists strive to gain new insights into the universe and its origins, they are eager to observe farther, faster, and with increasing detail via enhancements to their primary instrument: the telescope.

In each case, to unravel scientific mysteries that are either too big or too small to see with a physical instrument alone, they must work in conjunction with yet another critical piece of equipment: the computer. This means more data and increasingly complex datasets, which in turn impacts how quickly scientists can sift through these datasets to find the most relevant clues about where their research should go next.

Fortunately, being able to do this sort of data collection and processing in near real time is becoming a reality for projects like the Dark Energy Spectroscopic Instrument (DESI), a multi-facility collaboration led by Lawrence Berkeley National Laboratory whose goal is to produce the largest 3D map of the universe ever created. Installed on the Mayall Telescope at Kitt Peak National Observatory near Tucson, Arizona, DESI is bringing high-speed automation, high-performance computing, and high-speed networking to its five-year galaxy-mapping mission, capturing light from 35 million galaxies and 2.4 million quasars and transmitting that data to the National Energy Research Scientific Computing Center (NERSC), a U.S. Department of Energy user facility based at Berkeley Lab that serves as DESI’s primary computing center.

“We turn the raw data into useful data,” said Stephen Bailey, a physicist at Berkeley Lab who is the technical lead and manager of the DESI data systems. “The raw data coming off the telescope isn’t the map, so we have to take that data, calibrate it, process it, and turn it into a 3D map that the scientists within the broader collaboration (some 600 worldwide) use for their analyses.”

Over the last several years the DESI team has been using NERSC to build catalogues of the most interesting observational targets, modeling the shapes and colors of more than 1.6 billion individual galaxies detected in 4.3 million images collected by three large-scale sky surveys. The resulting DESI Legacy Imaging Surveys, hosted at NERSC, have performed their catalogue generation at NERSC over the course of eight data releases. The DESI project also leverages the Cosmology Data Repository hosted at NERSC, which contains about 900TB of data, and NERSC’s Community File System, scratch, and HPSS storage systems.

“The previous big survey was a few million objects, but now we are going up to 35-50 million objects,” Bailey said. “It’s a big step forward in the size of the map and the science you can do with.”

But storage is only part of the services NERSC delivers for DESI. The supercomputing center has also been instrumental in developing and supporting DESI’s data processing pipeline, which facilitates the transfer of data from the surveys to the computing center and to users. The project uses 10 dedicated nodes on the Cori supercomputer, enabling the pipeline to run throughout each night during a survey and ensure that the results are available to users by morning for same-day analysis, often helping to inform the next night’s observation plan. The DESI team also uses hundreds of nodes for other processing and expects to scale to thousands of nodes as the dataset increases. To facilitate data I/O, DESI depends on the NERSC data transfer nodes, which are managed as part of a collaborative effort between ESnet and NERSC to enable high performance data movement over the high-bandwidth 100Gb ESnet wide-area network.

“DESI is using the full NERSC ecosystem: computing services, storage, the real-time queue, and real-time data transfer,” Bailey said. “It’s a real game changer for being able to keep up with the data.”

Optimizing Python for CPUs and GPUs

While gearing up for the five-year DESI survey, which is expected to begin in late 2020, NERSC worked with the DESI team to identify  the most computationally intensive parts of the data processing pipeline and implement changes to speed them up. Through the NERSC Exascale Science Applications Program (NESAP), Laurie Stephey, then a postdoctoral researcher and now a data analytics engineer at NERSC, began examining the code.

The pipeline is written almost exclusively in Python – a specialty of Stephey’s – which enables domain scientists to write readable and maintainable scientific code in a relatively short amount of time. Stephey’s goal was to improve the pipeline’s performance while satisfying the DESI team’s requirement that the software remain in Python. The challenge, she explained, was in staying true to the original code while finding new and efficient ways to speed its performance.

“It was my job to keep their code readable and maintainable and to speed it up on the Cori supercomputer’s KNL manycore architecture,” Stephey said. “In the end, we increased their processing throughput 5 to 7 times, which was a big accomplishment – bigger than I’d expected.” This means that something that previously took up to 48 hours now happens overnight, thus enabling analysis during the day and feedback to the following night’s observations, Bailey noted. It also saves the DESI project tens of millions of compute hours at NERSC annually.

"New experiments funded by DOE approach NERSC for support all the time," said Rollin Thomas who runs NESAP for Data. "And experiments that already use NERSC are capitalizing on our diverse capabilities to do new and exciting things with data. DESI's sustained engagement with NERSC, through NESAP for Data, the Superfacility initiative and so on, is a model for other experiments. What we learn from these engagements helps us serve the broader experimental and observational data science community better.

And the optimization effort isn’t over yet. The next challenge is to make the DESI code compatible with the GPUs in NERSC’s Perlmutter system, which is slated to arrive in late 2020. Bailey and Stephey began this process last year – “Stephen was instrumental in rewriting the algorithm in a GPU-friendly way,” Stephey noted – but in April NERSC hired one of its newest NESAP postdocs, Daniel Margala, to take over. As a graduate student, Margala had previously worked with Bailey on the Baryon Oscillation Spectroscopic Survey, a DESI predecessor project, “so I’m familiar with a lot of the data processing that needs to be done for DESI,” he said.

So far, Margala’s focus is on preparing DESI’s code for GPUs so that it will be ready to leverage the full potential of the Perlmutter system. He is currently working with a small subset of DESI data on Cori’s GPU testbed nodes; the long-term goal is to make sure the software is ready to handle DESI’s entire five-year dataset.

“The astrophysicists and scientists on DESI are pretty comfortable using Python, so we are trying to do all of this in Python so that they will be able to understand the code we are writing and learn from it, contribute back to it, and maintain it going forward,” Margala said.

Over the next few years, NERSC resources will also be critical to another, larger goal of the DESI project: reprocessing and updating the data.

“Every year we are going to reprocess our data from the very beginning using the latest version of all of our code, and those will become our data assemblies that will then flow into the science papers for the collaboration,” Bailey said. “We only need 10 nodes at NERSC to keep up with the data in real time through the night, but if you want to go back and process 2, 3, 5 years of data, that’s where being able to use hundreds or thousands of nodes will allow us to quickly catch up on all that processing.”