ESnet’s OSCARS Software Automation System Gets Major Upgrade to Meet Future Complex Data Challenges
By Bonnie Powell
Sometimes even a tried-and-true workhorse can reach a new level of performance when challenged by a new race. The software automation system OSCARS, one of the key innovations powering the Energy Sciences Network’s (ESnet’s) high-speed network for Department of Energy–funded scientific research, has just gotten a major update: OSCARS 1.1, which is designed to take advantage of the capabilities offered by ESnet6, the latest iteration of the network.
First deployed by ESnet in 2005 and short for "On-demand Secure Circuits and Advance Reservation System," OSCARS provides multi-domain, high-bandwidth virtual circuits that guarantee end-to-end network data transfer performance between specified locations. OSCARS gives ESnet the ability to engineer, manage and automate the network according to user-specified requirements for scientific instruments, computation, and collaborations. In essence, it acts like an automated HOV lane for which its scientific research users can request the minimum speed limit, the number of cars, and more. OSCARS won an R&D 100 award in 2013 from R&D Magazine.
OSCARS has performed remarkably well for years, with minor adjustments. However, in 2022, ESnet launched ESnet6, the newest version of its global network, which offers more than 46 terabits per second of bandwidth and advanced, potentially transformative capabilities for orchestration and automation (among many areas). OSCARS needed to catch up, technologically, to the state of the other components in ESnet’s software stack, primarily so it could utilize the capabilities provided by the rest of the automation toolchain, but also to improve OSCARS’ own reliability and performance.
And there was a deadline approaching: the 2024 Data Challenge (DC24) of the Large Hadron Collider (LHC). Located at CERN in Switzerland, the LHC is the world’s largest and most powerful particle accelerator. ESnet carries all traffic for the LHC’s four particle detectors from Europe to the United States, two of which — LHC’s ATLAS and CMS (of Higgs boson fame) — are the largest sources of data traffic flowing across ESnet6, from CERN to Fermilab and Brookhaven National Laboratory as well as to thousands of physicists working in universities. (See the list of LHCONE collaborators.)
For DC24, which begins Feb. 12, ESnet will showcase its SENSE (SDN for End-to-end Networking @ Exascale) orchestration and intelligence system, which has been integrated into the Rucio data management and movement system that supports CMS workflows. This SENSE/Rucio integration can identify and prioritize LHC dataflow groups ensuring consistent, end-to-end class of service. SENSE and OSCARS are complementary, in that SENSE can coordinate resources across different administrative domains but requires OSCARS to manage the resources specifically within ESnet. The new version of OSCARS enables SENSE to fully leverage its automation capabilities for network provisioning.
In the years leading up to the LHC’s planned 2028 high-luminosity upgrade, which will increase the integrated luminosity by a factor of 10, the Worldwide LHC Computing Grid (WLCG) working group (tagline: “Dealing with the LHC data deluge”) has organized a series of coordinated data challenges to test the end-to-end dataset movement from its members’ storage systems as well as the speed, latency, and other performance metrics of the global networks charged with carrying LHC’s many petabytes of data. OSCARS needed to be ready by mid-February for DC24.
Starting in early 2023, ESnet’s Orchestration & Core Data group — with assistance from the Infrastructure and Networking teams — worked on integrating advanced network configuration capabilities and creating a robust continuous-integration pipeline for OSCARS, using Docker and Ansible. They also made additional performance improvements to bring OSCARS up to meet industry best practices, such as integrating Single Sign On and troubleshooting tools for faster bug fixing — while maintaining backward compatibility with OSCARS clients, preserving all historical data, and ensuring minimal service disruption.
The ESnet OSCARS team was able to deploy OSCARS 1.1 in time for SC23, the annual high performance computing conference in mid-November, to support many leading-edge networking demonstrations by ESnet’s collaborators. The road-testing did reveal some minor issues in the new release, as well as in other components of the ESnet automation stack, but “for us, the most important lesson was that having a quick and painless application build-and-deployment process has given us the capability to fix bugs almost as soon as they emerge,” said ESnet Software Engineer Vangelis Chaniotakis.
Early bug diagnosis, rapid fix deployment
That ability will be key for this and future LHC Data Challenges. Collaborations like the LHC’s ATLAS and CMS experiments involve thousands of researchers all over the world and require complex transfers over multiple networks, each operated by a different organization. When transfers fail, it is important to be able to diagnose and recover. The ESnet OSCARS team has added capabilities for its network engineers and collaborators to perform operational troubleshooting that can help quickly locate problems and point engineers in the right direction.
And they’re not finished. OSCARS 1.2, which ESnet hopes to release in February, aims to simplify the “advance reservation system” part of the acronym and allow users to express their needs for network resources in terms of intent, expressed via a modern browser user interface rather than requiring exhaustive technical detail as is currently necessary. OSCARS 1.2 will then interpret that intent, determine the necessary technical requirements, and allocate resources. Users can modify or update their reservations if needed.
Such improvements in simplification and flexibility are vital as global research collaborations continue to grow in complexity, with workflows involving multiple facilities and requiring swift, seamless transfer of massive quantities of data.