Graphcore Joins Upgraded Cerebras and SambaNova Systems with Spring Deployment

Applications of AI


May 4, 2023 — The AI ​​testbed at the Argonne Leadership Computing Facility (ALCF) at the U.S. Department of Energy’s Argonne National Laboratory has been updated to continuously bring the global user community to cutting-edge AI systems for scientific research. will be able to access

AI Testbed is a collection of the world’s most advanced AI accelerators designed to enable researchers to explore deep learning and machine learning workloads to advance AI for science. The system has also helped the facility better understand how to integrate new AI technologies into traditional supercomputing systems with CPUs (central processing units) and GPUs (graphics processing units). increase. ALCF is a US DOE Office of Science user facility in Argonne.

Graphcore Bow Pod64 system on the ALCF AI testbed. (Credit: Argonne National Laboratory)

The testbed’s primary focus is to help evaluate the usability and performance of machine learning-based high-performance computing applications on proprietary AI hardware. ALCF’s AI platform will complement the facility’s current and next-generation supercomputers, providing a state-of-the-art environment to support research at the intersection of AI, big data, and high-performance computing.

“Our upgraded AI testbed system offers unparalleled power and user accessibility to enable data-driven discovery in critical DOE scientific areas. Ranging from high-energy particle physics, these systems empower researchers to push the boundaries of scientific understanding and accelerate progress in both applied and basic research.” Argonne Computing, For the Environment and Life Sciences Directorate. “Whether they need in-situ analysis of beamline images, rapid training of billion parameter models, or learning-based optimization for running simulations, ALCF users are embracing the latest AI technologies comprehensively throughout their research campaigns. can be used for

This spring, ALCF will add a new Graphcore system to its AI testbed, along with upgraded Cerebras and SambaNova machines, giving researchers around the world access to powerful AI technologies.

The ALCF AI Testbed System is available to the global research community. Interested users can apply for a free quota of these machines at any time.

Deploying the Graphcore system

The 22 petaflops Graphcore Bow Pod64, an Intelligence Processing Unit (IPU) system, is the latest ALCF AI testbed platform to be deployed in the scientific community.

As IPUs are well suited for both general and specialized machine learning applications, these new processors will help facilitate the use of new AI techniques and model types.

Updating Cerebras and SambaNova Systems

The addition of the Cerebras Wafer-Scale Cluster WSE-2, which includes a MemoryX system and SwarmX fabric, optimizes Testbed’s existing Cerebras CS-2 system to enable near-perfect linear scaling of large language models (LLMs). increase. LLMS (deep learning models with billions of parameters) are well positioned to expand the possibilities of AI in science, but distributing such models across thousands of GPUs presents significant challenges such as sublinear performance. occurs.

Venkat Vishwanath, Data Science Team Leader at ALCF, said: “This can make AI at extreme scales much more manageable.”

The second-generation SambaNova DataScale system, on the other hand, consists of eight nodes for a total of 64 next-generation SN-30 Reconfigurable Datascale AI accelerators, enabling a wider range of AI for Science applications, enabling large-scale AI models and Make datasets more manageable. user.

Each accelerator is allocated terabytes of memory, making it ideal for applications involving LLM and high-resolution imaging from experimental facilities such as the Advanced Photon Source (APS), a DOE Office of Science User Facility in Argonne.

ALCF users have already begun to utilize the AI ​​testbed system in their research. See below for some of our ongoing efforts.

Experimental data analysis

Mathew Cherukara, a computational scientist and group leader in Argonne’s X-Ray Science Division, explores using multiple testbed systems to accelerate and scale deep learning models. Analysis of X-ray data obtained with APS.

“Several APS beamlines regularly collect data at rates exceeding 1 gigabyte per second, and these data need to be analyzed on the millisecond timescale,” he explained.

Cherukara uses ALCF’s Cerebras system for high-speed training, including training performed on live equipment data. The SambaNova machine, on the other hand, is useful for training models too large to run on a single GPU to produce improved 3D images from X-ray data. He pointed out that related work is being done on AI Testbed to investigate the use of his Groq system in fast inference applications.

Porting the models used in APS (PtychoNN, BraggNN, and AutoPhaseNN) to ALCF’s AI system yielded promising initial results, including many speedups over traditional supercomputers. ALCF and the vendor software team are working with the Chercara team to achieve further progress.

“Our ultimate goal is to have a testbed in production. Every year, more than 5,000 scientists nationwide and around the world use the unique capabilities APS provides,” says Cherukara. said. “These users need fast and accurate data analysis that runs synchronously with their experiments. AI testbed systems can provide fast, scalable deep learning model training and inference. , has the potential to meet these needs.”

neural network

Graph Neural Networks (GNNs) are powerful machine learning tools that can process and learn from data represented as graphs. GNNs are used in several areas of research, including molecular design, financial data, and social networks. Filippo Simini, a computer scientist at ALCF, is working on comparing the performance of his GNN model across multiple of his AI accelerators, including the Graphcore IPU.

“My team focuses on inference to see which GNN-specific operators or kernels can create computational bottlenecks that affect the overall runtime as a result of increasing parameter counts and batch sizes. I am,” he said.

COVID-19 research

Arvind Ramanathan, a computational biologist in Argonne’s Department of Data Science and Learning, and his team utilized an AI testbed when using LLM to discover SARS-CoV-2 variants. Their workflow leveraged the Cerebras CS-2 and Wafer-Scale Cluster with GPU-accelerated systems, including ALCF’s Polaris supercomputer.

He emphasized that one of the key problems to overcome in his research is how to manage the vast array of genome sequences, the size of which overwhelms many computing systems when establishing foundational models. The learning-optimized architecture of the Cerebras system was key to accelerating the training process. The team’s research was awarded the 2022 Gordon Bell Award for COVID-19 Research Special Award.

molecular simulation

Logan Ward, a computational scientist in the Department of Data Science and Learning at Argonne University, has led efforts to run applications that perform two types of computations for the study of potential battery materials. In the first type of computation, his team performs physical simulations of molecules under redox. That is, it calculates how much energy a molecule can store when it is charged. Then train a machine learning model to predict that amount of energy.

“Our application ends up combining two calculations,” says Ward. “We use a trained machine learning model to predict the outcome of redox computations and allow us to perform computations that identify molecules with the necessary capacity for energy storage.”

His job is to make this process as efficient as possible.

He detailed some of the benefits that come with using this system.

Leverage your existing workflows to jump-start your AI testbed system. ALCF’s professional staff provides users with hands-on assistance to achieve the best possible results. For more information, including how to obtain your quota, please visit https://www.alcf.anl.gov/alcf-ai-testbed.

Argonne Leadership Computing Facility It provides supercomputing capabilities to the scientific and engineering community, advancing fundamental discovery and understanding across a wide range of disciplines. Supported by the Advanced Scientific Computing Research (ASCR) Program, Office of Science, US Department of Energy (DOE), ALCF is one of two DOE Leadership Computing Facilities in the United States dedicated to open science.

Argonne National Laboratory Seeking solutions to pressing national problems in science and technology. As the nation’s first national laboratory, Argonne conducts cutting-edge basic and applied scientific research in nearly every scientific field. Argonne researchers work closely with researchers from hundreds of companies, universities, federal, state, and municipal institutions to help them solve specific problems, advance America’s scientific leadership, and improve the world. Helping prepare nations for the future.Argonne has employees in over 60 countries and is managed by UChicago Argonne, LLC, the U.S. Department of Energy’s Office of Science.

U.S. Department of Energy Office of Science is the largest proponent of basic research in the physical sciences of the United States, working to address some of the most pressing challenges of our time. For more information, please visit https://energy.gov/science.


Source: Nils Heinonen, ALCF



Source link

Leave a Reply

Your email address will not be published. Required fields are marked *