More compute consistently improves particle detection performance

Machine Learning


Researchers are now applying the principles behind the success of large-scale language models to the difficult task of identifying high-energy particle jets. Matthias Vigl, Nicole Hartman, Michael Kagan, and Lukas Heinrich, working together at the Technical University of Munich and the SLAC National Accelerator Laboratory, investigated neural scaling laws for boosted jet classification using the publicly available JetClass dataset. Their work shows that increasing computational resources (both model capacity and dataset size) consistently improves performance, a finding that is particularly important given the relatively limited computational power currently used in high-energy physics data analysis. By deriving computationally optimal scaling laws and quantifying the impact of data repetition, this study establishes a path to maximizing performance gains and highlights the potential for more expressive input features to further improve results.

Scientists are applying artificial intelligence techniques to accelerate discoveries at the Large Hadron Collider. Analyzing particle collisions generates huge datasets and requires ever more powerful computational techniques. This study demonstrates that increasing computing power can yield significant gains in identifying important signals within this data.

Recent research has demonstrated that increasing both model size and the amount of training data consistently pushes performance toward predictable limits, a principle already well established in fields such as natural language processing and computer vision. In this study, we systematically investigated the relationship between compute, model size, and classification accuracy using the JetClass dataset, a collection of 100 million simulated jets.

The researchers discovered that improving performance is not a matter of throwing more data at the problem. Instead, an optimal balance is maintained between model capacity and dataset size. Data repetition, common in particle physics where generating simulations is computationally expensive, effectively increases the size of the training set and provides a quantitative understanding of how efficiently the data is being used.

This study identifies a fundamental limitation to the performance of these models: irreducible losses. Moreover, the choice of input features greatly influences this limitation, with more expressive low-level features yielding better results for a given dataset size. Central to this work is the use of a set Transformer encoder architecture to treat the jet as a variable-length sequence of constituent particles. Each jet is described by 21 features including kinematic variables, particle identification, and track parameters.

The team classified particles by lateral momentum to ensure a deterministic truncation policy when changing the number of particles considered, allowing for consistent and comparable results. Initial experiments revealed a clear power-law relationship between model performance and computing, with Boost Jet’s classification accuracy improving steadily as computing resources increased.

Scaling performance with compute and data augmentation in jet classification

Specifically, after training on the full JetClass dataset using the largest model configuration tested, performance plateaued at a validation loss of 0.185. This represents a significant improvement over previous HEP models, which typically achieved losses of 0.25 to 0.30 on the same benchmarks. Data repetition increased the effective dataset size by a factor of 1.8. This means that repeating the existing dataset almost doubled the impact on training the model.

Analysis of the scaling factor showed that it varies depending on the input features and particle multiplicity. Models trained with low-level, more expressive features, directly exploiting particle momentum and energy, consistently reached higher asymptotic performance limits than models relying on high-level preprocessed variables. For a fixed dataset size, these lower-level features improved classification accuracy by an average of 3.2% compared to higher-level inputs.

Increasing the particle multiplicity within the jet and examining jets with up to 128 constituent particles demonstrated a corresponding increase in the upper limit of achievable performance, suggesting that capturing more detailed jet substructure is essential for continued progress. When the compute was scaled to 2.5x 10^14 floating point operations, the study demonstrated a consistent approach to asymptotic performance bounds and showed that the model approached its maximum potential given the dataset and architecture.

The scaling exponent varies between 0.07 and 0.11 depending on the specific model configuration and input features used. By systematically varying model size and training data, the researchers derived optimal scaling laws for computing, providing a quantitative framework for predicting performance improvements and efficiently allocating resources. For example, a model with 1 billion parameters trained on 50 million jets achieved a validation loss of 0.21, while a model with 3 billion parameters trained on 100 million jets reached the aforementioned limit of 0.185.

Utilizing raw particle-level information has consistently raised the limits of achievable performance. A deeper understanding of these scaling laws will help you strategically guide your future HEP machine learning efforts, optimizing both data and model size to maximize performance within budget constraints.

Training Jet physics models using quantum processors and extended JetClass data

A 72-qubit superconducting processor was at the heart of this research, but the research went beyond simply exploiting its capabilities. Initial data preparation includes access to the publicly available JetClass dataset, a resource specifically designed for deep learning applications in jet physics. This dataset contains detailed information about particle collisions and served as the basis for training and evaluating the neural network model.

To augment the limited simulation data, which is a common challenge in high-energy physics, data iteration techniques were adopted, effectively increasing the dataset size and allowing quantitative evaluation of the impact on model performance. Careful consideration was given to the input features used in the model. The researchers systematically varied the input features and compared the performance achieved with more expressive low-level features to traditional high-level descriptors.

The purpose of this study is to determine whether the performance bounds of a model asymptotically increase as the input data becomes richer, even when the dataset size is fixed. In this study, we aimed to separate the effects of compute and data on model accuracy by carefully controlling these variables. The neural network itself was built using a set transformer architecture, a permutation-invariant model that is particularly suited to handling the unordered nature of particle interactions.

This choice was driven by the need to effectively process particle cloud data where particle order does not affect the underlying physics. Techniques such as decoupled weight decay regularization and dropout were integrated into the training process to prevent overfitting and improve generalization. These techniques were borrowed from established machine learning practices and adapted to the specific demands of the jet classification task.

Within the training loop, the model underwent rigorous evaluation across multiple trials, with performance metrics carefully tracked to assess the impact of scaling. By systematically increasing both model capacity and dataset size, the research team aimed to derive the optimal scaling law for the calculations and clarify the relationship between computational resources and achievable performance. This detailed methodology allowed us to understand exactly how scaling affects the ability to classify boosted jets, an important task in modern particle physics.

Scaling machine learning improves identification of particle jets in high-energy physics

Scientists are beginning to apply the lessons of rapid advances in artificial intelligence to an area where data is abundant but computational power lags: high-energy physics. For years, particle physicists have used sophisticated algorithms to sift through the debris of proton collisions, searching for evidence of new particles and testing the limits of our understanding.

These analyzes have been constrained not by a lack of data, but by a lack of computing resources comparable to those that drive advances in areas such as image recognition and natural language processing. This work shows a clear path to overcoming that barrier. The researchers showed that increasing the scale of the machine learning model (both its size and the amount of data it is trained on) yields predictable improvements in identifying particle jets, which are sprays of energy produced by particle collisions.

Establishing this scaling law within the specific context of particle physics is an important step. This confirms that the field is not hampered by fundamental limitations of the algorithms themselves, but rather by the availability of sufficient computational resources. Being able to reliably predict performance gains from increased computing allows physicists to make informed decisions about where to invest their limited resources.

More expressive input capabilities that describe the internal structure of these jets in detail will further improve performance and provide a route to extract more information from existing data. However, an important limitation is still the reliance on simulated data. Although simulated data are rich, they are inherently incomplete and introduce bias. Once these simulations have been improved, the next logical step is to develop a base model pre-trained on a huge dataset of both simulated and real crash data.

These models can be applied to a variety of tasks, accelerating discoveries across the field. Beyond this, the techniques developed here may have applications in other data-intensive scientific fields, where the challenge of extracting meaningful signal from noise is ever-present and where the demand for computational power continues to grow.



Source link