Automated tools detect silent errors in deep learning training

TrainCheck uses training invariants to find the root cause of difficult-to-detect errors before causing downstream problems and saving time and resources.

A new open source framework developed at the University of Michigan actively detects silent errors that occur during deep learning training. These difficult-to-detect problems do not cause obvious training failures, but they quietly decompose the performance of the model while wasting valuable resources and time.

In the evaluation, the TrainCheck framework identified 18 out of 20 real-world silent training errors in one iteration, but the current method only caught two, but discovered six previously unknown bugs in the popular training library. Researchers introduced TrainCheck in a recently published study at Usenix Symposium on Boston's Operating System Design and Implementation (OSDI).

“By developing TrainCheck, we aim to empower developers with better tools to deal with silent errors, ultimately enabling more robust AI systems,” said Ryan Huang, an associate professor of computer science and engineering and a senior author of the study.

During deep learning, artificial neural networks learn to perform tasks using large amounts of data, adjusting parameters over several cycles to reach the desired performance. Large-scale AI models such as large-scale language models (LLMS) and computer vision models are particularly expensive to train, allowing training to continue, leading to suboptimal models, which leads to silent errors.

The current method monitors deep learning training with high-level signals, such as loss (how wrong it is to compare the model's predictions with the correct answer), accuracy (percentage of correct responses), and gradient norms (a measure of how much the model's parameters change at each training step).

However, these bird's-eye view metrics are noisy and fluctuate naturally during training, making them difficult to distinguish between normal variations and real problems. For example, training on the Bloom-176B LLM on Huggingface missed a silent error because it did not cause any obvious changes in losses or accuracy. The bug drifted off copies of models running on different GPUs, making the final trained model unavailable, wasting months of expensive calculations.

TrainCheck's new approach relies on training invariants. This is a certain rule throughout the training. This framework will help you to continuously monitor training invasions, immediately alert developers of deviations, provide detailed debugging information, and help you know what went wrong. This is a big step away from previous high-level methods where the root cause could not be found even if the problem was detected.

“By automatically guessing and monitoring training invariants, TrainCheck allows for rapid identification and error resolution, a major advance over traditional methods. This sets a new standard for error detection in machine learning frameworks.”

The researchers tested TrainCheck with 20 silent errors, comparing performance to four existing detection methods. Six silent errors were drawn from previous research, with the other 14 coming from issues discussed in the developer forum (GitHub, Stackoverflow, Social Media).

TrainCheck detected 18 out of 20 silent errors, with only two high-level signal detectors detected. The diagnosis revealed that out of the 18 errors where TrainCheck was detected, violation reports discovered the exact root cause of 10 cases and localized near the root of the other eight cases. In contrast, high-level detectors could only provide diagnostic hints for one error.

“We were impressed by how well TrainCheck worked in handling real-world issues using a principled, immutable-based approach,” Huang said.

When evaluating an incorrect error, TrainCheck warned the developer of the incorrect error, but warned it at a slower speed. Incorrect alarms occurred, but they followed a recognizable pattern that made them relatively easy to dismiss.

The powerful results show that TrainCheck can be integrated into a variety of machine learning frameworks, providing developers with proactive tools to protect errors. Early detection of silent errors minimizes wasted resources and increases the accuracy and robustness of the model.

Future adaptations could enhance TrainCheck to provide additional debugging help to developers, extending the continuous validation approach to other computational domains, such as distributed systems, silent errors general resilience and performance improvements.

detail:
Yuxuan Jiang et al, training with confidence: catch silent errors with deep learning training with automated proactive checks, arxiv (2025). doi:10.48550/arxiv.2506.14813

github:github.com/orderlab/traincheck

Journal Information:
arxiv

Provided by the University of Michigan

Quote: AI model improvement: Automated tools detect silent errors in deep learning training (July 24, 2025) Retrieved from 26th July 2025 from https://techxplore.com/news/2025-07-AI-automated-tool-silent-errors.html

This document is subject to copyright. Apart from fair transactions for private research or research purposes, there is no part that is reproduced without written permission. Content is provided with information only.

Source link