In a presentation at AI Engineer Europe, Snorkel developer advocate Kobie Crawford reflected on the important role of the Task Fidelity Scaling Law in advancing AI model development. Focusing on the integration of research and production at Snorkel, Crawford emphasized that the company’s academic research, particularly a doctoral dissertation at the Stanford AI Lab, is its genesis, leading to the development of a library for generating training data for fundamental models. This foundational work has evolved into a focus on providing datasets for our customers’ models, with a consistent focus on how to integrate research and operations.
Laws of Scaling Task Fidelity: Kobie Crawford on AI Data Quality — AI Engineer
Visual TL;DR. The importance of task quality leads to the law of task fidelity scaling. The law of task fidelity scaling involves defining and evaluating quality. The task fidelity scaling law includes failure mode analysis. The quality of the task is important. High quality tasks are possible. The snorkel approach creates a verifiable dataset. High-quality tasks enabled by verifiable datasets.
Task quality matters: AI model functionality is fundamentally limited by the quality of the training data
The Law of Scaling Task Fidelity: Kobie Crawford talks about its critical role in advancing AI model development
Defining and assessing quality: Understand and measure the quality of your training tasks
Failure mode analysis: Identifying the specific reasons why a task fails
High-quality tasks: Positively impact model performance, regardless of architecture.
Snorkel Approach: A library for generating testable training data for underlying models
Verifiable Datasets: Snorkel focuses on providing high-quality datasets to our customers
Visual TL;DR
The importance of task quality in AI training
Crawford began by asking a central question: “Does the quality of the task actually matter?” She argued that the capabilities of an AI model are fundamentally limited by the quality of the training data. This principle holds true regardless of model architecture, scale, or specific agent harness used. In agent benchmarking and evaluation, task quality is synonymous with data quality. However, Crawford noted that the field currently lacks sufficient empirical evidence to conclusively prove that selective selection of higher-quality tasks leads to meaningfully better training outcomes. This evidence gap motivated Snorkel’s study to measure the impact of task quality on model performance.
Defining and evaluating task quality
To address this, Snorkel evaluated Terminal Bench-style agent coding tasks against four key acceptance criteria: achievability, non-obviousness, functional correctness, and reliability. Tasks that met these criteria were classified as “accepted tasks” and tasks that did not meet the criteria were marked as “rejected tasks.” The objective was to compare the characteristics of accepted and rejected tasks to validate the curation process and demonstrate that higher quality tasks are selected.
Crawford presented data showing that accepted tasks are typically more difficult and complex, requiring multi-step workflows rather than one-shot answers. These tasks also increased the number of tool calls and inferences from the models that attempted them. Conversely, rejected tasks often represented simple problems or failures that were not very informative for model improvement.
Analyzing task failure modes
The presentation then delved into task failure categories to explore where the model failed and why. By classifying failures, Snorkel AImed understood the impact of task quality on model training. The analysis revealed that accepted tasks, although more complex, lead to cleaner failures and provide more actionable insights for model improvement. Rejected tasks, on the other hand, often result in “noisy” failures that are difficult to learn from.
Specifically, the data showed significant differences in the prevalence of specific failure modes between accepted and rejected tasks. For example, “logic errors” and “incompletes” were much more common in the accepted tasks, suggesting that these models are tackling more difficult problems. However, rejected tasks have a high percentage of “wrong approach” and “syntax error” failures, indicating problems with the task definition or basic understanding of the model problem.
Impact of high quality tasks on model performance
The central finding presented was that high-quality tasks lead to dramatically better models. The experiments compared the base model and the fine-tuned model using low-quality or high-quality data. The results showed that fine-tuning the model based on high-quality data significantly improved test passing rates. Specifically, we see a +6.2 percentage point improvement for high-quality tasks and a +1.1 percentage point improvement for low-quality tasks, indicating a 5x improvement in data quality alone.
Crawford emphasized that while the model trained on low-quality data showed some improvement over the base model, the improvement was small compared to what was achieved with high-quality data. This highlights the critical importance of data quality in achieving robust and reliable AI model performance. This study also highlighted that a human-involved process for generating and validating these high-quality tasks is essential to training the model on data that accurately reflects the desired functionality and challenges.
A snorkel approach to data curation
Crawford explained that Snorkel’s platform incorporates both human expertise and programmatic techniques to create high-quality, verifiable datasets. This approach enables the generation of training data that is not only accurate but also scalable. By applying rigorous standards and leveraging a combination of human and AI-driven annotations, Snorkel aims to overcome the inherent challenges in defining and measuring task quality, ultimately leading to more effective AI models.
The presentation concluded by highlighting Snorkel’s continued efforts to improve the way tasks are created and evaluated, with a focus on building benchmarks that are both challenging and useful for AI development. The company’s commitment to data quality is a key differentiator in the rapidly evolving AI landscape.