AI Lab: Secrets to keeping machine learning engineers moving fast

Machine Learning


  • The key to increasing overall AI development velocity is minimizing the time to first batch (TTFB) for machine learning (ML) engineers.
  • AI Lab is a pre-production framework used internally at Meta that allows for continuous A/B testing of common ML workflows, enabling proactive improvements and automatically preventing TTFB regressions.
  • AI Labs allows for experimentation and improvements while preventing TTFB regressions. For example, during open source rollouts Python Cinder RuntimeBy using AI Lab, we were able to improve TTFB by 2x compared to the initial results. Up to 40% improvement in TTFB.

Time to first batch (TTFB), or the delay between when a workflow is submitted and the first batch of a training job, plays a key role in accelerating the iteration rate of machine learning (ML) engineers. Essentially, TTFB is the time it takes for ML model training to[開始]It is the time that elapses from the moment you press the button until the first batch of data enters the model for processing. TTFB contributes to the overhead of every ML training job and is essentially the moment when a developer receives the first signal about a job.

Minimizing TTFB reduces the burden on our ML engineers, allowing them to perform more iterations per day, and improves the overall speed of innovation at Meta.

Supporting TTFB across Meta requires a scalable service that not only proactively improves this valuable metric, but also keeps it autonomously healthy. To this end, we created the AI ​​Lab, which helps infrastructure owners ship new changes with high reliability, Up to 40% improvement in TTFBThis, combined with automatic prevention of regressions, enables ML engineers to move quickly across Meta.

Optimizing TTFB helps ML engineers work faster

The overhead caused by TTFB is in the critical path of most ML development. It consists of components such as configuration validation, feature pre-processing, and infrastructure overhead (such as capacity queuing). Optimizing the TTFB components can even impact the entire training cycle for some models. At Meta's scale, we often see subtle changes in the TTFB metric value as developers iterate on models, launchers, or architectures.

An example of measuring TTFB using components.

To keep ML engineers moving quickly, we need two things:

  1. Proactively Improve Your TTFB: You need an intuitive, easy-to-use experimentation framework that allows users to quantify the impact of change, enables rapid iteration and impact certification of new features, and empowers infrastructure owners to ship new changes with confidence.
  2. Proactively prevent TTFB regressions: You need continuous regression prevention that tests the latest changes in a low-noise environment and provides a way to monitor, detect, and prevent regressions from impacting ML engineers.

Introducing the AI ​​Lab

AI Lab is a specialized pre-production framework for continuously running common ML workflows as A/B tests to accurately measure the impact of recent changes on metrics like TTFB. Mobile LabAI Lab automatically defends against TTFB by preventing regressions before release, and as an experimentation framework, it enables opportunistic proactive TTFB improvements.

Building the AI ​​Lab presented unique challenges. GPU capacity is a precious resource, so we needed to ensure positive capacity utilization across Meta. Working with our partners, we created scaled-down models and simple configurations that could run on CPU only, taking care to prevent regressions that frequently hog the GPU. To this end, we created an auto-scale feature that ensures tests run with the same code/configuration as production, but consumes less compute resources. This reduces the number of training iterations and model size, allowing for more deterministic behavior. These tests often run in under 10 minutes, which is beneficial for developers iterating on possible changes to TTFB. We also needed a holistic strategy to scale with the size of Meta, which we'll discuss in a later section.

AI Lab discovers TTFB regression.

Let’s look at a real-world example of how you can leverage tools like AI Lab to reduce TTFB.

Reducing TTFB with Python Cinder Runtime and AI Lab

Meta Open Source Python Cinder Runtime Max 40% improvement in TTFB Thanks to aggressive lazy imports, here we see the true utility of a framework like AI Lab and how it was used to facilitate this fundamental change.

Aggressively

Instead of experimenting with a real ML engineer's workflow, which can take days or weeks to validate a performance hypothesis, AI Lab allows you to accurately test and measure the impact of a proposed Cinder version on TTFB across a comprehensive set of representative ML scenarios in less than an hour.

In practice, the developers turned this into an iterative loop to test further optimizations and tweak Cinder, resulting in a 2x increase in initial TTFB improvements. For example, in the initial Cinder-enabled profile, engineers found that up to 10% of execution time was spent on workflows that simply printed. Due to the way memoization was used, Rerun() It's unlikely that this would happen in the underlying data structure, which happens to be huge in a typical machine learning scenario. Instead, they created an object wrapper around this underlying data structure, Object ID Instead.

The AI ​​Lab was able to validate the improvements and move forward with rolling out the changes.

Defensively

Around the time the Cinder rollout began, we happened to encounter a regression that was completely unrelated to the rollout. In this new regression, an engineer added some logging that they believed was happening. AsynchronouslyWhat they didn't know was that the call was blocking because one of the nested clients was synchronous. Incident Tracker We automatically attributed the regression to a specific change, and the authors of the regression changes were notified immediately afterwards and reverted their changes before the release was released to production.

Thanks to the AI ​​Lab, engineers working on Cinder no longer had to worry about a TTFB regression occurring in a release they rolled out, avoiding a potential rollback.

The root cause for AI Lab is a specific change that caused a TTFB regression.

How to deliver prevention at Meta scale

While we want to provide accurate TTFB signals as early as possible in the development cycle, it's not possible for us to benchmark every ML scenario for every change made by every engineer at Meta. Instead, Predictive Test SelectionIn this article, we will set limits on the amount of capacity used and work to find as many regressions/improvements as possible as early in the development cycle as possible. In practice, this means:

The AI ​​Lab will be integrated at various stages of pre-production.

  1. O(Code changes): We run relevant, effective, and computationally efficient (often CPU-only) AI Lab tests on potential changes before they are reviewed.
  2. O(Release): We will conduct more comprehensive AI lab testing before release. Like it was split in two Attribution process to find root causes.
    1. This method of attribution is highly effective and efficient, and serves as a good fallback when you need to run more computationally intensive tests to find specific regressions.
High-level end-to-end flow of the AI ​​Lab.

If a statistically significant change is found t-testperform additional checks before marking it as a regression/improvement.

  1. Perform confirmation runs to ensure that the expected regressions/improvements can be reliably reproduced.
  2. Validate that the magnitude of regression/improvement exceeds a dynamic threshold based on the test standard deviation and an adjusted baseline. Receiver Operating CharacteristicsFor example, a partner may request less than 1 false positive per week, which sets a threshold for testing to find as many true positives as possible while still staying below that threshold.

Calling for industry cooperation

While AI Lab is an internal-only tool at Meta, we welcome input from community members who operate similar platforms. Generating synthetic signals is a win-win for both developers and users. When developers can evaluate hypotheses faster and users experience fewer regressions, AI innovation accelerates across the industry. We look forward to working with the industry to improve tools like AI Lab and explore ways to further optimize metrics like TTFB.

Acknowledgements

AI Lab is Mobile LabWe would also like to work on AI efficiency metrics, aiming to scale beyond TTFB. Service LabWe'd like to thank the members of the AI ​​Training Orchestration team who helped us build AI Lab, and all of our users who have used the product to help improve TTFB.





Source link

Leave a Reply

Your email address will not be published. Required fields are marked *