Machine learning at scale: Managing multiple models in production

Do you even know how real machine learning products are actually implemented in big technology companies and sectors? If yes, then this article is for you 🙂

Before we discuss scalability, be sure to read my first article on the basics of machine learning in production.

In my last article, I shared that I have been working in the industry as an AI engineer for 10 years. Early in my career, I learned that models written in notebooks are just mathematical hypotheses. It’s only useful if the result is a hit with users, a product, or generates money.

We have already shown what “machine learning in production” looks like in a single project. However, the topic of today’s conversation is scale: Manage dozens or even hundreds of ML projects simultaneously. In recent years, we have sandbox era to infrastructure era. “Deploying models” is now a non-negotiable skill. The real challenge is ensuring that the vast portfolio of models works reliably and securely.

1. Breaking out of the sandbox: availability strategy

To understand ML at scale, we first need to let go of the “sandbox” mentality. The sandbox has static data and one model. If it drifts, look at it, stop it, and fix it.

However, when you move to scale mode, you’re no longer managing a model, you’re managing a portfolio. This is where the CAP theorem (consistency, availability, and partition tolerance) becomes a reality. A single model setup allows you to balance the tradeoffs, but at scale it is impossible to perfect all three metrics. You have to choose your battles, and availability is often your top priority.

why? Because when you run 100 models, something happens. everytime It breaks. If you stop the service every time your model drifts, your product will be offline 50% of the time.

You can’t stop a service, so design your model to fail “cleanly”. Consider the example of a recommendation system. Even if your model has corrupted data, it won’t crash or show a “404 error”. You should revert to safe default settings (such as showing “Top 10 Most Popular” items). Users are satisfied and the system remains available, even if the results are not optimal. But to do this you need to know when Trigger that fallback. And that brings us to our biggest challenge at scale: monitoring.

2. Monitoring challenges and why traditional indicators are disappearing at scale

When I say that it’s important to fail “cleanly” in a large system, you might think that it’s easy and you just need to check or monitor accuracy. However, at scale, “accuracy” alone is not enough. Let me explain exactly why.

Lack of human consent: For example, computer vision is easier to monitor because humans agree on the truth (is it a dog or not?). However, there is no “gold standard” for recommendation systems or ad ranking models. If the user doesn’t click, is the model bad or is the user just not in the mood?
Feature engineering trap: We overcompensate because “truth” cannot be easily measured on a simple scale. We add hundreds of features to the model in the hope that “more data” will resolve the uncertainty.
Theoretical upper limit: You’re fighting for 0.1% accuracy without knowing if your data is too noisy to get any more. We are chasing an invisible “ceiling”.

So let’s link all this together to understand where we are going and why this is important. Monitoring the “truth” at scale is nearly impossible (dead zone), so you can’t rely on simple alerts telling you to stop. This is exactly why we prioritize availability and safe fallbackassumes that the model may be failing even if the metrics don’t tell us, so we build a system that can tolerate that “ambiguous” failure.

3. What about engineering walls?

We’ve covered the strategy and monitoring challenges, but we haven’t addressed the infrastructure side yet, so we’re not ready to scale yet. Scaling requires engineering skills as much as data science skills.

You can’t talk about scaling without a robust and secure infrastructure. Because the model is complex, availability We need to think seriously about the architecture we set up because that is our top priority.

My honest advice at this stage is to surround yourself with a team or people who are used to building infrastructure at scale. You don’t necessarily need a large cluster or supercomputer, but you should consider these three execution basics:

Cloud and devices: Servers provide power and are easy to monitor, but they are expensive. The choice is entirely determined by cost and control.
Hardware: Not all models can be placed on the GPU. you will go bankrupt. A tiering strategy is required. Run simple “fallback” models on cheap CPUs and reserve expensive GPUs for heavy “money-making” models.
optimization: At scale, a 1 second delay in the fallback mechanism will result in failure. You’re no longer just writing Python. You have to learn how to compile and optimize your code for a specific chip so that the “Fail Cleanly” switch happens within milliseconds.

4. Be careful of label leakage

That means predicting failures, addressing availability, organizing monitoring, and building the infrastructure. You’re probably thinking you’re finally ready to master scalability. Actually, not yet. If you’ve never worked in a real environment, there are some issues you can’t predict at all.

Even if the engineering is perfect, a leaked label can ruin a strategy or a system running multiple models.

You might find a notebook leak in one project. But at scale, where data comes from 50 different pipelines, leaks become almost invisible.

Churn example: Imagine predicting which users will cancel their subscriptions. The training data has a feature called . Last_Login_Date. This model looks perfect with an F1 score of 99%.

But here’s what actually happened: The database team set up a trigger that “clears” the login date field the moment the user presses the “cancel” button. The model sees a “NULL” login date and realizes, “Oh! It’s canceled!”

In the real world, models need to make predictions accurate to milliseconds. in front If the user cancels, that field is not yet null. The model sees answers from the future.

This is a basic example to understand the concept. But believe me. If you have a complex system with real-time predictions (which often happens in IoT), this is very difficult to detect. You can only avoid it if you are aware of the problem from the beginning.

My tips:

Delay monitoring function: Don’t just monitor value Data monitoring when It is written relative to when the event actually occurred.
Millisecond test: Always ask, “Does this particular database row actually still contain this value at the exact moment of the prediction?”

These are simple questions, of course, but the best time to evaluate this is during the design stage, before you write a line of production code.

5. Finally, the human loop

The last piece of the puzzle is Accountability. At scale, metrics are fuzzy, infrastructure is complex, and data is vulnerable to leaks, so you need a “safety net.”

Shadow expansion: This is essential for scaling. I deployed “Model B” but did not show the results to the users. Let it run “in the shadows” for a week and compare its predictions with the “truth” that eventually arrives. Only if it is stable will it be promoted to ‘live’.
Human participation: A high-stakes model requires a small team to audit “safe defaults.” If your system reverts to “Most Popular Items” for 3 days, a human needs to ask. why The main model has not recovered.

Before you start using ML at scale, here’s a quick summary.

You will never be perfect, so choose to stay online (available) and fail safely.
Availability is our number one metric because large-scale monitoring is “fuzzy” and traditional metrics are unreliable.
We build the infrastructure (cloud/hardware) to quickly enable these secure failures.
We are on the lookout for “bad” data (leaks) where vague metrics appear to be untrue.
We use shadow deployment to prove that our models are secure before touching our customers.

And remember, scale is just as important as safety net. Don’t let your work be part of the 87% of failed projects.

👉 LinkedIn: SaberRice benDimerado

👉 Medium: https://medium.com/@sabrine.bendimerad1

👉 Instagram: https://tinyurl.com/datailearn

Source link