Safely deploy ML models to production: 4 controlled strategies (A/B, canary, interleaving, shadow testing)

Deploying a new machine learning model into production is one of the most important stages in the ML lifecycle. Even if a model performs well on validation or test datasets, directly replacing an existing production model can be risky. Offline assessments rarely capture the full complexity of real-world environments. The distribution of data may change, user behavior may change, and system constraints in a production environment may differ from system constraints in a controlled experiment.

As a result, a model that looked great during development may have poor performance or a negative impact on the user experience after deployment. To mitigate these risks, ML teams employ controlled rollout strategies that allow new models to be evaluated under real-world production conditions while minimizing potential disruption.

This article describes four widely used strategies to help organizations securely deploy and validate new machine learning models in production: A/B testing, canary testing, interleaved testing, and shadow testing.

A/B testing

A/B testing This is one of the most widely used strategies for safely introducing new machine learning models into production. In this approach, incoming traffic is split between two versions of the system. legacy model (control) and Candidate model (change). Typically, the variance will be uneven to limit risk. For example, 90% of requests are still handled by the legacy model, but only 10% are routed to the candidate model.

By exposing both models to real-world traffic, teams can compare downstream performance metrics such as click-through rates, conversions, engagement, and revenue. This controlled experiment allows organizations to evaluate whether a candidate model truly improves results before gradually increasing traffic share or replacing a traditional model completely.

canary test

canary test This is a controlled rollout strategy in which the new model is first rolled out to a small number of users and then gradually released to the entire user base. The name comes from an old mining practice in which miners carried canary birds into coal mines to detect toxic gases. Birds were the first to react, warning the miners of the danger. Similarly, in machine learning deployments, Candidate model It will initially be open to a limited group of users, but the majority will continue to legacy model.

Unlike A/B testing, which randomly splits traffic across all users, canary testing targets a specific subset and incrementally increases exposure if performance metrics indicate success. This gradual rollout allows teams to detect issues early and quickly roll back if needed, reducing the risk of widespread impact.

Interleave test

Interleave test Evaluate multiple models by mixing outputs within the same response displayed to the user. Rather than routing the entire request to either the legacy or candidate model, the system combines predictions from both models in real time. For example, in a recommendation system, some items in a recommendation list legacy modelthe others are Candidate model.

The system then logs downstream engagement signals such as click-through rate, watch time, and negative feedback for each recommendation. Because both models are evaluated within the same user interaction, interleaved testing allows teams to more directly and efficiently compare performance while minimizing bias introduced by different user groups or traffic distributions.

shadow test

shadow testalso known as Shadow deployment or dark launchallows teams to evaluate new machine learning models in a real production environment without impacting the user experience. In this approach, Candidate model is executed in parallel with legacy model It receives the same live requests as the production system. However, only the legacy model’s predictions are returned to the user, and the candidate model’s output is only recorded for analysis.

This setup helps teams evaluate how new models perform under real-world traffic and infrastructure conditions that are difficult to reproduce in offline experiments. Shadow testing provides a low-risk way to benchmark candidate models against traditional models, but its predictions are not visible to users, so you cannot capture true user engagement metrics such as clicks, watch time, or conversions.

Simulating ML model deployment strategies

set up

Before simulating a strategy, we need two things: a way to represent incoming requests and a stand-in for each model.

Each model is simply a function that takes a request and returns a score, a number that roughly represents how good the model’s recommendations are. The traditional model has an upper bound on the score of 0.35, while the candidate model has an upper bound on the score of 0.55, and because we are intentionally improving the candidates, we can be sure that each strategy is actually detecting improvements.

make_requests() generates 200 requests across 40 users. This allows us to keep our simulations lightweight while still providing enough traffic to see meaningful differences between strategies.

import random
import hashlib
 
random.seed(42)


def legacy_model(request):
    return {"model": "legacy",    "score": random.random() * 0.35}
 
def candidate_model(request):
    return {"model": "candidate", "score": random.random() * 0.55}
 
def make_requests(n=200):
    users = [f"user_{i}" for i in range(40)]
    return [{"id": f"req_{i}", "user": random.choice(users)} for i in range(n)]
 
requests = make_requests()

A/B testing

ab_route() is the core of this strategy. It extracts a random number for each incoming request and routes it to a candidate model only if that number is less than 0.10. Otherwise, the request will be sent to Legacy. This gives candidates about 10% of the traffic.

Next, we collect prediction scores from each model individually and finally calculate the average. In a real system, these scores would be replaced with actual engagement metrics such as click-through rate and watch time. The score here simply represents “how good this recommendation was.”

print("── 1. A/B Testing ──────────────────────────────────────────")
 
CANDIDATE_TRAFFIC = 0.10   # 10 % of requests go to candidate
 
def ab_route(request):
    return candidate_model if random.random() < CANDIDATE_TRAFFIC else legacy_model
 
results = {"legacy": [], "candidate": []}
for req in requests:
    model  = ab_route(req)
    pred   = model(req)
    results[pred["model"]].append(pred["score"])
 
for name, scores in results.items():
    print(f"  {name:12s} | requests: {len(scores):3d} | avg score: {sum(scores)/len(scores):.3f}")

canary test

The key function here is get_canary_users(), which uses an MD5 hash to definitively assign users to canary groups. What matters is decisive. Sorting users by hash means that the same users are always in the canary group across multiple runs. This reflects how a real-world canary deployment works, where a given user consistently sees the same model.

Next, we simulate three phases by simply increasing the percentage of canary users (5%, 20%, and 50%). The routing of each request is determined by whether the user is in a canary group, rather than a random coin toss like in A/B testing. This is the fundamental difference between the two strategies. A/B tests are split by request, and canary tests are split by user.

print("\n── 2. Canary Testing ───────────────────────────────────────")
 
def get_canary_users(all_users, fraction):
    """Deterministic user assignment via hash -- stable across restarts."""
    n = max(1, int(len(all_users) * fraction))
    ranked = sorted(all_users, key=lambda u: hashlib.md5(u.encode()).hexdigest())
    return set(ranked[:n])
 
all_users = list(set(r["user"] for r in requests))
 
for phase, fraction in [("Phase 1 (5%)", 0.05), ("Phase 2 (20%)", 0.20), ("Phase 3 (50%)", 0.50)]:
    canary_users = get_canary_users(all_users, fraction)
    scores = {"legacy": [], "candidate": []}
    for req in requests:
        model = candidate_model if req["user"] in canary_users else legacy_model
        pred  = model(req)
        scores[pred["model"]].append(pred["score"])
    print(f"  {phase} | canary users: {len(canary_users):2d} "
          f"| legacy avg: {sum(scores['legacy'])/max(1,len(scores['legacy'])):.3f} "
          f"| candidate avg: {sum(scores['candidate'])/max(1,len(scores['candidate'])):.3f}")

Interleave test

Both models run on every request, and interleave() merges the output by interleaving items (one from legacy, one from candidate, one from legacy, etc.). Each item is tagged with the source model, so when users click on something, they know exactly which model to credit.

A small random.uniform(-0.05, 0.05) noise added to each item’s score simulates the natural variation seen in real-world recommendations. No two items of the same model have the same quality.

Finally, calculate the CTR separately for each item in your model. Since both models competed for the same request for the same user at the same time, there were no confounding factors and the difference in CTR was purely due to the quality of the models. This is why the interleave test provides the most statistically clean comparison of the four strategies.

print("\n── 3. Interleaved Testing ──────────────────────────────────")
 
def interleave(pred_a, pred_b):
    """Alternate items: A, B, A, B ... tagged with their source model."""
    items_a = [("legacy",    pred_a["score"] + random.uniform(-0.05, 0.05)) for _ in range(3)]
    items_b = [("candidate", pred_b["score"] + random.uniform(-0.05, 0.05)) for _ in range(3)]
    merged  = []
    for a, b in zip(items_a, items_b):
        merged += [a, b]
    return merged
 
clicks = {"legacy": 0, "candidate": 0}
shown  = {"legacy": 0, "candidate": 0}
 
for req in requests:
    pred_l = legacy_model(req)
    pred_c = candidate_model(req)
    for source, score in interleave(pred_l, pred_c):
        shownMarkTechPost  += 1
        clicksMarkTechPost += int(random.random() < score)   # click ~ score
 
for name in ["legacy", "candidate"]:
    print(f"  {name:12s} | impressions: {shown[name]:4d} "
          f"| clicks: {clicks[name]:3d} "
          f"| CTR: {clicks[name]/shown[name]:.3f}")

shadow test

Both models run on every request, but the loop is a clear distinction. live_pred is what you get, shadow_pred is logged directly, and nothing else is logged. Candidate output is never returned, displayed, or executed. The log list is the crux of shadow testing. In a real system, this would be written to a database or data warehouse, and engineers would later query it to compare the latency distribution, output pattern, or score distribution to traditional models. All of this is done without impacting a single user.

print("\n── 4. Shadow Testing ───────────────────────────────────────")
 
log = []   # candidate's shadow log
 
for req in requests:
    # What the user sees
    live_pred   = legacy_model(req)
 
    # Shadow run -- never shown to user
    shadow_pred = candidate_model(req)
 
    log.append({
        "request_id":       req["id"],
        "legacy_score":     live_pred["score"],
        "candidate_score":  shadow_pred["score"],    # logged, not served
    })
 
avg_legacy    = sum(r["legacy_score"]    for r in log) / len(log)
avg_candidate = sum(r["candidate_score"] for r in log) / len(log)
 
print(f"  Legacy    avg score (served):  {avg_legacy:.3f}")
print(f"  Candidate avg score (logged):  {avg_candidate:.3f}")
print(f"  Note: candidate score has no click validation -- shadow only.")

Please check The complete notebook is here. Please feel free to follow us too Twitter Don’t forget to join us 120,000+ ML subreddits and subscribe our newsletter. hang on! Are you on telegram? You can now also participate by telegram.

I am a Civil Engineering graduate from Jamia Millia Islamia, New Delhi (2022) and have a strong interest in data science, especially neural networks and their applications in various fields.

Source link