Canary Deploying a Relevance Model Update
A ranking model that wins offline can still lose in production: an NDCG gain on a frozen judgment set says nothing about live click-through, latency tails, or the rare-query distribution the labels never covered. Shipping it to 100% of traffic on faith risks a relevance regression that no unit test catches and that users feel as “search got worse.” This guide canary-deploys a relevance model — shadow first, then a small live percentage gated by guardrail metrics and automated rollback — as a routine part of the observability and SRE practice for search and the wider discipline of search engine selection and architecture. The objective is to detect a regression on a 1% slice before it reaches everyone.
Prerequisites
- A ranking layer that can load two model versions concurrently and route per-request (a feature-store or model-server abstraction, not a hardcoded scorer).
- Query-level logging of impressions, clicks, and result-set fingerprints, with a stable
request_id. - A guardrail metric pipeline emitting CTR, zero-result rate, latency percentiles, and an online relevance proxy.
- A traffic-splitting mechanism keyed on a hashed, sticky identity (user or session) so a user never flickers between models mid-session.
Diagnosis: why offline wins do not guarantee online wins
A learning-to-rank model is trained against historical judgments, but production traffic drifts — new products, seasonal queries, a different head/tail mix. Offline NDCG is computed on the queries you happened to label; the model can regress badly on the unlabeled tail while improving the head, and the aggregate metric hides it. Worse, a model can win on relevance and lose on latency, pushing the p99 over the budget and degrading every query, ranked well or not.
The failure you are guarding against looks like this in the guardrail log — relevance nudges up while a guardrail quietly breaks:
model=ltr-v7 ctr=0.182 zero_result_rate=0.031 p99_ms=141 ndcg_online=0.612
model=ltr-v8 ctr=0.179 zero_result_rate=0.067 p99_ms=388 ndcg_online=0.628
# ^^^^ 2x worse ^^^ over budget
ltr-v8 improved online NDCG yet doubled the zero-result rate and blew the latency budget — a net loss that a single relevance number would have green-lit. Canarying exists to catch exactly this trade.
Solution Steps
1. Shadow the new model first — no user impact
Before any user sees v8, run it in shadow: score the same live queries with both models, log both result sets, serve only v7. This catches latency regressions and crashes against real query distribution at zero risk.
# ranking_handler.py — shadow stage
def rank(query, ctx):
stable = stable_model.score(query, ctx) # served to the user
# Fire-and-forget; never block the response or surface canary errors.
executor.submit(_shadow_eval, query, ctx, stable)
return stable
def _shadow_eval(query, ctx, stable):
t0 = time.monotonic()
try:
canary = canary_model.score(query, ctx)
except Exception as e:
metrics.incr("canary.error", tags={"model": "ltr-v8"})
return
metrics.timing("canary.latency_ms", (time.monotonic() - t0) * 1000)
# Rank-correlation between served and shadow rankings; 1.0 == identical
log_interleave(query, stable, canary, kendall_tau(stable, canary))
Decision gate: only proceed past shadow when canary p99 sits inside the latency budget and the error rate is zero across a full traffic cycle (typically 24h to cover daily seasonality).
2. Interleave to compare relevance without a population split
Interleaving (team-draft) is more sensitive than an A/B test because both models compete within the same result list for the same user, removing population variance. Each click is attributed to the model that contributed the clicked document.
# team_draft_interleave.py
import random
def team_draft(list_a, list_b, k=10):
"""Blend two rankings; track which model placed each doc."""
out, credit, used = [], {}, set()
ia = ib = 0
while len(out) < k:
pick_a = len(out) % 2 == 0 if random.random() < 0.5 else random.random() < 0.5
src, lst, idx = ("a", list_a, ia) if pick_a else ("b", list_b, ib)
while idx < len(lst) and lst[idx] in used:
idx += 1
if idx >= len(lst):
break
doc = lst[idx]
out.append(doc); used.add(doc); credit[doc] = src
if src == "a": ia = idx + 1
else: ib = idx + 1
return out, credit # credit[clicked_doc] tells you which model won that click
A reliable preference signal (model B wins clicks with p < 0.05 over a few thousand impressions) is the green light to put real traffic on the canary.
3. Split a small live percentage on a sticky hash
Route 1% of users to the canary using a hash of a stable identity, so the same user always lands in the same arm — flickering models mid-session corrupts the guardrail comparison and confuses users.
# traffic_split.py
import hashlib
CANARY_PCT = 1 # start at 1, ramp 1 -> 5 -> 25 -> 50 -> 100 only on clean gates
def model_for(user_id: str) -> str:
bucket = int(hashlib.sha256(user_id.encode()).hexdigest(), 16) % 100
return "ltr-v8" if bucket < CANARY_PCT else "ltr-v7"
4. Define guardrail metrics and rollback thresholds
Pick a small set of guardrails; trip any one and the canary rolls back automatically. Latency and zero-result rate are symmetric guardrails (regression in either direction is bad); CTR and online NDCG are the win metrics that justify promotion.
| Metric | Threshold | Direction | Action on breach |
|---|---|---|---|
zero_result_rate |
+0.5pp vs stable | worse | auto-rollback |
p99_latency_ms |
> 250 (budget) | worse | auto-rollback |
ctr |
-1.0pp vs stable | worse | auto-rollback |
ndcg_online |
-0.5pp vs stable | worse | hold ramp, alert |
5. Wire the automated rollback trigger
Encode the guardrails as alerting rules that flip the canary percentage to zero. A breach should page and act — humans are too slow to catch a fast relevance regression.
# rules/canary-guardrails.yml
groups:
- name: relevance_canary
rules:
- alert: CanaryZeroResultRegression
expr: |
(ranking_zero_result_rate{model="ltr-v8"}
- ranking_zero_result_rate{model="ltr-v7"}) > 0.005
for: 5m
labels: { severity: page, action: rollback }
annotations:
summary: "Canary ltr-v8 zero-result rate +{{ $value | humanizePercentage }}"
- alert: CanaryLatencyRegression
expr: histogram_quantile(0.99,
sum by (le) (rate(ranking_latency_bucket{model="ltr-v8"}[5m]))) > 0.25
for: 5m
labels: { severity: page, action: rollback }
annotations:
summary: "Canary ltr-v8 p99 over latency budget"
A webhook receiver on action="rollback" sets CANARY_PCT = 0 and redeploys config, so traffic drains to stable within one config-reload cycle.
Verification
Confirm the split sends roughly the configured fraction to the canary:
curl -s 'http://localhost:9090/api/v1/query' \
--data-urlencode 'query=sum by (model) (rate(ranking_requests_total[5m]))' \
| jq -r '.data.result[] | "\(.metric.model) \(.value[1])"'
# Expected (1% canary):
# ltr-v7 98.9
# ltr-v8 1.07
Verify the guardrail comparison is populated for both arms before trusting it:
ranking_zero_result_rate{model="ltr-v8"} - ranking_zero_result_rate{model="ltr-v7"}
# => 0.0003 (well under the 0.005 rollback threshold — safe to ramp)
Force a rollback in staging by injecting bad results and assert traffic drains:
# After tripping the guardrail, the alert should fire with action=rollback
curl -s 'http://localhost:9090/api/v1/alerts' \
| jq '.data.alerts[] | select(.labels.action=="rollback") | .state'
# Expected: "firing" then ranking_requests_total{model="ltr-v8"} -> 0
Common Pitfalls
Reading CTR as relevance when the result set shrank
Root cause: a model that returns fewer, safer results can show higher CTR per impression while serving worse coverage — users click more often on a tiny list but find less. Remediation: always pair CTR with zero_result_rate and result-set depth as guardrails, and prefer interleaving over raw CTR comparison, since interleaving controls for the population and the query mix that confound a naive CTR delta.
Non-sticky bucketing corrupts the experiment
Root cause: hashing per-request instead of per-user means the same session bounces between models, so clicks cannot be attributed and the user sees results re-shuffle on every keystroke in a search-as-you-type flow. Remediation: hash a stable identity (logged-in user id, else a long-lived session cookie), never the request_id, and assert the bucket function is deterministic in a unit test before rollout.
Promoting on a metric that never reached significance
Root cause: a 1% canary on a low-traffic index may take days to accumulate enough impressions for the CTR delta to clear noise; ramping on an early, lucky reading promotes a regression. Remediation: gate each ramp step on a minimum impression count and a confidence interval that excludes zero, not on elapsed time alone, and freshen the model’s training judgments from production interactions captured during the canary before the next iteration.
Related
- Observability and SRE for search — the parent area covering the metrics, alerts, and rollout discipline this rollout depends on.
- Alerting on indexing lag with SLOs — the burn-rate alerting pattern reused here to gate the canary on guardrail breaches.
- Learning-to-rank models — the ranking model whose new version this guide ships safely to production.