Observability & SRE for Search Systems

Search reliability is a cross-cutting concern: a slow result page can be caused by a frontend render stall, a query-API thread pool exhaustion, a segment merge storm on the engine, or stale data from a lagging ingestion pipeline. None of those are visible from a single dashboard unless you instrument the whole path. This guide resolves one engineering decision — how to make a search stack observable enough that you can defend a latency SLA and ship relevance changes without flying blind — and sits inside the broader Search Engine Selection & Architecture framework. It spans both the read path (queries, ranking) and the write path (ingestion, indexing), so the telemetry model has to cross those boundaries with a shared trace and a shared set of signals.

The goal is not “more metrics.” It is a small, opinionated set: distributed traces across the query path, four golden signals adapted to search, SLOs with error budgets that gate deploys, lag-aware alerting on the ingestion side, and a canary mechanism so a bad ranking change degrades 1% of traffic instead of 100%.

Prerequisites

OpenTelemetry SDK available in your query service runtime — opentelemetry-sdk>=1.24 (Python) or @opentelemetry/sdk-node>=1.21 (Node).
An OTLP-capable collector reachable at otel-collector:4317 (gRPC) or :4318 (HTTP).
A metrics backend that supports histograms and percentile queries (Prometheus 2.40+ with native histograms, or a Prometheus-compatible store).
Engine-side metrics exposed: Elasticsearch _nodes/stats / localhost:9200/_cat/thread_pool, Meilisearch localhost:7700/metrics, or Typesense localhost:8108/metrics.json.
Ingestion lag exposed as a metric — Kafka consumer-group lag or a last_indexed_at watermark per document stream.
A deployment system that can route a traffic percentage to a candidate version (service mesh, weighted DNS, or feature-flag fan-out).

Concept Deep-Dive: one trace, four signals, one feedback loop

A search request is a distributed transaction. The browser issues a request; an edge/query API parses it, applies BM25 weighting and custom scoring, round-trips to the engine, optionally reranks, and renders. Each hop is a span; the spans share a trace_id propagated via W3C traceparent headers. That single trace is what lets you answer “why was this request slow” without correlating four separate logs by timestamp.

Layered on top of traces are the four golden signals, adapted for search:

Latency — query response time as a histogram, reported at p50/p95/p99. Split client-observed latency from engine took_ms; the gap between them is your own overhead (parsing, rerank, serialization).
Errors — rate of 5xx, query timeouts, and empty-result-when-results-expected (a silent relevance failure that a pure HTTP error rate misses).
Traffic — queries per second, segmented by query class (autocomplete vs. full search vs. faceted), because their latency profiles differ by an order of magnitude.
Saturation — the resource closest to its ceiling: engine search thread-pool queue depth, JVM heap pressure, or indexing-thread saturation. On the write path, saturation also shows up as indexing throughput collapsing and indexing lag climbing.

The diagram below shows both halves: the left-to-right distributed trace across the query path, and the feedback loop where SLIs feed an error budget that gates relevance deploys.

A worked example: a regression where p99 query latency jumps from 90ms to 400ms. The trace shows engine.query flat at 35ms but parse + rerank ballooning — pointing at a reranker, not the engine. Without the cross-hop trace, the engine dashboard would look healthy and the investigation would stall on the wrong team. This is exactly the kind of attribution that instrumenting the query path with OpenTelemetry makes a two-minute lookup instead of a two-hour war room.

Step-by-Step Implementation

1. Define SLIs and an SLO

Pick SLIs that map to user pain. For search, the canonical pair is availability (fraction of queries returning a non-error result) and latency (fraction of queries served under a threshold). Express the SLO as a target over a rolling window.

# slo.yaml — search query SLO definition (Sloth / OpenSLO style)
service: search-query-api
slos:
  - name: query-latency
    objective: 99.0            # 99% of queries under threshold
    description: "p95 query latency under 150ms over 28d"
    sli:
      events:
        error_query: 'sum(rate(search_query_latency_ms_bucket{le="150"}[5m]))'
        total_query: 'sum(rate(search_query_latency_ms_count[5m]))'
    alerting:
      page_alert:
        labels: { severity: page }
  - name: query-availability
    objective: 99.9
    description: "non-error responses over 28d"

Verify: confirm the recording rules compile and emit a burn-rate series.

curl -s 'localhost:9090/api/v1/query?query=slo:sli_error:ratio_rate5m' | jq '.data.result | length'
# expect a non-zero number of series

2. Emit the four golden signals from the query path

Record a latency histogram keyed by query class, plus error and saturation counters. The histogram is what makes percentiles cheap to query.

# metrics.py — golden-signal instrumentation for the query API
from opentelemetry import metrics

meter = metrics.get_meter("search.query")

query_latency = meter.create_histogram(
    "search.query.latency", unit="ms",
    description="end-to-end query latency by class")
query_errors = meter.create_counter("search.query.errors")
engine_queue = meter.create_up_down_counter("search.engine.queue_depth")

def record_query(class_: str, latency_ms: float, ok: bool):
    attrs = {"query.class": class_}        # autocomplete | search | facet
    query_latency.record(latency_ms, attrs)
    if not ok:
        query_errors.add(1, attrs)

Verify: scrape the collector’s Prometheus exporter and confirm the histogram buckets appear.

curl -s localhost:8889/metrics | grep search_query_latency_ms_bucket | head

3. Alert on indexing lag, not just query health

A healthy query path serving stale data is still an outage. Track a per-stream lag watermark and alert when it exceeds the freshness budget. This couples directly to your data ingestion and synchronization pipelines — the lag metric is the freshness SLI for that area.

# prometheus-rules.yaml — indexing lag alert
groups:
  - name: search-ingestion
    rules:
      - alert: IndexingLagHigh
        expr: search_index_lag_seconds > 120
        for: 5m
        labels: { severity: page }
        annotations:
          summary: "Index lag {{ $value }}s exceeds 120s freshness budget"
      - alert: IndexingThroughputCollapse
        expr: rate(search_index_docs_total[5m]) < 1
        for: 10m
        labels: { severity: ticket }

Verify: fire a synthetic lag value and confirm the rule transitions to firing.

curl -s 'localhost:9090/api/v1/rules' | jq '.data.groups[].rules[] | select(.name=="IndexingLagHigh") | .state'
# "firing" once lag crosses the threshold

4. Canary-deploy relevance changes behind the error budget

Relevance changes have no compile-time safety net — a reranker tweak can tank conversion while every HTTP check stays green. Route a small traffic slice to the candidate, compare SLIs and engagement signals, and auto-rollback on budget burn. The mechanics are detailed in the guide on canary-deploying relevance models.

# canary.py — split traffic and compare candidate vs control SLIs
import hashlib

def variant_for(session_id: str, canary_pct: int) -> str:
    bucket = int(hashlib.sha1(session_id.encode()).hexdigest(), 16) % 100
    return "candidate" if bucket < canary_pct else "control"

# emit ranking model version as a span/metric attribute so SLIs split by variant
def query(session_id, q, canary_pct=1):
    variant = variant_for(session_id, canary_pct)
    results = run_search(q, ranker=variant)
    query_latency.record(results.took_ms, {"variant": variant})
    return results

Verify: confirm both variants are emitting latency series before ramping traffic.

curl -s 'localhost:9090/api/v1/query?query=count by (variant) (search_query_latency_ms_count)' \
  | jq '.data.result[].metric.variant'
# expect: "control" and "candidate"

Configuration Reference

Name	Default	Type	Effect
`OTEL_EXPORTER_OTLP_ENDPOINT`	`http://localhost:4318`	string (URL)	Collector endpoint for spans and metrics; point at `otel-collector:4317` for gRPC in-cluster.
`OTEL_TRACES_SAMPLER_ARG`	`1.0`	float 0–1	Head-sampling ratio. Drop to `0.1` at high QPS to bound trace volume; keep `1.0` for error/slow traces via tail sampling.
`search.query.latency` buckets	engine default	list[ms]	Explicit histogram boundaries (e.g. `[5,10,25,50,100,150,250,500]`); must straddle the SLO threshold for accurate percentiles.
`slo.query-latency.objective`	`99.0`	float (%)	Target fraction of queries under threshold; sets the error-budget size for the window.
`search_index_lag_seconds` alert	`120`	int (seconds)	Freshness budget; lag above this pages. Tighten for real-time catalogs, loosen for batch corpora.
`canary_pct`	`1`	int (%)	Share of sessions routed to the candidate ranker; ramp `1 → 5 → 25 → 100` on clean budgets.
`OTEL_METRIC_EXPORT_INTERVAL`	`60000`	int (ms)	Metric flush cadence; lower to `10000` for tighter alert latency at higher egress cost.

Failure Modes & Debugging

Symptom: p99 latency spikes but the engine dashboard is flat

Root cause: the slow time is in your own service (parse, rerank, serialization) or in the network hop, not in engine execution. A flat engine view hides it because engine.query took_ms excludes everything outside the engine. Remediation: pull a slow trace and compare span durations.

# find slow traces and inspect span breakdown (Tempo/Jaeger query)
curl -s 'localhost:3200/api/search?tags=service.name%3Dsearch-query-api&minDuration=300ms' | jq '.traces[0].traceID'

The span with the largest self-time is the culprit; if it is parse + rerank, profile the reranker rather than touching the cluster.

Symptom: error rate is 0% but users report "no results"

Root cause: empty-but-200 responses. A pure HTTP error SLI counts a zero-hit page as success. This is usually stale data (ingestion lag) or an over-aggressive filter. Remediation: add a dedicated SLI for unexpected-empty responses and correlate with lag.

curl -s 'localhost:9090/api/v1/query?query=rate(search_query_zero_hits_total[5m]) / rate(search_query_total[5m])' | jq '.data.result[0].value[1]'
# a rising ratio alongside high index lag points at staleness, not relevance

Symptom: traces stop at the query API; no engine or render spans

Root cause: broken context propagation — the traceparent header is not injected into the engine HTTP client or forwarded to the frontend. Remediation: confirm the propagator is configured and the header survives the hop.

# the outbound engine request must carry traceparent
curl -v localhost:9200/_search 2>&1 | grep -i traceparent
# absent header => instrument the HTTP client / enable W3C TraceContext propagator

Symptom: alert storm during a routine reindex

Root cause: indexing-throughput and saturation alerts fire on expected load from a full rebuild. Remediation: gate write-path alerts behind a maintenance label or silence the duration of the reindex, and alert on lag (a user-facing symptom) rather than raw throughput.

# create a scoped silence for the reindex window
amtool silence add alertname=IndexingThroughputCollapse --duration=2h --comment="planned reindex"

Performance & Scale Notes

Instrumentation is not free, but the cost is bounded and predictable. At 5,000 QPS with full (100%) head sampling, expect roughly 5,000 traces/s; at ~1–2 KB/span and 4 spans/trace that is 20–40 MB/s of export — usually the first thing to optimize. Drop head sampling to 10% and add tail sampling that keeps all error traces and all traces over the SLO threshold (e.g. >150ms): you retain the diagnostically valuable 100% of slow/error traces while cutting volume ~10x. Histogram metrics are far cheaper than traces — a fixed cardinality of buckets × query.class × variant (keep this product under ~10,000 series per instance) costs single-digit MB regardless of QPS.

For percentile accuracy, choose bucket boundaries that bracket your SLO threshold tightly; a 150ms SLO with buckets at [100, 150, 250] gives a usable p95 estimate, whereas [100, 500] does not. Benchmark the overhead the same way you benchmark the engine: run the k6 latency harness from the architecture overview with instrumentation on and off, and confirm the added p95 stays under ~2ms. Measure error-budget burn on a 28-day rolling window with multi-window multi-burn-rate alerts (fast: 1h@14.4x, slow: 6h@6x) so a sharp incident pages immediately while a slow leak still surfaces before the budget is exhausted.

Instrumenting search with OpenTelemetry — span-by-span SDK setup for the query path and OTLP export.
Alerting on indexing lag with SLOs — turn freshness budgets into burn-rate alerts.
Canary-deploying relevance models — ship ranking changes behind an error-budget gate.
Ranking Algorithms & Relevance Tuning — the read-path changes whose blast radius this telemetry contains.
Data Ingestion & Synchronization Pipelines — the write path whose lag is the freshness SLI here.