Configuring Highlight Fragments in Elasticsearch

Problem Statement

Your result list shows clean snippets for some hits and nothing for others — a matched document returns an empty highlight array, or a long article returns a 100-character stub that cuts mid-word and omits the strongest passage. Both are fragment-configuration problems: the defaults (fragment_size: 100, number_of_fragments: 5, no_match_size: 0) were never tuned for your field lengths or your UI. This guide fixes fragment sizing and guarantees a snippet on every row. It builds on the result highlighting and snippet generation guide within the broader Search Frontend & UX Patterns area.

Prerequisites

Elasticsearch 8.x at localhost:9200 with an index you can reindex.
A text field you already highlight (this guide uses body).
Ability to reindex if you switch the large field to the fvh highlighter, which needs term-vector offsets.
curl and jq for the verification steps.

Diagnosis / Context

fragment_size is the target length, in characters, of each passage the highlighter cuts from the field. number_of_fragments caps how many of those passages are returned, ranked by passage score. The trap is that number_of_fragments: 0 is a special value — it disables fragmentation and returns the entire field highlighted, which is fine for a short title but ships kilobytes for a long body.

no_match_size is the separate fix for empty snippets: when the highlighter finds no match in the field, it returns nothing by default, so the row renders blank. Setting no_match_size to a positive integer returns that many leading characters of the field as a fallback snippet. The fallback is taken from the start of the field, re-segmented to the nearest boundary, so a value of 160 yields roughly the opening sentence rather than an exact 160-character cut.

The relationship between the two sizing knobs is easy to get backwards. fragment_size is per passage; number_of_fragments is how many passages. The total snippet payload for a field is therefore approximately fragment_size × number_of_fragments characters, before markup. A result card that wants one tidy two-line preview wants number_of_fragments: 1 and a fragment_size matched to two lines of your card width; a “show context” expander that wants several jump-to passages wants number_of_fragments: 3 and keeps the same per-passage size. The diagram below traces how one field becomes the bytes that reach the browser.

Here is the empty-highlight case in a response — the hit matched on another field, so body highlighting produced nothing:

{
  "hits": { "hits": [
    { "_id": "42", "_score": 6.1, "_source": { "title": "vector search internals" } }
  ] }
}

No highlight key at all means the UI row has no preview text — exactly what no_match_size resolves. The opposite failure, an over-long snippet, comes from the other special value: number_of_fragments: 0 returns the whole field highlighted rather than a fragment. On a long body that is megabytes of unnecessary transfer per page of results. Both failures are configuration, not bugs — the defaults simply assume a generic five-fragment, hundred-character shape that rarely matches a real result card.

The choice of highlighter engine also shapes fragment behavior. The fvh highlighter, which you adopt for large fields, treats fragment_size as a soft target and snaps fragment edges to token boundaries; the unified highlighter segments on sentence boundaries first and so tends to return slightly longer, more readable passages. If you migrate a large field from the unified highlighter to fvh purely for speed, expect the snippet text to shift subtly even though the configuration numbers are unchanged.

A useful mental model: the highlighter is a passage retriever running inside a single document. fragment_size sets the granularity of the passages it considers, and number_of_fragments sets how many it returns. Too small a fragment_size and passages lack context — a two-word window around the match reads like a fortune cookie. Too large and the snippet sprawls, burying the matched terms in surrounding prose. For most product result lists the sweet spot is one passage of roughly two display lines, sized to the actual pixel width of your result card rather than to a round number.

Solution Steps

1. Size fragments to your UI, not the default

Two result-card lines at ~80 chars each implies fragment_size: 160. Return one strong passage, not five.

POST /articles/_search
{
  "query": { "match": { "body": "vector search" } },
  "highlight": {
    "fields": {
      "body": {
        "fragment_size": 160,        // target chars per passage; match your card width
        "number_of_fragments": 1     // one best passage per result row
      }
    }
  }
}

Start with number_of_fragments: 1 even if you later raise it. A single passage forces the passage scorer to commit to the strongest window, which is the right default for a scannable result list. Raise it only for a dedicated “matches in this document” view where the user has already chosen to dig in.

2. Use fvh + term-vector offsets for large fields

For multi-kilobyte bodies, re-analysis per hit is the latency cost. Store offsets once, then read them with the fvh highlighter. This requires a reindex because term_vector is a mapping-time setting — you cannot add term vectors to an existing field in place, so the standard path is to create a new index with the offset mapping and reindex into it, then swap an alias. Doing this on the large field only (not every field) keeps the storage overhead contained to where it earns its cost.

# Create the index with offsets stored on the large field.
curl -s -X PUT "localhost:9200/articles_v2" -H 'Content-Type: application/json' \
  -d '{
    "mappings": {
      "properties": {
        "body": { "type": "text", "term_vector": "with_positions_offsets" }
      }
    }
  }'

# Reindex existing data into the offset-backed index.
curl -s -X POST "localhost:9200/_reindex" -H 'Content-Type: application/json' \
  -d '{"source":{"index":"articles"},"dest":{"index":"articles_v2"}}'

POST /articles_v2/_search
{
  "query": { "match_phrase": { "body": "vector search" } },
  "highlight": {
    "fields": {
      "body": {
        "type": "fvh",              // reads stored offsets, skips per-hit re-analysis
        "fragment_size": 160,
        "number_of_fragments": 1,
        "fragmenter": "span"        // keep the phrase intact inside the fragment
      }
    }
  }
}

After the reindex, point your read traffic at the new index (directly or via an alias) and confirm the fvh highlighter is actually being used — if you forget the type: fvh on the highlight request, Elasticsearch silently uses the unified highlighter and you pay for the term vectors without getting the speed-up.

3. Guarantee a snippet with no_match_size

So every result row shows preview text even when the match was on another field, fall back to the field’s opening. This matters most for multi_match and cross-field queries, where a hit can legitimately match on title alone; without no_match_size those rows render with a blank preview and look broken to the user even though ranking is correct.

POST /articles_v2/_search
{
  "query": { "multi_match": { "query": "vector", "fields": ["title", "body"] } },
  "highlight": {
    "fields": {
      "body": {
        "type": "fvh",
        "fragment_size": 160,
        "number_of_fragments": 1,
        "no_match_size": 160        // return 160 leading chars when body has no match
      }
    }
  }
}

Verification

Confirm fragment length and that the no-match fallback fires. Index a document that matches only on title, then query.

curl -s -X POST "localhost:9200/articles_v2/_doc/99?refresh=true" \
  -H 'Content-Type: application/json' \
  -d '{"title":"vector search guide","body":"This article opens with an unrelated sentence about indexing pipelines and batching."}'

curl -s "localhost:9200/articles_v2/_search" -H 'Content-Type: application/json' \
  -d '{"query":{"multi_match":{"query":"vector","fields":["title","body"]}},
       "highlight":{"fields":{"body":{"type":"fvh","no_match_size":160,"number_of_fragments":1}}}}' \
  | jq -r '.hits.hits[0].highlight.body[0]'

Expected output — the leading characters of body, even though the match was on title:

This article opens with an unrelated sentence about indexing pipelines and batching.

Common Pitfalls

Setting number_of_fragments to 0 dumps the whole field

number_of_fragments: 0 does not mean “no fragments” — it disables fragmentation and returns the entire field highlighted. On a long body this ships the full document to the browser. Use 1 for a single passage; reserve 0 for short fields like title where returning the whole value is intended.

no_match_size ignored because the field has no stored data path

no_match_size returns leading characters of the field’s source text. If the highlighted field is index: true but its content is not retrievable for the highlighter (for example a pure keyword sub-field), the fallback yields nothing. Apply no_match_size to the analyzed text field, not its keyword multi-field.

fvh fragment_size off-by-a-few versus plain highlighter

The fvh highlighter treats fragment_size as approximate and aligns fragment edges to token boundaries, so returned passages run slightly shorter or longer than the exact number. Do not assert exact length in tests; assert an inclusive range (for example 120–200 chars for fragment_size: 160).

Result highlighting and snippet generation — the full guide to highlighters, matched_fields, and XSS-safe rendering.
Building faceted filters with aggregations — the sibling result-UI concern that runs alongside snippet rendering.
BM25 tuning and field weights — the scoring that determines which passage a fragment selector ranks first.