Choosing an Embedding Model for Semantic Search

The model you pick fixes your vector dimension, your storage and memory budget per million documents, your query latency, and your ceiling on retrieval recall — and changing it later means re-embedding the entire corpus and rebuilding every index. This is a one-way door, so the decision deserves more than “use whatever the tutorial used.” This guide weighs dimensions against recall and cost, treats MTEB as a starting filter rather than gospel, and contrasts open models you self-host against hosted API models. It sits within the broader vector search integration strategies and the engine-selection frameworks under Search Engine Selection & Architecture. Once a model is chosen, the pgvector implementation guide covers wiring the vectors into PostgreSQL.

Prerequisites

  1. A representative sample of your corpus (≥1,000 documents) and a labeled query set for offline recall evaluation.
  2. A target latency budget for embedding a query at request time (typically <50 ms).
  3. A storage/memory budget per million vectors (dimension × 4 bytes for float32).
  4. Python 3.10+ with sentence-transformers and an OpenAI key if you evaluate hosted models.

Diagnosis: Why Dimensions, Recall, and Cost Trade Off

Higher-dimension embeddings encode more semantic nuance and usually score higher recall, but the cost is linear in storage and roughly linear in distance-computation time. A 1536-dim float32 vector is 6 KB; ten million of them is 60 GB before index overhead, and an HNSW graph can double that. A 384-dim model stores the same corpus in 15 GB and computes distances four times faster. The question is whether the extra dimensions buy recall that your reranker or BM25 fusion would otherwise have to recover.

MTEB (Massive Text Embedding Benchmark) ranks models on retrieval, clustering, and classification, but its retrieval splits are generic — a model topping MTEB on Wikipedia-style QA can underperform a smaller domain-tuned model on legal or product-catalog text. Treat the leaderboard as a shortlist generator, then run your own recall@k on your labeled set. A typical disappointing baseline looks like this:

# recall@10 on 500 labeled queries, generic top-MTEB model, before domain check
recall@10 = 0.61   # acceptable on open-domain QA, weak on our SKU descriptions
mean query embed latency = 38ms (API) / 9ms (local MiniLM)

The other silent failure is sequence length. Most open encoders truncate at 512 tokens; feed a 2,000-token document and the tail is silently dropped, so the vector represents only the opening. That forces a chunking decision, and chunking changes what “a result” even means.

Cost and latency split sharply along the open-vs-API line. A self-hosted MiniLM model embeds a query in roughly 9 ms on a modest CPU and costs nothing per call, but you own the GPU or CPU fleet, the autoscaling, and the model lifecycle. A hosted model like text-embedding-3-small removes all of that infrastructure but bills per token and adds a network round-trip — 30 to 50 ms is typical — which matters when the embedding is on the critical path of an interactive query. The pragmatic split: use an API model for offline batch embedding of a large corpus where throughput, not per-call latency, dominates; self-host a small model for query-time embedding where every millisecond is user-visible. Many production systems do both — API for ingestion, local MiniLM for the live query — provided the two models share a vector space, which they generally do not, so the safer pattern is one model for both sides.

Domain fit is the factor MTEB hides. A model trained on web text and QA pairs has never seen your SKU descriptions, ICD codes, or contract clauses, and its notion of “similar” is calibrated to the training distribution. The only reliable signal is recall@k measured on your own labeled queries against your own corpus; a model two places lower on the leaderboard frequently wins on domain text because its training data overlapped yours. Newer API models add a Matryoshka property — text-embedding-3-small can be truncated from 1536 to 512 dimensions with only a small recall loss — which lets you trade storage for recall after the fact without re-embedding, a flexibility most open models lack.

Model Comparison

Name Default Type Effect
all-MiniLM-L6-v2 384 dims, 256-token window Open, self-hosted Fastest and smallest; ~9 ms local embeds; solid baseline recall, weak on long or domain text
bge-base-en-v1.5 768 dims, 512-token window Open, self-hosted Strong general retrieval near top of MTEB; needs a query instruction prefix for best recall
e5-large-v2 1024 dims, 512-token window Open, self-hosted Higher recall than base models; requires query:/passage: prefixes; ~3x storage of MiniLM
text-embedding-3-small 1536 dims (Matryoshka-truncatable), 8191-token window API, hosted Long context, no infra; per-token billing; dimension reducible to 512 with minor recall loss

Solution Steps

1. Generate the shortlist from MTEB, then narrow by constraints

Filter the leaderboard to models whose dimension fits your storage budget and whose context window covers your median document length. Keep two open candidates and one API candidate.

2. Apply the model’s required prefixes

e5 and bge models are trained with asymmetric instructions; omitting them costs measurable recall. Encode queries and passages differently.

# e5/bge require role prefixes — queries and passages are NOT symmetric
from sentence_transformers import SentenceTransformer

model = SentenceTransformer("intfloat/e5-large-v2")

def embed_query(text: str):
    return model.encode(f"query: {text}", normalize_embeddings=True)

def embed_passage(text: str):
    return model.encode(f"passage: {text}", normalize_embeddings=True)

3. Chunk documents to fit the sequence window

Split long documents into overlapping windows so nothing is truncated, and embed each chunk as its own vector keyed back to the parent document.

# Token-aware chunking with overlap; 480 leaves room under a 512 limit
from transformers import AutoTokenizer

tok = AutoTokenizer.from_pretrained("intfloat/e5-large-v2")

def chunk(text: str, max_tokens=480, overlap=64):
    ids = tok.encode(text, add_special_tokens=False)
    step = max_tokens - overlap
    for start in range(0, len(ids), step):
        window = ids[start:start + max_tokens]
        yield tok.decode(window)
        if start + max_tokens >= len(ids):
            break

4. Normalize vectors so cosine reduces to a dot product

Normalize to unit length at embedding time. Then cosine similarity equals the inner product, and you can use the faster vector_ip_ops operator class in pgvector without changing scores.

import numpy as np

def l2_normalize(v: np.ndarray) -> np.ndarray:
    norm = np.linalg.norm(v)
    return v / norm if norm > 0 else v
# With pre-normalized vectors, `<#>` (inner product) ranks identically to cosine.

5. Measure recall@k on your own labeled set

Compute recall against ground-truth relevant document IDs for each candidate model. This is the number that overrides the leaderboard.

# Offline recall@k harness — run once per candidate model
def recall_at_k(retrieved_ids, relevant_ids, k=10):
    hits = sum(1 for q in relevant_ids
               if relevant_ids[q] & set(retrieved_ids[q][:k]))
    return hits / len(relevant_ids)

for name in ["minilm", "bge", "e5", "openai-3-small"]:
    print(name, round(recall_at_k(RETRIEVED[name], RELEVANT, k=10), 3))

Verification

Confirm dimensions and normalization before committing the model to a production index:

# Sanity check: dimension matches your schema, vectors are unit length
import numpy as np
v = embed_passage("oxford cotton shirt, slim fit")
print("dim:", len(v))                       # expect 1024 for e5-large-v2
print("norm:", round(float(np.linalg.norm(v)), 4))  # expect ~1.0

Expected output:

dim: 1024
norm: 1.0

Then verify the winning candidate beats your baseline on your data, not MTEB:

minilm 0.612
bge    0.731
e5     0.758
openai-3-small 0.744
# e5-large-v2 wins recall; if storage is tight, bge-base at 768d is the value pick.

Common Pitfalls

Mixing query and passage encodings on asymmetric models

e5 and bge are trained with query: and passage: prefixes; embedding both sides identically (or omitting prefixes) silently drops several points of recall@10. Wrap encoding in embed_query / embed_passage helpers so the asymmetry is never bypassed, and re-embed the corpus if you change the prefix convention.

Truncation at the model's max sequence length

Most open encoders cap at 512 tokens and truncate without warning, so a long document is represented only by its opening. Chunk with token-aware windows and overlap, store one vector per chunk, and aggregate at retrieval time — never assume the full document made it into one vector.

Comparing un-normalized vectors with the wrong operator

If vectors are not unit-normalized, inner product and cosine diverge, and an index built on vector_ip_ops ranks differently from your offline cosine eval. Normalize at embedding time and keep operator class, distance metric, and eval metric consistent end to end.