Embeddings and Vector Search

An embedding maps text to a fixed-length vector so that semantically similar text lands nearby; vector search then finds the nearest vectors to a query.

Why it matters

This is the retrieval engine under agent memory and RAG. Keyword search misses “car” when the doc says “automobile”; embeddings match by meaning, so an agent can pull the relevant chunk regardless of wording. It powers rag-and-vector-databases, long-term-memory, deduplication, and clustering — anywhere an agent must find “things like this” in a large corpus.

How it works

An embedding model turns a string into a dense vector (e.g. 1536 dims). Similarity is cosine (angle); on normalized vectors that’s just a dot product. At scale, an ANN index (HNSW, IVF) trades exactness for speed.

  • Chunk → embed → index. Split docs to ~200–500 tokens, embed each chunk, store vector + metadata + source text.
  • Query → embed → top-k. Embed the query with the same model, return the k nearest chunks, feed them to the LLM.
  • ANN graphs (HNSW) give sub-millisecond search over millions of vectors at ~95–99% recall — exact brute force is O(N) and too slow.
  • Hybrid retrieval fuses vector + keyword (BM25), often re-ranked by a cross-encoder, to beat either alone.
ConceptMeaning
Dimensionsvector length (384–3072 typical)
Cosine similarity1.0 identical, ~0 unrelated
top-khow many neighbors to retrieve
ANN (HNSW/IVF)approximate, fast nearest-neighbor

Example

Semantic search over a docs corpus:

"how do I reset my password?"  → embed → [0.01, -0.23, ...] (1536d)
ANN top-3 from 50k chunks:
  0.91  "Resetting your account credentials"
  0.84  "Forgot password flow"
  0.79  "Login troubleshooting"
→ pass these 3 chunks as context to the LLM

The exact words “reset” and “password” never appear in the top hit — meaning, not lexical overlap, ranked it first.

Pitfalls

  • Mismatched models. Querying with a different embedding model (or version) than you indexed with silently destroys recall — pin one model.
  • Bad chunking. Chunks too big dilute the vector; too small lose context. Tune size and overlap to the content.
  • Cosine ≠ truth. High similarity isn’t relevance; re-rank and set a score threshold or you’ll retrieve confident garbage.
  • Stale index. Embeddings don’t auto-update; re-embed when source docs or the model change, or retrieval drifts.

See also