Inverted index
The inverted index is the core data structure behind full-text search: a sorted dictionary mapping each term to the list of documents that contain it (its postings list).
Why it matters
It flips the question. A database asks “for this doc, what terms?”; search asks “for this term, which docs?” — and the inverted index answers in roughly O(matches) instead of scanning every row. This is why Elasticsearch returns hits over billions of documents in milliseconds, and why analyzers (not the raw string) decide what is searchable.
How it works
At index time, each text field is run through an analyzer and the resulting terms are added to a per-field dictionary.
docs: 1 "the quick fox" 2 "quick brown dogs"
analyze → 1:[quick,fox] 2:[quick,brown,dog]
inverted index (per field):
brown → [2]
fox → [1]
quick → [1, 2]
- Postings carry more than IDs — term frequency (for scoring), positions (for
match_phrase), and offsets (for highlighting). - Dictionary is sorted — terms are stored as an FST (finite-state transducer), giving fast prefix/
wildcardwalks, much like a tries. - IDF is cheap — the postings list length is the document frequency, so BM25 gets IDF for free.
- Immutable per segment — each segment has its own inverted index; a query merges results across all segments of a shard.
Example
Query match: "quick fox" analyzes to [quick, fox], then:
postings(quick) = [1,2]
postings(fox) = [1]
union → candidate docs {1,2}; score each by TF×IDF
Doc 2 (no “fox”) still matches a should/OR query but scores lower — the engine never touches docs lacking either term.
Pitfalls
- Query/index analyzer mismatch — indexing stemmed (
running→run) but searching akeywordreturns nothing; the terms must be produced the same way both sides. - High-cardinality
text(UUIDs, long codes) bloats the dictionary for no search value — map askeywordinstead. - Phrase queries need positions — if
index_optionsstrips positions to save space,match_phrasesilently breaks.