Search engine fundamentals

The concepts every full-text engine is built on: analysis, the inverted index, and relevance ranking — the machinery that turns “find documents about X” into a fast, ordered result list.

Why it matters

A database LIKE '%term%' scans every row and treats all matches as equal. A search engine inverts the problem at index time so lookups are O(matches), and it ranks results so the best ones come first. Understanding analysis → indexing → scoring is what lets you debug “why didn’t this match?” and “why is the wrong doc first?“.

How it works

Two pipelines that must agree: text is analyzed the same way at index time and query time.

  • Analysis — a character stream is split by a tokenizer into terms, then token filters lowercase, stem, and drop stopwords. "Running Shoes"[run, shoe].
  • Inverted index — maps each term → a postings list of doc IDs (plus positions/frequencies). This is the same idea as a tries/dictionary keyed by term (see inverted-index).
  • Scoring — for a query, intersect/union postings, then rank by BM25 using term frequency (TF) and inverse document frequency (IDF).
StepInputOutput
Tokenize”Quick Foxes!”[Quick, Foxes]
Filter (lowercase, stem)[Quick, Foxes][quick, fox]
Index[quick, fox]postings: quick→{1}, fox→{1}

Example

Querying "fox" against three docs:

term "fox" → postings [doc1, doc4, doc9]
score(doc) = TF(fox, doc) × IDF(fox)
IDF rises when "fox" is rare across the corpus → rarer terms weigh more

The engine never reads doc2 or doc3 — it jumps straight to the postings list, which is the entire performance win over a row scan.

Pitfalls

  • Mismatched analyzers — indexing with stemming but querying a keyword (or vice versa) silently returns nothing.
  • Treating relevance as binary — full-text returns a ranked list; if you only need yes/no, use a filter and skip scoring.
  • Forgetting precision/recall trade-offs — stemming and fuzzy matching boost recall but can hurt precision; tune per field.

See also