Apache Lucene foundations

Lucene is the Java full-text library Elasticsearch is built on; every shard is a Lucene index, so Lucene’s behavior dictates Elasticsearch’s.

Why it matters

The traits engineers blame on Elasticsearch — near real-time search, immutable segments, deletes that only mark tombstones, force merge — are all Lucene mechanics. Understanding the layer below explains why a refresh is needed before reads and why updates are expensive.

How it works

A Lucene index is a set of segments: self-contained, immutable mini-indexes. Writes go to an in-memory buffer, then flush to a new segment.

  • Inverted index — term → postings, the core lookup structure.
  • Immutable segments — once written, never modified; new docs create new segments, which are periodically merged.
  • Soft deletes — a delete sets a bit in a .liv file; the doc is filtered out of results but reclaimed only on merge.
  • Specialized formatsdoc values (columnar, for sorting/aggregations), stored fields (the _source), term vectors (for highlighting).
  • Commit point — fsync of segments for durability; Elasticsearch layers a translog on top for crash recovery between commits.
Lucene conceptElasticsearch surface
IndexA single shard
SegmentImmutable file set, merged over time
Refresh (open searcher)Makes new docs searchable (~1s)
Doc valuesPowers aggregations / sorting

Example

The lifecycle of one indexed document:

index doc ─▶ in-memory buffer + translog
   refresh (1s) ─▶ new segment, now searchable (not yet fsync'd)
   flush       ─▶ segments fsync'd, translog truncated (durable commit)
   merge       ─▶ small segments combined, deletes purged

This is why a freshly indexed doc is searchable in ~1s but only crash-safe after a flush.

Pitfalls

  • Expecting in-place updates — segments are immutable, so an “update” indexes a new doc and tombstones the old one; churny data fragments the index.
  • Too many small segments — every search hits all of them; refreshing too often (or never merging) hurts query latency.
  • Forcing merges on hot indicesforce merge is heavy I/O; only run it on read-only/old indices.

See also