Tokenizers
The component inside an analyzer that splits a character stream into individual tokens (terms) — and emits their offsets and positions for highlighting and phrase queries.
Why it matters
The tokenizer decides what counts as a “word”. Choose standard and "wi-fi" becomes [wi, fi]; choose whitespace and it stays "wi-fi". Tokenization drives partial-match behavior: edge_ngram is the standard way to build search-as-you-type autocomplete, while the wrong tokenizer makes hyphenated codes or URLs unsearchable.
How it works
Exactly one tokenizer runs per analyzer, after char filters and before token filters. Each token carries a position and start/end offset.
standard— Unicode segmentation (UAX #29); drops most punctuation. Default, good for prose.whitespace— splits only on whitespace; keeps punctuation and case.keyword— emits the entire input as a single token (combine with filters to lowercase a whole string).pattern— splits on a regex (default\W+).ngram— every N-length window; matches substrings anywhere (heavy index growth).edge_ngram— prefixes only ("quick"→q, qu, qui, quic, quick); ideal for autocomplete, far smaller thanngram.path_hierarchy—"/a/b/c"→[/a, /a/b, /a/b/c]; for filesystem-style facets.
| Tokenizer | "Quick-Fox 2" → | Use |
|---|---|---|
standard | [quick, fox, 2] | General text |
whitespace | [Quick-Fox, 2] | Pre-tokenized input |
keyword | [Quick-Fox 2] | Whole-value tokens |
edge_ngram (min2,max5) | [Qu, Qui, Quic, Quick, ...] | Autocomplete |
Example
PUT /ac
{ "settings": { "analysis": {
"tokenizer": { "edge": { "type": "edge_ngram", "min_gram": 2, "max_gram": 10,
"token_chars": ["letter", "digit"] } },
"analyzer": { "ac": { "tokenizer": "edge", "filter": ["lowercase"] } } } },
"mappings": { "properties": {
"title": { "type": "text", "analyzer": "ac", "search_analyzer": "standard" } } } }
Indexing "Laptop" stores [la, lap, lapt, lapto, laptop]; a search for "lap" (analyzed by standard) hits the lap term — instant prefix match.
Pitfalls
ngramindex bloat — full n-grams multiply term count; preferedge_ngramunless you truly need infix matching.- No
search_analyzer— analyzing the query withedge_ngramtoo means"lap"becomes[la, lap]and over-matches; search withstandard. - Unbounded
max_gram— large windows explode the index and slow indexing; cap it (often ≤10–15). - Losing offsets — custom regex tokenizers can break highlighting if offsets aren’t preserved.