Token filters & character filters
The two filtering stages of an analyzer: character filters transform the raw text before the tokenizer; token filters add, remove, or rewrite tokens after it.
Why it matters
These are where most analysis tuning happens. Token filters give you lowercasing, accent folding, stemming, stopwords, synonyms, and n-grams — the difference between “café” matching “cafe” and not. Character filters fix the input before tokenization, e.g. stripping HTML so <b> tags don’t pollute terms. Order matters: a synonym filter placed before lowercase may never match.
How it works
Pipeline order is fixed: char filters → tokenizer → token filters, each chain applied in array order.
- Character filters —
html_stripremoves markup;mappingdoes literal char swaps (& → and);pattern_replacerewrites via regex. - Token filters — operate on the token stream:
lowercase/asciifolding— normalize case and strip diacritics.stop— drop stopwords (the,is); language-specific lists.- stemmers —
porter_stem,kstem,snowballreduce to roots. synonym/synonym_graph— expand"tv" ⇒ "television".edge_ngram— generate prefixes at the token level for autocomplete.
- Position-aware — filters preserve positions so phrase/
match_phrasequeries still work; multi-word synonyms needsynonym_graph.
| Stage | Runs | Examples |
|---|---|---|
| Char filter | Before tokenizing | html_strip, mapping, pattern_replace |
| Token filter | After tokenizing | lowercase, stop, porter_stem, synonym |
Example
PUT /blog
{ "settings": { "analysis": {
"char_filter": { "amp": { "type": "mapping", "mappings": ["& => and"] } },
"filter": { "stops": { "type": "stop", "stopwords": "_english_" } },
"analyzer": { "clean": { "char_filter": ["html_strip", "amp"],
"tokenizer": "standard",
"filter": ["lowercase", "stops", "porter_stem"] } } } } }
POST /blog/_analyze
{ "analyzer": "clean", "text": "<p>Cats & the Running Foxes</p>" }
// → [cat, run, fox] ("the" dropped, "&" → "and" then stopped, tags stripped)
Pitfalls
- Wrong filter order —
synonymbeforelowercasemisses mixed-case input;stopbeforesynonymcan delete a synonym’s trigger word. - Multi-word synonyms — plain
synonymmishandles them across positions; usesynonym_graph(and place it last). asciifoldingonly — folds accents but not case; pair withlowercase.- Char filters break offsets —
pattern_replacethat changes length can misalign highlighting offsets.