Token filters & character filters

The two filtering stages of an analyzer: character filters transform the raw text before the tokenizer; token filters add, remove, or rewrite tokens after it.

Why it matters

These are where most analysis tuning happens. Token filters give you lowercasing, accent folding, stemming, stopwords, synonyms, and n-grams — the difference between “café” matching “cafe” and not. Character filters fix the input before tokenization, e.g. stripping HTML so <b> tags don’t pollute terms. Order matters: a synonym filter placed before lowercase may never match.

How it works

Pipeline order is fixed: char filters → tokenizer → token filters, each chain applied in array order.

  • Character filtershtml_strip removes markup; mapping does literal char swaps (& → and); pattern_replace rewrites via regex.
  • Token filters — operate on the token stream:
    • lowercase / asciifolding — normalize case and strip diacritics.
    • stop — drop stopwords (the, is); language-specific lists.
    • stemmers — porter_stem, kstem, snowball reduce to roots.
    • synonym / synonym_graph — expand "tv" ⇒ "television".
    • edge_ngram — generate prefixes at the token level for autocomplete.
  • Position-aware — filters preserve positions so phrase/match_phrase queries still work; multi-word synonyms need synonym_graph.
StageRunsExamples
Char filterBefore tokenizinghtml_strip, mapping, pattern_replace
Token filterAfter tokenizinglowercase, stop, porter_stem, synonym

Example

PUT /blog
{ "settings": { "analysis": {
    "char_filter": { "amp": { "type": "mapping", "mappings": ["& => and"] } },
    "filter":      { "stops": { "type": "stop", "stopwords": "_english_" } },
    "analyzer":    { "clean": { "char_filter": ["html_strip", "amp"],
                                "tokenizer": "standard",
                                "filter": ["lowercase", "stops", "porter_stem"] } } } } }

POST /blog/_analyze
{ "analyzer": "clean", "text": "<p>Cats &amp; the Running Foxes</p>" }
// → [cat, run, fox]   ("the" dropped, "&" → "and" then stopped, tags stripped)

Pitfalls

  • Wrong filter ordersynonym before lowercase misses mixed-case input; stop before synonym can delete a synonym’s trigger word.
  • Multi-word synonyms — plain synonym mishandles them across positions; use synonym_graph (and place it last).
  • asciifolding only — folds accents but not case; pair with lowercase.
  • Char filters break offsetspattern_replace that changes length can misalign highlighting offsets.

See also