Automata & formal languages

A formal language is a set of strings over an alphabet, and an automaton is an abstract machine that decides which strings belong — the foundation for parsing, regex, and what “computable” even means.

Why it matters

This is where computation gets a precise definition. Regular expressions, the lexers in every compiler, protocol validators, and search engines all rest on automata theory. The Chomsky hierarchy tells you, before you write a line of code, whether a problem is even solvable with the tool you reached for — e.g. why you cannot match balanced parentheses with a true regular expression.

How it works

The classic ladder of language classes, each strictly more powerful, with a matching machine:

  • Regular — recognized by finite automata (DFA/NFA); exactly what regular expressions describe. Finite memory only, so no counting.
  • Context-free — recognized by pushdown automata (a finite machine plus a stack); generated by grammars like S -> aSb | ε. The stack lets them balance nesting.
  • Context-sensitive and recursively enumerable — top of the ladder, recognized by the Turing machine, the model of general computation.

A DFA is a directed graph of states: an edge per input symbol, one start state, a set of accepting states. An NFA allows multiple/ε transitions but recognizes the same languages — you can always determinize it via subset construction. The pumping lemma is the standard tool for proving a language is not regular (or not context-free).

Example

A DFA over {a, b} accepting strings with an even number of as — two states, even and odd:

state even:  on a -> odd;   on b -> even   (accepting)
state odd:   on a -> even;  on b -> odd
start = even

Feed abba: even -> odd -> odd -> odd -> even accept. A trie is a closely related structure: a DFA-shaped tree that recognizes a finite set of words.

Pitfalls

  • Asking regex to count or balance — matching a^n b^n or nested brackets needs a stack; that is context-free, beyond true regular languages. (Many “regex” engines cheat with extensions.)
  • Confusing NFA power with DFA power — nondeterminism adds convenience, not expressiveness; both recognize exactly the regular languages.
  • Forgetting the pumping lemma proves a negative only — passing it does not prove a language is regular.

See also