Tree of Thought (ToT)

A reasoning architecture that explores multiple candidate reasoning paths as a search tree — generating several next thoughts at each step, scoring them, and backtracking from dead ends.

Why it matters

CoT commits to one linear chain; if an early step is wrong, the whole answer is wrong with no recovery. ToT treats reasoning as search over a tree of partial solutions, so the model can branch, evaluate, prune bad branches, and backtracking — turning a one-shot guess into a deliberate look-ahead. On tasks needing exploration (Game of 24, puzzles, planning), the original paper reports ToT solving 74% of Game-of-24 cases versus ~4% for CoT. The cost is many more LLM calls, so it’s reserved for hard problems where a single chain reliably fails.

How it works

Four pluggable components, run as a tree search:

ComponentJob
Thought generatorpropose k next steps from a node
State evaluatorscore each node (LLM rates “sure/maybe/no” or a value)
Search strategyBFS or DFS over the frontier
Pruningdrop low-scored branches; keep top-b (beam)

The loop, in prose: from the current partial solution, generate k candidate next thoughts; evaluate each (have the model judge how promising it is); keep the best b (the beam width); expand those next. DFS dives deep and backtracking on a dead end; BFS keeps a frontier of the b best states per depth. Search ends at a complete solution or a depth limit. Total cost scales with branching: roughly b × depth × (generate + evaluate) calls — easily 50–100× a single CoT pass.

Example

Game of 24 with inputs 4, 9, 10, 13 (combine to make 24):

root: {4,9,10,13}
 ├─ 13-9=4  -> {4,4,10}   eval: maybe
 │   └─ 10-4=6 -> {4,6}   -> 4*6=24  ✓ SOLVED
 ├─ 10-4=6  -> {6,9,13}   eval: no   (pruned)
 └─ 4+9=13  -> {10,13,13} eval: no   (pruned)

The evaluator kills the two weak branches early; the search spends its budget on the promising {4,4,10} line and finds (13-9) then (10-4)·4 = 24.

Pitfalls

  • Cost explosion. Branching factor × depth multiplies calls fast; cap beam width and depth or a single question costs dollars and minutes.
  • Weak evaluator = blind search. If the LLM can’t reliably score partial states, pruning is random and ToT degrades to expensive guessing — validate the evaluator first.
  • Wrong tool for the job. ToT shines only when problems decompose into checkable intermediate states; for retrieval-style tasks, ReAct grounding beats deeper internal search.
  • Confusing ToT with multi-sample. Self-consistency (sample N chains, majority-vote) is not ToT — it has no per-state evaluation or backtracking.

See also