Summarization / Compression

Summarization compresses old conversation history into a shorter representation so a long-running agent stays under its context window without forgetting the gist of what happened.

Why it matters

Agent loops are append-only — every tool result and turn grows short-term memory until it overflows. The naive fix (drop old turns) loses facts; the better fix is to compress them: a rolling summary keeps a 50-turn chat coherent in 2K tokens instead of 40K, cutting per-turn cost and latency while preserving decisions, constraints, and entities the model still needs.

How it works

Replace a span of verbose messages with a dense summary, then keep appending — the summary itself rolls forward as new turns arrive.

  • Rolling / recursive summary. When history crosses a threshold, summarize the oldest turns into a running synopsis; new turns append after it. The summary is re-summarized as it grows.
  • Buffer + summary (hybrid). Keep the last K turns verbatim (recency matters) plus a summary of everything older — LangChain’s ConversationSummaryBufferMemory pattern.
  • Structured compression. Extract a typed scratchpad (decisions, open_questions, entities) instead of prose — denser and less lossy than free text.
  • Tool-output trimming. Summarize or truncate large observations before they enter history, not after they’ve already bloated it.
TechniqueKeepsTradeoff
Truncate (drop)recent onlycheap, lossy
Rolling summarygist of allLLM call cost, lossy
Buffer + summaryrecent + gistbest recall/cost balance

Example

A 200K-window agent that summarizes every 10 turns:

turns 1–10 (≈18K tokens of back-and-forth)
  → summary call → 600 tokens:
    "User U7 debugging a 502 on /checkout. Ruled out DNS and
     the LB. Suspect the payment service. Has prod log access."
history now = [summary 600t] + turns 11–14 verbatim
per-turn input: ~190K → ~14K

The 502 context, the ruled-out causes, and the current hypothesis all survive in 600 tokens; the verbose transcript is gone.

Pitfalls

  • Lossy on specifics. Summaries drop exact IDs, numbers, and code; extract those to long-term-memory before compressing, don’t trust prose to keep them.
  • Error compounding. Recursively summarizing a summary amplifies omissions and hallucinations — anchor on the original where you can.
  • Summarizing too eagerly. Compressing the last few turns hurts recency; keep a verbatim recent buffer.
  • Cache invalidation. Rewriting history to insert a summary busts the prompt cache for that prefix — summarize at stable boundaries.

See also