Fine-tuning vs Prompt Engineering

Two ways to steer an LLM without training from scratch: change the input at inference time (prompting) or change the weights on your data (fine-tuning).

Why it matters

It’s a build-vs-buy decision for behavior. Prompting is instant, free of training infra, and easy to iterate — the right first move for almost every agent. Fine-tuning earns its keep only when prompting plateaus: you need a consistent style/format, a narrow skill, lower latency, or shorter prompts at scale. Reaching for a fine-tune to add facts is the classic mistake — that’s a retrieval problem, not a weights problem.

How it works

Order of escalation, cheapest first: prompt → few-shot → rag-and-vector-databases → fine-tune. Climb only when the rung below stops improving.

  • Prompt engineering shapes one request: clear instructions, provide-additional-context, few-shot examples, output schema. Zero training cost; changes ship instantly.
  • RAG injects fresh, source-grounded facts at query time — the right tool for knowledge that changes.
  • Fine-tuning (usually LoRA/PEFT — train a small adapter, not all weights) bakes in behavior: tone, a domain’s jargon, reliable JSON, a tool-call dialect. Needs a labeled dataset and an eval set.
ConcernPromptingFine-tuning
Adds knowledgeweak (use RAG)weak, goes stale
Adds behavior/formatmoderatestrong
Iteration speedsecondshours–days
Up-front cost~0data + GPU time
Per-call costhigher (long prompts)lower (short prompts)

Example

A support agent must always answer in a strict JSON triage schema.

v1 prompt + 3 examples  → ~88% schema-valid, ~1.5K-token prompt
v2 RAG for policy facts → accurate, still 88% format
v3 LoRA on 2K labeled   → ~99% valid, prompt shrinks to ~200 tokens
                          (lower latency + cost per call)

Facts came from RAG; format reliability came from the fine-tune — each fixed what the other couldn’t.

Pitfalls

  • Fine-tuning for facts. Weights freeze at train time and go stale; use RAG for knowledge, fine-tuning for behavior.
  • Skipping prompt work first. Many “we need a fine-tune” problems vanish with a clear instruction and two examples.
  • No eval set. Without a held-out benchmark you can’t tell a fine-tune helped or quietly regressed other tasks (catastrophic forgetting).
  • Tiny or dirty datasets. A few hundred noisy examples overfit; quality and coverage beat volume.

See also