Fine-tuning vs Prompt Engineering

Two ways to steer an LLM without training from scratch: change the input at inference time (prompting) or change the weights on your data (fine-tuning).

Why it matters

It’s a build-vs-buy decision for behavior. Prompting is instant, free of training infra, and easy to iterate — the right first move for almost every agent. Fine-tuning earns its keep only when prompting plateaus: you need a consistent style/format, a narrow skill, lower latency, or shorter prompts at scale. Reaching for a fine-tune to add facts is the classic mistake — that’s a retrieval problem, not a weights problem.

How it works

Order of escalation, cheapest first: prompt → few-shot → rag-and-vector-databases → fine-tune. Climb only when the rung below stops improving.

Prompt engineering shapes one request: clear instructions, provide-additional-context, few-shot examples, output schema. Zero training cost; changes ship instantly.
RAG injects fresh, source-grounded facts at query time — the right tool for knowledge that changes.
Fine-tuning (usually LoRA/PEFT — train a small adapter, not all weights) bakes in behavior: tone, a domain’s jargon, reliable JSON, a tool-call dialect. Needs a labeled dataset and an eval set.

Concern	Prompting	Fine-tuning
Adds knowledge	weak (use RAG)	weak, goes stale
Adds behavior/format	moderate	strong
Iteration speed	seconds	hours–days
Up-front cost	~0	data + GPU time
Per-call cost	higher (long prompts)	lower (short prompts)

Example

A support agent must always answer in a strict JSON triage schema.

v1 prompt + 3 examples  → ~88% schema-valid, ~1.5K-token prompt
v2 RAG for policy facts → accurate, still 88% format
v3 LoRA on 2K labeled   → ~99% valid, prompt shrinks to ~200 tokens
                          (lower latency + cost per call)

Facts came from RAG; format reliability came from the fine-tune — each fixed what the other couldn’t.

Pitfalls

Fine-tuning for facts. Weights freeze at train time and go stale; use RAG for knowledge, fine-tuning for behavior.
Skipping prompt work first. Many “we need a fine-tune” problems vanish with a clear instruction and two examples.
No eval set. Without a held-out benchmark you can’t tell a fine-tune helped or quietly regressed other tasks (catastrophic forgetting).
Tiny or dirty datasets. A few hundred noisy examples overfit; quality and coverage beat volume.

tech-studies

Explorer

Fine-tuning vs Prompt Engineering

Fine-tuning vs Prompt Engineering

Why it matters

How it works

Example

Pitfalls

See also

Graph View

Table of Contents

Backlinks