LangSmith

LangChain’s hosted platform for tracing, evaluating, and monitoring LLM apps — framework-agnostic despite the name, but tightest with langchain and langgraph.

Why it matters

When an agent-loop makes 12 nested model and tool calls, a flat log is useless; you need the trace tree to see which step hallucinated or stalled. LangSmith captures that tree, lets you turn any captured run into a dataset, and run evaluators over it in CI — closing the loop from “saw a bad prod run” to “added it to the regression set”. It’s the default if your stack already speaks LangChain.

How it works

The core object is a run (one unit of work); nested runs form a trace. You attach evaluators that score runs against a reference dataset.

ConceptMeaning
Runone LLM/tool/chain call, nestable
Tracethe full tree for one request
Datasetinput→reference examples for eval
Evaluatorfn or LLM judge scoring a run
Experimentone eval pass over a dataset
  • Auto-instrumentation. Set LANGCHAIN_TRACING_V2=true and LangChain/LangGraph runs trace themselves; outside the framework, wrap functions with the @traceable decorator.
  • Evaluators. Built-in (exactness, embedding distance, LLM-as-judge for correctness/helpfulness) or custom Python; run them via evaluate() over a dataset.
  • Datasets from prod. Click a bad trace → add to dataset → it becomes a permanent test case (see integration-testing-for-flows).
  • Monitoring. Dashboards for latency, cost, error rate, and feedback (metrics-to-track); supports online evals on sampled live traffic.

Example

Regression-testing a prompt change:

from langsmith import evaluate
def correct(run, example):                 # custom evaluator
    return {"score": run.outputs["answer"] == example.outputs["answer"]}
 
evaluate(my_agent, data="qa-goldset-v3",
         evaluators=[correct, "qa"])        # "qa" = built-in LLM judge
# → experiment: 91% correct, vs 88% on previous prompt

Every run is browsable in the trace UI; the experiment diff shows which examples flipped.

Pitfalls

  • Sending PII to a hosted backend. Traces include raw prompts/outputs — redact or self-host before logging user data (data-privacy-pii-redaction).
  • Trusting the LLM judge blindly. Calibrate its verdicts against human labels before gating on them.
  • Tracing in the hot path. Default export is async/batched; a synchronous custom wrapper can add latency — keep it off the critical path.
  • Assuming LangChain-only. It traces any code via @traceable; don’t rewrite your agent just to use it.

See also