Evaluation / Observability

Lessons in this group, roughly in build order:

metrics-to-track — The small set of quality, cost, latency, and reliability numbers you instrument on every agent run so you…
human-in-the-loop-evaluation — Using human judgment — ratings, pairwise preferences, or approvals — to score agent output on the…
langsmith — LangChain’s hosted platform for tracing, evaluating, and monitoring LLM apps — framework-agnostic despite…
langfuse — An open-source LLM observability and evaluation platform — traces, prompt management, evals, and cost…
helicone — An open-source LLM observability layer that works as a proxy/gateway — you change the base URL, and it…
deepeval — An open-source LLM evaluation framework that feels like Pytest for LLM output — you write assert-style…
ragas — An open-source framework for evaluating RAG pipelines specifically, scoring retrieval and generation…
openllmetry — An open-source set of OpenTelemetry extensions (by Traceloop) that auto-instrument LLM and vector-DB…
structured-logging-tracing — Emitting machine-readable, queryable records of what an agent did — structured key-value logs for events…
integration-testing-for-flows — Testing a whole agent run end-to-end — perception → reason → tools → final answer — against a frozen…
unit-testing — Testing the deterministic pieces around the model in isolation — tool functions, schema parsing, prompt…

tech-studies