Evaluation / Observability
Lessons in this group, roughly in build order:
- metrics-to-track — The small set of quality, cost, latency, and reliability numbers you instrument on every agent run so you…
- human-in-the-loop-evaluation — Using human judgment — ratings, pairwise preferences, or approvals — to score agent output on the…
- langsmith — LangChain’s hosted platform for tracing, evaluating, and monitoring LLM apps — framework-agnostic despite…
- langfuse — An open-source LLM observability and evaluation platform — traces, prompt management, evals, and cost…
- helicone — An open-source LLM observability layer that works as a proxy/gateway — you change the base URL, and it…
- deepeval — An open-source LLM evaluation framework that feels like Pytest for LLM output — you write assert-style…
- ragas — An open-source framework for evaluating RAG pipelines specifically, scoring retrieval and generation…
- openllmetry — An open-source set of OpenTelemetry extensions (by Traceloop) that auto-instrument LLM and vector-DB…
- structured-logging-tracing — Emitting machine-readable, queryable records of what an agent did — structured key-value logs for events…
- integration-testing-for-flows — Testing a whole agent run end-to-end — perception → reason → tools → final answer — against a frozen…
- unit-testing — Testing the deterministic pieces around the model in isolation — tool functions, schema parsing, prompt…