Evaluation / Observability

Lessons in this group, roughly in build order:

  • metrics-to-track — The small set of quality, cost, latency, and reliability numbers you instrument on every agent run so you…
  • human-in-the-loop-evaluation — Using human judgment — ratings, pairwise preferences, or approvals — to score agent output on the…
  • langsmith — LangChain’s hosted platform for tracing, evaluating, and monitoring LLM apps — framework-agnostic despite…
  • langfuse — An open-source LLM observability and evaluation platform — traces, prompt management, evals, and cost…
  • helicone — An open-source LLM observability layer that works as a proxy/gateway — you change the base URL, and it…
  • deepeval — An open-source LLM evaluation framework that feels like Pytest for LLM output — you write assert-style…
  • ragas — An open-source framework for evaluating RAG pipelines specifically, scoring retrieval and generation…
  • openllmetry — An open-source set of OpenTelemetry extensions (by Traceloop) that auto-instrument LLM and vector-DB…
  • structured-logging-tracing — Emitting machine-readable, queryable records of what an agent did — structured key-value logs for events…
  • integration-testing-for-flows — Testing a whole agent run end-to-end — perception → reason → tools → final answer — against a frozen…
  • unit-testing — Testing the deterministic pieces around the model in isolation — tool functions, schema parsing, prompt…