Skip to content
v0.2.3 · MIT · 2 dependencies

The diff tool for AI agent runs.

Kalibra is the CLI that catches what the dashboard misses. Compare two trace files. Get a statistical verdict. Exit 1 when something regressed.

$ pip install kalibra copy
View on GitHub →
baseline.jsonl main
traces100
success rate50.0%
cost (median)$0.036
duration (median)7.6s
current.jsonl PR #482
traces100
success rate75.0%
cost (median)$0.021
duration (median)15.2s
$ kalibra compare baseline.jsonl current.jsonl
Kalibra Compare ────────────────────────────────────────────────────────── Success rate 50.0% → 75.0% +25.0 pp Cost $0.036 → $0.021 −40.5% Duration 7.6s → 15.2s +99.1%   Trace breakdown 20 matched — ✓ 10 improved, ✗ 5 regressed   Quality gates [ OK ] success_rate_delta >= −5 [FAIL] duration_delta_pct <= 30 ────────────────────────────────────────────────────────── FAILED — quality gate violation (exit code 1)
What just happened

The aggregate looked great. Five tasks completely broke.

+25 pp

Dashboard says ship it

Success rate up. Cost down. Every aggregate metric green-lit this deploy.

5/5 → 0/5

Five tasks died silently

Per-task breakdown surfaced them. New easy tasks lifted the average and hid the failures.

+99%

Quality gate held

duration_delta_pct ≤ 30 violated. PR blocked. Exit code 1. Merge stopped.

"Unsuccessful AI products almost always share a common root cause: a failure to create robust evaluation systems."

Statistically transparent

Two-proportion z-test success / error rate
Percentile bootstrap (n=1000) cost / tokens / duration
Per-task z-test grid finds hidden regressions
BH FDR correction on the roadmap
BCa bootstrap on the roadmap
Audience

Built for the layer below the eval.

For you if

  • You run agent evals in CI and want a real regression gate
  • You've been burned by averages hiding regressions
  • You prefer a CLI and a config file over another UI to log into
  • You already use Phoenix, OTel GenAI, Langfuse — and just want to diff two runs
  • Reach for something else if

  • You need to judge whether outputs are correct. Kalibra measures change between two runs, not quality — pair it with an LLM judge or rule-based scorer.
  • You need real-time monitoring or alerts on live traffic. Kalibra is offline: it compares two trace files at a moment in time.
  • You need trace storage, search, or a hosted team UI. Phoenix, Langfuse, Braintrust do this — Kalibra reads what they export.
  • Reads the trace format you already have

    Auto-detected. No field mapping for standard formats.