Kalibra is the CLI that catches what the dashboard misses. Compare two trace files. Get a statistical verdict. Exit 1 when something regressed.
Success rate up. Cost down. Every aggregate metric green-lit this deploy.
Per-task breakdown surfaced them. New easy tasks lifted the average and hid the failures.
duration_delta_pct ≤ 30 violated. PR blocked. Exit code 1. Merge stopped.
"Unsuccessful AI products almost always share a common root cause: a failure to create robust evaluation systems."
Span tree aggregation. Token + cost extraction from LLM spans only.
gen_ai.*Validated with opentelemetry-instrumentation-openai-v2. Finish-reason aware.
flat JSONLOne trace per line. Map fields once in kalibra.yml.
kalibra inspect --suggest reads your data and prints field mappings.