Getting Started
If you're coming from…
- Phoenix / Arize — keep using Phoenix for tracing. Export spans with
client.spans.get_spans()to JSONL, thenkalibra comparetwo exports. Auto-detected, no field mapping. See Phoenix integration. - LangSmith — export traces as JSONL (any shape). Use
kalibra inspect --suggestto print field mappings, then add them tokalibra.yml. - Braintrust — Braintrust gives you scores; Kalibra gives you a statistical diff between two runs. Map
scores.<name>tooutcomeandmetrics.costtocostinkalibra.yml. - Langfuse —
gen_ai.*attributes are auto-detected. Cost attributes vary by exporter — see OTel GenAI for the mapping. - Ad-hoc scripts and notebooks — your existing JSONL is probably enough. Run
kalibra inspect your-file.jsonl --suggestand paste the suggested command.
Install
Try the demo
This creates a kalibra-demo/ directory with sample traces and runs an interactive comparison. Afterwards:
Interactive tutorials
Each notebook works without an API key using pre-recorded traces. All Kalibra analysis runs identically.
Compare your own data
If your JSONL uses non-standard field names, let Kalibra figure it out:
This scans your data and prints metric readiness, field coverage, and a copy-pasteable compare command:
Metric readiness
✓ Outcome 200/200 traces
✗ Cost 0/200 traces
✓ Tokens 200/200 traces
✓ Duration 200/200 traces
Suggested field mappings
★ gen_ai.usage.input_tokens
★ gen_ai.usage.output_tokens
Option 1 — quick compare with flags:
kalibra compare traces.jsonl <current.jsonl> \
--input-tokens gen_ai.usage.input_tokens \
--output-tokens gen_ai.usage.output_tokens
Set up quality gates
Create a kalibra.yml to make comparisons repeatable and add CI gates:
Or write one manually:
baseline:
path: ./baselines/production.jsonl
current:
path: ./eval-output/canary.jsonl
require:
- success_rate_delta >= -2
- regressions <= 5
- cost_delta_pct <= 20
Run kalibra compare --metrics to see all available gate fields — token_delta_pct, duration_delta_pct, span_regressions, and more.
Then:
Filtering from a single file
If your baseline and current traces are in the same file (tagged by a field like variant), use where to split them:
sources:
baseline:
path: ./all-traces.jsonl
where:
- variant == baseline
current:
path: ./all-traces.jsonl
where:
- variant == current
require:
- success_rate_delta >= -2
- regressions <= 5
Operators: == (equal), != (not equal), =~ (regex match), !~ (regex not match). Multiple matchers are ANDed. Traces missing the field are excluded.
Add to CI
# .github/workflows/quality-gate.yml
name: Agent Quality Gate
on: [pull_request]
jobs:
kalibra:
runs-on: ubuntu-latest
permissions:
pull-requests: write
steps:
- uses: actions/checkout@v5
- run: python eval.py --output current.jsonl
- uses: khan5v/kalibra-action@v1
with:
baseline: baselines/production.jsonl
current: current.jsonl
config: kalibra.yml
The kalibra-action posts a markdown report as a PR comment and exits 1 if any gate fails.