Skip to content

Getting Started

If you're coming from…

  • Phoenix / Arize — keep using Phoenix for tracing. Export spans with client.spans.get_spans() to JSONL, then kalibra compare two exports. Auto-detected, no field mapping. See Phoenix integration.
  • LangSmith — export traces as JSONL (any shape). Use kalibra inspect --suggest to print field mappings, then add them to kalibra.yml.
  • Braintrust — Braintrust gives you scores; Kalibra gives you a statistical diff between two runs. Map scores.<name> to outcome and metrics.cost to cost in kalibra.yml.
  • Langfusegen_ai.* attributes are auto-detected. Cost attributes vary by exporter — see OTel GenAI for the mapping.
  • Ad-hoc scripts and notebooks — your existing JSONL is probably enough. Run kalibra inspect your-file.jsonl --suggest and paste the suggested command.

Install

pip install kalibra

Try the demo

kalibra demo

This creates a kalibra-demo/ directory with sample traces and runs an interactive comparison. Afterwards:

kalibra compare kalibra-demo/baseline.jsonl kalibra-demo/current.jsonl -v

Interactive tutorials

Each notebook works without an API key using pre-recorded traces. All Kalibra analysis runs identically.

Integration Trace format Demo scenario Tutorial
Phoenix / OpenInference llm.*, openinference.* Multi-step agent with span tree aggregation Open in Colab
OTel GenAI gen_ai.* Truncation regression hidden by aggregate improvement Open in Colab
CrewAI Flat JSONL Failure redistribution and cost explosion Open in Colab

Compare your own data

kalibra compare baseline.jsonl current.jsonl

If your JSONL uses non-standard field names, let Kalibra figure it out:

kalibra inspect your-traces.jsonl --suggest

This scans your data and prints metric readiness, field coverage, and a copy-pasteable compare command:

  Metric readiness
    ✓ Outcome          200/200 traces
    ✗ Cost               0/200 traces
    ✓ Tokens           200/200 traces
    ✓ Duration         200/200 traces

  Suggested field mappings
    ★ gen_ai.usage.input_tokens
    ★ gen_ai.usage.output_tokens

  Option 1 — quick compare with flags:
      kalibra compare traces.jsonl <current.jsonl> \
      --input-tokens gen_ai.usage.input_tokens \
      --output-tokens gen_ai.usage.output_tokens

Set up quality gates

Create a kalibra.yml to make comparisons repeatable and add CI gates:

kalibra init

Or write one manually:

baseline:
  path: ./baselines/production.jsonl
current:
  path: ./eval-output/canary.jsonl

require:
  - success_rate_delta >= -2
  - regressions <= 5
  - cost_delta_pct <= 20

Run kalibra compare --metrics to see all available gate fields — token_delta_pct, duration_delta_pct, span_regressions, and more.

Then:

kalibra compare    # reads kalibra.yml, exits 1 on failure

Filtering from a single file

If your baseline and current traces are in the same file (tagged by a field like variant), use where to split them:

sources:
  baseline:
    path: ./all-traces.jsonl
    where:
      - variant == baseline
  current:
    path: ./all-traces.jsonl
    where:
      - variant == current

require:
  - success_rate_delta >= -2
  - regressions <= 5

Operators: == (equal), != (not equal), =~ (regex match), !~ (regex not match). Multiple matchers are ANDed. Traces missing the field are excluded.

Add to CI

# .github/workflows/quality-gate.yml
name: Agent Quality Gate
on: [pull_request]

jobs:
  kalibra:
    runs-on: ubuntu-latest
    permissions:
      pull-requests: write
    steps:
      - uses: actions/checkout@v5
      - run: python eval.py --output current.jsonl
      - uses: khan5v/kalibra-action@v1
        with:
          baseline: baselines/production.jsonl
          current: current.jsonl
          config: kalibra.yml

The kalibra-action posts a markdown report as a PR comment and exits 1 if any gate fails.