Skip to content

Getting Started

Install

pip install kalibra

Try the demo

kalibra demo

This creates a kalibra-demo/ directory with sample traces and runs an interactive comparison. Afterwards:

kalibra compare kalibra-demo/baseline.jsonl kalibra-demo/current.jsonl -v

Interactive tutorials

Each notebook works without an API key using pre-recorded traces. All Kalibra analysis runs identically.

Integration Trace format Demo scenario Tutorial
Phoenix / OpenInference llm.*, openinference.* Multi-step agent with span tree aggregation Open in Colab
OTel GenAI gen_ai.* Truncation regression hidden by aggregate improvement Open in Colab
CrewAI Flat JSONL Failure redistribution and cost explosion Open in Colab

Compare your own data

kalibra compare baseline.jsonl current.jsonl

If your JSONL uses non-standard field names, let Kalibra figure it out:

kalibra inspect your-traces.jsonl --suggest

This scans your data and prints metric readiness, field coverage, and a copy-pasteable compare command:

  Metric readiness
    ✓ Outcome          200/200 traces
    ✗ Cost               0/200 traces
    ✓ Tokens           200/200 traces
    ✓ Duration         200/200 traces

  Suggested field mappings
    ★ gen_ai.usage.input_tokens
    ★ gen_ai.usage.output_tokens

  Option 1 — quick compare with flags:
      kalibra compare traces.jsonl <current.jsonl> \
      --input-tokens gen_ai.usage.input_tokens \
      --output-tokens gen_ai.usage.output_tokens

Set up quality gates

Create a kalibra.yml to make comparisons repeatable and add CI gates:

kalibra init

Or write one manually:

baseline:
  path: ./baselines/production.jsonl
current:
  path: ./eval-output/canary.jsonl

require:
  - success_rate_delta >= -2
  - regressions <= 5
  - cost_delta_pct <= 20

Run kalibra compare --metrics to see all available gate fields — token_delta_pct, duration_delta_pct, span_regressions, and more.

Then:

kalibra compare    # reads kalibra.yml, exits 1 on failure

Filtering from a single file

If your baseline and current traces are in the same file (tagged by a field like variant), use where to split them:

sources:
  baseline:
    path: ./all-traces.jsonl
    where:
      - variant == baseline
  current:
    path: ./all-traces.jsonl
    where:
      - variant == current

require:
  - success_rate_delta >= -2
  - regressions <= 5

Operators: == (equal), != (not equal), =~ (regex match), !~ (regex not match). Multiple matchers are ANDed. Traces missing the field are excluded.

Add to CI

# .github/workflows/quality-gate.yml
name: Agent Quality Gate
on: [pull_request]

jobs:
  kalibra:
    runs-on: ubuntu-latest
    permissions:
      pull-requests: write
    steps:
      - uses: actions/checkout@v5
      - run: python eval.py --output current.jsonl
      - uses: khan5v/kalibra-action@v1
        with:
          baseline: baselines/production.jsonl
          current: current.jsonl
          config: kalibra.yml

The kalibra-action posts a markdown report as a PR comment and exits 1 if any gate fails.