Do I actually need a pipeline, or is a one-shot eval script enough?

A one-shot script is not a pipeline. Without versioned datasets, a scorer suite, an experiment runner, a comparison UI, and trace-to-dataset promotion, score deltas become impossible to attribute and production failures never feed back into regressions.

Should I self-host Langfuse or pay for Braintrust?

If self-host is required for compliance, residency, or cost ceiling, Langfuse is the only OSS option here with versioned datasets and source-trace linkage. If side-by-side experiment diff is the daily workflow surface, Braintrust Pro or above earns its price.

When is rolling my own with pytest and JSONL the right call?

Right call when sweeps finish inside the CI timeout, judge-call spend stays inside budget without rate-limit retries, and only one or two engineers read comparison output. Wrong call once a third non-engineer needs to read results, which is the actual migration trigger.

How do I keep LLM-as-judge cost from dominating the eval bill?

Forecast judge cost as rows times judges-per-row times per-call token cost, then calibrate a cheap judge against a frontier judge on a 200-row held-out set. If the bootstrap CI lower bound clears 0.90 and mean bias stays inside plus or minus 0.2, the cheap judge runs the sweep and the frontier judge handles contested rows.

Is 1% production trace sampling enough for the outer loop?

At 200 runs/day and 1% sampling, there is only a 6% chance of catching a single 0.1%-rate failure in a month, which is noise rather than signal. The fix is conditional sampling: 100% on failures and uncertain calls, 1% on successes, wired as a custom sampler at the tracing layer.

Do I need to redact PII before traces hit the eval store?

Yes. On every product I checked, raw payloads persist by default and redaction is a configurable processor rather than an opt-out, so the redaction processor must sit between the agent runtime and the tracing endpoint. Redacting at dataset-item-creation time is too late for HIPAA, GDPR, or SOC 2 contexts.

Does OpenTelemetry GenAI semconv solve vendor lock-in?

It shrinks trace-layer lock-in but does not eliminate it, since two platforms can ingest the same spans and render radically different views depending on which attributes they parse. Eval-layer lock-in (datasets, scorers, experiment metadata) has no equivalent standard on the horizon as of 2026-04-26.

Agent Eval Pipelines: What Operators Actually Need to Know (2026)

I built my first agent eval pipeline because a model swap silently regressed a tool-use agent in production, and our smoke tests caught nothing. The agent still answered. It still called tools. It just answered worse, on the cases that mattered. That is the failure mode evals exist to catch, and it is the one I keep watching teams underbuild on the dimensions that decide whether the pipeline holds: versioned datasets, trace replay, scorer drift, production-trace sampling, PII redaction, and CI regression gates. This page is the architecture review I wish I had read before picking tools.

Do not use this page for procurement without rechecking pricing and plan terms against the linked vendor pages. Several rows below are volatile claims with the dated sources noted inline.

The mental model

An agent eval pipeline is three loops stacked on top of each other.

The inner loop is offline evaluation. A versioned dataset of inputs, expected outputs (or rubrics), and a scorer. The candidate model, prompt, or agent graph runs against the dataset, every row gets scored, and the result aggregates. Runs on every PR.

The middle loop is regression detection. The same dataset runs against production traffic samples, or a captured trace replays through a new candidate. Scores compare against a baseline experiment. This loop catches the silent regression I described above.

The outer loop is online evaluation. Real production traces get sampled, scored with LLM-as-judge or deterministic checks, and surface drift over time. Interesting traces, especially failures, get promoted into dataset items so the inner loop gets stronger. Most teams skip this loop.

A pipeline is the dataset, the scorer suite, the experiment runner, the comparison UI, and the job that promotes selected production traces into dataset items. Cut any one of those and three failure modes show up: comparisons across runs become unversioned (model drift vs. dataset drift, indistinguishable); experiments stop being reproducible because dataset version, scorer version, and judge prompt are not pinned; and there is no traceability from a production failure back to a regression case, so the same failure ships twice.

The landscape

Three choices I can defend from hands-on use as of 2026-04-26: Langfuse self-host or Cloud, Braintrust hosted, and a roll-your-own pytest plus JSONL pipeline. Other vendors exist (LangSmith, Arize Phoenix, OpenAI Evals) and matter for many operators, but I have not run them at depth.

A heuristic for reading product surfaces: tools cluster by which surface they were built around first. Langfuse, LangSmith, and Phoenix feel trace-first to me, since the trace tree and production-traffic views are most polished and dataset/experiment shape was added on top. Braintrust feels experiment-first, since the experiment diff and scorer abstractions are the most invested surfaces. The lineage shows up in which views feel polished versus grafted on.

Dimension	Langfuse (self-host or Cloud)	Braintrust (hosted)	Roll-your-own (pytest + JSONL)
Trace ingest	OTel + native SDK; OSS self-host (Postgres + ClickHouse + Redis + S3) per self-hosting docs, verified 2026-04-26	OTel + native SDK; hosted only (self-host gated to Enterprise) per pricing, verified 2026-04-26	Whatever the operator wires
Dataset versioning	Versioned on every add/update/delete/archive, re-runnable against historical versions per datasets docs, verified 2026-04-26	Versioned datasets, comparison view binds runs to versions per Braintrust docs, verified 2026-04-26	Git history of the JSONL file
Experiment runner	`dataset.run_experiment()` SDK method per experiment-runner changelog 2025-09-17; v4 SDK rewrite reshaped client construction per PyPI, verified 2026-04-26	First-class experiments and side-by-side diff UI per Braintrust Foundations, verified 2026-04-26	`pytest` plugin, scores to SQLite
Scorer library shape	Custom evaluator functions plus a managed evaluator library (LLM-as-judge templates and deterministic checks) per evaluation methods docs, verified 2026-04-26	Bundled deterministic + LLM-as-judge scorers as composable units per Braintrust Foundations, verified 2026-04-26	Plain Python functions
Trace-to-dataset round-trip	`source_trace_id` field on dataset items, native	"Logs to evals" workflow, native per Braintrust docs, verified 2026-04-26	Manual copy from log to JSONL
Self-host	Yes, OSS, four stateful stores, operator-owned backup/HA	Enterprise tier only, custom pricing	Trivially, on the operator's CI
Export portability	OSS schema, full DB export possible	Export scoped to retention window on Starter (14d) and Pro (30d) per pricing, verified 2026-04-26	Files on disk
Pricing entry point (checked on 2026-04-26)	Hobby $0/mo, Core $29/mo per Langfuse pricing	Starter $0/mo, Pro $249/mo per Braintrust pricing	$0 vendor fee, real CI and judge-call costs

A fourth shape: rolling a pipeline with pytest, a JSONL dataset, and an LLM-as-judge function. I have run production agents on hand-rolled pipelines for months. The wall is the comparison UI: once there are ten experiments and the question becomes whether scorer drift or model drift caused the delta, the hosted tools earn their price.

What actually matters operationally

Vendor pages emphasize the wrong dimensions. Here is the dimension list I use.

Dataset versioning semantics. When a dataset item gets edited, can an old experiment re-run against the old version? Langfuse explicitly versions on every add, update, delete, or archive per the datasets docs, verified 2026-04-26. Without this, comparing experiments across time is unreliable because the underlying ground truth shifted underneath.

Scorer composability. Real eval suites mix deterministic checks (did the function call use the right tool name?), LLM-as-judge (did the response satisfy the rubric?), and embedding-similarity scorers. The question is whether the platform treats scorers as first-class composable units or makes the operator wire each one in as bespoke code. Braintrust covers both deterministic and LLM-as-judge scorers via its AutoEvals library and a function-based scorer API (Braintrust Foundations, verified 2026-04-26). Langfuse covers the same surface differently: managed evaluators alongside custom score functions (Langfuse evaluation docs, verified 2026-04-26). Daily ergonomics differ: Braintrust pushes operators toward a small set of well-shaped scorers visible side-by-side per row; Langfuse exposes more knobs and leaves more composition to operator code.

Trace-to-dataset round-trip. Click a production trace, mark it as a regression, have it land as a dataset item. Langfuse supports this via source_trace_id on dataset items. Braintrust calls this "logs to evals." The absence of this round-trip is the failure mode I keep watching teams repeat: the dataset stays frozen at synthetic seeds, real production failures never feed back, and three months in the eval still passes on every run while the agent regresses on cases that ship to users.

Cost shape at the actual workload. Eval pipelines emit data in two places: tracing volume from production, and scorer execution cost (LLM-as-judge calls are not free). Vendors price the first. The missed line item is judge-call spend.

Data ownership and exit cost. Self-host vs. hosted. License terms. Whether export is real. Eval datasets become institutional knowledge over 12 to 24 months and they should not get trapped.

Concurrency under experiment runs. Work the arithmetic per row. A 5,000-row dataset with one candidate call plus three LLM-as-judge scorers per row is 5,000 + (5,000 × 3) = 20,000 model calls per sweep. Swap an LLM judge for a deterministic scorer and that scorer drops out of the model-call count entirely. Serialized: hours. Bad parallelism: rate-limit failure halfway through with no checkpoint.

Detailed teardowns

Langfuse: the open self-host default

Position: trace-first observability platform, OSS, with an evaluation product on top. Threshold for a real eval product: versioned datasets, experiment runs bound to versions, side-by-side comparison UI, source-trace linkage on dataset items. Langfuse hits all four.

Architecture, when self-hosted, is four stores: Postgres for transactional metadata, ClickHouse for OLAP traces and scores, Redis or Valkey for queue and cache, S3-compatible object storage for events and large exports per the self-hosting docs, verified 2026-04-26. Production targets Kubernetes via Helm, or Terraform on AWS, Azure, GCP, Railway.

Production gotchas Langfuse documents but most write-ups skip:

Docker Compose is low-scale only. Production = Kubernetes/Helm or Terraform.
All containers run UTC. Non-UTC breaks time-bucketed ClickHouse queries silently. Set TZ=UTC everywhere.
Redis/Valkey configured maxmemory-policy noeviction. Langfuse uses Redis as a durable queue. Eviction silently drops queued events.
ClickHouse sizing. Single-node fine for low volume; production wants a replicated cluster with sized disks and a TTL policy.
Backup and HA are operator-owned, bucket policy varies by role. Postgres needs PITR. ClickHouse needs scheduled backups. The bucket holding raw events benefits from versioning plus a lifecycle rule. The bucket fronting ClickHouse as object-storage disk is the opposite: docs warn against versioning there because ClickHouse manages object lifecycles itself.

Pricing on Cloud, checked 2026-04-26 against langfuse.com/pricing: Hobby $0/month (50k units), Core $29/month, Pro $199/month, Enterprise $2,499/month. Overage: $8.00 per 100k units (100k to 1M), tiered down to $6.00 (50M+).

A billable unit is any tracing data point: a trace, an observation (span, event, generation), or a score. The trap is that one agent run with five tool calls and three sub-spans easily emits 20+ units. The 50k Hobby allowance covers roughly 2,500 production runs at that shape, not 50,000.

The eval surface is built around versioned datasets and dataset.run_experiment(). Scoring has shifted: Langfuse now ships managed evaluators alongside scorer functions in operator code. The honest tradeoff is that the batteries do not absolve operators of the operational work: calibration against a human-labeled gold set, judge-prompt versioning, model-cost forecasting on judge calls, and evaluator governance all stay on the team.

Right call when: self-host is required, OSS is wanted as a hedge, the team is comfortable operating ClickHouse and a multi-store deployment.

Wrong call when: turnkey hosted with built-in scorer libraries and zero ops is the goal.

Braintrust: the eval-first hosted SaaS

Position: closed-source hosted platform, eval and experiment workflow at the center, observability around it.

Pricing tiers, checked 2026-04-26 against braintrust.dev/pricing:

Starter: $0/month, 1 GB processed data ($4/GB overage), 10k scores ($2.50/1k overage), 14 days retention
Pro: $249/month, 5 GB processed data ($3/GB overage), 50k scores ($1.50/1k overage), 30 days retention
Enterprise: custom pricing, custom retention and export, RBAC, on-prem or hosted

The two-meter shape has a direct consequence: forecast processed-data and score volume separately, because they scale on different axes. Tracing volume tracks production traffic; score volume tracks experiment cadence times dataset size times scorer count. A nightly sweep on a 5,000-row dataset with three scorers burns 450k score-meter units a month before a single production trace lands.

Self-hosting is gated to Enterprise. No Docker Compose, no public Helm chart. If on-prem is required and the budget is below Enterprise, hard no.

Right call when: the bottleneck is "did the new prompt regress?" and the team needs side-by-side experiment diffs as a daily workflow, with first-class scorer abstractions covering both deterministic checks and LLM-as-judge per Braintrust Foundations. Where the UI gets thin: bulk annotation across hundreds of rows is slower than a CSV round-trip, and exported experiment artifacts on Starter and Pro are scoped to the retention window.

On regression gating: Braintrust documents CI gates through its GitHub Action and SDK Eval runner per write-experiments docs and compare-experiments docs. The real operational work is deciding what the gate enforces: which baseline does a PR run compare against? What threshold counts as a regression on a noisy LLM-as-judge scorer? Calibrate scorer variance on a fixed candidate first, set the threshold above the noise floor, pin the baseline experiment ID rather than chasing a moving "latest main" target.

Wrong call when: data residency, OSS preference, or budget below Pro with steady volume.

LangSmith: the LangChain-native option

This teardown is intentionally incomplete. I have used LangSmith in passing but not at the depth I have on Langfuse and Braintrust, and as of 2026-04-26 I have not re-verified current pricing tiers, score-meter shape, or self-host posture. Operators evaluating it should treat smith.langchain.com as the source of truth and run the dimension list above against it directly.

The architectural shape resembles Langfuse: trace capture first, datasets and experiments second, with the trace tree shaped around LangChain's run-tree abstraction. That last point is load-bearing. If the agent runtime is already LangChain or LangGraph, integration tax is near zero. If the runtime is anything else, the spans get coerced into a shape built for someone else's framework.

Roll-your-own with pytest + JSONL

Position: not a vendor, but a real architectural choice.

The shape: dataset as JSONL in the repo, scorers as Python functions, runner as a pytest plugin recording scores to SQLite, comparison via a small Streamlit dashboard or just a CSV diff. No vendor fee, but costs do not go to zero: CI minutes for nightly runs, storage for SQLite history, hosting for the dashboard if it leaves localhost, reviewer time on diffs no UI is helping read, and maintenance tax on the runner plugin when the SDK or judge prompt shape changes.

On wall-clock cost: a 2,000-row sweep with one candidate model call per row plus one cheaper judge call per row, roughly 1,800 input and 400 output tokens per candidate and 1,200 input and 80 output per judge, on a ubuntu-latest GitHub Actions runner at concurrency 16 with tenacity-based exponential-backoff retry, landed between 22 and 38 minutes per sweep across a week of runs in my own setup. Variance dominated by 429 retries against the upstream API.

Right call when: a full sweep finishes inside the CI step timeout the team can tolerate (for me, 30 minutes on the workload above), per-run judge-call bill stays inside the experiment budget without rate-limit retries, scorer concurrency fits inside the model provider's per-key request ceiling, and the number of humans who need to read comparison output is small (one or two engineers).

Wrong call when those constraints flip: sweep duration crosses the CI timeout, judge-call cost pushes into checkpoint-and-resume territory, the scorer mix grows heterogeneous, or a third consumer (PM, domain expert, another engineer) needs to see comparisons without setting up a local Python environment.

The migration to a hosted tool has never been triggered by the rolled version breaking. It happens when explaining "is this experiment better than that one" to a non-engineer becomes the dominant time cost, which is a team-shape signal, not a tooling-failure signal.

The standards layer

OpenTelemetry has a GenAI semantic conventions working group hardening conventions for LLM and agent telemetry. As of 2026-04-26 the conventions are still marked Development, not Stable, per opentelemetry.io/docs/specs/semconv/gen-ai/. Span attribute names ship in the SDK but may rename before stabilizing. Pin a version and expect to revisit it.

What this gives operationally is narrow: trace ingestion. If an agent runtime emits OTel-compliant spans, most observability platforms can pull them off the wire with no platform-specific instrumentation. Langfuse, Arize Phoenix, several APMs, and Braintrust ingest them.

What "ingestion" does not equal: useful rendering. Two platforms can both accept the same OTel GenAI spans and produce radically different trace views, because one parses gen_ai.request.model while another keys off llm.model. Tool-call spans, retrieval spans, and nested agent-graph spans render even more inconsistently. The wire format is converging. The UI semantics are not. Test the round trip with a real agent before assuming a clean port.

Eval portability is a separate question. I found no stable published standard or vendor-neutral schema for datasets, experiment runs, scorer definitions, or scorer outputs as of 2026-04-26. Trace-layer lock-in is shrinking. Eval-layer lock-in is not.

Things nobody talks about

LLM-as-judge cost dominates the eval bill at scale, and the math is parameterized. One sweep cost = rows × judges_per_row × (input_tokens × input_price + output_tokens × output_price). A 5,000-row dataset with three judge calls per row at roughly 1,200 input and 80 output tokens is 15,000 judge calls per sweep. The trap: operators forecast tracing volume against vendor pricing pages and forget judge calls are billed by the model provider on a separate axis.

Variable	Where it comes from
`rows`	Dataset size for the sweep
`judges_per_row`	Count of LLM-as-judge scorers; deterministic scorers do not count
`input_tokens`	Prompt template + question + expected + response, measured per row
`output_tokens`	Verdict JSON, typically 50 to 150
`input_price`, `output_price`	Provider's current price page, dated
`runs_per_month`	Inner-loop cadence (per PR + nightly) plus middle-loop sweeps

Scorer drift is real and silent. If the judge prompt changes, all historical scores become incomparable. Treat the judge prompt as part of the experiment definition. Version it alongside the dataset. Never edit a scorer in place during an active eval campaign without re-running historical baselines.

Calibration needs a stated method, and a model-vs-model check is not the whole method. Comparing a cheap judge against a frontier judge on a held-out set is a cost-equivalence check, not a quality validation. It tells me whether the cheap judge tracks the frontier judge closely enough to swap in. It does not tell me whether either judge is correct on rows where both are systematically wrong.

The way I close that gap: a separate human-labeled gold set, 100 to 200 rows, scored by two reviewers independently with a third adjudicating disagreements. Cohen's kappa between reviewers gets reported alongside the score. A rubric where humans only agree 60% of the time is a rubric problem, not a judge problem. The frontier judge gets evaluated against adjudicated human labels first; if it agrees on fewer than 80% of gold rows, the rubric or judge prompt gets reworked before any cheap-judge calibration is worth running.

Once the frontier judge clears the gold-set bar, the cost-equivalence calibration runs on a separate 200-row held-out set: agreement-within-1 on the 1-5 rubric, mean signed bias, and a 95% bootstrap confidence interval on the agreement rate (1,000 resamples). If the lower bound of the CI sits above 0.90 and mean bias is inside ±0.2, the cheap judge is safe for the regression sweep with the frontier judge reserved for contested rows. Outside that envelope, the cheap judge becomes a coarse pre-filter, not a primary scorer.

Production-trace sampling rate determines whether the outer loop has signal. Probability of catching at least one failure in a month is 1 - (1 - p)^n, where p is the failure rate and n is sampled traces in the window.

Traffic (runs/day)	Sample rate	Sampled/mo	P(≥1) @ 0.1%	P(≥1) @ 1%	P(≥1) @ 5%
200	1%	60	6%	45%	95%
200	100%	6,000	99.7%	~100%	~100%
2,000	1%	600	45%	99.7%	~100%
2,000	5%	3,000	95%	~100%	~100%
20,000	1%	6,000	99.7%	~100%	~100%

The 1%-sampled, low-traffic agent has a 6% chance of even seeing a single 0.1%-rate failure in a month. That is noise. The fix is conditional sampling: 100% on failures and uncertain calls (latching on an is_uncertain heuristic, an explicit error, or low-confidence model output), 1% on successes. Of the products I checked on 2026-04-26 (Langfuse Cloud, Braintrust, OpenLLMetry's OTel sampler), none surfaced a default conditional-sampling preset; all required a custom sampler at the tracing layer.

Treat raw traces as PII-bearing until proven otherwise, and wire redaction at trace-ingest time. Production traces contain user inputs, retrieved documents, tool-call payloads. Of the products I checked on 2026-04-26 (Langfuse Cloud, Langfuse self-host, Braintrust hosted), redaction was a configurable processor on the SDK or gateway path, not an opt-out default. The defensive rule: assume raw traces persist unless a redaction processor is configured before ingest, and put that processor between the agent runtime and the tracing endpoint. For HIPAA, GDPR, or SOC 2, redaction at dataset-item-creation time is too late.

The "we support OpenTelemetry" claim hides which span attributes are actually consumed. Two platforms can both ingest OTel GenAI spans and still display radically different views, because one parses gen_ai.request.model and the other parses llm.model. As of 2026-04-26 the GenAI semconv is still experimental. Test the round trip with a real agent before committing.

Implementation patterns

The companion repo at github.com/MPIsaac-Per/agentinfra-examples carries working versions. Each pattern below is illustrative pseudocode. The companion repo is the source of truth for pinned versions, provider model IDs, and the test commands that verify the calls compile against live SDKs.

Pattern 1: Langfuse experiment with versioned dataset

# Illustrative shape only. The companion repo's tests/test_langfuse_experiment.py
# carries the pinned langfuse and provider-SDK versions and the exact model ID
# used at the time of that commit. Verify against
# https://langfuse.com/docs/evaluation/dataset-runs/run-via-sdk before deploying.
from langfuse import Langfuse
from anthropic import Anthropic  # or the provider SDK of choice

langfuse = Langfuse()
client = Anthropic()

CANDIDATE_MODEL = "<provider-model-id-from-current-docs>"

def my_agent(*, item, **kwargs):
    response = client.messages.create(
        model=CANDIDATE_MODEL,
        max_tokens=1024,
        messages=[{"role": "user", "content": item.input["question"]}],
    )
    return response.content[0].text

def exact_match(*, output, expected_output, **kwargs):
    return {"name": "exact_match", "value": 1 if output.strip() == expected_output.strip() else 0}

dataset = langfuse.get_dataset(name="agent-regression-suite-v3")
dataset.run_experiment(
    name=f"baseline-{CANDIDATE_MODEL}-2026-04-26",
    task=my_agent,
    evaluators=[exact_match],
)

Two version boundaries matter. The dataset.run_experiment() method shipped on 2025-09-17 inside the 3.x SDK line per the experiment runner changelog. Snippets predating that changelog use the older dataset.get_items() plus manual loop pattern and will break on this call. The separate v4 SDK rewrite landed in March 2026 and reshaped client construction, observation payloads, and the trace/observation split. Confirm the installed version with pip show langfuse and cross-check against the current run-via-SDK docs. The call binds to the current dataset version. Re-running after editing items produces a new run against a new version. Old runs stay anchored to old versions, which is the property worth having.

Pattern 2: LLM-as-judge with calibration check (non-production pseudocode)

The snippet below is non-production pseudocode that omits structured-output validation, schema enforcement, score clamping, retries, refusal handling, and per-row error capture. Production judge code wraps client.messages.create in a structured-output schema, validates parsed scores against the rubric range, retries on parse failures with bounded attempts, and records per-row errors as a separate scorer dimension. The companion repo carries the production version.

import json
import random
from statistics import mean
from anthropic import Anthropic

client = Anthropic()

JUDGE_PROMPT_TEMPLATE = """Score the agent response on a 1-5 rubric for factual accuracy.
Return JSON only, shaped as {{"score": <int>, "reason": <str>}}.
Question: {question}
Expected: {expected}
Response: {response}"""


def judge(question: str, expected: str, response: str, model: str) -> dict:
    msg = client.messages.create(
        model=model,
        max_tokens=256,
        messages=[{
            "role": "user",
            "content": JUDGE_PROMPT_TEMPLATE.format(
                question=question, expected=expected, response=response
            ),
        }],
    )
    return json.loads(msg.content[0].text)


def _bootstrap_ci(values: list[int], iters: int = 1000, alpha: float = 0.05) -> tuple[float, float]:
    n = len(values)
    if n == 0:
        return (float("nan"), float("nan"))
    rng = random.Random(0)
    means = []
    for _ in range(iters):
        sample = [values[rng.randrange(n)] for _ in range(n)]
        means.append(sum(sample) / n)
    means.sort()
    lo = means[int((alpha / 2) * iters)]
    hi = means[int((1 - alpha / 2) * iters) - 1]
    return (lo, hi)


def calibrate(rows: list[dict], cheap_model: str, reference_model: str) -> dict:
    """Compare a candidate cheap judge against a reference judge on a held-out set.

    Reference judge is treated as a relative anchor, not ground truth.
    Each row is shaped {"question": str, "expected": str, "response": str}.
    """
    if not rows:
        return {"n": 0, "agreement_within_1": float("nan"), "mean_bias": float("nan"), "ci_95": (float("nan"), float("nan"))}
    deltas = []
    agreements = []
    for row in rows:
        small = judge(row["question"], row["expected"], row["response"], cheap_model)
        large = judge(row["question"], row["expected"], row["response"], reference_model)
        delta = small["score"] - large["score"]
        deltas.append(delta)
        agreements.append(1 if abs(delta) <= 1 else 0)
    return {
        "n": len(rows),
        "agreement_within_1": mean(agreements),
        "mean_bias": mean(deltas),
        "ci_95": _bootstrap_ci(agreements),
    }

The trap is treating an unvalidated cheap judge as a drop-in replacement. The three numbers calibrate() returns are agreement-within-1, mean signed bias, and the bootstrap 95% CI on the agreement rate. If the lower bound clears 0.90 and mean bias is inside ±0.2, the cheap judge runs the regression sweep and the frontier judge handles contested rows. The 200-row figure is the smallest set where a 95% bootstrap CI gets tight enough (roughly ±0.04) to distinguish a 0.90 judge from a 0.95 judge. At 50 rows the CI is wide enough that "good" calibration is indistinguishable from luck.

Error analysis closes the loop. Every row where the cheap and frontier judges disagree by more than 1, and every row where the frontier judge disagreed with the human gold label, gets read by a human and tagged with a failure category. Recurring categories signal the rubric or judge prompt needs revision.

Pattern 3: Outer-loop trace-to-dataset capture

# Illustrative shape. Verify the v4 SDK call signatures against
# https://langfuse.com/docs/v4 and
# https://langfuse.com/docs/api-and-data-platform/features/query-via-sdk/
from langfuse import Langfuse

langfuse = Langfuse()

def promote_trace_to_dataset(
    trace_id: str,
    observation_id: str,
    dataset_name: str,
    expected_output: str,
):
    observation = langfuse.fetch_observation(observation_id)
    langfuse.create_dataset_item(
        dataset_name=dataset_name,
        input=observation.input,
        expected_output=expected_output,
        source_trace_id=trace_id,
    )

The v4 SDK line moved IO fields off the trace object and onto observations. In v4 a trace is a thin grouping construct, and the input/output payloads worth promoting live on the observation per the v4 migration docs. Older snippets that read trace.input are stale against v3 and broken against v4. The pattern that survives: pick the observation that matters for eval (typically the root generation, or the specific sub-span whose output the regression turns on), fetch it by ID, map its payload into the dataset item.

The source_trace_id link is what makes the outer loop close. Six months later, the production failure that motivated this dataset item is still recoverable.

Tested with

The companion repo pins exact package versions, provider model IDs, and the test commands that verify each pattern against live SDKs. Treat the snippets above as shape, not as line-for-line copy targets.

Decision framework

Constraint that bites first	Choice	Why
Self-host required (compliance, residency, cost ceiling)	Langfuse self-host, with roll-your-own as fallback if the team cannot operate four stateful stores	Only OSS option here with versioned datasets and source-trace linkage
Side-by-side experiment diff is the daily workflow surface	Braintrust hosted, Pro tier or above	Comparison UI is the product investment; AutoEvals plus first-class scorer abstractions
Monthly tooling budget under $250	Langfuse Hobby or Core, or roll-your-own	Braintrust Starter caps on score volume fast; Langfuse Core covers small workloads
Agent runtime is LangChain or LangGraph	Re-verify LangSmith plan and self-host posture against smith.langchain.com, then run the dimension list against it	Native run-tree integration reduces instrumentation; current pricing out of scope here
Comparison complexity still low (one or two engineers, < ~10 active experiments)	Roll-your-own with pytest and JSONL	Hosted UI cost exceeds CSV-diff-plus-Slack cost until a third reader joins
Ten-plus active experiments and a non-engineer reads results	Migrate to Langfuse or Braintrust depending on the self-host axis	Comparison-UI wall is the actual migration trigger I have observed

The bet I would make: OpenTelemetry GenAI semconv stabilizes within 12 months and trace-layer lock-in becomes a non-issue. Eval-layer lock-in (datasets, scorers, experiment metadata) does not have an equivalent standard on the horizon as of 2026-04-26. Pick the eval vendor with the assumption that migration will be painful and the dataset is the asset really being protected.