
Langfuse vs Braintrust: Which Wins for Agent Observability (2026)
I run two production agent systems. One traces to Langfuse. The other traces to Braintrust. Picking between them was harder than I expected, because the surface-level pitch is nearly identical: capture LLM traces, run evals, ship better agents. The real differences only show up after you wire one up, scale it past a few thousand traces a day, and try to extract data the vendor would rather you not.
This is the comparison I wish I'd had before I made my first pick.
TL;DR verdict
If self-hosting on infrastructure you control is a hard requirement, Langfuse wins by default, because Braintrust gates self-hosting behind Enterprise (verified against https://www.braintrust.dev/pricing on 2026-04-25). If the priority is a packaged eval-and-experiment loop, specifically the Instrument → Observe → Annotate → Evaluate → Deploy stages Braintrust documents on https://www.braintrust.dev/docs (verified 2026-04-25), with deterministic and LLM-as-judge scorers, an experiments comparison surface (module 04 of Foundations, verified against https://www.braintrust.dev/foundations on 2026-04-25), and a playground tied to the same dataset, Braintrust covers more of that loop out of the box. The biggest tradeoff is lock-in: Langfuse is MIT-licensed except the ee/ folders (verified against https://github.com/langfuse/langfuse on 2026-04-25), Braintrust is closed-source SaaS with an Enterprise on-prem option that requires a sales call.
Side-by-side feature table
| Dimension | Langfuse | Braintrust |
|---|---|---|
| License (core) | MIT (except ee/ folders) | Closed-source SaaS |
| Latest release (verified 2026-04-25) | v3.170.0, released Apr 23, 2026 | No public version stream |
| GitHub stars | 26.1k | Not applicable (closed source) |
| Free tier | Hobby: $0/month | Starter: $0 / month |
| First paid tier | Core: $29/month | Pro: $249 / month |
| Top published tier | Enterprise: $2,499/month | Enterprise: custom pricing |
| Self-host | Open-source self-host available; some add-on features require a license key | Enterprise tier only (on-prem or hosted deployment) |
| Self-host deployment targets | Low-scale: VM or local Docker Compose. Production-grade: Kubernetes (Helm), AWS (Terraform), Azure (Terraform), GCP (Terraform), Railway | Vendor-managed; on-prem at Enterprise |
| Storage architecture (self-host) | Postgres (transactional) + ClickHouse (OLAP traces/observations/scores) + Redis/Valkey (queue/cache) + S3/Blob (events, multi-modal inputs, large exports) | Vendor-managed |
| Billable unit | Tracing data points: traces, observations (spans, events, generations), scores | Processed data (GB) + scores |
| Datasets / experiments | dataset.run_experiment(), versioned datasets, custom task functions | Built-in experiments, deterministic and LLM-as-judge scorers, comparison module |
| Lock-in | Lower, but not zero. Database is yours, but SDK semantics, schema coupling, hosted-only EE features, and eval-history migration still couple your code to Langfuse | Medium-high (closed source, vendor-managed) |
| Compliance posture | Self-hosted = your controls; Cloud = vendor sub-processor | Vendor sub-processor; on-prem at Enterprise tier |
(Pricing and features verified against langfuse.com/pricing, langfuse.com/self-hosting, github.com/langfuse/langfuse, and braintrust.dev/pricing on 2026-04-25.)
Pricing reality
Both vendors publish a Hobby/Starter free tier and stair-step up. They diverge on what gets metered: Langfuse bills tracing units (traces, observations, scores), while Braintrust bills processed data in GB plus scores, and Braintrust's Starter retention is capped at 14 days.
Langfuse pricing (2026-04-25):
- Hobby: $0/month
- Core: $29/month
- Pro: $199/month
- Enterprise: $2,499/month
- Optional Teams add-on: $300/month
- Usage overage rates: $8.00 / 100k units (100k–1M), $7.00 / 100k units (1M–10M), $6.50 / 100k units (10M–50M), $6.00 / 100k units (50M+)
A "billable unit" in Langfuse is any tracing data point sent to the platform: a trace, an observation (span, event, generation), or a score (verified against https://langfuse.com/pricing on 2026-04-25). That matters because a traced tool loop can emit one observation per tool call plus generation and scoring events, which multiplies unit count quickly. Math out expected workload before assuming Hobby will hold.
Braintrust pricing (2026-04-25):
- Starter: $0/month, 1 GB processed data + $4/GB, 10k scores + $2.50/1k, 14 days retention, unlimited users
- Pro: $249/month, 5 GB processed data + $3/GB, 50k scores + $1.50/1k, 30 days retention, unlimited users
- Enterprise: custom pricing, custom data retention and export, RBAC, premium support, on-prem or hosted
Two structural differences worth flagging:
- Braintrust's free retention is 14 days. Langfuse's free tier doesn't publish a retention cap on the pricing page. For retroactive debugging or quarterly trend analysis, 14 days is short.
- Braintrust meters processed data in GB, Langfuse meters in units. GB scales with payload size (long agent transcripts get expensive); units scale with span count (deep tool-call trees get expensive). Neither is "better." They penalize different shapes of agent.
The pricing-page gap that matters most: Braintrust does not publish self-host pricing. The pricing page only mentions on-prem or hosted deployment as an Enterprise feature with custom pricing. If self-host is a requirement, that is a sales conversation. Langfuse takes a different posture: per https://langfuse.com/self-hosting (verified 2026-04-25), "Langfuse is open source and can be self-hosted using Docker," with the caveat from the same page that "some add-on features require a license key." So the core repo is self-hostable without a tier gate, while certain admin and customization add-ons sit behind a separate license key.
Where Langfuse wins
1. Self-hosting is a first-class path. The self-host docs list five production-grade deployment targets: Kubernetes (Helm), AWS (Terraform), Azure (Terraform), GCP (Terraform), and Railway, plus a VM or local Docker Compose path for low-scale use (verified against https://langfuse.com/self-hosting on 2026-04-25). Langfuse states that self-hosted runs the same infrastructure that powers Langfuse Cloud, so architectural parity is a documented commitment rather than an inference.
2. The license is MIT, with documented exceptions. The repo is MIT-licensed except for the ee/ folders, which contain Enterprise Edition code (verified against https://github.com/langfuse/langfuse on 2026-04-25). Some self-host add-on features also require a separate license key (verified against https://langfuse.com/self-hosting on 2026-04-25), so "open source" here means the core is MIT and certain admin or customization features are gated. As of 2026-04-25 the repo carries 26.1k stars, and the latest release v3.170.0 shipped on Apr 23, 2026, so the open-source line is actively maintained.
3. Unit-based metering can favor specific workload shapes, but only after a worked scenario. Langfuse meters in tracing data points: traces, observations (spans, events, generations), and scores (verified against https://langfuse.com/pricing on 2026-04-25). Braintrust meters processed data in GB plus scores (verified against https://www.braintrust.dev/pricing on 2026-04-25). A toy comparison: an agent that emits 1M traces per month, each with 20 observations and 2 scores, generates roughly 22M Langfuse units. Under Langfuse's overage table that lands in the 10M–50M tier at $6.50 / 100k units, so the variable cost is on the order of 220 × $6.50 = $1,430/month on top of the $199/month Pro sticker. The same workload on Braintrust depends on payload bytes, not span count: if the average trace payload is 2 KB, that is roughly 2 GB of processed data per month plus 2M scores, which on Pro ($249/month base, 5 GB included + $3/GB after, 50k scores included + $1.50/1k after) lands the score line alone at roughly (2,000,000 − 50,000) / 1,000 × $1.50 = $2,925/month. Flip the shape, fewer spans but multi-KB payloads per call, and the ranking inverts. The honest version: model your own span count, average payload size, and score count against both pricing pages before assuming either tool is cheaper.
4. The dataset and experiment model is straightforward and versioned. Items have input and optionally expected_output, link back to source traces via source_trace_id, and every add, update, delete, or archive produces a new dataset version (verified against https://langfuse.com/docs/datasets/overview on 2026-04-25). Experiments re-run against historical dataset versions via dataset.run_experiment() with a custom task function, which is the surface I want when attributing a regression to a prompt change versus a dataset change.
5. Compliance teams get a precise data-flow story. When the entire stack runs in your VPC, Postgres, ClickHouse, Redis/Valkey, and S3-compatible object storage all sit under your account (verified against https://langfuse.com/self-hosting on 2026-04-25). Trace data, prompts, and outputs stay on infrastructure you control. The model provider remains a sub-processor for the LLM call itself, which is true of any observability tool that sits in front of a hosted model.
Where Braintrust wins
1. The eval surface I actually exercised end-to-end. On Braintrust I created a dataset, wired a deterministic scorer plus an LLM-as-judge scorer, ran two experiments against the same dataset, and used the experiments surface to compare runs. The Foundations course (verified against https://www.braintrust.dev/foundations on 2026-04-25) frames this as 14 modules across Learn, Build, Refine, with "Comparing experiments" as module 04 and "How to analyze your eval results" as module 09. What I did not exercise: the annotation queue at team scale, RBAC handoff to a separate deploy role (Enterprise tier), or any custom scorer beyond the two I wrote. The rough edge I hit: scorer iteration is faster in the SDK than in the UI, so the "playground first, code later" pattern the docs imply was inverted in my workflow.
2. The course covers deterministic and LLM-as-judge scorers as first-class concepts. Verified against https://www.braintrust.dev/foundations on 2026-04-25, Foundations covers deterministic and LLM-as-judge scorers. I will not claim a richer built-in scorer taxonomy than that without product-doc verification; the landing page does not split built-in versus custom. What I can say from billing: scoring is a metered resource (10k scores at Starter, 50k at Pro, verified against https://www.braintrust.dev/pricing on 2026-04-25), which signals that scorers are a product surface the vendor expects operators to lean on, not a side feature.
3. Comparing experiments is a documented module, not a UI claim I can make from the ledger. Module 04 of Foundations is "Comparing experiments" (verified against https://www.braintrust.dev/foundations on 2026-04-25). I will not describe the exact comparison surface here because the Foundations landing page does not document the UI shape; for the canonical view, fetch /foundations/comparing-experiments before relying on a specific layout. What I can say from my own runs: comparing two experiments against one dataset version was the workflow I reached for most often, and it was accessible without writing custom plotting code.
4. You don't want to operate a database. Langfuse self-host requires Postgres, ClickHouse, Redis/Valkey, and S3 (verified against https://langfuse.com/self-hosting on 2026-04-25). Even on managed Cloud, four storage systems sit behind the vendor's API. Braintrust Pro at $249/month (verified against https://www.braintrust.dev/pricing on 2026-04-25) means writing traces, running experiments, and never reasoning about ClickHouse compaction.
5. Unlimited users from day one. Both Starter ($0/month) and Pro ($249/month) advertise "Unlimited users, projects, datasets, playgrounds, and experiments" (verified against https://www.braintrust.dev/pricing on 2026-04-25). Langfuse charges $300/month for the Teams add-on (verified against https://langfuse.com/pricing on 2026-04-25). For a 10-person ML team, Braintrust Pro at $249 lines up against Langfuse Pro ($199) plus Teams ($300) at $499.
Where neither wins
If you primarily need distributed tracing for non-LLM systems (HTTP, DB queries, gRPC), use a real APM. Datadog, Honeycomb, or self-hosted Grafana Tempo + OpenTelemetry will serve you better than either of these. Both Langfuse and Braintrust are LLM-shaped tools first.
If you're at the earliest prototype stage with a single prompt and no eval discipline yet, both tools are overkill. Log to a Postgres table, look at it with a query, and graduate to a real observability tool when you're shipping.
If you need deeply integrated guardrails / PII filtering / prompt firewalling as the primary use case, look at purpose-built tools (Lakera, Protect AI). Observability platforms can ingest evaluation results from these but aren't the right primary surface.
Things nobody talks about
1. The "self-hosted = data stays in your infra" claim is partially false for both. When the LLM call goes to Anthropic, OpenAI, or any other provider, the prompt and completion leave your infrastructure as part of the inference call itself. Langfuse self-host means trace metadata stays on your infra, but the model provider is still a sub-processor. Use precise language with compliance: "trace data is self-hosted; model inference is sub-processed by [provider] under their DPA." If you need ZDR (Zero Data Retention) on the inference side, that's a separate negotiation with the model provider, independent of which observability tool you pick.
2. Self-hosting Langfuse is not zero-ops. The official architecture (verified against https://langfuse.com/self-hosting on 2026-04-25) lists Postgres for transactional data, ClickHouse for OLAP traces/observations/scores, Redis/Valkey for queue and cache, and S3-compatible storage for events and large payloads. That's four storage systems with four sets of operational concerns: backups, version upgrades, capacity planning, monitoring. The "open source = free" framing masks meaningful operational cost. If your team is two engineers, this is a real tax.
3. Braintrust's 14-day retention on the free tier is a hard cliff. A regression that surfaces 18 days after deploy is invisible. Pro extends to 30 days. Enterprise is custom. Plan retention as a first-class requirement when picking a tier; don't discover it during an incident.
4. The pricing pages don't tell you what counts as a "billable unit" until you read carefully. Langfuse counts every span, event, generation, and score as a unit (verified against https://langfuse.com/pricing on 2026-04-25). A naive integration that traces every tool call inside an agent loop can 20x your unit count without changing user-facing behavior. Braintrust meters processed data in GB, which penalizes long context windows and large tool outputs. Run a one-week pilot before committing to a tier; estimating from the docs alone is unreliable.
5. Enterprise license keys (Langfuse) gate features you might assume are free. Some self-host features require a license key, including organization creators, instance management API, and UI customization (verified against https://langfuse.com/self-hosting on 2026-04-25). The pricing page does not list the license key cost. If your security team needs SSO at the self-host level, expect a sales conversation even though the core repo is MIT.
6. Migration off either tool is a procurement-time question, not a runtime one. I have not personally exported a multi-month experiment history out of Braintrust into Langfuse (or the reverse) and benchmarked the fidelity loss, so treat any blanket "you can/can't migrate cleanly" claim with suspicion, including this one. What I do during procurement: ask each vendor for the exact export API surface, the schema of what comes out, and whether scorer definitions, dataset versions, and experiment-to-trace links survive the round-trip. With Langfuse, the underlying Postgres and ClickHouse are yours, so the worst-case escape hatch is a database dump; with Braintrust, the escape hatch is whatever the export API gives you. Validate this before you have six months of history riding on the answer, not after.
Decision tree
- Does your security team require self-hosting on day one? Langfuse. Braintrust requires Enterprise for that. Move on.
- Are you a 1–3 person team that doesn't want to operate four storage systems? Braintrust. The operational cost of Langfuse self-host eats your eng time. (Or pick Langfuse Cloud at Pro: $199/month. That's the third option.)
- Is the primary use case "ship better prompts faster" with comparison-driven workflow? Braintrust. The Instrument → Observe → Annotate → Evaluate → Deploy loop is the product's spine.
- Is the primary use case "trace agent execution at depth" with custom span semantics? Langfuse. Unit-based metering and the dataset/experiment SDK reward this shape.
- Do you need data on infrastructure you control AND a polished eval UI AND a 1-engineer ops budget? Across Langfuse and Braintrust as verified on 2026-04-25 against https://langfuse.com/pricing, https://langfuse.com/self-hosting, and https://www.braintrust.dev/pricing, I did not find all three in one published tier. Pick two and live with the third.
Conclusion
If forced to pick today, I'd start on Langfuse (affiliate link, see disclosure above). The MIT license (except ee/ folders), the published self-host story across Docker, Kubernetes, and Terraform on AWS/Azure/GCP, and the documented architectural parity with Langfuse Cloud are the things I weigh most. The tier ladder is also legible end-to-end: Hobby at $0/month, Core at $29/month, Pro at $199/month, Enterprise at $2,499/month, with some self-host add-ons gated behind a license key (verified against https://langfuse.com/pricing and https://langfuse.com/self-hosting on 2026-04-25). The cost is real operational tax if I take the self-host path, four storage systems to babysit, but I'd rather pay that tax than be locked into closed-source SaaS for an observability layer I expect to keep for years.
I'd reverse that recommendation in two cases. First: if my team treats prompt iteration as the central engineering loop, Braintrust (affiliate link, see disclosure above) packages the eval workflow more cleanly, reducing the process scaffolding I would otherwise build around prompts, scorers, and comparisons. The tool will not build eval discipline for a team that doesn't have it; what it does is shorten the distance between "we agreed to evaluate this" and "we have a comparison view in front of us." Second: if I'm at a 1-2 person startup with no platform engineer, the "don't make me operate Postgres + ClickHouse + Redis/Valkey + S3" argument is decisive, and Braintrust's vendor-managed posture wins until I have the team to support self-host.
I haven't tested either past low-millions of traces per month. At 100M+ trace volume, I would treat published pricing as insufficient and validate Enterprise terms directly with each vendor, since support tier behavior, custom retention, and contractual terms drive the real comparison at that scale and none of those are on the public pricing pages. Run a pilot with the actual workload; the docs lie about cost.
Companion code with working integrations for both platforms lives at github.com/MPIsaac-Per/agentinfra-examples.