Does my security team require self-hosting on day one?

If yes, Langfuse wins by default since Braintrust gates self-hosting behind its Enterprise tier. Langfuse publishes self-host paths across Docker, Kubernetes, and Terraform on AWS, Azure, and GCP.

How do Langfuse and Braintrust meter usage differently?

Langfuse meters tracing data points (traces, observations, scores), so deep tool-call trees inflate cost. Braintrust meters processed data in GB plus scores, so long agent transcripts and large tool outputs inflate cost.

Is data really staying in my infrastructure when I self-host?

Only the trace metadata. The prompt and completion still leave your infra as part of the inference call, so the model provider remains a sub-processor regardless of which observability tool sits in front.

What happens to data on Braintrust's free tier after 14 days?

It's gone. A regression that surfaces 18 days after deploy is invisible on Starter; Pro extends retention to 30 days, and Enterprise is custom.

Can I migrate experiment history cleanly between the two tools?

Treat any blanket claim with suspicion. With Langfuse the Postgres and ClickHouse stores are yours so a database dump is the worst-case escape; with Braintrust you're limited to whatever the export API surfaces, which is a procurement-time question.

Langfuse vs Braintrust: Which Wins for Agent Observability (2026)

Q: Are we a 1 to 3 person team that doesn't want to operate four storage systems?

If yes, Braintrust's vendor-managed posture wins. Langfuse self-host requires Postgres, ClickHouse, Redis/Valkey, and S3, which is a real operational tax for small teams.

Q: Should I trust the published pricing pages at high trace volume?

Not past low-millions of traces per month. At 100M+ volume, support tier behavior, custom retention, and contractual terms drive the real comparison and none of those are on the public pricing pages.

I run two production agent systems. One traces to Langfuse. The other traces to Braintrust. Picking between them was harder than I expected, because the surface-level pitch is nearly identical: capture LLM traces, run evals, ship better agents. The real differences only show up after you wire one up, scale it past a few thousand traces a day, and try to extract data the vendor would rather you not.

This is the comparison I wish I'd had before I made my first pick.

TL;DR verdict

If self-hosting on infrastructure you control is a hard requirement, Langfuse wins by default, because Braintrust gates self-hosting behind Enterprise (verified against https://www.braintrust.dev/pricing on 2026-04-25). If the priority is a packaged eval-and-experiment loop, specifically the Instrument → Observe → Annotate → Evaluate → Deploy stages Braintrust documents on https://www.braintrust.dev/docs (verified 2026-04-25), with deterministic and LLM-as-judge scorers, an experiments comparison surface (module 04 of Foundations, verified against https://www.braintrust.dev/foundations on 2026-04-25), and a playground tied to the same dataset, Braintrust covers more of that loop out of the box. The biggest tradeoff is lock-in: Langfuse is MIT-licensed except the ee/ folders (verified against https://github.com/langfuse/langfuse on 2026-04-25), Braintrust is closed-source SaaS with an Enterprise on-prem option that requires a sales call.

Side-by-side feature table

Dimension	Langfuse	Braintrust
License (core)	MIT (except `ee/` folders)	Closed-source SaaS
Latest release (verified 2026-04-25)	v3.170.0, released Apr 23, 2026	No public version stream
GitHub stars	26.1k	Not applicable (closed source)
Free tier	Hobby: $0/month	Starter: $0 / month
First paid tier	Core: $29/month	Pro: $249 / month
Top published tier	Enterprise: $2,499/month	Enterprise: custom pricing
Self-host	Open-source self-host available; some add-on features require a license key	Enterprise tier only (on-prem or hosted deployment)
Self-host deployment targets	Low-scale: VM or local Docker Compose. Production-grade: Kubernetes (Helm), AWS (Terraform), Azure (Terraform), GCP (Terraform), Railway	Vendor-managed; on-prem at Enterprise
Storage architecture (self-host)	Postgres (transactional) + ClickHouse (OLAP traces/observations/scores) + Redis/Valkey (queue/cache) + S3/Blob (events, multi-modal inputs, large exports)	Vendor-managed
Billable unit	Tracing data points: traces, observations (spans, events, generations), scores	Processed data (GB) + scores
Datasets / experiments	`dataset.run_experiment()`, versioned datasets, custom task functions	Built-in experiments, deterministic and LLM-as-judge scorers, comparison module
Lock-in	Lower, but not zero. Database is yours, but SDK semantics, schema coupling, hosted-only EE features, and eval-history migration still couple your code to Langfuse	Medium-high (closed source, vendor-managed)
Compliance posture	Self-hosted = your controls; Cloud = vendor sub-processor	Vendor sub-processor; on-prem at Enterprise tier

(Pricing and features verified against langfuse.com/pricing, langfuse.com/self-hosting, github.com/langfuse/langfuse, and braintrust.dev/pricing on 2026-04-25.)

Pricing reality

Both vendors publish a Hobby/Starter free tier and stair-step up. They diverge on what gets metered: Langfuse bills tracing units (traces, observations, scores), while Braintrust bills processed data in GB plus scores, and Braintrust's Starter retention is capped at 14 days.

Langfuse pricing (2026-04-25):

Hobby: $0/month
Core: $29/month
Pro: $199/month
Enterprise: $2,499/month
Optional Teams add-on: $300/month
Usage overage rates: $8.00 / 100k units (100k–1M), $7.00 / 100k units (1M–10M), $6.50 / 100k units (10M–50M), $6.00 / 100k units (50M+)

A "billable unit" in Langfuse is any tracing data point sent to the platform: a trace, an observation (span, event, generation), or a score (verified against https://langfuse.com/pricing on 2026-04-25). That matters because a traced tool loop can emit one observation per tool call plus generation and scoring events, which multiplies unit count quickly. Math out expected workload before assuming Hobby will hold.

Braintrust pricing (2026-04-25):

Starter: $0/month, 1 GB processed data + $4/GB, 10k scores + $2.50/1k, 14 days retention, unlimited users
Pro: $249/month, 5 GB processed data + $3/GB, 50k scores + $1.50/1k, 30 days retention, unlimited users
Enterprise: custom pricing, custom data retention and export, RBAC, premium support, on-prem or hosted

Two structural differences worth flagging:

Braintrust's free retention is 14 days. Langfuse's free tier doesn't publish a retention cap on the pricing page. For retroactive debugging or quarterly trend analysis, 14 days is short.
Braintrust meters processed data in GB, Langfuse meters in units. GB scales with payload size (long agent transcripts get expensive); units scale with span count (deep tool-call trees get expensive). Neither is "better." They penalize different shapes of agent.

The pricing-page gap that matters most: Braintrust does not publish self-host pricing. The pricing page only mentions on-prem or hosted deployment as an Enterprise feature with custom pricing. If self-host is a requirement, that is a sales conversation. Langfuse takes a different posture: per https://langfuse.com/self-hosting (verified 2026-04-25), "Langfuse is open source and can be self-hosted using Docker," with the caveat from the same page that "some add-on features require a license key." So the core repo is self-hostable without a tier gate, while certain admin and customization add-ons sit behind a separate license key.

Where Langfuse wins

1. Self-hosting is a first-class path. The self-host docs list five production-grade deployment targets: Kubernetes (Helm), AWS (Terraform), Azure (Terraform), GCP (Terraform), and Railway, plus a VM or local Docker Compose path for low-scale use (verified against https://langfuse.com/self-hosting on 2026-04-25). Langfuse states that self-hosted runs the same infrastructure that powers Langfuse Cloud, so architectural parity is a documented commitment rather than an inference.

2. The license is MIT, with documented exceptions. The repo is MIT-licensed except for the ee/ folders, which contain Enterprise Edition code (verified against https://github.com/langfuse/langfuse on 2026-04-25). Some self-host add-on features also require a separate license key (verified against https://langfuse.com/self-hosting on 2026-04-25), so "open source" here means the core is MIT and certain admin or customization features are gated. As of 2026-04-25 the repo carries 26.1k stars, and the latest release v3.170.0 shipped on Apr 23, 2026, so the open-source line is actively maintained.

3. Unit-based metering can favor specific workload shapes, but only after a worked scenario. Langfuse meters in tracing data points: traces, observations (spans, events, generations), and scores (verified against https://langfuse.com/pricing on 2026-04-25). Braintrust meters processed data in GB plus scores (verified against https://www.braintrust.dev/pricing on 2026-04-25). A toy comparison: an agent that emits 1M traces per month, each with 20 observations and 2 scores, generates roughly 22M Langfuse units. Under Langfuse's overage table that lands in the 10M–50M tier at $6.50 / 100k units, so the variable cost is on the order of 220 × $6.50 = $1,430/month on top of the $199/month Pro sticker. The same workload on Braintrust depends on payload bytes, not span count: if the average trace payload is 2 KB, that is roughly 2 GB of processed data per month plus 2M scores, which on Pro ($249/month base, 5 GB included + $3/GB after, 50k scores included + $1.50/1k after) lands the score line alone at roughly (2,000,000 − 50,000) / 1,000 × $1.50 = $2,925/month. Flip the shape, fewer spans but multi-KB payloads per call, and the ranking inverts. The honest version: model your own span count, average payload size, and score count against both pricing pages before assuming either tool is cheaper.

4. The dataset and experiment model is straightforward and versioned. Items have input and optionally expected_output, link back to source traces via source_trace_id, and every add, update, delete, or archive produces a new dataset version (verified against https://langfuse.com/docs/datasets/overview on 2026-04-25). Experiments re-run against historical dataset versions via dataset.run_experiment() with a custom task function, which is the surface I want when attributing a regression to a prompt change versus a dataset change.

5. Compliance teams get a precise data-flow story. When the entire stack runs in your VPC, Postgres, ClickHouse, Redis/Valkey, and S3-compatible object storage all sit under your account (verified against https://langfuse.com/self-hosting on 2026-04-25). Trace data, prompts, and outputs stay on infrastructure you control. The model provider remains a sub-processor for the LLM call itself, which is true of any observability tool that sits in front of a hosted model.

Where Braintrust wins

1. The eval surface I actually exercised end-to-end. On Braintrust I created a dataset, wired a deterministic scorer plus an LLM-as-judge scorer, ran two experiments against the same dataset, and used the experiments surface to compare runs. The Foundations course (verified against https://www.braintrust.dev/foundations on 2026-04-25) frames this as 14 modules across Learn, Build, Refine, with "Comparing experiments" as module 04 and "How to analyze your eval results" as module 09. What I did not exercise: the annotation queue at team scale, RBAC handoff to a separate deploy role (Enterprise tier), or any custom scorer beyond the two I wrote. The rough edge I hit: scorer iteration is faster in the SDK than in the UI, so the "playground first, code later" pattern the docs imply was inverted in my workflow.

2. The course covers deterministic and LLM-as-judge scorers as first-class concepts. Verified against https://www.braintrust.dev/foundations on 2026-04-25, Foundations covers deterministic and LLM-as-judge scorers. I will not claim a richer built-in scorer taxonomy than that without product-doc verification; the landing page does not split built-in versus custom. What I can say from billing: scoring is a metered resource (10k scores at Starter, 50k at Pro, verified against https://www.braintrust.dev/pricing on 2026-04-25), which signals that scorers are a product surface the vendor expects operators to lean on, not a side feature.

3. Comparing experiments is a documented module, not a UI claim I can make from the ledger. Module 04 of Foundations is "Comparing experiments" (verified against https://www.braintrust.dev/foundations on 2026-04-25). I will not describe the exact comparison surface here because the Foundations landing page does not document the UI shape; for the canonical view, fetch /foundations/comparing-experiments before relying on a specific layout. What I can say from my own runs: comparing two experiments against one dataset version was the workflow I reached for most often, and it was accessible without writing custom plotting code.

4. You don't want to operate a database. Langfuse self-host requires Postgres, ClickHouse, Redis/Valkey, and S3 (verified against https://langfuse.com/self-hosting on 2026-04-25). Even on managed Cloud, four storage systems sit behind the vendor's API. Braintrust Pro at $249/month (verified against https://www.braintrust.dev/pricing on 2026-04-25) means writing traces, running experiments, and never reasoning about ClickHouse compaction.

5. Unlimited users from day one. Both Starter ($0/month) and Pro ($249/month) advertise "Unlimited users, projects, datasets, playgrounds, and experiments" (verified against https://www.braintrust.dev/pricing on 2026-04-25). Langfuse charges $300/month for the Teams add-on (verified against https://langfuse.com/pricing on 2026-04-25). For a 10-person ML team, Braintrust Pro at $249 lines up against Langfuse Pro ($199) plus Teams ($300) at $499.

Where neither wins

If you primarily need distributed tracing for non-LLM systems (HTTP, DB queries, gRPC), use a real APM. Datadog, Honeycomb, or self-hosted Grafana Tempo + OpenTelemetry will serve you better than either of these. Both Langfuse and Braintrust are LLM-shaped tools first.

If you're at the earliest prototype stage with a single prompt and no eval discipline yet, both tools are overkill. Log to a Postgres table, look at it with a query, and graduate to a real observability tool when you're shipping.

If you need deeply integrated guardrails / PII filtering / prompt firewalling as the primary use case, look at purpose-built tools (Lakera, Protect AI). Observability platforms can ingest evaluation results from these but aren't the right primary surface.

Things nobody talks about

1. The "self-hosted = data stays in your infra" claim is partially false for both. When the LLM call goes to Anthropic, OpenAI, or any other provider, the prompt and completion leave your infrastructure as part of the inference call itself. Langfuse self-host means trace metadata stays on your infra, but the model provider is still a sub-processor. Use precise language with compliance: "trace data is self-hosted; model inference is sub-processed by [provider] under their DPA." If you need ZDR (Zero Data Retention) on the inference side, that's a separate negotiation with the model provider, independent of which observability tool you pick.

2. Self-hosting Langfuse is not zero-ops. The official architecture (verified against https://langfuse.com/self-hosting on 2026-04-25) lists Postgres for transactional data, ClickHouse for OLAP traces/observations/scores, Redis/Valkey for queue and cache, and S3-compatible storage for events and large payloads. That's four storage systems with four sets of operational concerns: backups, version upgrades, capacity planning, monitoring. The "open source = free" framing masks meaningful operational cost. If your team is two engineers, this is a real tax.

3. Braintrust's 14-day retention on the free tier is a hard cliff. A regression that surfaces 18 days after deploy is invisible. Pro extends to 30 days. Enterprise is custom. Plan retention as a first-class requirement when picking a tier; don't discover it during an incident.

4. The pricing pages don't tell you what counts as a "billable unit" until you read carefully. Langfuse counts every span, event, generation, and score as a unit (verified against https://langfuse.com/pricing on 2026-04-25). A naive integration that traces every tool call inside an agent loop can 20x your unit count without changing user-facing behavior. Braintrust meters processed data in GB, which penalizes long context windows and large tool outputs. Run a one-week pilot before committing to a tier; estimating from the docs alone is unreliable.

5. Enterprise license keys (Langfuse) gate features you might assume are free. Some self-host features require a license key, including organization creators, instance management API, and UI customization (verified against https://langfuse.com/self-hosting on 2026-04-25). The pricing page does not list the license key cost. If your security team needs SSO at the self-host level, expect a sales conversation even though the core repo is MIT.

6. Migration off either tool is a procurement-time question, not a runtime one. I have not personally exported a multi-month experiment history out of Braintrust into Langfuse (or the reverse) and benchmarked the fidelity loss, so treat any blanket "you can/can't migrate cleanly" claim with suspicion, including this one. What I do during procurement: ask each vendor for the exact export API surface, the schema of what comes out, and whether scorer definitions, dataset versions, and experiment-to-trace links survive the round-trip. With Langfuse, the underlying Postgres and ClickHouse are yours, so the worst-case escape hatch is a database dump; with Braintrust, the escape hatch is whatever the export API gives you. Validate this before you have six months of history riding on the answer, not after.

Decision tree

Does your security team require self-hosting on day one? Langfuse. Braintrust requires Enterprise for that. Move on.
Are you a 1–3 person team that doesn't want to operate four storage systems? Braintrust. The operational cost of Langfuse self-host eats your eng time. (Or pick Langfuse Cloud at Pro: $199/month. That's the third option.)
Is the primary use case "ship better prompts faster" with comparison-driven workflow? Braintrust. The Instrument → Observe → Annotate → Evaluate → Deploy loop is the product's spine.
Is the primary use case "trace agent execution at depth" with custom span semantics? Langfuse. Unit-based metering and the dataset/experiment SDK reward this shape.
Do you need data on infrastructure you control AND a polished eval UI AND a 1-engineer ops budget? Across Langfuse and Braintrust as verified on 2026-04-25 against https://langfuse.com/pricing, https://langfuse.com/self-hosting, and https://www.braintrust.dev/pricing, I did not find all three in one published tier. Pick two and live with the third.

Conclusion

If forced to pick today, I'd start on Langfuse (affiliate link, see disclosure above). The MIT license (except ee/ folders), the published self-host story across Docker, Kubernetes, and Terraform on AWS/Azure/GCP, and the documented architectural parity with Langfuse Cloud are the things I weigh most. The tier ladder is also legible end-to-end: Hobby at $0/month, Core at $29/month, Pro at $199/month, Enterprise at $2,499/month, with some self-host add-ons gated behind a license key (verified against https://langfuse.com/pricing and https://langfuse.com/self-hosting on 2026-04-25). The cost is real operational tax if I take the self-host path, four storage systems to babysit, but I'd rather pay that tax than be locked into closed-source SaaS for an observability layer I expect to keep for years.

I'd reverse that recommendation in two cases. First: if my team treats prompt iteration as the central engineering loop, Braintrust (affiliate link, see disclosure above) packages the eval workflow more cleanly, reducing the process scaffolding I would otherwise build around prompts, scorers, and comparisons. The tool will not build eval discipline for a team that doesn't have it; what it does is shorten the distance between "we agreed to evaluate this" and "we have a comparison view in front of us." Second: if I'm at a 1-2 person startup with no platform engineer, the "don't make me operate Postgres + ClickHouse + Redis/Valkey + S3" argument is decisive, and Braintrust's vendor-managed posture wins until I have the team to support self-host.

I haven't tested either past low-millions of traces per month. At 100M+ trace volume, I would treat published pricing as insufficient and validate Enterprise terms directly with each vendor, since support tier behavior, custom retention, and contractual terms drive the real comparison at that scale and none of those are on the public pricing pages. Run a pilot with the actual workload; the docs lie about cost.

Companion code with working integrations for both platforms lives at github.com/MPIsaac-Per/agentinfra-examples.