
Production Agent Observability: What Operators Actually Need to Know (2026)
I shipped my first multi-agent system to real users in late 2024. The intent classifier worked, the tool router worked, the model gateway worked. Then a customer asked why a single request had cost them $4.17 in tokens and taken 38 seconds. I had no answer. I had logs of HTTP calls and bills from three providers, but no way to reconstruct what happened inside the agent loop. That hole is what production agent observability fills. This page is the writeup I wish I had had: the mental model, the landscape, operator-grade tradeoffs, and the gotchas vendor pages skip.
The mental model
An agent observability stack records what happens inside an LLM application well enough to debug it later, prove it worked, and improve it over time. The unit of work is the trace. Per Langfuse's definition, each trace captures every operation, including LLM calls, retrieval steps, tool executions, and custom logic, along with timing, inputs, and outputs (verified 2026-04-25 against https://langfuse.com/docs/observability/overview). A trace is composed of nested observations. Langfuse describes a typical example as "an initial model call, multiple tool executions, and a final summarization step."
Traces group into sessions for multi-turn applications. The core data-model primitives across most platforms are traces, sessions, and observations (verified 2026-04-25 against https://langfuse.com/docs/observability/overview). Evaluation scores sit alongside that core, attached as metadata to traces, not as a fourth primitive. This is not the same shape as APM tooling for web services. A request span in Datadog has a clean parent-child tree and a clear root. An agent trace has a planning loop, model calls that retry, tool executions that fan out, retrieval that decorates the context, and a final response synthesis step. Span counts per request can run from 5 to 50+ depending on agent depth.
Three layers compose the stack:
- Application tracing. Per Langfuse's overview page, application tracing produces structured logs of every request that capture the exact prompt sent, the model's response, token usage, and latency (verified 2026-04-25 against https://langfuse.com/docs/observability/overview). Concretely, the minimum viable capture set is: prompt, completion, model, provider, input/output token counts, latency, errors, tool-call inputs and outputs, user/session identifiers, and redaction status. Without those fields per call, root-cause analysis on agent failures collapses into guessing.
- Evaluation. Scoring outputs with code, LLM-as-judge, or humans, against datasets built from real production traces.
- Monitoring and alerting. Latency, cost, error rate, quality scores, with thresholds and routing.
A platform that does only #1 is a tracer. A platform that does all three is an observability stack. The distinction matters when costs are compared.
The landscape
The space splits along two axes: hosted vs self-hostable, and standards-aligned vs proprietary. Three vendors come up most often in the operator shortlists I see, drawn from my direct conversations with infra teams I have worked with over the last 12 months. That is not a market survey, it is an anecdotal field of view. A formal market map would require structured signal across GitHub stars, customer disclosures, and community surveys, none of which I have run.
Langfuse is open source and can be self-hosted. The hosted Hobby tier is $0/month with 50k units / month included and 30 days data access (verified 2026-04-25 against https://langfuse.com/pricing). The product covers tracing, evaluation, prompt management, and dataset workflows. Anecdotally, I see it most often in teams that want to run the platform inside their own VPC, but I have not measured share.
LangSmith is LangChain's commercial observability product. Per the product page, it works with any LLM framework, not just LangChain (verified 2026-04-25 against https://www.langchain.com/langsmith/observability), with native tracing for OpenAI SDK, Anthropic SDK, Vercel AI SDK, LlamaIndex, and custom implementations. Client SDKs ship for Python, TypeScript, Go, and Java.
Braintrust positions itself as evals-first, with observability on top of a custom storage layer named Brainstore. The pitch: AI traces are large and nested, traditional databases struggle, and Brainstore is designed specifically for AI observability (verified 2026-04-25 against https://www.braintrust.dev). The product emphasizes turning production traces into eval datasets with one click.
A separate axis is the open standard: OpenTelemetry GenAI semantic conventions, currently at Development (experimental) stability per https://opentelemetry.io/docs/specs/semconv/gen-ai/. This matters more than vendor pages let on. Instrumentation written against OTel GenAI conventions reduces the rewrite cost of swapping backends, though backend semantics, dashboards, redaction logic, eval datasets, and vendor extensions still require migration work. Instrumentation written against a vendor SDK locks all of that in.
The rest of the field, including Datadog LLM Observability, Arize, Helicone, PromptLayer, and rolling your own with OTel + Grafana Tempo, is out of scope for this page. I have not pressure-tested those stacks under production agent workloads, and I do not want to rank tools I have not run. This is not a market map.
What actually matters operationally
Vendor landing pages benchmark themselves on feature checklists. That is the wrong frame for production. The dimensions I weight, in order:
Data ownership and residency. Where do prompts and completions live? For HIPAA, GDPR, or financial workloads, this is not a footnote. Langfuse Pro at $199/month includes SOC2 & ISO27001 reports and BAA available (HIPAA), and Enterprise at $2,499/month includes HIPAA compliance and SOC2 Type II & ISO27001 reports (verified 2026-04-25 against https://langfuse.com/pricing). LangSmith puts hybrid and self-hosted so data doesn't leave your VPC behind Enterprise tier (verified 2026-04-25 against https://www.langchain.com/pricing). Braintrust is HIPAA compliant, GDPR compliant, and SOC 2 Type II audited annually, with SSO/SAML, granular permissions, and hybrid deployment (verified 2026-04-25 against https://www.braintrust.dev). Important precision: even self-hosted, prompts still leave the operator's infrastructure to whatever model provider is called. Self-hosting changes who stores the trace, not whether the prompt was processed by Anthropic or OpenAI as a sub-processor. Compliance reviews must list the model provider as a sub-processor on the DPA, with a separate BAA where PHI is in scope.
Retention. Trace retention windows vary widely. Langfuse Hobby gives 30 days, Core 90 days, Pro and Enterprise 3 years (verified 2026-04-25 against https://langfuse.com/pricing). LangSmith splits traces into base (14 days) and extended (400 days) tiers, with extended traces costing $5.00 per 1k traces vs $2.50 per 1k traces for base, and a $2.50 per 1k base-to-extended upgrade path (verified 2026-04-25 against https://www.langchain.com/pricing). Braintrust Starter retains 14 days, Pro 30 days, Enterprise custom (verified 2026-04-25 against https://www.braintrust.dev/pricing). Default retention drives compliance posture and post-incident debug capability. A 14-day default is too short for most production teams.
Cost units and metering shape. This is where comparisons get messy. Langfuse meters by units, with included quotas of 50k (Hobby) and 100k (Core/Pro/Enterprise) per month, then graduated overages: $8.00 / 100k units from 100k to 1M, $7.00 / 100k units from 1M to 10M, $6.50 / 100k units from 10M to 50M, $6.00 / 100k units above 50M (verified 2026-04-25 against https://langfuse.com/pricing). LangSmith meters by traces with a base/extended split. Braintrust meters by GB of processed data and number of scores: Starter is $0/month with 1 GB processed data and 10k scores, Pro is $249/month with 5 GB processed data and 50k scores; overages are + $4/GB on Starter, + $3/GB on Pro, with score overages at + $2.50/1k and + $1.50/1k respectively (verified 2026-04-25 against https://www.braintrust.dev/pricing). The unit shapes are not directly comparable. Estimating costs requires running representative production load against each, not arithmetic on the pricing pages. A workload model worth piloting captures: requests/month, spans/request, average trace size in KB, eval score count, retention class, and seat count.
Integration friction. Three patterns: vendor SDK, OTel GenAI exporter, framework auto-instrumentation. Vendor SDKs ship the most features fastest but lock instrumentation in. OTel GenAI exports are partially portable; the conventions are still at Development stability, which means the spec ships breaking changes between releases. Auto-instrumentation for OpenAI SDK, Anthropic SDK, LangChain, and LlamaIndex covers many common code paths but misses custom tool calls and any logic that does not flow through the wrapped client. I have not measured a coverage percentage; the gaps depend on agent shape.
Evaluation workflow. This is the dimension most operators underweight. Tracing without an evaluation loop is just expensive logging. Braintrust ships one-click conversion of production traces into eval datasets (verified 2026-04-25 against https://www.braintrust.dev), and based on my own use during Q1 2026 against a tool-calling agent with ~12k traces/day, the production-to-dataset flow had the lowest friction of the three for my specific workflow (no dataset export benchmark, no scorer reproducibility test across vendors, no head-to-head CI latency measurement; treat as field experience, not ranking). LangSmith provides Online LLM-as-judge and code evals plus unsupervised topic clustering (verified 2026-04-25 against https://www.langchain.com/langsmith/observability). Langfuse supports evaluation scores attached to traces.
Lock-in. Instrumenting with a vendor SDK means migrating away requires rewriting every traced call site. Instrumenting with OTel GenAI semantic conventions and shipping to a vendor that ingests OTel reduces span instrumentation rewrite to a config change, but does not migrate dashboards, redaction rules, eval datasets, or vendor-specific extension attributes. LangSmith advertises bidirectional OTel: send LangSmith trace data out to other tools, ingest OTel data in (verified 2026-04-25 against https://www.langchain.com/langsmith/observability). Braintrust positions itself as framework agnostic, no framework lock-in, no rewrites. Langfuse's overview page names OpenAI, LangChain, LlamaIndex, and more, without naming OTel directly.
Debugging UX. Step-by-step view of agent execution, full prompt and response visibility, message threading for multi-turn chat, and end-to-end execution traces. LangSmith specifically calls out tool and agent trajectory monitoring, message threading for multi-turn chat interactions, and complete end-to-end execution traces for debugging hallucinations (verified 2026-04-25 against https://www.langchain.com/langsmith/observability). Without these, operators read JSON in a list view, which is fine for 100 traces and unworkable at 100k.
The standards layer
For teams betting on agent infrastructure for the next 3 years, OTel GenAI semantic conventions are the standard worth tracking. As of 2026-04-25, the spec is at Development (experimental) stability per https://opentelemetry.io/docs/specs/semconv/gen-ai/. That is OTel's way of saying "subject to breaking changes, do not pin production tooling to this without expecting churn." The signal types in scope are events, metrics, model spans, and agent spans, with technology-specific conventions defined for Anthropic, Azure AI Inference, AWS Bedrock, and OpenAI.
The attribute namespace is gen_ai.*. The relevant attributes break down by requirement level:
| Attribute | Requirement | Notes |
|---|---|---|
gen_ai.operation.name | Required | Required on inference, embeddings, retrieval, and execute_tool spans |
gen_ai.provider.name | Required | Conditionally Required on retrieval spans, "when applicable" |
gen_ai.request.model | Conditionally Required | "If available" |
gen_ai.usage.input_tokens | Recommended | Token usage on inputs |
gen_ai.usage.output_tokens | Recommended | Token usage on outputs |
(All verified 2026-04-25 against https://opentelemetry.io/docs/specs/semconv/gen-ai/gen-ai-spans/.)
Span kinds: inference spans SHOULD be CLIENT and MAY be INTERNAL for in-process model calls. Embeddings and retrieval spans SHOULD be CLIENT. execute_tool spans SHOULD be INTERNAL. Span name formats: {gen_ai.operation.name} {gen_ai.request.model} for inference and embeddings, {gen_ai.operation.name} {gen_ai.data_source.id} for retrieval, execute_tool {gen_ai.tool.name} for tool execution.
The currently defined gen_ai.operation.name values are chat, generate_content, text_completion, embeddings, retrieval, execute_tool, create_agent, and invoke_agent. All values are at Development stability, meaning the vocabulary is subject to change (verified 2026-04-25 against https://opentelemetry.io/docs/specs/semconv/gen-ai/gen-ai-spans/).
What the spec does not yet solve: a stable schema for tool-call arguments and results, prompt template versioning attributes, eval score primitives, and a consistent shape for streaming/partial spans. Vendor instrumentation diverges in these gaps. For applications that cross those gaps, plan for vendor-specific extension attributes alongside the standard.
Operational failure modes
1. PII redaction is opt-in, not opt-out, on every platform I have configured. Default capture and redaction behavior varies by SDK version, integration path, and sampling configuration; verify per-vendor before assuming anything. Production consequence (anonymized): on a 2025 engagement at a healthcare-adjacent SaaS, an integration shipped to a HIPAA-eligible tier without the redaction layer enabled. Inbound prompts containing PHI sat in plaintext for several weeks before the gap was caught during a vendor security review. Lesson: deploy redaction at the SDK boundary before the first production trace, with a synthetic PII test suite as a deploy gate, and verify default capture/redaction docs against the SDK version actually deployed.
2. Trace sampling at high volume is not optional, and platforms do not make it easy. At 1k QPS with 30 spans per request, a system generates 30k spans/sec. Default public pricing on each of the three platforms becomes material at that volume well before any meaningful retention class is selected. Sampling can happen in four places: at the SDK, at the OTel Collector, at backend ingestion, or at query/retention time. Each has different fidelity and cost implications. The OTel Collector route (tail-sampling processor with error and latency rules) is the most backend-neutral option. The vendor-SDK route varies: LangSmith exposes sampling on the trace client; Langfuse exposes sampling and masking through the SDK and the self-hosted collector; Braintrust exposes filters at the SDK level. Recommendation: instrument an OTel Collector with a tail-sampling processor and rehearse a 10x volume spike before launch. The exact bill jump on under-sampled rollouts varies with vendor, retention class, and trace shape; I have seen monthly bills move by 10-15x in a single launch week when sampling was assumed but not configured.
3. Cost-per-trace metering encourages an anti-pattern: stop tracing the failures. When traces cost real money, the easy budget cut is "let's only trace 10% of production." For debugging rare agent failures, uniform low-rate sampling is usually the wrong default. Failures cluster in the long tail, and a uniform 10% sample misses 90% of the rare modes. Anonymized example: a team I worked with reduced trace volume aggressively to fit budget and then could not reproduce a customer-reported failure for over a week because none of the impacted requests were sampled. Recommendation: 100% sample errors and tail-latency outliers, head-sample successes. This is configurable in the OTel Collector tail_sampling processor today; vendor-SDK support for the same shape is uneven.
4. Self-hosted does not mean "data doesn't leave your infrastructure." Self-hosting Langfuse means traces are stored in your VPC. The prompts and completions inside those traces still left the VPC the moment they were sent to Anthropic, OpenAI, or any other model provider as a data sub-processor. For HIPAA, the model provider also needs a BAA, not just the observability vendor. For data residency, the model provider's regional routing matters as much as the trace store's. I have audited compliance setups that satisfied the observability vendor BAA but had no BAA with the model provider, leaving a hole that a downstream auditor flagged. Action: every entity that touches the prompt is a sub-processor, and the DPA list must reflect that.
5. Lock-in is not in the SDK, it is in the eval datasets. Migrating instrumentation between platforms varies with call-site count and instrumentation pattern, anywhere from a few days for a small thin-wrapper agent to multiple weeks for a heavily decorated codebase. Migrating eval datasets, golden traces, scoring rubrics, and accumulated human feedback labels is materially harder, often a quarter or more depending on label volume and rubric complexity. The eval workflow is where the vendor's gravity actually lives. When piloting, instrument with OTel GenAI conventions, then evaluate the eval workflow on production-shaped data. The platform that wins the eval workflow is the platform a team will run for years.
Implementation patterns
The companion repo at https://github.com/MPIsaac-Per/agentinfra-examples carries runnable versions of these patterns. The release tag matching this page is v2026.04.25; future SDK upgrades will land on later tags. Verify pinned versions in the tagged requirements.txt against PyPI before deploying.
Pattern 1: OTel GenAI portable instrumentation (vendor-neutral, minimal instrumentation).
import os
from anthropic import Anthropic
from opentelemetry import trace
from opentelemetry.trace import Status, StatusCode
from opentelemetry.sdk.trace import TracerProvider
from opentelemetry.sdk.trace.export import BatchSpanProcessor
from opentelemetry.exporter.otlp.proto.http.trace_exporter import OTLPSpanExporter
trace.set_tracer_provider(TracerProvider())
trace.get_tracer_provider().add_span_processor(
BatchSpanProcessor(OTLPSpanExporter(endpoint=os.environ["OTEL_EXPORTER_OTLP_ENDPOINT"]))
)
tracer = trace.get_tracer("agent.runtime")
anthropic_client = Anthropic(api_key=os.environ["ANTHROPIC_API_KEY"])
# Pin the model slug from the operator's verified model registry.
# Slug intentionally omitted here: see companion repo tag v2026.04.25 for the verified value. [unverified]
DEFAULT_MODEL = os.environ["AGENT_MODEL_SLUG"]
def call_model(prompt: str, model: str = DEFAULT_MODEL):
with tracer.start_as_current_span(f"chat {model}") as span:
span.set_attribute("gen_ai.operation.name", "chat")
span.set_attribute("gen_ai.provider.name", "anthropic")
span.set_attribute("gen_ai.request.model", model)
try:
response = anthropic_client.messages.create(
model=model,
max_tokens=1024,
messages=[{"role": "user", "content": prompt}],
)
span.set_attribute("gen_ai.response.id", response.id)
span.set_attribute("gen_ai.usage.input_tokens", response.usage.input_tokens)
span.set_attribute("gen_ai.usage.output_tokens", response.usage.output_tokens)
span.set_status(Status(StatusCode.OK))
return response
except Exception as exc:
span.record_exception(exc)
span.set_status(Status(StatusCode.ERROR, str(exc)))
raise
This snippet is minimal instrumentation only: it covers required (gen_ai.operation.name, gen_ai.provider.name), conditionally required (gen_ai.request.model), and recommended (token usage) attributes per the OTel GenAI spans spec, plus exception recording, status, and provider response ID. Production instrumentation also needs retry counters, request IDs from the HTTP layer, sampling configuration, and prompt/completion redaction at the boundary; see the companion repo for the full version.
Pattern 2: Langfuse SDK direct integration (vendor-native).
import os
from langfuse import Langfuse
from langfuse.decorators import observe, langfuse_context
langfuse = Langfuse(
secret_key=os.environ["LANGFUSE_SECRET_KEY"],
public_key=os.environ["LANGFUSE_PUBLIC_KEY"],
host=os.environ.get("LANGFUSE_HOST", "https://cloud.langfuse.com"),
)
@observe()
def execute_tool(step):
langfuse_context.update_current_observation(
name=f"tool.{step.tool_name}",
input=step.args,
metadata={"tool_version": step.tool_version},
)
result = step.run()
langfuse_context.update_current_observation(output=result)
return result
@observe()
def agent_turn(user_input: str, user_id: str, session_id: str):
langfuse_context.update_current_trace(user_id=user_id, session_id=session_id)
plan = make_plan(user_input)
tool_results = [execute_tool(step) for step in plan.steps]
response = synthesize_response(plan, tool_results)
return response
def shutdown():
langfuse.flush() # required for serverless / short-lived workers
The decorator wires nesting automatically when child functions are themselves decorated and their return values are consumed in the parent. User and session identifiers are attached at the trace level; tool call inputs, outputs, and metadata are attached per observation. The langfuse.flush() call is required at the end of any short-lived process (Lambda, Cloud Run cold start, batch job) to avoid losing buffered spans. Trade-off: zero portability. Migrating to LangSmith or Braintrust later means refactoring every @observe-decorated call site.
Pattern 3: Hybrid (OTel for spans, vendor SDK for evals, with correlation).
Instrument application traces with OTel GenAI conventions, ship to a vendor that ingests OTel, and use the vendor SDK only for the eval and dataset workflow. LangSmith documents bidirectional OTel ingest. The correlation requirement most teams miss: the OTel span must carry the same trace ID the vendor SDK would otherwise mint, and any eval dataset rows derived from a trace must carry the OTel trace/span ID, the prompt version ID, and the dataset row ID as foreign keys. When OTel timestamps and vendor SDK timestamps disagree (clock drift across collector and SDK process, batch flush latency), traces can appear out of order in the vendor UI; treat OTel-ingested timestamps as authoritative and configure the vendor SDK to defer to them where the option exists.
A representative OTel Collector tail-sampling and redaction snippet for this pattern:
processors:
attributes/redact:
actions:
- key: gen_ai.prompt
action: hash
- key: gen_ai.completion
action: hash
tail_sampling:
decision_wait: 10s
policies:
- name: errors
type: status_code
status_code: { status_codes: [ERROR] }
- name: tail_latency
type: latency
latency: { threshold_ms: 5000 }
- name: head_sample_success
type: probabilistic
probabilistic: { sampling_percentage: 5 }
This keeps eval gravity at the vendor while keeping span instrumentation portable, and addresses gotchas #1 and #2 in one place.
A decision framework
There is no universal winner. The right call depends on three operator dimensions:
| If the team... | Lean toward |
|---|---|
| Needs self-host for compliance and has ClickHouse operational capacity (merge-tree tuning, disk sizing, backup/restore, TTL management, ingestion-pressure monitoring) | Langfuse self-hosted |
| Wants hosted with predictable per-unit cost graduation | Langfuse Cloud |
| Already runs LangChain/LangGraph and wants bidirectional OTel ingest/export | LangSmith |
| Treats evals as the primary observability use case | Braintrust |
| Is at Series A or earlier and has not shipped production yet (pre-production evaluation only; Hobby includes 50k units / month and 30 days data access, not a production retention class) | OTel GenAI + Langfuse Hobby, switch later |
| Operates regulated workloads (HIPAA/PCI/financial) | Langfuse Pro/Enterprise (BAA available on Pro, full HIPAA on Enterprise), LangSmith Enterprise (custom hosting and BAA via custom terms), or Braintrust at a tier that includes BAA (HIPAA compliance is on the homepage; confirm BAA availability per plan with sales before relying on it) |
What I am betting on over a 3-year horizon, framed as a bet, not a forecast: the OTel GenAI conventions move from Development (experimental) toward Stable as adoption grows, and instrumentation becomes more commoditized. The current canonical fact is that the spec is at Development (experimental) stability as of 2026-04-25 per https://opentelemetry.io/docs/specs/semconv/gen-ai/. I could be wrong. If the working group changes scope or a vendor consortium produces a competing standard, the bet does not pay out.
What I would avoid: pinning a production app to a vendor SDK with no OTel migration path before having at least 6 months of production-scale data on hand. The cost of being wrong about an observability vendor at 100 QPS is small. The cost at 10k QPS, with a year of eval datasets and golden traces, is most of a quarter.
If I were starting a new agent product this week, I would instrument with OTel GenAI conventions, ship traces to Langfuse Cloud Core for the first 90 days, and run a parallel pilot of Braintrust against the same trace stream to evaluate the eval workflow on production-shaped data. Concrete pilot exit criteria after 90 days: cost per 1M agent requests, redaction coverage measured against a synthetic PII suite, p95 trace query latency under load, eval workflow throughput (datasets created per week, scorers run per dataset), exportability test (full dataset round-trip to a flat file and re-import), and on-call usefulness (mean time to root cause on a curated set of injected failures). Real pilots usually surface tradeoffs, not an obvious winner; the criteria are what make the decision defensible either way.