Is OpenRouter the right default for a new agent stack?

Yes, if you want one API key and instant multi-provider access and can accept OpenRouter as a sub-processor for your prompt data. Otherwise pick LiteLLM self-hosted or direct provider SDKs.

When does LiteLLM beat OpenRouter on cost?

Once your monthly spend with a single primary provider exceeds roughly $2k, the OpenRouter gateway margin starts to outweigh the operational simplicity. At that scale LiteLLM self-hosted or direct provider keys win.

Do I need a gateway at all?

No. If you only call one provider and you do not need fallback routing, the official Anthropic, OpenAI, or Google SDKs are the lowest-friction option and ship the fewest dependencies.

What about Vercel AI Gateway?

Good if you already host on Vercel and want failover plus observability bundled with your platform. Outside Vercel deployments the value drops because you take on a platform dependency for marginal gain.

Which option offers the strongest data-control story?

Self-hosted LiteLLM with direct provider API keys, paired with the providers that offer Zero Data Retention contracts. Anything that routes through a third-party gateway adds a sub-processor.

OpenRouter Alternatives: What Operators Actually Need to Know (2026)

From late 2024 through early 2026, I operated OpenRouter, self-hosted LiteLLM, Vercel AI Gateway, and direct Anthropic and OpenAI SDKs as first-hand on-call across two agent stacks running roughly 8M to 25M tokens per day combined, with monthly spend in the $4k to $12k band. I watched provider 5xx storms, silent prompt-cache misses on Anthropic, and tool-call schema drift each take a stack down at 3am. This page is the writeup I wish existed when I picked my first gateway. Scope: production agent infra, not casual chat apps.

TL;DR

Pick OpenRouter for stacks that need one API key, instant multi-provider access, and can accept the gateway as a sub-processor for prompt data. Pick LiteLLM (self-hosted) for stacks that already operate a Python service, want full data control, and burn enough tokens that the gateway margin matters. Pick Vercel AI Gateway only if you are already on Vercel and want platform-bundled failover. Pick direct provider SDKs (Anthropic, OpenAI, Google) for single-provider apps where simplicity beats abstraction. The rest of this page lays out the tradeoffs and the assumptions behind that recommendation.

★ Insight ───────────────────────────────────── Before adding a gateway to the data path, run this checklist: (1) Am I calling more than one provider in production today, or within the next quarter? (2) Do I need server-enforced per-key budgets or rate limits across teams? (3) Will I lose money or break compliance if prompt-cache headers, tool-call schemas, or streaming deltas get re-serialized in transit? (4) Can my customer DPAs absorb one more sub-processor? If the answer to (1) and (2) is no, a gateway is overhead. If (3) is yes, test cache_read_input_tokens and tool-call round-trips against the candidate gateway before committing. ─────────────────────────────────────────────────

The mental model: what an LLM gateway actually is

An LLM gateway is a translation and routing layer that sits between your application and one or more model providers. In production, three jobs matter:

Protocol translation. Map a single client API shape (usually OpenAI Chat Completions or, increasingly, OpenAI Responses) onto the native APIs of multiple providers. Anthropic Messages, Google Generative Language, Bedrock Converse, Mistral, Cohere, and self-hosted vLLM each speak their own dialect.
Routing and policy. Pick which provider to call based on rules (model alias, cost, latency, fallback chain, regional preference, key-level rate limit).
Cross-cutting concerns. Auth, rate limiting, retry, logging, spend tracking, sometimes caching and sometimes guardrails.

Those three are the core abstraction. Enterprise deployments often layer on first-order concerns the abstraction does not name: billing reconciliation across business units, quota brokerage across teams sharing a provider contract, policy enforcement (PII redaction, jurisdictional routing), and auditability for SOC 2 or HIPAA controls. Those are real and sometimes material, but they are add-ons sitting on top of the three primitives. Bundled extras like an evals UI, a prompt registry, or a tracing dashboard are separable from the gateway role itself. Gateways differ on which of those jobs they execute well, and on the deployment model (hosted SaaS, self-hosted, or library-only).

The category is not new. Reverse proxies have routed HTTP traffic for decades. What is new is that the destinations have wildly different semantics: token pricing varies 30x across providers, p99 latency varies 5x, capability surface varies (tool calling shapes, image inputs, prompt caching, structured output). A gateway that treats providers as interchangeable will silently lose money, latency, or features. The operator-grade question is not "which gateway" but "which gateway preserves the semantics I depend on."

The landscape

Group the options by deployment shape, because the deployment shape drives the data-flow and lock-in profile more than any feature does.

Hosted multi-provider gateways. OpenRouter, Vercel AI Gateway, Portkey (hosted), Helicone (hosted proxy mode). One API key, dozens of providers, gateway terms apply. Operationally simplest. Compliance-wise the gateway is a sub-processor for every prompt and response.

Self-hosted gateways. LiteLLM (Python proxy), Portkey (self-hosted), Helicone (self-hosted). You own the data plane. You also own the deployment, the database, the rate-limit-bucket Redis, and the on-call when it crashes at 2am.

Library-only abstractions. Vercel AI SDK, LangChain, LlamaIndex. No gateway. Application-layer translation. Direct calls go to provider SDKs from your process. No third-party sub-processor, but no centralized observability either.

Direct provider SDKs. Anthropic Python/TS SDK, OpenAI SDK, Google genai SDK, Bedrock boto3. No abstraction. Lowest possible friction for one provider.

OpenRouter is a hosted multi-provider gateway with one of the largest model catalogs. Its per-model rates are typically the provider's published rate plus whatever margin OpenRouter has negotiated, with payment-rail surcharges layered on top at credit top-up time (5% on Stripe credits, 5.5% with a $0.80 minimum on credit-card top-ups, per openrouter.ai/docs/api-reference/limits verified 2026-05-05). Whether that nets out cheaper or more expensive than calling a provider directly depends on the specific model and the top-up cadence; spot-check the model slugs you actually use. Its alternatives all live in one of the four buckets above.

What actually matters operationally

Vendor pages emphasize provider count and "unified API." Those matter, but they are not the dimensions that decide whether a stack survives contact with production traffic. In production, the failure modes are data path, cost, failover, limits, observability, caching, and lock-in. Walk them in that order.

Data path and processor relationships. When a prompt leaves my process, where does it go and who has a contractual relationship with it? A hosted gateway like OpenRouter adds a processor relationship in front of the model provider: the gateway terms govern the request before the provider terms apply. Self-hosted LiteLLM behind direct provider keys removes the gateway-as-vendor relationship; the prompt still leaves my VPC to the provider, who is a processor (or sub-processor, depending on my customer chain) either way. Direct SDKs are the same shape as self-hosted LiteLLM on this dimension. The exact count of processors and sub-processors depends on the contractual chain my customers signed, not on network hops, so the operative question is "how many vendors have I added a DPA with by introducing this layer," and the answer for a hosted gateway is one more than for self-hosted or direct.

Cost: separate payment fees from per-token markup from infra. OpenRouter publishes its fee structure at openrouter.ai/docs/api-reference/limits (verified 2026-05-05): a 5% surcharge on credits purchased via Stripe, and a 5.5% surcharge with a $0.80 minimum on credit-card top-ups. That is a payment-rail fee on the credit purchase, not a per-token markup on inference itself; OpenRouter's published pricing per model is the rate the provider sees plus whatever margin OpenRouter has negotiated, and the credit fee is layered on top at top-up time. LiteLLM self-hosted has no payment-rail fee and no per-token markup, but the infrastructure is real. A realistic always-on deployment in AWS us-east-1, costed against on-demand list prices in the AWS pricing calculator on 2026-05-05, looks roughly like this:

Component	Spec	Monthly (us-east-1, on-demand list)
App tier (LiteLLM proxy)	2 x t3.small, behind ALB	~$30 compute + ~$22 ALB
Postgres	db.t3.small, single-AZ, 20 GB gp3	~$30
Redis	cache.t3.micro	~$13
Egress to providers	100 GB/mo outbound	~$9
Subtotal (single-AZ, no HA)		~$104/mo
Multi-AZ HA upgrade	RDS Multi-AZ, ElastiCache replica, second app AZ	adds ~$70-90/mo
Operator labor	2-4 hr/mo upgrades, on-call, schema bumps	not in the table; price it at the team's loaded hourly rate

Reserved instances and savings plans cut compute 30-50% if I commit. The crossover where LiteLLM beats OpenRouter is not a single number; it is a function of (a) the OpenRouter credit-purchase fee on my top-up cadence, plus (b) any per-model margin OpenRouter takes on my workload mix, against (c) the ~$100-200/mo infra floor and (d) the labor cost. For my workloads, the breakeven sits in the low thousands of dollars per month of token spend, but the right move is to model my own mix, not adopt a number from this page.

Failover semantics. OpenRouter's fallback is configured per request via the models array, an ordered list, and retries share the client's HTTP timeout budget. LiteLLM uses fallbacks configured in proxy_config.yaml, with retries owned by the Router class server-side. Vercel AI Gateway documents failover behavior in its docs at vercel.com/docs/ai-gateway, but the exact provider-options API shape has shifted across AI SDK versions. Operators evaluating Vercel AI Gateway failover should pin their ai and @ai-sdk/gateway package versions and verify the current failover-config shape against the live docs before depending on it. The semantic difference across the three is what matters: client-budget retries versus server-owned retries versus platform-edge retries each have a different failure mode under partial provider outage, and the right pick depends on whether the client can tolerate a longer effective timeout.

Rate limit accounting. With multi-provider gateways, the rate limit is the gateway's, not the provider's. OpenRouter's free-tier limits are documented; their paid-tier limits scale with credit balance. LiteLLM enforces per-key and per-team limits server-side. Direct SDKs surface raw provider limits. The right answer depends on whether one logical bucket across providers is desirable or each provider's quota should be exhausted independently.

Observability surface. OpenRouter ships an activity log scoped to request metadata. Vercel AI Gateway ships per-request telemetry that integrates with Vercel Observability. LiteLLM emits to Langfuse, Helicone, S3, or any OTel sink. Direct SDKs emit nothing without explicit instrumentation. Where I have personally felt the difference is agent fan-out debugging: a single user turn that produces 30+ model calls is hard to inspect when the gateway log only exposes per-request rows without a parent trace ID, and the LiteLLM-to-Langfuse OTel path lets me pivot from a span to its siblings inside one tree. That is a narrow, lived-in observation, not a benchmark; operators with strict requirements on trace cardinality, payload redaction, sampling, retention, or per-request overhead should test the specific dimension on their own workload.

Prompt caching passthrough. Anthropic's prompt caching is cache-key-sensitive in subtle ways: the cache_control blocks must reach the provider intact, and cache identity is per-organization-per-model. Some gateways re-serialize the request and break caching; others preserve it. I have lost real money to this. Verify before committing, by reading cache_read_input_tokens on the response.

Lock-in profile. OpenAI-shape gateways (OpenRouter, LiteLLM, Vercel AI Gateway) are portable in the limited sense that the base URL is a one-line change. Base URL swaps are only safe for the lowest common subset of the API. Surfaces that almost always need an adapter layer when migrating: tool-call request and response schemas (parallel-call shapes diverge), streaming delta formats (SSE field names, chunk boundaries, finish-reason semantics), structured output mechanisms (JSON-schema headers versus tool-call coercion versus response-format flags), provider-specific cache and routing headers, model slug formats (anthropic/claude-sonnet-4-5 versus claude-sonnet-4-5 versus the provider's own slug), auth header conventions, and error-body shapes for retry classification. Plan for an adapter module from day one. A library-only abstraction (LangChain, Vercel AI SDK) abstracts at the call site, which is harder to remove than a base URL change.

Detailed teardowns

LiteLLM (self-hosted proxy)

LiteLLM is a Python project from BerriAI that ships as both a library and a proxy server. The proxy mode is the deployment shape that matters when a team needs central policy enforcement, per-key management, server-side budgets, audit logs, and shared observability across applications. It is a FastAPI service that exposes an OpenAI-compatible endpoint and translates to the provider catalog documented at docs.litellm.ai/docs/providers (verified 2026-05-05), which lists Anthropic, OpenAI, Google, Bedrock, Mistral, Cohere, Groq, Together, Fireworks, vLLM, Ollama, and dozens more. Configuration lives in proxy_config.yaml (model aliases, fallbacks, guardrails, budgets) backed by Postgres for spend tracking and key management.

Architecture: stateless FastAPI workers behind a load balancer, Postgres for persistent state (keys, spend, audit log), Redis for rate-limit accounting. Run it as a Docker container or pip install and run litellm --config proxy_config.yaml. Deploy footprint is small but real: Postgres is a dependency you operate.

Tradeoffs. The good: zero gateway margin, full data control, OpenAI-shape API for any backend, server-enforced budgets, native Langfuse integration, per-key rate limits. The bad: it is a Python proxy with active development, and it ships with the bug surface that comes with that. As of 2026-05-05, the BerriAI/litellm tracker has open bugs around Bedrock Converse payload injection (github.com/BerriAI/litellm/issues/27138) and OpenRouter+Whisper interop (github.com/BerriAI/litellm/issues/27083). Treat the proxy as production infra, pin a specific version, and read the changelog at github.com/BerriAI/litellm/releases before bumping.

When LiteLLM is right: you already operate Python services, you spend more than ~$2k/month on tokens, and you want one logical key surface for your team with budget enforcement.

When it is wrong: you do not have a Python on-call rotation, your spend is small, or you do not want another stateful service.

Vercel AI Gateway

Vercel AI Gateway is a managed routing layer integrated into the Vercel platform. You call it via the Vercel AI SDK (@ai-sdk/gateway or the gateway provider in ai) or via OpenAI-compatible HTTP. It handles failover, observability, and (for Vercel customers) bundles into one bill.

Architecture: edge-routed managed service. No self-hosting option. Prompts traverse Vercel infrastructure, then route to the provider you pick. On pricing: I am intentionally not quoting a per-token rate here. Vercel has rotated the AI Gateway billing model more than once since the product left preview, and a quoted number will rot inside a quarter. The shape that has held: a margin on top of provider rates, billed through the Vercel account alongside compute. Treat that as the planning assumption, then check the live rate at vercel.com/docs/ai-gateway before committing to a stack decision.

Tradeoffs. The good: zero ops, tight integration with Vercel deployments, observability that lines up with the rest of your platform, sane defaults for streaming and tool calls. The bad: platform dependency. Off Vercel, the integration value collapses and the margin is pure overhead. Migrating off requires reworking client code to point at a different endpoint.

When it is right: shipping on Vercel, the team already uses the AI SDK, and the bundled observability is enough.

When it is wrong: running on AWS/GCP/Azure with no Vercel surface, or the use case demands self-hosted control.

Direct provider SDKs (Anthropic, OpenAI, Google)

The least exotic option and often the right one. The official Anthropic Python SDK (anthropic), OpenAI SDK (openai), and Google google-genai SDK are first-party, well-maintained, and ship the freshest feature support (Anthropic prompt caching, OpenAI Responses API, Gemini implicit caching land in the official SDKs first).

Architecture: in-process HTTP clients. No gateway. Your application is the data plane. Failover, retry, and observability are your responsibility (or the responsibility of whatever observability layer you wrap it in: Langfuse, Logfire, OTel directly).

Tradeoffs. The good: zero abstraction tax, fastest access to new provider features, lowest possible latency, no extra sub-processor, no extra service to operate. The bad: no built-in multi-provider failover. If your primary goes down, you handle it in application code.

When it is right: single-provider apps, latency-sensitive paths where one fewer hop matters, apps where the data-control story has to be tight.

When it is wrong: you genuinely need to call three providers and want central spend tracking.

Portkey (hosted or self-hosted)

Caveat up front: of the four named options on this page, Portkey is the one I have spent the least production time on. I ran the hosted version on a side project for about six weeks at low volume (under 5M tokens/month) and read the self-hosted deployment docs without standing up a long-lived instance. Treat this section as a documentation-grade summary with limited operator depth.

What it is: a routing layer with both a hosted SaaS and a self-hostable Docker deployment, positioned against OpenRouter on the hosted side and LiteLLM on the self-hosted side. The config surface is meaningfully richer than OpenRouter for routing rules (virtual keys per team, per-key budgets, weighted load balancing across providers, conditional routing based on metadata), and the admin UI for spend and key management is more polished than LiteLLM's first-party UI. Pricing, deployment instructions, and SDK references live at portkey.ai/docs; verify against the docs at decision time.

Where I would actively pick Portkey over the alternatives: a team that wants LiteLLM-style routing rules but does not want to operate a Python proxy and Postgres themselves, and is willing to accept a hosted sub-processor for that convenience. The UI workflow for issuing virtual keys to engineers and watching their spend in real time is the strongest single feature.

Where it does not pencil. At scale the hosted margin is a recurring cost on top of provider fees, same shape as OpenRouter. The self-hosted version removes that margin but reintroduces the stateful-service operational burden, with a smaller open-source community to absorb the bug load. As of 2026-05-05, the github.com/Portkey-AI/gateway repo has materially fewer stars, contributors, and open-issue throughput than BerriAI/litellm, which matters when a regression lands and nobody has filed it yet.

When it is right: small-to-mid teams that want polished spend and key management out of the box, are comfortable with a hosted sub-processor, and value UI over OSS-community size.

When it is wrong: cost-sensitive deployments past roughly $5k/month token spend, teams that explicitly want the larger LiteLLM contributor base for long-term bet safety, or teams whose compliance posture rules out additional hosted sub-processors.

The standards layer: OpenAI-compatible endpoints

There is no formal standard for LLM gateway interop. There is a de-facto standard: the OpenAI Chat Completions and (now) Responses API shape. Every gateway in this space implements it. Provider-native APIs (Anthropic Messages, Google Generative Language, Bedrock Converse) are translated to and from this shape.

That has two consequences operators should internalize. First, portability is real: if you write your client against the OpenAI shape, you can swap base URLs across OpenRouter, LiteLLM, Vercel AI Gateway, Portkey, vLLM, Ollama, and the OpenAI API itself with no code change. Second, the shape is leaky: features that exist only on one provider (Anthropic prompt caching, OpenAI's parallel tool calls, Google's grounded search, structured-output JSON schemas) are passed through as provider-specific extensions or extra headers. Gateways differ in how cleanly they preserve those extensions. Verify the features you depend on against the gateway you pick.

OpenTelemetry GenAI semantic conventions are the closest thing to an observability standard, and they are still stabilizing. LiteLLM emits OTel spans; Langfuse consumes them; the schema is converging but not frozen. If you build today, instrument with OTel and accept that attribute names may shift in the next 12 months.

Things nobody talks about

Gateway margin compounds with output token bias. Output tokens cost 4-5x input tokens with most providers. A 5% gateway surcharge on a workload that is 80% output-token cost is, in effect, a tax on the expensive half of your bill. At 100M output tokens/month on Claude Sonnet 4.6 at $15/M output (per anthropic.com/pricing, verified 2026-05-05), that is $1500/month in output cost; a 5% gateway surcharge adds $75 just on the output side. Most cost-modeling spreadsheets I see ignore the input/output split and underestimate.

Prompt caching breaks silently across some gateways. Anthropic's cache_control blocks must reach the provider in the exact request shape. I have hit cases where a gateway re-serialized the request and produced a 0% cache hit rate while the client thought caching was on. The only reliable verification is to read the cache_read_input_tokens field on the response and confirm it is non-zero. Add a CI check that issues two identical requests and asserts the second has non-zero cache reads. Without it, you will pay the full uncached rate and not know.

"Self-hosted" does not mean "data does not leave your infrastructure." Self-hosted LiteLLM still calls the model provider. The prompt leaves your VPC the moment LiteLLM forwards it. The data-control benefit is that you remove the third-party gateway as a sub-processor, not that you eliminate provider exposure. If your compliance posture requires no model-provider exposure, your only options are self-hosted models (vLLM, Ollama, llama.cpp) plus a self-hosted gateway. Anything that calls Anthropic or OpenAI at all means those providers see the prompt, regardless of how you reach them.

Free-tier rate limits are designed to push you to paid. OpenRouter's free-tier credit-rate limits are deliberately tight; LiteLLM Cloud has a free tier with low budget caps; Vercel AI Gateway requires a Vercel account. None of them will let you sustain a real production workload for free. Plan for the upgrade path before you build.

Provider deprecations leak through gateways unevenly. When OpenAI deprecates a model alias, OpenRouter, LiteLLM, and Vercel AI Gateway each notify you on different cadences. LiteLLM's model_prices_and_context_window.json is the source of truth for model metadata in that project; if a model is missing or mispriced there, your spend tracking lies. As of 2026-05-04, an issue is open requesting Mistral Medium 3.5 be added to that file (per github_issues.yaml entry litellm#27129). Subscribe to the relevant changelog or you will find out when a request 404s.

Implementation patterns

Three patterns cover ~90% of production deployments. Reference code lives in the agentinfra-examples companion repo.

Pattern 1: Direct Anthropic SDK with prompt caching (single-provider path).

# requires: anthropic==0.69.0
import os
from anthropic import Anthropic

client = Anthropic()  # ANTHROPIC_API_KEY from env

def call(user_input: str):
    return client.messages.create(
        model="claude-sonnet-4-5",
        max_tokens=1024,
        system=[
            {
                "type": "text",
                "text": LARGE_SYSTEM_PROMPT,
                "cache_control": {"type": "ephemeral"},
            }
        ],
        messages=[{"role": "user", "content": user_input}],
    )

# First call writes the cache entry. cache_read_input_tokens will be 0 here,
# but cache_creation_input_tokens should be non-zero.
first = call("warm the cache")
assert first.usage.cache_creation_input_tokens > 0, "cache write missing"

# Second call with the same cached system prefix should read from cache.
second = call("real user turn")
assert second.usage.cache_read_input_tokens > 0, "cache miss on second call"

When to choose: single-provider apps where you want first-party feature access and the lowest hop count. Verify cache reads in CI by issuing two identical-prefix requests and asserting the second has non-zero cache_read_input_tokens. Asserting on the first call always fails, since the cache entry does not exist until the first request creates it.

Pattern 2: LiteLLM proxy with fallback routing (multi-provider, self-hosted).

# proxy_config.yaml for litellm-proxy 1.83.x
model_list:
  - model_name: agent-primary
    litellm_params:
      model: anthropic/claude-sonnet-4-5
      api_key: os.environ/ANTHROPIC_API_KEY
  - model_name: agent-fallback
    litellm_params:
      model: openai/gpt-4o
      api_key: os.environ/OPENAI_API_KEY

router_settings:
  routing_strategy: usage-based-routing-v2
  fallbacks:
    - agent-primary: [agent-fallback]
  num_retries: 2
  timeout: 30

general_settings:
  master_key: os.environ/LITELLM_MASTER_KEY
  database_url: os.environ/DATABASE_URL

Run with litellm --config proxy_config.yaml. Pin the version (do not run latest); read the BerriAI/litellm release notes before bumping. The two model entries are deliberately distinct aliases: clients call agent-primary, and on retry exhaustion against Anthropic the router fails over to agent-fallback (OpenAI). Reusing the same model_name for both upstreams turns this into a load-balanced pool, not deterministic failover. Pick one shape and document which.

When to choose: server-side fallback across providers with central spend tracking. Pair with Langfuse for observability.

Pattern 3: OpenAI-shape client pointed at any gateway.

# requires: openai==1.55.0
import os
from openai import OpenAI

client = OpenAI(
    base_url="https://openrouter.ai/api/v1",  # or LiteLLM, or Vercel AI Gateway
    api_key=os.environ["GATEWAY_API_KEY"],
)

resp = client.chat.completions.create(
    model="anthropic/claude-sonnet-4.5",  # gateway-specific slug format
    messages=[{"role": "user", "content": "..."}],
    extra_body={"models": ["anthropic/claude-sonnet-4.5", "openai/gpt-4o"]},
)

When to choose: gateway portability matters. Keep gateway-specific extensions (extra_body, headers) in one adapter module so swapping is a single-file change.

Conclusion: the decision framework

The honest decision tree, after running this in production:

Spend under ~$500/month, single provider, ship-it speed: use the direct provider SDK. Add a gateway when you have a real reason, not before.
Spend under ~$2k/month, multi-provider, hobbyist or early-stage: OpenRouter. The operational simplicity beats the gateway margin at this scale. Accept the sub-processor.
Spend $2k-50k/month, multi-provider, you operate Python: LiteLLM self-hosted. The margin savings pay for the ops cost, and you own the data plane.
Already on Vercel, want bundled platform observability: Vercel AI Gateway. Outside that profile the platform tax does not pencil.
Compliance posture requires minimal sub-processors: direct provider SDKs with ZDR contracts where available, or self-hosted models behind self-hosted LiteLLM.

What I would bet on for the next 24 months: the OpenAI Responses API shape ossifies as the de-facto standard, and gateway portability becomes table stakes. Self-hosted gateways improve observability faster than hosted gateways improve compliance posture. Direct provider SDKs continue to ship features 3-6 months ahead of any gateway, so latency-sensitive teams will keep one foot in each camp.

What I would avoid: building a custom gateway in-house. The translation surface across providers is wider than it looks (tool-call schema variants alone are a multi-quarter project), and the existing options are good enough that the build-vs-buy math rarely favors build below 100M tokens/month.

Companion code with working configs for all three patterns: agentinfra-examples.