Does OpenRouter mark up provider token prices?

No. OpenRouter passes through the provider's per-token price. The margin sits on credit purchases (a percentage skim when you load credits) and on a separate per-request fee for BYOK traffic.

Is OpenRouter cheaper than calling Anthropic or OpenAI directly?

Per-token, no. You pay the same provider rate plus the credit-purchase skim. The savings are operational: one key, one invoice, one fallback layer, no per-provider rate-limit math.

When does LiteLLM beat OpenRouter on cost?

Once monthly token spend is large enough that the credit-skim plus opportunity cost of routing latency exceeds the operational cost of running a Python proxy. For most teams under roughly $5K/month in inference, OpenRouter wins on total cost of ownership.

Are there free models on OpenRouter and what's the catch?

Yes, several models are listed at $0 input and $0 output. The catch is heavy rate limits, opaque routing to whichever sponsor is footing the bill that day, and no SLA. Fine for prototypes, dangerous for production paths.

Does using OpenRouter add latency?

Yes. Expect a small overhead per request from the routing hop plus provider selection. For agent loops with tens of sequential calls, this compounds. Benchmark your specific workload before committing.

OpenRouter Pricing: What Operators Actually Pay in 2026

I started routing production agent traffic through OpenRouter in late 2024 because I was tired of juggling four provider keys, four billing portals, and four sets of rate-limit math. Eighteen months in, I have opinions. The pricing page covers per-token passthrough rates and lists the credit-purchase fee, but four line items move the actual monthly bill and rarely sit on one page together: the credit-purchase platform fee taken at top-up, the BYOK percentage charged on reference cost after a free-request allowance, free-tier behavior under sponsor-budget caps, and the latency tax that compounds across sequential agent calls. This page is the teardown I wish I had read before I committed.

TL;DR

Pick OpenRouter when one API key, instant multi-provider access, and a single-digit-percent skim on inference spend are an acceptable trade for skipping the operational work of running a gateway. Pick a self-hosted alternative like LiteLLM once the credit-purchase margin on monthly spend exceeds the loaded cost of operating a Python proxy with observability and on-call coverage. The crossover is the output of a formula in the Decision framework section: blended passthrough spend times the credit-purchase fee percentage, compared against engineering hours, container compute, and observability stack costs. With the assumptions I see most often (roughly 5% credit-purchase fee, a few engineering hours per month at a $150/hour loaded rate, modest compute and observability, no dedicated on-call premium), the band lands somewhere between $5,000 and $15,000 per month of blended inference spend. Run the formula on the team's actual numbers before treating any threshold as real.

OpenRouter's per-token rates are passthrough from the upstream provider. Their margin lives in two places. First, a credit-purchase platform fee taken when credits are loaded (verify the current percentage on https://openrouter.ai/pricing). Second, for BYOK traffic, a percentage fee on the inference cost that would have been billed at OpenRouter's published per-token rate, applied after a small monthly free allowance; the figure confirmed in the BYOK docs at https://openrouter.ai/docs/use-cases/byok is roughly 5% of that reference cost. There is no flat per-request toll on BYOK. The break-even on BYOK versus straight credits pivots on the size of the operator's negotiated upstream discount relative to that 5% reference-cost fee, not on request count. Free models are real but throttled and route opaquely; fine for prototypes, do not put production paths on them.

The mental model

OpenRouter is a unified inference router. I send an OpenAI-compatible request to one endpoint with one model slug, they pick a provider that serves that model (often there are several with different prices and latencies), and they return the completion. They handle authentication to the upstream provider, retries, fallbacks, and the billing reconciliation. I get one invoice and one set of logs.

The pricing has three discrete cost layers, and treating them as a single number is how teams get surprised:

Per-token cost. Passthrough from the upstream provider for paid models. If Anthropic charges $X per million input tokens for claude-sonnet-4.5, OpenRouter charges $X per million input tokens.
Credit-purchase platform fee. When credits are loaded via the dashboard, OpenRouter takes a platform fee. As of 2026-05-05, the published pay-as-you-go fee on https://openrouter.ai/pricing is 5.5% on credit purchases (verify before committing, the number has moved during 2025-2026). This is the actual revenue line for paid traffic.
BYOK percentage fee after included request allowance. When an operator brings their own provider key, OpenRouter charges a percentage of what the same call would have cost at OpenRouter's normal rate, after a small monthly allowance. The figure confirmed in the BYOK documentation is roughly 5% of that reference cost; verify against https://openrouter.ai/docs/use-cases/byok.

Free models are a separate category. They cost $0 per token because a sponsor is paying. They have hard rate limits, they can disappear, and OpenRouter does not guarantee which physical provider serves the request. Excellent for prototypes, dangerous for any path that has to work tomorrow.

The cleanest way to reason about the bill is layered, not flat:

paid_traffic_cost = provider_passthrough * (1 + credit_purchase_fee_pct)
byok_traffic_cost = direct_provider_spend
                  + max(0, byok_reference_cost_above_allowance) * byok_fee_pct
total = paid_traffic_cost + byok_traffic_cost + taxes_and_payment_processing

byok_reference_cost_above_allowance is the portion of BYOK calls that exceeds the included monthly request allowance, priced at OpenRouter's normal rate (not the operator's negotiated direct-provider rate). Taxes and any payment-processing fees on credit top-ups stack on top, depending on jurisdiction.

The landscape

The unified-router space has consolidated into a handful of credible options. The major axes are hosted-versus-self-hosted and proprietary-versus-open-protocol.

Hosted unified routers. OpenRouter is the most-used independent option and has the broadest model catalog. Vercel AI Gateway is the newer entrant from Vercel, tightly integrated into the AI SDK. Cloud-vendor offerings (AWS Bedrock, Azure AI Foundry, Google Vertex) also serve as multi-model gateways but lock you to one cloud's catalog and pricing.

Self-hosted gateways. LiteLLM is the dominant open-source proxy, runs as a Python service, and gives you the same unified-API shape with no markup on tokens. You pay in operational overhead instead of margin.

Direct provider calls plus your own routing. Call Anthropic, OpenAI, Google directly with separate keys, write a small router in your application, and accept the per-provider operational toll. For teams running heavy traffic on one or two providers, this is often the right answer despite the engineering work.

What actually matters operationally

Vendor pages emphasize model count, provider count, and a low-anchor price example. Operators care about a different list.

Effective per-million-token cost at the actual traffic mix. Take the last 30 days of model usage from the dashboard and compute the blended rate. Then add the credit-margin percentage. The number to compare against direct-provider calls is this blended figure, not the headline rate of the cheapest model in the catalog.

Latency overhead per call. Each routing hop adds time. For chat UIs, the overhead often disappears inside other latency. For agent loops with 20 to 100 sequential model calls, even a small per-call overhead compounds visibly. The directional claim from my own use is that OpenRouter adds noticeable but small overhead on median versus calling the upstream provider directly, with a larger fraction at the tail when provider selection lands on a slower upstream. There is no single number worth quoting here; results drift with provider selection, time of day, streaming mode, and region. The setup I run for any decision that depends on the answer: identical prompts and token budgets across both arms, sequential calls (concurrency 1), 100+ samples per arm to stabilize p95, both arms in the same client region, and a recorded distribution of which physical provider OpenRouter actually picked per request. Re-run before committing a stack to the routing tax.

Fallback and retry behavior. OpenRouter exposes two distinct fallback layers, and conflating them is the source of most surprises. Provider fallback (the default) routes around a single upstream provider when it rate-limits or errors, while keeping the same model slug. Model fallback is a separate feature, configured via the models array in the request body, documented at openrouter.ai/docs/guides/routing/model-fallbacks. The models array can include cross-family entries, so a request can list anthropic/claude-sonnet-4.5 as primary and openai/gpt-4o as next attempt. There is no separate per-feature fee; OpenRouter bills the successful run at whichever model and provider actually served it. The gotcha: a fallback hop can change the per-token rate, the upstream data-handling posture (ZDR commitments, regional routing, BAA scope), the latency profile, and the structured-output behavior. Where LiteLLM still gives more control: deterministic routing rules per tenant or key, pre-flight content checks before falling back, custom retry budgets with exponential backoff per provider, and per-attempt observability hooks.

Data-flow and compliance posture. Prompts pass through OpenRouter on the way to the upstream model provider, and the role each party plays (controller, processor, sub-processor) depends on the contract chain and the deployment shape. A B2B SaaS reselling inference may sit as a processor to its customer-as-controller, with OpenRouter as a sub-processor and the upstream as a sub-sub-processor; a different deployment distributes the roles differently. BYOK adds another wrinkle: the upstream provider relationship is between the operator and the provider, with OpenRouter still touching the prompt in transit. The procedure I run before any production traffic: obtain OpenRouter's DPA and current sub-processor list, map every party that touches a prompt or completion, confirm ZDR commitments and regional routing controls survive each hop, verify BAA availability where HIPAA applies, and review the upstream provider's terms for any clause that gets weakened when accessed via a router rather than directly.

Observability. The OpenRouter dashboard shows per-model spend, per-key spend, request counts, and a per-request activity log including which provider served each call. It does not give the depth of trace data a self-hosted proxy can emit into Langfuse, Braintrust, or an internal OTel collector with custom span attributes. Agent debugging that depends on full prompt-and-response capture with arbitrary metadata runs OpenRouter plus a separate tracing layer in the application code.

Lock-in risk. The wire format (OpenAI Chat Completions) is shared across OpenRouter, LiteLLM, Vercel AI Gateway, and most direct provider SDKs, so the base URL and key swap is genuinely small. The portability traps live one layer down. Migration checklist: remap every model slug (OpenRouter's anthropic/claude-sonnet-4.5 versus LiteLLM's dated anthropic/claude-sonnet-4-5-20250929); replay tool-call round trips end-to-end, since tool_calls id formats and tool role message shapes drift between providers; diff streaming event shapes; re-test structured outputs (response_format: json_schema is translated unevenly across providers); audit error taxonomy; inventory provider-specific headers; and re-wire tracing metadata. The wire-format compatibility is the easy 80%; the test surface above is the 20% that catches teams mid-migration.

Detailed teardowns

OpenRouter

Position: hosted unified router, broadest model catalog, lowest operational overhead.

OpenRouter routes an OpenAI-compatible request to one of N upstream providers. They handle authentication to the upstream, the per-provider rate limits, the failover, and the billing rollup. The dashboard exposes per-model spend, per-key spend, and credit balance. Operators load credits in advance (the platform fee is taken at credit-purchase time) and burn them down as calls execute.

Cost shape, stated precisely: layer one, paid traffic, per-token rates passthrough plus a credit-purchase platform fee taken when credits are loaded (single-digit percent, verify against openrouter.ai/docs). Layer two, BYOK, a percentage of the inference cost that would have been billed at OpenRouter's published rate, after a small monthly free-allowance threshold. The published BYOK fee confirmed in the docs is roughly 5% of that reference cost. There is no separate flat per-request fee on BYOK.

A concrete $1,000/month example. Workload A pays through OpenRouter credits at published rates: token cost $1,000, credit-purchase fee at roughly 5% adds $50, total $1,050. Workload B brings its own Anthropic key with a 10% negotiated enterprise discount: direct Anthropic spend $900, OpenRouter BYOK fee at roughly 5% of the $1,000 reference adds $50, total $950. With a 5% negotiated discount the math washes against Workload A and direct calls become cleaner; below 5% discount, BYOK loses to straight credits.

Right call when: the team has fewer than five engineers, no appetite for operating a Python proxy, real value in the broad model catalog, genuine use of cross-provider fallback at the same model slug, and monthly spend small enough that the platform fee is not the dominant line. Also right when prototyping against models the team has not committed to yet.

Wrong call when: monthly token volume is high and concentrated on one or two model slugs; deep custom observability is required beyond what the dashboard exposes; the workload is in a heavily regulated industry where adding another sub-processor is a compliance burden; or the latency budget is tight enough that the routing hop matters.

Try it via my OpenRouter referral link (affiliate, see disclosure above).

LiteLLM (self-hosted)

Position: self-hosted Python proxy, no token markup, full control over routing logic.

LiteLLM gives you the same OpenAI-compatible unified API but runs as a service you operate. You configure providers in proxy_config.yaml, deploy the container, and your application calls https://your-litellm.example.com/v1/chat/completions. Per-token costs are exactly what the upstream provider charges; there is no margin layer.

The cost shape inverts. You pay for the underlying provider tokens (no markup) plus the operational cost of running the proxy: container compute, observability, on-call burden when it breaks. For a team already operating Python services in production, the marginal cost is small. For a team that is not, you are spinning up a new operational surface.

Recent issue volume on the project is meaningful. Per the niche enrichment data I track, the BerriAI/litellm repo has open bugs around Bedrock tool injection (issue 27138, opened 2026-05-04), header duplication on Anthropic-wrapped routes (27119), and end-user spend logging that ignores its own gating flag (27038). None are dealbreakers, but they confirm LiteLLM is a real piece of software you operate, with the bug-volume profile of a real piece of software.

What LiteLLM uniquely buys you is not cross-model-family fallback. OpenRouter handles that now via its provider-routing controls, documented at openrouter.ai/docs/guides/routing/provider-selection. The actual differentiators are structural: the proxy runs in your VPC so prompts never leave your network boundary before hitting the upstream provider; you write your own policy logic (per-tenant budgets, hard spend caps, per-key model allowlists, prompt-content guards); you wire traces directly into your existing observability stack with whatever custom attributes you want; and you control auth (mTLS, internal SSO, your own key-issuance flow).

Right call when: compliance posture requires the gateway to live in your VPC; you need policy controls that a hosted dashboard does not expose; you want first-class integration with an in-house tracing stack; or your monthly inference spend is high enough that the OpenRouter margin exceeds your operational cost.

Wrong call when: the team does not already operate Python services, the team is small, or there is no one on-call for a new piece of infrastructure.

Vercel AI Gateway

Position: hosted gateway tightly integrated with the Vercel AI SDK.

If you are already on Vercel and using the AI SDK, the AI Gateway is the lowest-friction option. It handles provider routing, fallback, and observability inside the Vercel surface. The pricing model and exact margin behavior have evolved during 2025 and 2026; verify the current rates against the Vercel pricing page before committing.

Right call when: you are already a Vercel customer, you are building with the AI SDK, and you want one platform for hosting plus inference routing.

Wrong call when: you are not on Vercel. The lock-in is real because the value comes from the integration, not from the gateway in isolation.

Direct provider calls

Position: the no-gateway option.

Call Anthropic, OpenAI, Google directly with separate SDKs and separate keys. Write a small router in your application if you need cross-provider fallback. You pay exactly the provider rate, you get exactly the provider's observability, and you accept exactly the provider's rate limits.

Right call when: you have stabilized on one or two model providers, your traffic is heavy enough that the gateway margin would be the largest line item, and your team can absorb the per-provider operational toll. Most production agent platforms over a certain scale end up here for at least their hot path.

Wrong call when: you are still experimenting with which models work for your workload, you need rapid access to new models as they ship, or you do not want to write fallback logic yourself.

The standards layer

The unified-API shape across all of these gateways is the OpenAI Chat Completions wire format. It is not a formal standard with a governance body, but in practice it has become the lingua franca of inference routing. OpenRouter, LiteLLM, Vercel AI Gateway, and most self-hosted proxies all speak it. Anthropic's native API has a different shape but is widely supported via translation layers.

What this means for portability: switching gateways is a configuration change, not a code rewrite, as long as you stay on the OpenAI-compatible surface. Tool-use semantics are mostly compatible across providers but have subtle differences in how function-call results round-trip; test this part specifically when you switch.

What it does not solve: observability schemas, tracing semantics (OpenTelemetry GenAI semconv is the live standard there but is still stabilizing), and the fact that provider-specific features (Anthropic's prompt caching, OpenAI's structured outputs, Gemini's context caching) leak through the unified API in different ways.

Things nobody talks about

The credit-purchase margin compounds in a non-obvious way at scale. A small percentage skim is invisible at $200/month. At $20,000/month, the margin alone is a junior engineer's monthly comp. Most teams do not notice the crossover because the OpenRouter dashboard shows token spend, not "what you would have paid the provider directly." Action: set a calendar reminder to compute blended rate versus upstream rate every quarter.

Free-model rate limits change without notice and route opaquely. A model that worked yesterday may rate-limit aggressively today because the sponsor capped the budget. There is no SLA. I have had production prototypes break overnight when a free Llama variant got reclassified. Action: tag any code path that hits a free model and audit it before promoting to a real workload.

BYOK does not always save what operators think. As of 2026-05-05, OpenRouter's BYOK model charges a percentage fee on the inference cost that would have been billed at their published rate, after a small monthly free allowance. The number confirmed in the docs is roughly 5% of the equivalent passthrough cost. Concrete example: a workload that would cost $10,000/month at OpenRouter's published claude-sonnet-4.5 passthrough rates, routed via BYOK against a negotiated Anthropic enterprise contract priced 20% under list, runs the operator $8,000 in direct Anthropic spend plus roughly $500 in OpenRouter BYOK fees, totalling $8,500. That beats both straight OpenRouter ($10,000 plus credit-purchase margin) and direct Anthropic at list ($10,000), but only because the negotiated discount exceeds the BYOK fee. With a 5% enterprise discount and a 5% BYOK fee, the math washes out. Action: model the specific contract before flipping the BYOK switch.

Provider selection is sometimes opaque and affects effective price. When multiple providers serve the same model slug at different per-token rates, OpenRouter's routing logic picks one. The picks are documented at a high level but the per-request decision is not always visible without digging into the response metadata. Action: check the response headers or the activity log to see which provider actually served each call, especially for new slugs.

Compliance scope expands when a router is added. Adding OpenRouter to the data flow adds a sub-processor. The DPA-with-customers, the privacy policy, and the security questionnaires all need to reflect this. For B2B sales into regulated industries, this is not free. Action: get OpenRouter's DPA and security documentation before writing any production traffic, not when the first enterprise prospect asks.

Implementation patterns

The following patterns assume Node.js with the OpenAI SDK pointed at OpenRouter. Same shape works in Python.

import OpenAI from "openai";

const client = new OpenAI({
  apiKey: process.env.OPENROUTER_API_KEY,
  baseURL: "https://openrouter.ai/api/v1",
  defaultHeaders: {
    "HTTP-Referer": "https://your-app.example.com",
    "X-Title": "your-app",
  },
});

const completion = await client.chat.completions.create({
  model: "anthropic/claude-sonnet-4.5",
  messages: [{ role: "user", content: "..." }],
});

When to choose this: anything experimental, anything where you want one key and one bill, any new agent stack where you have not yet stabilized on a model.

Note the model slug difference between the two routers. OpenRouter uses the dotted form anthropic/claude-sonnet-4.5. LiteLLM expects the dated Anthropic model id (e.g. anthropic/claude-sonnet-4-5-20250929), matching Anthropic's native API surface. If you copy-paste a slug from one config into the other you will get a 404 or a silent fallback.

# proxy_config.yaml — LiteLLM equivalent for the same call surface
model_list:
  - model_name: claude-sonnet
    litellm_params:
      model: anthropic/claude-sonnet-4-5-20250929
      api_key: os.environ/ANTHROPIC_API_KEY

router_settings:
  fallbacks:
    - claude-sonnet: ["gpt-4o"]

general_settings:
  master_key: os.environ/LITELLM_MASTER_KEY

When to choose this: monthly spend is high, you need custom fallback semantics across model families, or you need the gateway in your VPC for compliance.

For a working version of both patterns including request logging and a small benchmark harness, see the companion repo at agentinfra-examples.

Decision framework

The order of operations matters. Spend is the last filter, not the first.

Compliance and data-flow first. If the workload carries PHI, regulated PII, or contractual ZDR requirements, the question is whether adding a sub-processor is acceptable at all. If the answer is no, default to LiteLLM in the VPC or direct provider calls with DPAs and BAAs already in hand. Spend thresholds do not override this.
Latency budget second. If the agent loop has a hard p95 budget (voice UX, real-time tools, sub-second turns), measure the routing-hop tax against that budget before anything else.
Existing contracts third. A negotiated enterprise discount with Anthropic, OpenAI, or a hyperscaler changes the arithmetic. BYOK through OpenRouter still incurs the per-request fee, and for short-turn high-request workloads that fee can erase the discount. Direct calls or self-hosted proxy with the negotiated key is often the right answer regardless of total spend.
Then the break-even math. One framework, applied consistently:
```
monthly_margin_cost   = blended_passthrough_spend * credit_margin_pct
monthly_self_host_cost = engineer_hours_per_month * loaded_hourly_rate
                       + container_compute
                       + observability_stack
                       + on_call_premium
```
The crossover is where monthly_margin_cost exceeds monthly_self_host_cost. For the assumptions I see most often (a roughly 5% credit-purchase margin, two to four engineering hours per month of LiteLLM toil at a $150/hour loaded rate, $50 to $150/month of compute and observability, no dedicated on-call premium), the crossover lands somewhere between $5,000 and $15,000 per month of blended inference spend. A team with $300/hour loaded engineers and a real on-call premium pushes the crossover above $20,000. A team that already runs Python services and has Langfuse wired up pushes it below $5,000.
Spend-based defaults, given the formula. Compliance and latency clear, contracts modeled, math run:
- Below the team's computed crossover and outside any compliance constraint: OpenRouter. The operational savings are real.
- Above the computed crossover: migrate the hot path to direct provider calls or self-hosted LiteLLM. Keep OpenRouter for experimental traffic and long-tail models.
- Straddling the crossover: split the traffic. Hot path direct, experimental and tail through the router. The wire format is the same, so the split costs config, not code.

Revisit the formula every quarter. Provider rates move, blended mix drifts as new models ship, and engineering costs change. Anyone presenting a single dollar threshold without showing their assumptions is selling something.