
How to Set Up LiteLLM Proxy for Production AI Agents (2026)
I run a LiteLLM proxy in front of every agent fleet I ship. This page is the single-host Compose baseline I lay down before any production hardening, not the hardened topology itself. Once more than two agents call more than one provider, the alternative is duplicated retry policy across each app's SDK client, key rotation drift the day a provider key leaks, and no per-app spend-log join key when finance asks where the bill went. The baseline covers Postgres-backed virtual keys, budgets, retries, and a named fallback. HA, managed Postgres, Redis cache, and an auth boundary in front of the admin surface are called out inline as follow-ups.
Prerequisites
- Docker 24+ and Docker Compose v2
- Postgres 14+ for virtual keys, spend tracking, and the admin UI
- At least one provider API key. This walkthrough uses Anthropic and OpenAI, plus an OpenRouter key as a fallback target
- A master key for the proxy admin surface. The LiteLLM virtual-keys flow at https://docs.litellm.ai/docs/proxy/virtual_keys (fetched 2026-04-25) treats
sk--prefixed admin keys as the convention. Generate withopenssl rand -hex 32and prependsk- - Optional: Redis 7+ for shared rate-limit and cache state across replicas
The Compose file in Step 3 uses a mutable tag (ghcr.io/berriai/litellm-database:v1.55.10) so the page reads as a copy-paste smoke-test baseline. That tag is fine for a local laptop and not fine for anything a paying customer touches: a routine container restart on a mutable tag is how a working proxy turns into a 3 a.m. CrashLoopBackOff after a Prisma schema change landed overnight. Before promoting beyond a smoke test, resolve the tag to an immutable digest on deploy day with docker pull ghcr.io/berriai/litellm-database:v1.55.10 followed by docker inspect --format='{{index .RepoDigests 0}}' ghcr.io/berriai/litellm-database:v1.55.10, then replace the tag in image: with the resulting @sha256: reference.
Step 1, Lay out the project
litellm-proxy/
docker-compose.yml
config.yaml
.env
.gitignore
.gitignore:
.env
.env (do not commit):
LITELLM_MASTER_KEY=sk-litellm-...replace-with-openssl-rand-hex-32...
LITELLM_SALT_KEY=...replace-with-another-openssl-rand-hex-32...
POSTGRES_PASSWORD=...replace-with-openssl-rand-hex-32...
ANTHROPIC_API_KEY=sk-ant-...
OPENAI_API_KEY=sk-...
OPENROUTER_API_KEY=sk-or-v1-...
DATABASE_URL=postgresql://litellm:${POSTGRES_PASSWORD}@db:5432/litellm
Two requirements break the proxy in different ways. The master-key prefix uses sk- per the documented convention. The Postgres password in DATABASE_URL has to match POSTGRES_PASSWORD, because Compose initializes the db service with the latter and the proxy authenticates against the former. The throwaway litellm:litellm credential pair from quickstarts is a smoke-test value, not a deployment default.
The salt key, per the LiteLLM production guide (fetched 2026-04-25), encrypts DB-stored credentials, the upstream provider keys written to Postgres when store_model_in_db is on, and secrets entered through the admin UI. It is not the encryption key for the virtual-key strings the proxy issues; those are stored as hashed verification tokens and validated by hash comparison. On the baseline in this page (store_model_in_db: false, provider keys resolved from os.environ/...), the practical blast radius of a lost salt is limited. Once store_model_in_db is flipped on, or once provider credentials are entered through the UI, the salt becomes load-bearing and a fresh value silently breaks every credential it previously protected. Store the salt in a secrets manager the day it is generated.
Backing up pgdata before any migration
The pgdata volume holds every issued virtual key, every spend record, and any model state stored when store_model_in_db is on. Losing it loses all three.
DUMP_FILE="litellm-$(date -u +%Y%m%dT%H%M%SZ).dump"
docker compose exec -T db pg_dump -U litellm -Fc litellm > "$DUMP_FILE"
test -s "$DUMP_FILE" || { echo "dump is empty: $DUMP_FILE" >&2; exit 1; }
docker compose exec -T db pg_restore -l < "$DUMP_FILE" \
| grep -E 'LiteLLM_(VerificationToken|SpendLogs)'
Restore is pg_restore --clean --if-exists -U litellm -d litellm against the same image version the dump was taken on, with the proxy stopped. On managed Postgres, the equivalent is whatever point-in-time recovery the provider offers. Run a periodic restore drill against a throwaway database; a backup that has never been restored is a hope, not a backup.
Step 2, Write the proxy config
config.yaml:
model_list:
- model_name: claude-sonnet
litellm_params:
model: anthropic/claude-sonnet-4-5-20250929
api_key: os.environ/ANTHROPIC_API_KEY
# rpm/tpm: derive from your provider account, see below
- model_name: claude-sonnet-fallback
litellm_params:
model: openrouter/anthropic/claude-sonnet-4.5
api_key: os.environ/OPENROUTER_API_KEY
- model_name: gpt-4o
litellm_params:
model: openai/gpt-4o
api_key: os.environ/OPENAI_API_KEY
router_settings:
routing_strategy: simple-shuffle
num_retries: 2
timeout: 30
fallbacks:
- claude-sonnet: ["claude-sonnet-fallback"]
litellm_settings:
drop_params: false
set_verbose: false
json_logs: true
disable_error_logs: true
general_settings:
master_key: os.environ/LITELLM_MASTER_KEY
database_url: os.environ/DATABASE_URL
store_model_in_db: false
ui_access_mode: admin_only
database_connection_pool_limit: 20
Why the Sonnet 4.5 dated slug
The dated slug anthropic/claude-sonnet-4-5-20250929 is what this page is verified against. A tutorial that pins an undated alias rots silently the moment Anthropic ships the next build, because the context-window arithmetic and fallback compatibility are tied to a specific documented input window. As of 2026-04-25 the canonical Anthropic format is anthropic/claude-sonnet-4-5-20250929 and OpenRouter exposes the same model as anthropic/claude-sonnet-4.5. A wrong slug returns BadRequestError: model not found.
Deriving rpm and tpm
Derive rpm and tpm from the provider account's limits page and the response headers returned for the exact key and tier the proxy authenticates as. Anthropic, per the rate-limit reference, enforces ITPM and OTPM as separate ceilings; LiteLLM's tpm field is a single number per route and cannot encode the split, so set it to the stricter observed bucket. OpenAI surfaces its ceiling on x-ratelimit-limit-tokens. Until that derivation happens, leave the fields commented out and let the upstream do the rate-limiting.
OpenRouter as a named fallback
The version above gives the OpenRouter route its own model_name (claude-sonnet-fallback) and references it in the fallbacks block, so callers asking for claude-sonnet hit Anthropic Direct on the happy path and only see OpenRouter when Anthropic Direct returns a retryable error after num_retries.
A failover to OpenRouter changes the data processor and adds a sub-processor: OpenRouter sees every prompt and response in plaintext. For PHI, PII under GDPR Art. 28, or anything covered by a customer DPA that names sub-processors, fallback through an aggregator without prior written approval is a compliance incident. I drop the fallback entry on regulated paths.
Why context_window_fallbacks is intentionally absent
claude-sonnet-4-5-20250929 documents a 200K-token input window; gpt-4o documents 128K. A context-window fallback has to route to a model with a verified larger or compatible context window. Routing to a smaller window means a request that overflows Sonnet also overflows the fallback and fails the same way, which is decorative rather than useful. Either remove the fallback or add a target whose documented context window is verified larger, proven with a per-route conformance test.
Production mode and structured logs
The defaults that ship with the proxy image are tuned for development: verbose payload logging, plaintext log lines, and an error path that emits request and response bodies. Step 3's Compose file sets:
environment:
LITELLM_MODE: "PRODUCTION"
LITELLM_LOG: "ERROR"
json_logs: true makes every line a structured JSON object. disable_error_logs: true strips the proxy's own error-log lines that embed request bodies; the error itself still propagates as an HTTP response and as a single ERROR log record. Ship the proxy's stdout to a log pipeline that applies redaction at ingest, restrict read access to LiteLLM_SpendLogs, and treat the proxy host's local log files as transient buffers.
drop_params: false by default
The flag silently strips request parameters the target provider does not accept. With it on, a single agent using a non-portable parameter starts succeeding with different semantics than the calling code asked for. Leave drop_params: false so the proxy returns a 400 instead of a silent semantic shift. Allowlisting a single parameter on a specific route requires a passing route-level conformance test first.
| Parameter class | Why dropping is unsafe |
|---|---|
seed | Invalidates regression-test fixtures |
temperature, top_p, top_k | Returns measurably different prose than the calibrated baseline |
tools, tool_choice | Returns plain completion instead of structured tool_calls; dispatcher silently no-ops |
response_format | Returns free-form prose where caller expected schema-conformant JSON |
logprobs | Silently zeros the signal scorers depend on |
user, metadata, custom headers | Breaks attribution, billing reconciliation, routing policy |
database_connection_pool_limit is sized against actual concurrency
Per the production guide, this limit applies per worker process. The formula is database_connection_pool_limit <= max_connections / (instances * workers), with headroom for the admin UI, migration jobs, and psql sessions. Postgres 16 ships max_connections=100, so a single Compose instance with one worker can take a 20-connection pool. Sizing wrong cascades: too-large pools exhaust Postgres connections under load, every replica refuses requests at the database layer, and the symptom looks like "the proxy died."
store_model_in_db: false
The DB-backed model store lets the model list be edited from the admin UI without a redeploy, which is a split-brain trap: the Postgres copy and the config.yaml copy diverge silently the first time someone clicks through the UI. config.yaml is the only canonical source, lives in version control, and the proxy reloads on commit.
ui_access_mode: admin_only
The admin UI ships with broader access modes that surface key generation and spend data to non-admin viewers. Locking it to admin-only is the conservative default. The UI itself still belongs behind a separate auth boundary.
Step 3, Compose it together
docker-compose.yml:
services:
litellm:
image: ghcr.io/berriai/litellm-database:v1.55.10
restart: unless-stopped
ports:
- "127.0.0.1:4000:4000"
env_file: .env
environment:
DISABLE_SCHEMA_UPDATE: "false"
LITELLM_MODE: "PRODUCTION"
LITELLM_LOG: "ERROR"
volumes:
- ./config.yaml:/app/config.yaml:ro
command: ["--config", "/app/config.yaml", "--port", "4000"]
depends_on:
db:
condition: service_healthy
healthcheck:
test: ["CMD", "curl", "-f", "http://localhost:4000/health/liveliness"]
interval: 30s
timeout: 5s
retries: 3
db:
image: postgres:16
restart: unless-stopped
environment:
POSTGRES_USER: litellm
POSTGRES_PASSWORD: ${POSTGRES_PASSWORD}
POSTGRES_DB: litellm
volumes:
- pgdata:/var/lib/postgresql/data
healthcheck:
test: ["CMD", "pg_isready", "-U", "litellm"]
interval: 10s
timeout: 5s
retries: 5
volumes:
pgdata:
The litellm-database image variant ships with the Prisma client, Postgres client, and curl baked in, which the slim runtime image does not. Substituting a stripped image leaves the container marked unhealthy even when the proxy is serving traffic.
The port binding is 127.0.0.1:4000:4000, not 0.0.0.0:4000:4000. A real deployment puts the proxy behind a separate layer with five properties: TLS termination, an auth boundary in front of master-key-protected admin routes, a private listener, source-CIDR allowlisting, and access logging.
The healthcheck uses /health/liveliness because liveliness is the right semantics for a Docker restart probe. For load-balancer gates and Kubernetes readinessProbe the right endpoint is /health/readiness, which additionally validates the DB connection.
Run schema migrations from exactly one process. On v1.55.10 the proxy honors DISABLE_SCHEMA_UPDATE=true. The migration owner replica boots with that variable unset and runs Prisma migrations; every other replica boots with it set to true. A dedicated one-shot job is cleaner than picking a long-running replica as owner. Without this discipline, a rolling deploy across N replicas produces concurrent migration contention.
Bring it up:
docker compose up -d
docker compose logs -f litellm
Watch for the line saying the database schema has been migrated. Prisma errors at this stage almost always mean DATABASE_URL resolved to localhost instead of the db service.
Step 4, Issue your first virtual key
Each app gets its own key per environment. That is the unit of isolation: per-app max_budget, per-app models allowlist, per-app rpm_limit, and per-app revocation that does not touch any other app or any upstream provider key.
umask 077
curl -sS -X POST http://localhost:4000/key/generate \
-H "Authorization: Bearer $LITELLM_MASTER_KEY" \
-H "Content-Type: application/json" \
-d '{
"models": ["claude-sonnet"],
"duration": "30d",
"max_budget": 25.00,
"budget_duration": "30d",
"rpm_limit": 60,
"metadata": {"app": "agent-research", "owner": "michael"}
}' | jq '{key: .key, expires: .expires}'
The response includes a key starting with sk- and an expires timestamp the proxy computed from duration. Write the key directly into the secrets manager rather than to a file.
duration sets the key's lifetime; budget_duration is the rolling window over which max_budget accumulates and resets. A key with duration: "30d" and budget_duration: "7d" would refill its $25 cap weekly and stop authenticating after a month.
On the OSS image, no built-in scheduler reissues keys before they expire. The OSS pattern is overlapping manual issuance: on day 23 of a 30-day window, issue a second key with the same payload, write the new value into the secrets manager, and trigger a rolling restart on every agent. Once spend logs for the old key show zero requests for one full traffic cycle, retire it.
# Disable the old key after the overlap window. Reversible.
curl -sS -X POST http://localhost:4000/key/block \
-H "Authorization: Bearer $LITELLM_MASTER_KEY" \
-H "Content-Type: application/json" \
-d '{"key": "sk-...the-old-key..."}'
/key/block preserves spend history and allows /key/unblock if the cutover is incomplete. /key/delete is permanent; use only after a blocked key has shown zero traffic for a full retention window.
The max_budget field is intended to make the proxy reject requests once tracked spend on the key crosses $25. On the v1.55.10 digest, the budget-exceeded path is an HTTP 400 with a JSON body whose error.type is budget_exceeded. 429 is reserved for rate-limit semantics. Treating a budget rejection as a 429 means the calling app retries a request that will keep failing. Spend aggregation is also eventually consistent. Treat the budget as a circuit breaker, not a hard guarantee.
Smoke test:
curl http://localhost:4000/v1/chat/completions \
-H "Authorization: Bearer sk-...the-issued-key..." \
-H "Content-Type: application/json" \
-d '{
"model": "claude-sonnet",
"messages": [{"role": "user", "content": "Reply with the single word: ok"}]
}'
Expected: a 200 response with a choices array containing an assistant message and a usage block with token counts populated. The exact assistant text is advisory; tightening to exact prose belongs in the conformance suite.
Step 5, Wire your agent code
# pip install "openai>=1.55,<2"
from openai import OpenAI
client = OpenAI(
base_url="http://litellm-proxy:4000/v1",
api_key="sk-...the-issued-key...",
)
resp = client.chat.completions.create(
model="claude-sonnet",
messages=[{"role": "user", "content": "What's 2+2? One word."}],
extra_body={
"metadata": {"generation_name": "smoke-test"},
"user": "feature=test",
},
)
print(resp.choices[0].message.content)
The agent code knows nothing about Anthropic, OpenAI, or OpenRouter. It speaks OpenAI-compatible HTTP. When migrating from Sonnet to a cheaper model, the change is one line in config.yaml plus a proxy reload. A config-only swap is safe only when the replacement model preserves the contract the app depends on: context window, tool-use format, structured-output behavior, latency profile, per-token price.
Two attribution paths are available on the OSS image without crossing the Enterprise line. First, the user field on the chat-completions request, persisted on the spend-log row. Second, finer-grained app boundaries via more keys. Request-body metadata.tags for feature-grain spend breakdown is an Enterprise feature per the cost-tracking docs.
Client-side attribution only works as a trust signal when callers are trusted. It is unsafe the moment a virtual key reaches an untrusted caller, because anything in user or metadata lands verbatim in the spend ledger. The two safe sources are the metadata block on the virtual key itself (set at issuance) and server-side injection at a trusted hop in front of the proxy.
Common errors
litellm.AuthenticationError: ... missing api_key — The proxy started before .env loaded, or the key reference in config.yaml is misspelled. Confirm os.environ/ANTHROPIC_API_KEY matches the variable name in .env exactly. The os.environ/ prefix is required.
Prisma Client could not connect to the database — Run docker compose ps to confirm db is healthy. Exec into the proxy container and run nc -zv db 5432. If the network looks intact, recreate without touching the volume: docker compose up -d --force-recreate. Reserve docker compose down -v for local sandboxes you are willing to wipe; the -v flag deletes pgdata and every issued key with it.
BadRequestError: model 'claude-3.5-sonnet' not found — The slug does not match any model_name in config.yaml. model_name is the alias your apps use; litellm_params.model is the upstream slug.
Things nobody talks about
The proxy host becomes a single point of failure for every agent in the fleet. When a routine apt upgrade reboots that host, every agent fails. I run two replicas behind a load balancer with shared Redis state and Postgres on a managed instance.
Spend tracking is eventually consistent. The budget-exceeded path is an HTTP 400 with error.type: budget_exceeded, not a 429. Wiring client retry to the wrong one produces silent breakage. Under a single-replica deployment with low double-digit RPS and no Redis transaction buffering, the largest overrun I saw on a $25/30d key was on the order of a few dollars. That is one setup, not a benchmark. A multi-replica deployment without Redis buffering aggregates spend per-replica with periodic flushes, so concurrency bursts can overshoot further. The provider's bill is the only authoritative number.
store_model_in_db: true creates a split-brain problem. After three months of UI tweaks I once had a config that nobody could reconstruct from the file. Pick one source of truth. For config-as-code, set this to false.
Vendor lock-in: migrating off LiteLLM is medium-painful. The OpenAI-compatible request format is portable to any other gateway (OpenRouter, Vercel AI Gateway, Cloudflare AI Gateway). What is not portable is the virtual-key store, the spend ledger, and the routing config. A clean migration means re-issuing every app's key, replicating budgets, and translating routing rules.
When this approach wins, when it does not
The decision turns on four variables, not on agent count alone. First, duplicated retry and timeout code: the proxy centralizes it. Second, key rotation burden: virtual keys collapse a coordinated rotation to a single revoke. Third, spend attribution: the tagged spend ledger is the cheapest place to get per-app breakdowns. Fourth, provider failover: cross-provider fallback is straightforward in router_settings.fallbacks.
The heuristic I use is three-or-more agents against two-or-more providers, because that is the point where the four variables usually all cross threshold at once. A two-agent fleet against a single provider can still justify the proxy if the spend-attribution requirement is sharp or if a leaked key would force a fleet-wide rotation tonight.
It loses on the inverse shape. One agent against one provider with no second consumer on the horizon: run the SDK directly. The proxy adds a hop, a Postgres dependency, an upgrade discipline, and a 3 a.m. failure domain that did not exist before, in exchange for routing flexibility nobody is using. The compliance boundary matters in the same way: if every workload has to stay on a single named processor with a signed DPA and no aggregator allowed in the data path, the fallback story is off the table.
The proxy concentrates blast radius by design. When healthy, a routing change ships to every agent at once. When unhealthy, every agent 502s at once. Operators willing to run two replicas behind a load balancer with shared Redis state and managed Postgres get the upside. Operators who run a single VM and hope get a single point of failure for the whole fleet.
I run LiteLLM in production on the multi-agent, multi-provider, attribution-required shape. I run direct SDK calls on the one-off shape. The decision is the variables above, not a default.