Back to Blog
How to Set Up LiteLLM Proxy for Production AI Agents (2026)

How to Set Up LiteLLM Proxy for Production AI Agents (2026)

Michael Isaac

I run a LiteLLM proxy in front of every agent fleet I ship. This page is the single-host Compose baseline I lay down before any production hardening, not the hardened topology itself. Once more than two agents call more than one provider, the alternative is duplicated retry policy across each app's SDK client, key rotation drift the day a provider key leaks, and no per-app spend-log join key when finance asks where the bill went. The baseline covers Postgres-backed virtual keys, budgets, retries, and a named fallback. HA, managed Postgres, Redis cache, and an auth boundary in front of the admin surface are called out inline as follow-ups.

Prerequisites

  • Docker 24+ and Docker Compose v2
  • Postgres 14+ for virtual keys, spend tracking, and the admin UI
  • At least one provider API key. This walkthrough uses Anthropic and OpenAI, plus an OpenRouter key as a fallback target
  • A master key for the proxy admin surface. The LiteLLM virtual-keys flow at https://docs.litellm.ai/docs/proxy/virtual_keys (fetched 2026-04-25) treats sk--prefixed admin keys as the convention. Generate with openssl rand -hex 32 and prepend sk-
  • Optional: Redis 7+ for shared rate-limit and cache state across replicas

The Compose file in Step 3 uses a mutable tag (ghcr.io/berriai/litellm-database:v1.55.10) so the page reads as a copy-paste smoke-test baseline. That tag is fine for a local laptop and not fine for anything a paying customer touches: a routine container restart on a mutable tag is how a working proxy turns into a 3 a.m. CrashLoopBackOff after a Prisma schema change landed overnight. Before promoting beyond a smoke test, resolve the tag to an immutable digest on deploy day with docker pull ghcr.io/berriai/litellm-database:v1.55.10 followed by docker inspect --format='{{index .RepoDigests 0}}' ghcr.io/berriai/litellm-database:v1.55.10, then replace the tag in image: with the resulting @sha256: reference.

Step 1, Lay out the project

litellm-proxy/
  docker-compose.yml
  config.yaml
  .env
  .gitignore

.gitignore:

.env

.env (do not commit):

LITELLM_MASTER_KEY=sk-litellm-...replace-with-openssl-rand-hex-32...
LITELLM_SALT_KEY=...replace-with-another-openssl-rand-hex-32...
POSTGRES_PASSWORD=...replace-with-openssl-rand-hex-32...
ANTHROPIC_API_KEY=sk-ant-...
OPENAI_API_KEY=sk-...
OPENROUTER_API_KEY=sk-or-v1-...
DATABASE_URL=postgresql://litellm:${POSTGRES_PASSWORD}@db:5432/litellm

Two requirements break the proxy in different ways. The master-key prefix uses sk- per the documented convention. The Postgres password in DATABASE_URL has to match POSTGRES_PASSWORD, because Compose initializes the db service with the latter and the proxy authenticates against the former. The throwaway litellm:litellm credential pair from quickstarts is a smoke-test value, not a deployment default.

The salt key, per the LiteLLM production guide (fetched 2026-04-25), encrypts DB-stored credentials, the upstream provider keys written to Postgres when store_model_in_db is on, and secrets entered through the admin UI. It is not the encryption key for the virtual-key strings the proxy issues; those are stored as hashed verification tokens and validated by hash comparison. On the baseline in this page (store_model_in_db: false, provider keys resolved from os.environ/...), the practical blast radius of a lost salt is limited. Once store_model_in_db is flipped on, or once provider credentials are entered through the UI, the salt becomes load-bearing and a fresh value silently breaks every credential it previously protected. Store the salt in a secrets manager the day it is generated.

Backing up pgdata before any migration

The pgdata volume holds every issued virtual key, every spend record, and any model state stored when store_model_in_db is on. Losing it loses all three.

DUMP_FILE="litellm-$(date -u +%Y%m%dT%H%M%SZ).dump"
docker compose exec -T db pg_dump -U litellm -Fc litellm > "$DUMP_FILE"
test -s "$DUMP_FILE" || { echo "dump is empty: $DUMP_FILE" >&2; exit 1; }
docker compose exec -T db pg_restore -l < "$DUMP_FILE" \
  | grep -E 'LiteLLM_(VerificationToken|SpendLogs)'

Restore is pg_restore --clean --if-exists -U litellm -d litellm against the same image version the dump was taken on, with the proxy stopped. On managed Postgres, the equivalent is whatever point-in-time recovery the provider offers. Run a periodic restore drill against a throwaway database; a backup that has never been restored is a hope, not a backup.

Step 2, Write the proxy config

config.yaml:

model_list:
  - model_name: claude-sonnet
    litellm_params:
      model: anthropic/claude-sonnet-4-5-20250929
      api_key: os.environ/ANTHROPIC_API_KEY
      # rpm/tpm: derive from your provider account, see below

  - model_name: claude-sonnet-fallback
    litellm_params:
      model: openrouter/anthropic/claude-sonnet-4.5
      api_key: os.environ/OPENROUTER_API_KEY

  - model_name: gpt-4o
    litellm_params:
      model: openai/gpt-4o
      api_key: os.environ/OPENAI_API_KEY

router_settings:
  routing_strategy: simple-shuffle
  num_retries: 2
  timeout: 30
  fallbacks:
    - claude-sonnet: ["claude-sonnet-fallback"]

litellm_settings:
  drop_params: false
  set_verbose: false
  json_logs: true
  disable_error_logs: true

general_settings:
  master_key: os.environ/LITELLM_MASTER_KEY
  database_url: os.environ/DATABASE_URL
  store_model_in_db: false
  ui_access_mode: admin_only
  database_connection_pool_limit: 20

Why the Sonnet 4.5 dated slug

The dated slug anthropic/claude-sonnet-4-5-20250929 is what this page is verified against. A tutorial that pins an undated alias rots silently the moment Anthropic ships the next build, because the context-window arithmetic and fallback compatibility are tied to a specific documented input window. As of 2026-04-25 the canonical Anthropic format is anthropic/claude-sonnet-4-5-20250929 and OpenRouter exposes the same model as anthropic/claude-sonnet-4.5. A wrong slug returns BadRequestError: model not found.

Deriving rpm and tpm

Derive rpm and tpm from the provider account's limits page and the response headers returned for the exact key and tier the proxy authenticates as. Anthropic, per the rate-limit reference, enforces ITPM and OTPM as separate ceilings; LiteLLM's tpm field is a single number per route and cannot encode the split, so set it to the stricter observed bucket. OpenAI surfaces its ceiling on x-ratelimit-limit-tokens. Until that derivation happens, leave the fields commented out and let the upstream do the rate-limiting.

OpenRouter as a named fallback

The version above gives the OpenRouter route its own model_name (claude-sonnet-fallback) and references it in the fallbacks block, so callers asking for claude-sonnet hit Anthropic Direct on the happy path and only see OpenRouter when Anthropic Direct returns a retryable error after num_retries.

A failover to OpenRouter changes the data processor and adds a sub-processor: OpenRouter sees every prompt and response in plaintext. For PHI, PII under GDPR Art. 28, or anything covered by a customer DPA that names sub-processors, fallback through an aggregator without prior written approval is a compliance incident. I drop the fallback entry on regulated paths.

Why context_window_fallbacks is intentionally absent

claude-sonnet-4-5-20250929 documents a 200K-token input window; gpt-4o documents 128K. A context-window fallback has to route to a model with a verified larger or compatible context window. Routing to a smaller window means a request that overflows Sonnet also overflows the fallback and fails the same way, which is decorative rather than useful. Either remove the fallback or add a target whose documented context window is verified larger, proven with a per-route conformance test.

Production mode and structured logs

The defaults that ship with the proxy image are tuned for development: verbose payload logging, plaintext log lines, and an error path that emits request and response bodies. Step 3's Compose file sets:

environment:
  LITELLM_MODE: "PRODUCTION"
  LITELLM_LOG: "ERROR"

json_logs: true makes every line a structured JSON object. disable_error_logs: true strips the proxy's own error-log lines that embed request bodies; the error itself still propagates as an HTTP response and as a single ERROR log record. Ship the proxy's stdout to a log pipeline that applies redaction at ingest, restrict read access to LiteLLM_SpendLogs, and treat the proxy host's local log files as transient buffers.

drop_params: false by default

The flag silently strips request parameters the target provider does not accept. With it on, a single agent using a non-portable parameter starts succeeding with different semantics than the calling code asked for. Leave drop_params: false so the proxy returns a 400 instead of a silent semantic shift. Allowlisting a single parameter on a specific route requires a passing route-level conformance test first.

Parameter classWhy dropping is unsafe
seedInvalidates regression-test fixtures
temperature, top_p, top_kReturns measurably different prose than the calibrated baseline
tools, tool_choiceReturns plain completion instead of structured tool_calls; dispatcher silently no-ops
response_formatReturns free-form prose where caller expected schema-conformant JSON
logprobsSilently zeros the signal scorers depend on
user, metadata, custom headersBreaks attribution, billing reconciliation, routing policy

database_connection_pool_limit is sized against actual concurrency

Per the production guide, this limit applies per worker process. The formula is database_connection_pool_limit <= max_connections / (instances * workers), with headroom for the admin UI, migration jobs, and psql sessions. Postgres 16 ships max_connections=100, so a single Compose instance with one worker can take a 20-connection pool. Sizing wrong cascades: too-large pools exhaust Postgres connections under load, every replica refuses requests at the database layer, and the symptom looks like "the proxy died."

store_model_in_db: false

The DB-backed model store lets the model list be edited from the admin UI without a redeploy, which is a split-brain trap: the Postgres copy and the config.yaml copy diverge silently the first time someone clicks through the UI. config.yaml is the only canonical source, lives in version control, and the proxy reloads on commit.

ui_access_mode: admin_only

The admin UI ships with broader access modes that surface key generation and spend data to non-admin viewers. Locking it to admin-only is the conservative default. The UI itself still belongs behind a separate auth boundary.

Step 3, Compose it together

docker-compose.yml:

services:
  litellm:
    image: ghcr.io/berriai/litellm-database:v1.55.10
    restart: unless-stopped
    ports:
      - "127.0.0.1:4000:4000"
    env_file: .env
    environment:
      DISABLE_SCHEMA_UPDATE: "false"
      LITELLM_MODE: "PRODUCTION"
      LITELLM_LOG: "ERROR"
    volumes:
      - ./config.yaml:/app/config.yaml:ro
    command: ["--config", "/app/config.yaml", "--port", "4000"]
    depends_on:
      db:
        condition: service_healthy
    healthcheck:
      test: ["CMD", "curl", "-f", "http://localhost:4000/health/liveliness"]
      interval: 30s
      timeout: 5s
      retries: 3

  db:
    image: postgres:16
    restart: unless-stopped
    environment:
      POSTGRES_USER: litellm
      POSTGRES_PASSWORD: ${POSTGRES_PASSWORD}
      POSTGRES_DB: litellm
    volumes:
      - pgdata:/var/lib/postgresql/data
    healthcheck:
      test: ["CMD", "pg_isready", "-U", "litellm"]
      interval: 10s
      timeout: 5s
      retries: 5

volumes:
  pgdata:

The litellm-database image variant ships with the Prisma client, Postgres client, and curl baked in, which the slim runtime image does not. Substituting a stripped image leaves the container marked unhealthy even when the proxy is serving traffic.

The port binding is 127.0.0.1:4000:4000, not 0.0.0.0:4000:4000. A real deployment puts the proxy behind a separate layer with five properties: TLS termination, an auth boundary in front of master-key-protected admin routes, a private listener, source-CIDR allowlisting, and access logging.

The healthcheck uses /health/liveliness because liveliness is the right semantics for a Docker restart probe. For load-balancer gates and Kubernetes readinessProbe the right endpoint is /health/readiness, which additionally validates the DB connection.

Run schema migrations from exactly one process. On v1.55.10 the proxy honors DISABLE_SCHEMA_UPDATE=true. The migration owner replica boots with that variable unset and runs Prisma migrations; every other replica boots with it set to true. A dedicated one-shot job is cleaner than picking a long-running replica as owner. Without this discipline, a rolling deploy across N replicas produces concurrent migration contention.

Bring it up:

docker compose up -d
docker compose logs -f litellm

Watch for the line saying the database schema has been migrated. Prisma errors at this stage almost always mean DATABASE_URL resolved to localhost instead of the db service.

Step 4, Issue your first virtual key

Each app gets its own key per environment. That is the unit of isolation: per-app max_budget, per-app models allowlist, per-app rpm_limit, and per-app revocation that does not touch any other app or any upstream provider key.

umask 077
curl -sS -X POST http://localhost:4000/key/generate \
  -H "Authorization: Bearer $LITELLM_MASTER_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "models": ["claude-sonnet"],
    "duration": "30d",
    "max_budget": 25.00,
    "budget_duration": "30d",
    "rpm_limit": 60,
    "metadata": {"app": "agent-research", "owner": "michael"}
  }' | jq '{key: .key, expires: .expires}'

The response includes a key starting with sk- and an expires timestamp the proxy computed from duration. Write the key directly into the secrets manager rather than to a file.

duration sets the key's lifetime; budget_duration is the rolling window over which max_budget accumulates and resets. A key with duration: "30d" and budget_duration: "7d" would refill its $25 cap weekly and stop authenticating after a month.

On the OSS image, no built-in scheduler reissues keys before they expire. The OSS pattern is overlapping manual issuance: on day 23 of a 30-day window, issue a second key with the same payload, write the new value into the secrets manager, and trigger a rolling restart on every agent. Once spend logs for the old key show zero requests for one full traffic cycle, retire it.

# Disable the old key after the overlap window. Reversible.
curl -sS -X POST http://localhost:4000/key/block \
  -H "Authorization: Bearer $LITELLM_MASTER_KEY" \
  -H "Content-Type: application/json" \
  -d '{"key": "sk-...the-old-key..."}'

/key/block preserves spend history and allows /key/unblock if the cutover is incomplete. /key/delete is permanent; use only after a blocked key has shown zero traffic for a full retention window.

The max_budget field is intended to make the proxy reject requests once tracked spend on the key crosses $25. On the v1.55.10 digest, the budget-exceeded path is an HTTP 400 with a JSON body whose error.type is budget_exceeded. 429 is reserved for rate-limit semantics. Treating a budget rejection as a 429 means the calling app retries a request that will keep failing. Spend aggregation is also eventually consistent. Treat the budget as a circuit breaker, not a hard guarantee.

Smoke test:

curl http://localhost:4000/v1/chat/completions \
  -H "Authorization: Bearer sk-...the-issued-key..." \
  -H "Content-Type: application/json" \
  -d '{
    "model": "claude-sonnet",
    "messages": [{"role": "user", "content": "Reply with the single word: ok"}]
  }'

Expected: a 200 response with a choices array containing an assistant message and a usage block with token counts populated. The exact assistant text is advisory; tightening to exact prose belongs in the conformance suite.

Step 5, Wire your agent code

# pip install "openai>=1.55,<2"
from openai import OpenAI

client = OpenAI(
    base_url="http://litellm-proxy:4000/v1",
    api_key="sk-...the-issued-key...",
)

resp = client.chat.completions.create(
    model="claude-sonnet",
    messages=[{"role": "user", "content": "What's 2+2? One word."}],
    extra_body={
        "metadata": {"generation_name": "smoke-test"},
        "user": "feature=test",
    },
)
print(resp.choices[0].message.content)

The agent code knows nothing about Anthropic, OpenAI, or OpenRouter. It speaks OpenAI-compatible HTTP. When migrating from Sonnet to a cheaper model, the change is one line in config.yaml plus a proxy reload. A config-only swap is safe only when the replacement model preserves the contract the app depends on: context window, tool-use format, structured-output behavior, latency profile, per-token price.

Two attribution paths are available on the OSS image without crossing the Enterprise line. First, the user field on the chat-completions request, persisted on the spend-log row. Second, finer-grained app boundaries via more keys. Request-body metadata.tags for feature-grain spend breakdown is an Enterprise feature per the cost-tracking docs.

Client-side attribution only works as a trust signal when callers are trusted. It is unsafe the moment a virtual key reaches an untrusted caller, because anything in user or metadata lands verbatim in the spend ledger. The two safe sources are the metadata block on the virtual key itself (set at issuance) and server-side injection at a trusted hop in front of the proxy.

Common errors

litellm.AuthenticationError: ... missing api_key — The proxy started before .env loaded, or the key reference in config.yaml is misspelled. Confirm os.environ/ANTHROPIC_API_KEY matches the variable name in .env exactly. The os.environ/ prefix is required.

Prisma Client could not connect to the database — Run docker compose ps to confirm db is healthy. Exec into the proxy container and run nc -zv db 5432. If the network looks intact, recreate without touching the volume: docker compose up -d --force-recreate. Reserve docker compose down -v for local sandboxes you are willing to wipe; the -v flag deletes pgdata and every issued key with it.

BadRequestError: model 'claude-3.5-sonnet' not found — The slug does not match any model_name in config.yaml. model_name is the alias your apps use; litellm_params.model is the upstream slug.

Things nobody talks about

The proxy host becomes a single point of failure for every agent in the fleet. When a routine apt upgrade reboots that host, every agent fails. I run two replicas behind a load balancer with shared Redis state and Postgres on a managed instance.

Spend tracking is eventually consistent. The budget-exceeded path is an HTTP 400 with error.type: budget_exceeded, not a 429. Wiring client retry to the wrong one produces silent breakage. Under a single-replica deployment with low double-digit RPS and no Redis transaction buffering, the largest overrun I saw on a $25/30d key was on the order of a few dollars. That is one setup, not a benchmark. A multi-replica deployment without Redis buffering aggregates spend per-replica with periodic flushes, so concurrency bursts can overshoot further. The provider's bill is the only authoritative number.

store_model_in_db: true creates a split-brain problem. After three months of UI tweaks I once had a config that nobody could reconstruct from the file. Pick one source of truth. For config-as-code, set this to false.

Vendor lock-in: migrating off LiteLLM is medium-painful. The OpenAI-compatible request format is portable to any other gateway (OpenRouter, Vercel AI Gateway, Cloudflare AI Gateway). What is not portable is the virtual-key store, the spend ledger, and the routing config. A clean migration means re-issuing every app's key, replicating budgets, and translating routing rules.

When this approach wins, when it does not

The decision turns on four variables, not on agent count alone. First, duplicated retry and timeout code: the proxy centralizes it. Second, key rotation burden: virtual keys collapse a coordinated rotation to a single revoke. Third, spend attribution: the tagged spend ledger is the cheapest place to get per-app breakdowns. Fourth, provider failover: cross-provider fallback is straightforward in router_settings.fallbacks.

The heuristic I use is three-or-more agents against two-or-more providers, because that is the point where the four variables usually all cross threshold at once. A two-agent fleet against a single provider can still justify the proxy if the spend-attribution requirement is sharp or if a leaked key would force a fleet-wide rotation tonight.

It loses on the inverse shape. One agent against one provider with no second consumer on the horizon: run the SDK directly. The proxy adds a hop, a Postgres dependency, an upgrade discipline, and a 3 a.m. failure domain that did not exist before, in exchange for routing flexibility nobody is using. The compliance boundary matters in the same way: if every workload has to stay on a single named processor with a signed DPA and no aggregator allowed in the data path, the fallback story is off the table.

The proxy concentrates blast radius by design. When healthy, a routing change ships to every agent at once. When unhealthy, every agent 502s at once. Operators willing to run two replicas behind a load balancer with shared Redis state and managed Postgres get the upside. Operators who run a single VM and hope get a single point of failure for the whole fleet.

I run LiteLLM in production on the multi-agent, multi-provider, attribution-required shape. I run direct SDK calls on the one-off shape. The decision is the variables above, not a default.