Skip to content

02 — Gateway and providers

How we reach LLM providers. Platform comparison and recommended two-phase architecture.

Two audiences, two gateways

The organisation has two distinct LLM access patterns with different requirements:

Employee chat Application layer
Users Staff — email drafting, ad-hoc text analysis, general productivity Services — agentic workflows, tool-calling, automated decisions
Auth model Entra SSO (per-user identity, per-user audit) API keys in secrets vault (service-to-service, per-workflow audit)
Interface Browser UI (chat surface) HTTP API (programmatic, tool-use loops)
What matters Zero-config onboarding, content filtering, per-user controls Per-execution cost caps, tool dispatch, multi-model routing, spend tracking by workflow/review ID
Recommended Azure AI Foundry LiteLLM → Anthropic direct

These don't overlap. Foundry's value propositions (Entra SSO, browser UI, per-user content filtering) are irrelevant to service-to-service agent calls. LiteLLM's value propositions (tool-call dispatch, spend tracking by tag, multi-provider routing) are irrelevant to a staff member asking Claude to reword an email.

The rest of this document covers the application layer only. Employee chat via Foundry is an IT/procurement decision, not an architecture one — deploy it alongside, not instead of, the application gateway.

Platform comparison

Direct Anthropic API

Deployment SaaS; no infrastructure to operate
Claude support Full — prompt caching, tool use, extended thinking, streaming, batches
Cost overhead Zero markup; published per-token pricing
Operational burden None. One API key, one base URL
Audit Client-side only; you log what you send/receive
Multi-model Anthropic models only

Best for: bootstrapping (Phases 1–2). Worst for: multi-provider routing, centralised cost control, local model access.

LiteLLM

Deployment Self-hosted; single Docker container + Postgres for spend tracking
Claude support Full pass-through — tool use, prompt caching, streaming, extended thinking all forwarded natively to the Anthropic API
Cost overhead Zero per-call markup; you bring your own API keys. Infra cost: one container + one Postgres instance (shared with existing)
Operational burden Low. Stateless proxy; Postgres for spend logs. YAML config for model routing. Runs on a single VM or as a compose service
Audit Built-in request/response logging to Postgres, or callbacks to Langfuse/OpenTelemetry. Captures full payloads, token counts, latency, cost
Multi-model 100+ providers: Anthropic, OpenAI, Azure, Bedrock, Ollama, vLLM, local models

LiteLLM presents an OpenAI-compatible API surface. Application code sends OpenAI-format requests; LiteLLM translates to the target provider's native format. For Anthropic, it can also operate in pass-through mode — forward Anthropic-native requests unchanged.

Best for: unified API surface, multi-provider routing, spend tracking, local model access via Ollama/vLLM behind the same endpoint. This is the recommended production gateway.

Azure AI Foundry

Deployment SaaS (Azure-hosted); tied to Azure subscription
Claude support Claude available via Models-as-a-Service (MaaS). Tool use and caching supported; feature parity lags direct API by weeks
Cost overhead ~15–25% markup over direct Anthropic pricing
Operational burden Medium. Azure RBAC + Entra integration — relevant since we already use Entra for staff auth
Audit Azure Monitor / Log Analytics integration. Enterprise-grade but Azure-locked
Multi-model Azure model catalog (OpenAI, Meta, Mistral, Claude via MaaS)

Right for the employee chat surface. Entra SSO gives every staff member browser-based access to Claude gated on their existing company identity — no API keys, no onboarding friction, per-user audit out of the box. The 15–25% markup is the price of zero-config access for non-technical users. Content filtering and per-user spend controls come standard.

Wrong for the application layer. Service-to-service calls don't benefit from Entra SSO. The markup compounds across high-volume agent workloads. Feature parity lags the direct API (tool use, prompt caching, extended thinking land on direct first). And the operational needs — per-execution cost caps, spend tracking by workflow ID, multi-provider routing to local models — aren't in Foundry's design surface.

If compliance requires that application-layer LLM calls also route through Azure infrastructure, LiteLLM can target Foundry's OpenAI-compatible endpoint as a backend — preserving all the application-layer tooling while satisfying the compliance gate.

Amazon Bedrock

Deployment SaaS (AWS-hosted)
Claude support Full — prompt caching ("context caching"), tool use, streaming. Cross-region inference for capacity
Cost overhead ~15–25% markup. Provisioned Throughput option for predictable cost
Operational burden Medium. IAM policies, CloudWatch billing alarms, VPC endpoints for private access
Audit CloudWatch logging built in
Multi-model AWS model garden (Anthropic, Meta, Mistral, Cohere, Stability)

Not compelling unless the production stack moves to AWS. The IAM overhead and markup are not justified at our scale.

Phase 1 — Direct Anthropic API (bootstrap)

  BFF (Python)  ──httpx──▶  api.anthropic.com
  rev-sci (Go)  ──net/http──▶  api.anthropic.com

Each service has its own thin wrapper (see 03-client-wrappers.md). Env-gated on ANTHROPIC_API_KEY; no-op stub if unset. Covers Layers A + D (Haiku, no tools) and Layer B (Sonnet, sandbox-only tools).

Ship this. It works tomorrow.

Phase 2 — LiteLLM gateway (production)

  BFF (Python)  ──httpx──┐
  rev-sci (Go)  ──net/http──┤
  future svc    ──────────┤
                     litellm-proxy  ─┬─▶  api.anthropic.com (Claude)
                       (compose)    ├─▶  ollama (local models)
                          │         ├─▶  vllm (self-hosted)
                          │         └─▶  azure/bedrock (optional)
                     litellm-db (Postgres)
                       spend tracking
                       request logs
                       audit trail

The migration from Phase 1 to Phase 2 is a one-line change per service: swap the base URL from api.anthropic.com to litellm:4000. The client wrapper abstracts this.

LiteLLM config

model_list:
  - model_name: claude-sonnet
    litellm_params:
      model: anthropic/claude-sonnet-4-6
      api_key: os.environ/ANTHROPIC_API_KEY

  - model_name: claude-haiku
    litellm_params:
      model: anthropic/claude-haiku-4-5-20251001
      api_key: os.environ/ANTHROPIC_API_KEY

  - model_name: local-small
    litellm_params:
      model: ollama/llama3.2
      api_base: http://ollama:11434

  - model_name: claude-sonnet
    litellm_params:
      model: anthropic/claude-sonnet-4-6
      api_key: os.environ/ANTHROPIC_API_KEY_FALLBACK
    model_info:
      priority: 2  # fallback if primary key is rate-limited

general_settings:
  master_key: os.environ/LITELLM_MASTER_KEY
  database_url: os.environ/LITELLM_DATABASE_URL
  max_budget: 50.0        # $50/day hard cap
  budget_duration: 24h

litellm_settings:
  cache: true
  cache_params:
    type: redis
    host: shared-redis
    port: 6379
  success_callback: ["langfuse"]
  failure_callback: ["langfuse"]

Compose service

litellm:
  image: ghcr.io/berriai/litellm:main-latest
  ports:
    - "4000:4000"
  volumes:
    - ./litellm_config.yaml:/app/config.yaml
  environment:
    - ANTHROPIC_API_KEY=${ANTHROPIC_API_KEY}
    - LITELLM_MASTER_KEY=${LITELLM_MASTER_KEY}
    - LITELLM_DATABASE_URL=postgresql://litellm:litellm@litellm-db:5432/litellm
  command: ["--config", "/app/config.yaml"]
  depends_on:
    - litellm-db
    - shared-redis
  networks:
    - backoffice

litellm-db:
  image: postgres:16-alpine
  environment:
    POSTGRES_USER: litellm
    POSTGRES_PASSWORD: litellm
    POSTGRES_DB: litellm
  volumes:
    - litellm-pgdata:/var/lib/postgresql/data
  networks:
    - backoffice

Open questions

  1. LiteLLM from day one? Spend tracking and unified logging are immediately useful. Is the extra container justified before multi-provider routing is needed?
  2. Foundry for employee chat — who owns it? IT procurement or engineering? The application layer doesn't depend on it, but the deployment timeline may affect how staff perceive "we have AI."
  3. Compliance routing. Does compliance require that application- layer LLM calls also go through Azure infrastructure? If yes, LiteLLM routes to Foundry's endpoint as a backend — architecture doesn't change, but we need the Foundry deployment stood up first.
  4. Local model use cases. What justifies the GPU cost for Ollama/vLLM? Candidate: high-volume intake parsing if Haiku costs become material at scale.