02 — Gateway and providers¶

How we reach LLM providers. Platform comparison and recommended two-phase architecture.

Two audiences, two gateways¶

The organisation has two distinct LLM access patterns with different requirements:

	Employee chat	Application layer
Users	Staff — email drafting, ad-hoc text analysis, general productivity	Services — agentic workflows, tool-calling, automated decisions
Auth model	Entra SSO (per-user identity, per-user audit)	API keys in secrets vault (service-to-service, per-workflow audit)
Interface	Browser UI (chat surface)	HTTP API (programmatic, tool-use loops)
What matters	Zero-config onboarding, content filtering, per-user controls	Per-execution cost caps, tool dispatch, multi-model routing, spend tracking by workflow/review ID
Recommended	Azure AI Foundry	LiteLLM → Anthropic direct

These don't overlap. Foundry's value propositions (Entra SSO, browser UI, per-user content filtering) are irrelevant to service-to-service agent calls. LiteLLM's value propositions (tool-call dispatch, spend tracking by tag, multi-provider routing) are irrelevant to a staff member asking Claude to reword an email.

The rest of this document covers the application layer only. Employee chat via Foundry is an IT/procurement decision, not an architecture one — deploy it alongside, not instead of, the application gateway.

Platform comparison¶

Direct Anthropic API¶


Deployment	SaaS; no infrastructure to operate
Claude support	Full — prompt caching, tool use, extended thinking, streaming, batches
Cost overhead	Zero markup; published per-token pricing
Operational burden	None. One API key, one base URL
Audit	Client-side only; you log what you send/receive
Multi-model	Anthropic models only

Best for: bootstrapping (Phases 1–2). Worst for: multi-provider routing, centralised cost control, local model access.

LiteLLM¶


Deployment	Self-hosted; single Docker container + Postgres for spend tracking
Claude support	Full pass-through — tool use, prompt caching, streaming, extended thinking all forwarded natively to the Anthropic API
Cost overhead	Zero per-call markup; you bring your own API keys. Infra cost: one container + one Postgres instance (shared with existing)
Operational burden	Low. Stateless proxy; Postgres for spend logs. YAML config for model routing. Runs on a single VM or as a compose service
Audit	Built-in request/response logging to Postgres, or callbacks to Langfuse/OpenTelemetry. Captures full payloads, token counts, latency, cost
Multi-model	100+ providers: Anthropic, OpenAI, Azure, Bedrock, Ollama, vLLM, local models

LiteLLM presents an OpenAI-compatible API surface. Application code sends OpenAI-format requests; LiteLLM translates to the target provider's native format. For Anthropic, it can also operate in pass-through mode — forward Anthropic-native requests unchanged.

Best for: unified API surface, multi-provider routing, spend tracking, local model access via Ollama/vLLM behind the same endpoint. This is the recommended production gateway.

Azure AI Foundry¶


Deployment	SaaS (Azure-hosted); tied to Azure subscription
Claude support	Claude available via Models-as-a-Service (MaaS). Tool use and caching supported; feature parity lags direct API by weeks
Cost overhead	~15–25% markup over direct Anthropic pricing
Operational burden	Medium. Azure RBAC + Entra integration — relevant since we already use Entra for staff auth
Audit	Azure Monitor / Log Analytics integration. Enterprise-grade but Azure-locked
Multi-model	Azure model catalog (OpenAI, Meta, Mistral, Claude via MaaS)

Right for the employee chat surface. Entra SSO gives every staff member browser-based access to Claude gated on their existing company identity — no API keys, no onboarding friction, per-user audit out of the box. The 15–25% markup is the price of zero-config access for non-technical users. Content filtering and per-user spend controls come standard.

Wrong for the application layer. Service-to-service calls don't benefit from Entra SSO. The markup compounds across high-volume agent workloads. Feature parity lags the direct API (tool use, prompt caching, extended thinking land on direct first). And the operational needs — per-execution cost caps, spend tracking by workflow ID, multi-provider routing to local models — aren't in Foundry's design surface.

If compliance requires that application-layer LLM calls also route through Azure infrastructure, LiteLLM can target Foundry's OpenAI-compatible endpoint as a backend — preserving all the application-layer tooling while satisfying the compliance gate.

Amazon Bedrock¶


Deployment	SaaS (AWS-hosted)
Claude support	Full — prompt caching ("context caching"), tool use, streaming. Cross-region inference for capacity
Cost overhead	~15–25% markup. Provisioned Throughput option for predictable cost
Operational burden	Medium. IAM policies, CloudWatch billing alarms, VPC endpoints for private access
Audit	CloudWatch logging built in
Multi-model	AWS model garden (Anthropic, Meta, Mistral, Cohere, Stability)

Not compelling unless the production stack moves to AWS. The IAM overhead and markup are not justified at our scale.

Recommended architecture¶

Phase 1 — Direct Anthropic API (bootstrap)¶

  BFF (Python)  ──httpx──▶  api.anthropic.com
  rev-sci (Go)  ──net/http──▶  api.anthropic.com

Each service has its own thin wrapper (see 03-client-wrappers.md). Env-gated on ANTHROPIC_API_KEY; no-op stub if unset. Covers Layers A + D (Haiku, no tools) and Layer B (Sonnet, sandbox-only tools).

Ship this. It works tomorrow.

Phase 2 — LiteLLM gateway (production)¶

  BFF (Python)  ──httpx──┐
  rev-sci (Go)  ──net/http──┤
  future svc    ──────────┤
                          ▼
                     litellm-proxy  ─┬─▶  api.anthropic.com (Claude)
                       (compose)    ├─▶  ollama (local models)
                          │         ├─▶  vllm (self-hosted)
                          │         └─▶  azure/bedrock (optional)
                          ▼
                     litellm-db (Postgres)
                       spend tracking
                       request logs
                       audit trail

The migration from Phase 1 to Phase 2 is a one-line change per service: swap the base URL from api.anthropic.com to litellm:4000. The client wrapper abstracts this.

LiteLLM config¶

model_list:
  - model_name: claude-sonnet
    litellm_params:
      model: anthropic/claude-sonnet-4-6
      api_key: os.environ/ANTHROPIC_API_KEY

  - model_name: claude-haiku
    litellm_params:
      model: anthropic/claude-haiku-4-5-20251001
      api_key: os.environ/ANTHROPIC_API_KEY

  - model_name: local-small
    litellm_params:
      model: ollama/llama3.2
      api_base: http://ollama:11434

  - model_name: claude-sonnet
    litellm_params:
      model: anthropic/claude-sonnet-4-6
      api_key: os.environ/ANTHROPIC_API_KEY_FALLBACK
    model_info:
      priority: 2  # fallback if primary key is rate-limited

general_settings:
  master_key: os.environ/LITELLM_MASTER_KEY
  database_url: os.environ/LITELLM_DATABASE_URL
  max_budget: 50.0        # $50/day hard cap
  budget_duration: 24h

litellm_settings:
  cache: true
  cache_params:
    type: redis
    host: shared-redis
    port: 6379
  success_callback: ["langfuse"]
  failure_callback: ["langfuse"]

Compose service¶

litellm:
  image: ghcr.io/berriai/litellm:main-latest
  ports:
    - "4000:4000"
  volumes:
    - ./litellm_config.yaml:/app/config.yaml
  environment:
    - ANTHROPIC_API_KEY=${ANTHROPIC_API_KEY}
    - LITELLM_MASTER_KEY=${LITELLM_MASTER_KEY}
    - LITELLM_DATABASE_URL=postgresql://litellm:litellm@litellm-db:5432/litellm
  command: ["--config", "/app/config.yaml"]
  depends_on:
    - litellm-db
    - shared-redis
  networks:
    - backoffice

litellm-db:
  image: postgres:16-alpine
  environment:
    POSTGRES_USER: litellm
    POSTGRES_PASSWORD: litellm
    POSTGRES_DB: litellm
  volumes:
    - litellm-pgdata:/var/lib/postgresql/data
  networks:
    - backoffice

Open questions¶

LiteLLM from day one? Spend tracking and unified logging are immediately useful. Is the extra container justified before multi-provider routing is needed?
Foundry for employee chat — who owns it? IT procurement or engineering? The application layer doesn't depend on it, but the deployment timeline may affect how staff perceive "we have AI."
Compliance routing. Does compliance require that application- layer LLM calls also go through Azure infrastructure? If yes, LiteLLM routes to Foundry's endpoint as a backend — architecture doesn't change, but we need the Foundry deployment stood up first.
Local model use cases. What justifies the GPU cost for Ollama/vLLM? Candidate: high-volume intake parsing if Haiku costs become material at scale.