02 — Gateway and providers¶
How we reach LLM providers. Platform comparison and recommended two-phase architecture.
Two audiences, two gateways¶
The organisation has two distinct LLM access patterns with different requirements:
| Employee chat | Application layer | |
|---|---|---|
| Users | Staff — email drafting, ad-hoc text analysis, general productivity | Services — agentic workflows, tool-calling, automated decisions |
| Auth model | Entra SSO (per-user identity, per-user audit) | API keys in secrets vault (service-to-service, per-workflow audit) |
| Interface | Browser UI (chat surface) | HTTP API (programmatic, tool-use loops) |
| What matters | Zero-config onboarding, content filtering, per-user controls | Per-execution cost caps, tool dispatch, multi-model routing, spend tracking by workflow/review ID |
| Recommended | Azure AI Foundry | LiteLLM → Anthropic direct |
These don't overlap. Foundry's value propositions (Entra SSO, browser UI, per-user content filtering) are irrelevant to service-to-service agent calls. LiteLLM's value propositions (tool-call dispatch, spend tracking by tag, multi-provider routing) are irrelevant to a staff member asking Claude to reword an email.
The rest of this document covers the application layer only. Employee chat via Foundry is an IT/procurement decision, not an architecture one — deploy it alongside, not instead of, the application gateway.
Platform comparison¶
Direct Anthropic API¶
| Deployment | SaaS; no infrastructure to operate |
| Claude support | Full — prompt caching, tool use, extended thinking, streaming, batches |
| Cost overhead | Zero markup; published per-token pricing |
| Operational burden | None. One API key, one base URL |
| Audit | Client-side only; you log what you send/receive |
| Multi-model | Anthropic models only |
Best for: bootstrapping (Phases 1–2). Worst for: multi-provider routing, centralised cost control, local model access.
LiteLLM¶
| Deployment | Self-hosted; single Docker container + Postgres for spend tracking |
| Claude support | Full pass-through — tool use, prompt caching, streaming, extended thinking all forwarded natively to the Anthropic API |
| Cost overhead | Zero per-call markup; you bring your own API keys. Infra cost: one container + one Postgres instance (shared with existing) |
| Operational burden | Low. Stateless proxy; Postgres for spend logs. YAML config for model routing. Runs on a single VM or as a compose service |
| Audit | Built-in request/response logging to Postgres, or callbacks to Langfuse/OpenTelemetry. Captures full payloads, token counts, latency, cost |
| Multi-model | 100+ providers: Anthropic, OpenAI, Azure, Bedrock, Ollama, vLLM, local models |
LiteLLM presents an OpenAI-compatible API surface. Application code sends OpenAI-format requests; LiteLLM translates to the target provider's native format. For Anthropic, it can also operate in pass-through mode — forward Anthropic-native requests unchanged.
Best for: unified API surface, multi-provider routing, spend tracking, local model access via Ollama/vLLM behind the same endpoint. This is the recommended production gateway.
Azure AI Foundry¶
| Deployment | SaaS (Azure-hosted); tied to Azure subscription |
| Claude support | Claude available via Models-as-a-Service (MaaS). Tool use and caching supported; feature parity lags direct API by weeks |
| Cost overhead | ~15–25% markup over direct Anthropic pricing |
| Operational burden | Medium. Azure RBAC + Entra integration — relevant since we already use Entra for staff auth |
| Audit | Azure Monitor / Log Analytics integration. Enterprise-grade but Azure-locked |
| Multi-model | Azure model catalog (OpenAI, Meta, Mistral, Claude via MaaS) |
Right for the employee chat surface. Entra SSO gives every staff member browser-based access to Claude gated on their existing company identity — no API keys, no onboarding friction, per-user audit out of the box. The 15–25% markup is the price of zero-config access for non-technical users. Content filtering and per-user spend controls come standard.
Wrong for the application layer. Service-to-service calls don't benefit from Entra SSO. The markup compounds across high-volume agent workloads. Feature parity lags the direct API (tool use, prompt caching, extended thinking land on direct first). And the operational needs — per-execution cost caps, spend tracking by workflow ID, multi-provider routing to local models — aren't in Foundry's design surface.
If compliance requires that application-layer LLM calls also route through Azure infrastructure, LiteLLM can target Foundry's OpenAI-compatible endpoint as a backend — preserving all the application-layer tooling while satisfying the compliance gate.
Amazon Bedrock¶
| Deployment | SaaS (AWS-hosted) |
| Claude support | Full — prompt caching ("context caching"), tool use, streaming. Cross-region inference for capacity |
| Cost overhead | ~15–25% markup. Provisioned Throughput option for predictable cost |
| Operational burden | Medium. IAM policies, CloudWatch billing alarms, VPC endpoints for private access |
| Audit | CloudWatch logging built in |
| Multi-model | AWS model garden (Anthropic, Meta, Mistral, Cohere, Stability) |
Not compelling unless the production stack moves to AWS. The IAM overhead and markup are not justified at our scale.
Recommended architecture¶
Phase 1 — Direct Anthropic API (bootstrap)¶
Each service has its own thin wrapper (see 03-client-wrappers.md).
Env-gated on ANTHROPIC_API_KEY; no-op stub if unset. Covers Layers
A + D (Haiku, no tools) and Layer B (Sonnet, sandbox-only tools).
Ship this. It works tomorrow.
Phase 2 — LiteLLM gateway (production)¶
BFF (Python) ──httpx──┐
rev-sci (Go) ──net/http──┤
future svc ──────────┤
▼
litellm-proxy ─┬─▶ api.anthropic.com (Claude)
(compose) ├─▶ ollama (local models)
│ ├─▶ vllm (self-hosted)
│ └─▶ azure/bedrock (optional)
▼
litellm-db (Postgres)
spend tracking
request logs
audit trail
The migration from Phase 1 to Phase 2 is a one-line change per
service: swap the base URL from api.anthropic.com to
litellm:4000. The client wrapper abstracts this.
LiteLLM config¶
model_list:
- model_name: claude-sonnet
litellm_params:
model: anthropic/claude-sonnet-4-6
api_key: os.environ/ANTHROPIC_API_KEY
- model_name: claude-haiku
litellm_params:
model: anthropic/claude-haiku-4-5-20251001
api_key: os.environ/ANTHROPIC_API_KEY
- model_name: local-small
litellm_params:
model: ollama/llama3.2
api_base: http://ollama:11434
- model_name: claude-sonnet
litellm_params:
model: anthropic/claude-sonnet-4-6
api_key: os.environ/ANTHROPIC_API_KEY_FALLBACK
model_info:
priority: 2 # fallback if primary key is rate-limited
general_settings:
master_key: os.environ/LITELLM_MASTER_KEY
database_url: os.environ/LITELLM_DATABASE_URL
max_budget: 50.0 # $50/day hard cap
budget_duration: 24h
litellm_settings:
cache: true
cache_params:
type: redis
host: shared-redis
port: 6379
success_callback: ["langfuse"]
failure_callback: ["langfuse"]
Compose service¶
litellm:
image: ghcr.io/berriai/litellm:main-latest
ports:
- "4000:4000"
volumes:
- ./litellm_config.yaml:/app/config.yaml
environment:
- ANTHROPIC_API_KEY=${ANTHROPIC_API_KEY}
- LITELLM_MASTER_KEY=${LITELLM_MASTER_KEY}
- LITELLM_DATABASE_URL=postgresql://litellm:litellm@litellm-db:5432/litellm
command: ["--config", "/app/config.yaml"]
depends_on:
- litellm-db
- shared-redis
networks:
- backoffice
litellm-db:
image: postgres:16-alpine
environment:
POSTGRES_USER: litellm
POSTGRES_PASSWORD: litellm
POSTGRES_DB: litellm
volumes:
- litellm-pgdata:/var/lib/postgresql/data
networks:
- backoffice
Open questions¶
- LiteLLM from day one? Spend tracking and unified logging are immediately useful. Is the extra container justified before multi-provider routing is needed?
- Foundry for employee chat — who owns it? IT procurement or engineering? The application layer doesn't depend on it, but the deployment timeline may affect how staff perceive "we have AI."
- Compliance routing. Does compliance require that application- layer LLM calls also go through Azure infrastructure? If yes, LiteLLM routes to Foundry's endpoint as a backend — architecture doesn't change, but we need the Foundry deployment stood up first.
- Local model use cases. What justifies the GPU cost for Ollama/vLLM? Candidate: high-volume intake parsing if Haiku costs become material at scale.