Skip to content

07 — Cost controls, safety, and audit

Per-execution budgets, spend tracking, observability, API key management, prompt injection defence, and audit trail.

Cost envelope

At 100 reviews/day running all four layers:

Layer Volume Tokens/call Model Daily cost
A (intake parse) 100 ~1.5k Haiku ~$0.30
D (narrative) 100 ~2.5k Haiku ~$0.50 (cached)
B (advisory) ~10 (bursty) ~4k Sonnet ~$0.40
C (shadow) 100 ~8k Sonnet ~$8.00
Total (shadow) ~$10/day
Full swarm 100 × 5 agents ~$50/day

Prompt caching is load-bearing. System prompts + tool schemas are stable across calls — >80% cache hit rate realistic for Layers A/D.

Per-execution budgets

Each swarm execution has a MaxBudgetCents cap. The tool dispatcher tracks cumulative token usage. If budget is exhausted mid-swarm, the synthesiser receives a "budget exhausted" signal and must decide with available information.

LiteLLM spend tracking

LiteLLM's Postgres database records every call: - model, provider, tokens (in/out/cached), cost - API key used, request tags (review_id, workflow_id) - full request/response payloads (optional; toggle per model)

Exposed via admin API: - GET /spend/logs — per-call log - GET /spend/tags — spend by tag - GET /budget/info — current budget consumption

Langfuse integration (optional)

LiteLLM supports Langfuse as a callback: - Trace visualisation (full tool-loop chains) - Cost dashboards by model, user, feature - Prompt versioning and A/B testing - Evaluation pipelines (concordance scoring for shadow mode)

Not required for v1; valuable for Phase 3a when concordance measurement becomes load-bearing.

Structured logging

Every LLM call logged with business context:

{
  "event": "llm_call",
  "agent": "margin-analyst",
  "workflow_id": "bespoke-review-abc123",
  "review_id": "rev_01JXYZ",
  "model": "claude-sonnet",
  "input_tokens": 3200,
  "output_tokens": 450,
  "cache_read_tokens": 1800,
  "tool_calls": ["simulate_allocation"],
  "latency_ms": 2340,
  "cost_cents": 4.2,
  "stop_reason": "end_turn"
}

Emitted from the tool dispatcher, not the HTTP client — captures which agent and which review alongside the technical metrics.

API key management

  • ANTHROPIC_API_KEY — stored in secrets manager (Vault / Azure Key Vault). Never in compose env files for production.
  • LiteLLM master key — rotated independently; scoped to internal traffic only.
  • Per-service virtual keys (LiteLLM feature) — each service gets its own key with its own budget. BFF cannot exhaust rev-sci's budget.

Prompt injection defence

LLM agents that process user-supplied text (Layer A intake parsing, Layer C reviewing partner justifications) are prompt injection surfaces. Mitigations:

  • User-supplied text is always in the user message, never interpolated into the system prompt.
  • Tool results are returned as tool_result content blocks, not concatenated into message text.
  • Agents that process untrusted text have no write-effect tools. Layer A has no tools at all; Layer C's write-effect tools (approve_review, decline_review) require the synthesiser to act — a specialist cannot approve directly.

Audit trail

Every LLM-influenced decision captures: - Full prompt (system + messages + tool schemas) - All tool calls and results - Final response - Token counts and cost - Workflow ID + review ID for correlation

Stored in llm_audit_log (ClickHouse, reading parquet on NFS / ADLS) for long-term retention, not in the transactional database. The audit parquet is a Dagster asset — materialised from the LiteLLM Postgres spend log, partitioned by date. Retention: 2 years minimum for compliance.

Open questions

  1. Monthly spend budget and approval authority. Who approves?
  2. Prompt caching validation. Prototype Phase 1 and measure actual cache hit rates before projecting Phase 3 costs?
  3. Concordance threshold. What % agreement graduates shadow → autonomous? Who sets it?
  4. Prompt governance. Who can edit prompt templates? PR review process?
  5. Rate limiting LLM calls. Per-day or per-month cost cap? Does the cap interact with the narrow-band threshold?