Skip to content

Agentic platform — design review

Purpose: Shareable overview of our LLM agent platform design. Read this first; it pulls together the proposals from the research docs into one reviewable surface. Add your comments inline or raise them in discussion — the "Input needed" sections are the decisions that need more than one brain.

Detail docs (read if you want the full picture): - 01-insertion-points.md — four LLM layers, rollout path - 02-gateway-and-providers.md — platform comparison, LiteLLM config - 03-client-wrappers.md — Go + Python HTTP wrappers - 04-tool-use.md — schema generation, dispatch loops, ML model tools - 05-data-layer.md — ClickHouse over parquet, NFS/ADLS - 06-agentic-swarms.md — multi-agent orchestration via Temporal - 07-cost-safety-audit.md — cost, safety, audit trail - key-decisions.md § D18 — the guardrails ADR

Supporting context (read if a section references them): - bespoke-telemetry.md — the data layers that feed agent tools - bespoke-allocator.md — the deterministic math engine agents must not replace - uwe-queue-temporal-blueprint.md — underwriting queue where agents also plug in


1. Core principle

LLMs earn their keep on free-text parsing and tool-call reasoning, NOT on hard mathematics. The deterministic allocator owns the math. Every LLM-proposed allocation is validated by the deterministic simulate endpoint — the LLM cannot approve something that violates floor or budget constraints without the violation being surfaced mechanically. This is ADR D18.

Input needed

Is this constraint the right one? Are there cases where we'd want the LLM to do numerical reasoning (e.g. "this partner's giveaway pattern is trending up 15% quarter-on-quarter") that go beyond parsing and tool-calling? Or is that always a ClickHouse query?


2. Four insertion points

Each "layer" is an independent LLM integration point. They ship separately, in order of risk.

→ Full detail: 01-insertion-points.md

Layer What it does Where Model Tools Risk
A — Intake parsing Extract structured tags from free-text justifications BFF endpoint Haiku None Low
B — Advisory allocator Propose a giveaway vector in the sandbox BFF endpoint Sonnet get_proforma, get_partner_history Medium-low
C — Decision reviewer Review real bespoke requests; approve/decline/escalate rev-sci Temporal activity Sonnet 7 tools (simulate, history, approve, decline, escalate...) High
D — Outcome narrative Summarise a completed review in plain English BFF endpoint, cached Haiku None Very low

Rollout order

Phase 1:  Layer A + D     ← cheap, advisory-only, no tools, ships first
Phase 2:  Layer B          ← sandbox-only, exercises tool-calling safely
Phase 3a: Layer C shadow   ← writes shadow_decision; human still decides
Phase 3b: Layer C auto     ← narrow-band autonomous (low-stakes only)

Input needed

  1. Are these the right four? Is there a fifth insertion point we're missing? (e.g. proactive partner outreach, portfolio-level risk scoring, quote-complexity triage before it reaches a human?)

  2. Phase 1 priority. Layer A (intake tags) and Layer D (narrative) are bundled because they're both Haiku / no tools. Would you ship one before the other? Does the narrative add enough value on day one, or is intake parsing the only priority?

  3. Layer C scope. The seven tools for the decision reviewer — are they the right set? Too many? Missing one?


3. LLM gateway — how we call models

→ Full detail: 02-gateway-and-providers.md

Two audiences, two gateways

The organisation has two distinct LLM access needs:

Employee chat (Azure AI Foundry). Staff need browser-based access to Claude for email wording, ad-hoc text analysis, general productivity. Foundry gives them Entra SSO, per-user audit, content filtering — zero-config onboarding gated on their existing company identity. The 15–25% markup is the price of that zero-friction access. This is an IT/procurement decision.

Application layer (LiteLLM → Anthropic direct). Our agentic workflows are service-to-service — API keys in a secrets vault, not user identities. What we need: per-execution cost caps, spend tracking by workflow/review ID, tool-call dispatch, multi-model routing. None of that is in Foundry's design surface. LiteLLM provides all of it at zero per-call markup.

These don't overlap. Deploy them alongside each other, not instead of each other.

Application layer: two-phase approach

Phase 1 (bootstrap): Direct Anthropic API. Thin httpx (Python) / net/http (Go) wrappers in each service. Env-gated — no ANTHROPIC_API_KEY, no LLM features, no errors.

Phase 2 (production): LiteLLM proxy. Single Docker container in compose. All services point at http://litellm:4000 instead of api.anthropic.com. Gives us:

  • Unified API surface (OpenAI-compatible) across providers
  • Per-service spend caps and rate limits
  • Request/response logging to Postgres (full audit trail)
  • Multi-provider routing (Anthropic, Ollama for local models)
  • Zero per-call markup — we bring our own API keys

If compliance requires that application-layer calls also route through Azure, LiteLLM can target Foundry's endpoint as a backend — our architecture doesn't change.

Input needed

  1. LiteLLM from day one? It adds spend tracking and unified logging immediately. Is that worth the extra container, or do we bootstrap with direct API and add LiteLLM when we need multi-provider?

  2. Foundry for employee chat — who owns it? IT procurement or engineering? The application layer doesn't depend on it, but the deployment timeline may affect how staff perceive "we have AI."

  3. Compliance routing. Does compliance require that application- layer LLM calls also go through Azure infrastructure? If yes, LiteLLM routes to Foundry as a backend — architecture unchanged, but we need the Foundry deployment stood up first.


4. Tools — ML models and analytics as agent capabilities

→ Full detail: 04-tool-use.md

The core insight: existing service clients are already the right shape for LLM tools. The fraud scoring model, the PyICE allocator, the rev-sci partner history endpoint — each is a typed function with a request struct, a response struct, and an HTTP call. Wrapping them as tools means generating a JSON Schema from the existing type and writing a dispatch case.

Tool catalogue (proposed)

Tool Backing service What it does Used by
score_fraud_risk UW fraud-scoring ML model Score fraud risk per quote line; returns risk bands + probability Layer C, UW swarm
simulate_allocation PyICE /vos/bespoke/simulate Validate a proposed giveaway vector against floor + budget constraints Layer B, Layer C
get_partner_history rev-sci / ClickHouse Partner's bespoke review history: connection rates by decision branch Layer B, Layer C
explain_clamps PyICE Explain why specific packages were clamped in an allocation Layer C
query_analytics ClickHouse (read-only SQL) Ad-hoc analytical queries over parquet data lake Layer C, UW swarm
approve_review rev-sci Temporal Approve a bespoke review with a specific allocation Layer C only
decline_review rev-sci Temporal Decline a bespoke review with rationale Layer C only
escalate_to_human rev-sci Temporal Escalate to human reviewer Layer C only

Input needed

  1. Tool granularity. Are approve_review / decline_review / escalate_to_human the right decision tools, or should there be a single submit_decision(action, ...) tool? Fewer tools = simpler prompt; more tools = tighter validation per action.

  2. query_analytics guardrails. The ClickHouse tool lets agents write SQL. Mitigations: read-only user, allowed-table list, row limits, query timeout, DML regex block. Is that enough? Should we constrain further (e.g. pre-approved query templates only)?

  3. Missing tools? What about get_quote_snapshot, get_similar_quotes, get_partner_tier? Are there things an analyst looks up during a review that aren't in this list?


5. ClickHouse + parquet data layer

→ Full detail: 05-data-layer.md

Agents need analytical data — partner history, quote outcomes, commission patterns. ClickHouse queries parquet files from the data lake.

Data flow

MySQL (rev_sci, commercial_engine, V3/VOS)
Dagster assets (scheduled / sensor-triggered)
    │ extract → transform → partition → write parquet
NFS mount (dev/on-prem)  or  ADLS Gen2 (production)
ClickHouse reads parquet on query via File() / azureBlobStorage()
query_analytics tool → agent

Dev: NFS mount at /data/lake/, bind-mounted into ClickHouse container. Zero cost, zero latency.

Production: ADLS Gen2, Dagster-orchestrated. Partitioned assets with freshness sensors, lineage tracking, automatic backfills.

Input needed

  1. Freshness SLA. How stale can the data be for agent queries? Nightly materialisation is simplest. Hourly is achievable. Near-real-time (CDC sensors) adds pipeline complexity. What does the review workflow actually need?

  2. Dagster ownership. Who writes the Dagster asset definitions? Is this the data team's remit, or does the back-office team own the pipeline end-to-end?

  3. What tables? partner_metrics, quote_outcomes, commission_history are proposed. What's missing? What does the commercial team actually query today that should be in the lake?


6. Agentic swarms — multi-agent orchestration

→ Full detail: 06-agentic-swarms.md

Single tool-use loops (one agent, one prompt, a few tool calls) work for Layers A–D. The broader vision — underwriting triage, cross-domain deal analysis, portfolio risk — needs specialist agents coordinated by an orchestrator.

Pattern: Temporal as swarm backbone

We already use Temporal for BespokeReviewWorkflow. Each specialist agent becomes a Temporal activity. The workflow dispatches them, handles timeouts, retries failures.

Bespoke review swarm
  ├── IntakeParser        (Haiku, no tools)        → advisory tags
  ├── MarginAnalyst       (Sonnet, simulate + clamps)  ─┐
  ├── PartnerAnalyst      (Sonnet, history + analytics) ─┤ parallel
  ├── FraudScreener       (Sonnet, fraud ML + analytics) ─┘
  ├── Synthesiser         (Sonnet, reads specialist outputs)
  │                         → approve / counter / decline / escalate
  ├── [if escalate] wait for human signal
  └── NarrativeWriter     (Haiku, no tools)        → summary

Second swarm: underwriting triage

Same pattern, different specialists:

UW triage swarm
  ├── DocumentParser       (Haiku)          → structured fields
  ├── FraudAnalyst         (Sonnet, fraud ML + analytics)
  ├── CreditAnalyst        (Sonnet, analytics)
  ├── ComplianceChecker    (Haiku, sanctions + PEP)
  ├── Synthesiser          (Sonnet)         → approve / refer / decline
  └── DecisionWriter       (Haiku)          → audit narrative

Input needed

  1. Swarm vs single agent for Layer C. The insertion-points doc describes Layer C as a single agent with seven tools. The swarms doc proposes a multi-agent version with specialists. Which ships first? Single agent is simpler; swarm is more capable. Proposal: single agent for Phase 3a (shadow), swarm for Phase 3b (autonomous).

  2. UW triage swarm — real or aspirational? Is the underwriting team ready for agentic review, or is this a 12-month horizon? Should we design the platform with UW in mind but ship bespoke first?

  3. Per-execution cost cap. The swarm config proposes $0.50 per review execution. Is that the right order of magnitude? At 100 reviews/day that's $50/day for the swarm alone.


7. Cost envelope

→ Full detail: 07-cost-safety-audit.md

Scenario Daily volume Daily cost Notes
Phase 1 (A + D only) 100 reviews ~$0.80 Haiku, no tools
Phase 2 (+ Layer B) +10 sandbox uses ~$1.20 Sonnet, bursty
Phase 3a (+ shadow) +100 shadow reviews ~$9.20 Sonnet, 8k tokens each
Full swarm 100 reviews × 5 agents ~$50 Parallel specialists
Scale (1000/day) 10× above ~$500 Prompt caching load-bearing

Input needed

  1. Budget authority. Who approves the spend? Is there a monthly cap? Does it need sign-off per phase?

  2. Prompt caching. The cost estimates assume >80% cache hit rate on system prompts. This is realistic for repetitive tasks but unvalidated. Should we prototype Phase 1 first and measure actual cache hit rates before committing to Phase 3 cost projections?


8. Guardrails and safety

Guardrail How it works
Env-gated No ANTHROPIC_API_KEY → no LLM features → no errors. Workflows degrade, never break
Deterministic validation Every LLM-proposed allocation runs through the simulate endpoint. Floor/budget violations are caught mechanically
Shadow-first Phase 3a: LLM writes what it would decide; human decides. No downstream effect
Narrow-band autonomous Phase 3b: auto-approve only below £X, low variance, high confidence. Everything else escalates
Escalate-to-human always available The default when confidence is low
Read-only ClickHouse READONLY user, DML regex block, allowed tables, row limits, query timeout
Prompt injection defence User text in user role only; no write-effect tools on untrusted-text agents
Audit trail Full prompt + tool calls + response logged per LLM call, correlated by workflow/review ID
Per-execution cost cap Swarm aborts if budget exhausted; synthesiser decides with available info

Input needed

  1. Concordance threshold. What % agreement between shadow and human decisions is "good enough" to graduate to autonomous? Who sets it — commercial ops, engineering, both?

  2. Prompt governance. Who can edit prompt templates? Is there a review process? Should prompt changes go through the same PR process as code?


9. What we are NOT doing

  • No Anthropic SDK dependency in Phase 8 (current phase) — design only
  • No prompt text — describe shape, not prose
  • No MCP orchestration — native tool-use API, single-loop agents
  • No multi-turn conversations — each layer is one tool-use loop
  • No model fine-tuning — prompt engineering only
  • No self-hosted LLM training — inference access via wrappers
  • No production deployment topology — compose is the dev shape

10. Summary of decisions needed

For quick reference — every "Input needed" section above, condensed:

# Decision Who should weigh in
1 LLMs parse + reason only, or also numerical trend analysis? Engineering, commercial
2 Are four insertion points the right set? Missing a fifth? Commercial ops, engineering
3 Phase 1 priority: both A+D, or intake parsing (A) first? Product
4 Layer C tool set — right size? Missing tools? Bespoke reviewers
5 LiteLLM from day one or after bootstrap? Engineering
6 Foundry for employee chat — who owns it? Timeline? IT, management
6b Compliance routing — must app-layer calls also go through Azure? Compliance, IT
7 query_analytics SQL guardrails — sufficient? Security, engineering
8 ClickHouse freshness SLA (nightly / hourly / CDC)? Commercial ops, data
9 Dagster pipeline ownership — back-office or data team? Management
10 What tables belong in the analytical lake? Commercial, data
11 Swarm vs single agent — which ships first? Engineering
12 UW triage swarm — real timeline or aspirational? UW team, product
13 Per-execution cost cap — $0.50 right? Commercial, finance
14 Monthly spend budget and approval authority? Finance, management
15 Concordance threshold for shadow → autonomous? Commercial ops, engineering
16 Prompt governance process? Engineering, compliance