Agentic platform — design review¶

Purpose: Shareable overview of our LLM agent platform design. Read this first; it pulls together the proposals from the research docs into one reviewable surface. Add your comments inline or raise them in discussion — the "Input needed" sections are the decisions that need more than one brain.

Detail docs (read if you want the full picture): - 01-insertion-points.md — four LLM layers, rollout path - 02-gateway-and-providers.md — platform comparison, LiteLLM config - 03-client-wrappers.md — Go + Python HTTP wrappers - 04-tool-use.md — schema generation, dispatch loops, ML model tools - 05-data-layer.md — ClickHouse over parquet, NFS/ADLS - 06-agentic-swarms.md — multi-agent orchestration via Temporal - 07-cost-safety-audit.md — cost, safety, audit trail - key-decisions.md § D18 — the guardrails ADR

Supporting context (read if a section references them): - bespoke-telemetry.md — the data layers that feed agent tools - bespoke-allocator.md — the deterministic math engine agents must not replace - uwe-queue-temporal-blueprint.md — underwriting queue where agents also plug in

1. Core principle¶

LLMs earn their keep on free-text parsing and tool-call reasoning, NOT on hard mathematics. The deterministic allocator owns the math. Every LLM-proposed allocation is validated by the deterministic simulate endpoint — the LLM cannot approve something that violates floor or budget constraints without the violation being surfaced mechanically. This is ADR D18.

Input needed¶

Is this constraint the right one? Are there cases where we'd want the LLM to do numerical reasoning (e.g. "this partner's giveaway pattern is trending up 15% quarter-on-quarter") that go beyond parsing and tool-calling? Or is that always a ClickHouse query?

2. Four insertion points¶

Each "layer" is an independent LLM integration point. They ship separately, in order of risk.

→ Full detail: 01-insertion-points.md

Layer	What it does	Where	Model	Tools	Risk
A — Intake parsing	Extract structured tags from free-text justifications	BFF endpoint	Haiku	None	Low
B — Advisory allocator	Propose a giveaway vector in the sandbox	BFF endpoint	Sonnet	`get_proforma`, `get_partner_history`	Medium-low
C — Decision reviewer	Review real bespoke requests; approve/decline/escalate	rev-sci Temporal activity	Sonnet	7 tools (simulate, history, approve, decline, escalate...)	High
D — Outcome narrative	Summarise a completed review in plain English	BFF endpoint, cached	Haiku	None	Very low

Rollout order¶

Phase 1:  Layer A + D     ← cheap, advisory-only, no tools, ships first
Phase 2:  Layer B          ← sandbox-only, exercises tool-calling safely
Phase 3a: Layer C shadow   ← writes shadow_decision; human still decides
Phase 3b: Layer C auto     ← narrow-band autonomous (low-stakes only)

Input needed¶

Are these the right four? Is there a fifth insertion point we're missing? (e.g. proactive partner outreach, portfolio-level risk scoring, quote-complexity triage before it reaches a human?)

Phase 1 priority. Layer A (intake tags) and Layer D (narrative) are bundled because they're both Haiku / no tools. Would you ship one before the other? Does the narrative add enough value on day one, or is intake parsing the only priority?

Layer C scope. The seven tools for the decision reviewer — are they the right set? Too many? Missing one?

3. LLM gateway — how we call models¶

→ Full detail: 02-gateway-and-providers.md

Two audiences, two gateways¶

The organisation has two distinct LLM access needs:

Employee chat (Azure AI Foundry). Staff need browser-based access to Claude for email wording, ad-hoc text analysis, general productivity. Foundry gives them Entra SSO, per-user audit, content filtering — zero-config onboarding gated on their existing company identity. The 15–25% markup is the price of that zero-friction access. This is an IT/procurement decision.

Application layer (LiteLLM → Anthropic direct). Our agentic workflows are service-to-service — API keys in a secrets vault, not user identities. What we need: per-execution cost caps, spend tracking by workflow/review ID, tool-call dispatch, multi-model routing. None of that is in Foundry's design surface. LiteLLM provides all of it at zero per-call markup.

These don't overlap. Deploy them alongside each other, not instead of each other.

Application layer: two-phase approach¶

Phase 1 (bootstrap): Direct Anthropic API. Thin httpx (Python) / net/http (Go) wrappers in each service. Env-gated — no ANTHROPIC_API_KEY, no LLM features, no errors.

Phase 2 (production): LiteLLM proxy. Single Docker container in compose. All services point at http://litellm:4000 instead of api.anthropic.com. Gives us:

Unified API surface (OpenAI-compatible) across providers
Per-service spend caps and rate limits
Request/response logging to Postgres (full audit trail)
Multi-provider routing (Anthropic, Ollama for local models)
Zero per-call markup — we bring our own API keys

If compliance requires that application-layer calls also route through Azure, LiteLLM can target Foundry's endpoint as a backend — our architecture doesn't change.

Input needed¶

LiteLLM from day one? It adds spend tracking and unified logging immediately. Is that worth the extra container, or do we bootstrap with direct API and add LiteLLM when we need multi-provider?

Foundry for employee chat — who owns it? IT procurement or engineering? The application layer doesn't depend on it, but the deployment timeline may affect how staff perceive "we have AI."

Compliance routing. Does compliance require that application- layer LLM calls also go through Azure infrastructure? If yes, LiteLLM routes to Foundry as a backend — architecture unchanged, but we need the Foundry deployment stood up first.

4. Tools — ML models and analytics as agent capabilities¶

→ Full detail: 04-tool-use.md

The core insight: existing service clients are already the right shape for LLM tools. The fraud scoring model, the PyICE allocator, the rev-sci partner history endpoint — each is a typed function with a request struct, a response struct, and an HTTP call. Wrapping them as tools means generating a JSON Schema from the existing type and writing a dispatch case.

Tool catalogue (proposed)¶

Tool	Backing service	What it does	Used by
`score_fraud_risk`	UW fraud-scoring ML model	Score fraud risk per quote line; returns risk bands + probability	Layer C, UW swarm
`simulate_allocation`	PyICE `/vos/bespoke/simulate`	Validate a proposed giveaway vector against floor + budget constraints	Layer B, Layer C
`get_partner_history`	rev-sci / ClickHouse	Partner's bespoke review history: connection rates by decision branch	Layer B, Layer C
`explain_clamps`	PyICE	Explain why specific packages were clamped in an allocation	Layer C
`query_analytics`	ClickHouse (read-only SQL)	Ad-hoc analytical queries over parquet data lake	Layer C, UW swarm
`approve_review`	rev-sci Temporal	Approve a bespoke review with a specific allocation	Layer C only
`decline_review`	rev-sci Temporal	Decline a bespoke review with rationale	Layer C only
`escalate_to_human`	rev-sci Temporal	Escalate to human reviewer	Layer C only

Input needed¶

Tool granularity. Are approve_review / decline_review / escalate_to_human the right decision tools, or should there be a single submit_decision(action, ...) tool? Fewer tools = simpler prompt; more tools = tighter validation per action.

query_analytics guardrails. The ClickHouse tool lets agents write SQL. Mitigations: read-only user, allowed-table list, row limits, query timeout, DML regex block. Is that enough? Should we constrain further (e.g. pre-approved query templates only)?

Missing tools? What about get_quote_snapshot, get_similar_quotes, get_partner_tier? Are there things an analyst looks up during a review that aren't in this list?

5. ClickHouse + parquet data layer¶

→ Full detail: 05-data-layer.md

Agents need analytical data — partner history, quote outcomes, commission patterns. ClickHouse queries parquet files from the data lake.

Data flow¶

MySQL (rev_sci, commercial_engine, V3/VOS)
    │
    ▼
Dagster assets (scheduled / sensor-triggered)
    │ extract → transform → partition → write parquet
    ▼
NFS mount (dev/on-prem)  or  ADLS Gen2 (production)
    │
ClickHouse reads parquet on query via File() / azureBlobStorage()
    │
    ▼
query_analytics tool → agent

Dev: NFS mount at /data/lake/, bind-mounted into ClickHouse container. Zero cost, zero latency.

Production: ADLS Gen2, Dagster-orchestrated. Partitioned assets with freshness sensors, lineage tracking, automatic backfills.

Input needed¶

Freshness SLA. How stale can the data be for agent queries? Nightly materialisation is simplest. Hourly is achievable. Near-real-time (CDC sensors) adds pipeline complexity. What does the review workflow actually need?

Dagster ownership. Who writes the Dagster asset definitions? Is this the data team's remit, or does the back-office team own the pipeline end-to-end?

What tables? partner_metrics, quote_outcomes, commission_history are proposed. What's missing? What does the commercial team actually query today that should be in the lake?

6. Agentic swarms — multi-agent orchestration¶

→ Full detail: 06-agentic-swarms.md

Single tool-use loops (one agent, one prompt, a few tool calls) work for Layers A–D. The broader vision — underwriting triage, cross-domain deal analysis, portfolio risk — needs specialist agents coordinated by an orchestrator.

Pattern: Temporal as swarm backbone¶

We already use Temporal for BespokeReviewWorkflow. Each specialist agent becomes a Temporal activity. The workflow dispatches them, handles timeouts, retries failures.

Bespoke review swarm
  │
  ├── IntakeParser        (Haiku, no tools)        → advisory tags
  │
  ├── MarginAnalyst       (Sonnet, simulate + clamps)  ─┐
  ├── PartnerAnalyst      (Sonnet, history + analytics) ─┤ parallel
  ├── FraudScreener       (Sonnet, fraud ML + analytics) ─┘
  │
  ├── Synthesiser         (Sonnet, reads specialist outputs)
  │                         → approve / counter / decline / escalate
  │
  ├── [if escalate] wait for human signal
  │
  └── NarrativeWriter     (Haiku, no tools)        → summary

Second swarm: underwriting triage¶

Same pattern, different specialists:

UW triage swarm
  │
  ├── DocumentParser       (Haiku)          → structured fields
  ├── FraudAnalyst         (Sonnet, fraud ML + analytics)
  ├── CreditAnalyst        (Sonnet, analytics)
  ├── ComplianceChecker    (Haiku, sanctions + PEP)
  ├── Synthesiser          (Sonnet)         → approve / refer / decline
  └── DecisionWriter       (Haiku)          → audit narrative

Input needed¶

Swarm vs single agent for Layer C. The insertion-points doc describes Layer C as a single agent with seven tools. The swarms doc proposes a multi-agent version with specialists. Which ships first? Single agent is simpler; swarm is more capable. Proposal: single agent for Phase 3a (shadow), swarm for Phase 3b (autonomous).

UW triage swarm — real or aspirational? Is the underwriting team ready for agentic review, or is this a 12-month horizon? Should we design the platform with UW in mind but ship bespoke first?

Per-execution cost cap. The swarm config proposes $0.50 per review execution. Is that the right order of magnitude? At 100 reviews/day that's $50/day for the swarm alone.

7. Cost envelope¶

→ Full detail: 07-cost-safety-audit.md

Scenario	Daily volume	Daily cost	Notes
Phase 1 (A + D only)	100 reviews	~$0.80	Haiku, no tools
Phase 2 (+ Layer B)	+10 sandbox uses	~$1.20	Sonnet, bursty
Phase 3a (+ shadow)	+100 shadow reviews	~$9.20	Sonnet, 8k tokens each
Full swarm	100 reviews × 5 agents	~$50	Parallel specialists
Scale (1000/day)	10× above	~$500	Prompt caching load-bearing

Input needed¶

Budget authority. Who approves the spend? Is there a monthly cap? Does it need sign-off per phase?

Prompt caching. The cost estimates assume >80% cache hit rate on system prompts. This is realistic for repetitive tasks but unvalidated. Should we prototype Phase 1 first and measure actual cache hit rates before committing to Phase 3 cost projections?

8. Guardrails and safety¶

Guardrail	How it works
Env-gated	No `ANTHROPIC_API_KEY` → no LLM features → no errors. Workflows degrade, never break
Deterministic validation	Every LLM-proposed allocation runs through the simulate endpoint. Floor/budget violations are caught mechanically
Shadow-first	Phase 3a: LLM writes what it would decide; human decides. No downstream effect
Narrow-band autonomous	Phase 3b: auto-approve only below £X, low variance, high confidence. Everything else escalates
Escalate-to-human always available	The default when confidence is low
Read-only ClickHouse	READONLY user, DML regex block, allowed tables, row limits, query timeout
Prompt injection defence	User text in `user` role only; no write-effect tools on untrusted-text agents
Audit trail	Full prompt + tool calls + response logged per LLM call, correlated by workflow/review ID
Per-execution cost cap	Swarm aborts if budget exhausted; synthesiser decides with available info

Input needed¶

Concordance threshold. What % agreement between shadow and human decisions is "good enough" to graduate to autonomous? Who sets it — commercial ops, engineering, both?

Prompt governance. Who can edit prompt templates? Is there a review process? Should prompt changes go through the same PR process as code?

9. What we are NOT doing¶

No Anthropic SDK dependency in Phase 8 (current phase) — design only
No prompt text — describe shape, not prose
No MCP orchestration — native tool-use API, single-loop agents
No multi-turn conversations — each layer is one tool-use loop
No model fine-tuning — prompt engineering only
No self-hosted LLM training — inference access via wrappers
No production deployment topology — compose is the dev shape

10. Summary of decisions needed¶

For quick reference — every "Input needed" section above, condensed:

#	Decision	Who should weigh in
1	LLMs parse + reason only, or also numerical trend analysis?	Engineering, commercial
2	Are four insertion points the right set? Missing a fifth?	Commercial ops, engineering
3	Phase 1 priority: both A+D, or intake parsing (A) first?	Product
4	Layer C tool set — right size? Missing tools?	Bespoke reviewers
5	LiteLLM from day one or after bootstrap?	Engineering
6	Foundry for employee chat — who owns it? Timeline?	IT, management
6b	Compliance routing — must app-layer calls also go through Azure?	Compliance, IT
7	`query_analytics` SQL guardrails — sufficient?	Security, engineering
8	ClickHouse freshness SLA (nightly / hourly / CDC)?	Commercial ops, data
9	Dagster pipeline ownership — back-office or data team?	Management
10	What tables belong in the analytical lake?	Commercial, data
11	Swarm vs single agent — which ships first?	Engineering
12	UW triage swarm — real timeline or aspirational?	UW team, product
13	Per-execution cost cap — $0.50 right?	Commercial, finance
14	Monthly spend budget and approval authority?	Finance, management
15	Concordance threshold for shadow → autonomous?	Commercial ops, engineering
16	Prompt governance process?	Engineering, compliance