Agentic platform — design review¶
Purpose: Shareable overview of our LLM agent platform design. Read this first; it pulls together the proposals from the research docs into one reviewable surface. Add your comments inline or raise them in discussion — the "Input needed" sections are the decisions that need more than one brain.
Detail docs (read if you want the full picture): - 01-insertion-points.md — four LLM layers, rollout path - 02-gateway-and-providers.md — platform comparison, LiteLLM config - 03-client-wrappers.md — Go + Python HTTP wrappers - 04-tool-use.md — schema generation, dispatch loops, ML model tools - 05-data-layer.md — ClickHouse over parquet, NFS/ADLS - 06-agentic-swarms.md — multi-agent orchestration via Temporal - 07-cost-safety-audit.md — cost, safety, audit trail - key-decisions.md § D18 — the guardrails ADR
Supporting context (read if a section references them): - bespoke-telemetry.md — the data layers that feed agent tools - bespoke-allocator.md — the deterministic math engine agents must not replace - uwe-queue-temporal-blueprint.md — underwriting queue where agents also plug in
1. Core principle¶
LLMs earn their keep on free-text parsing and tool-call reasoning, NOT on hard mathematics. The deterministic allocator owns the math. Every LLM-proposed allocation is validated by the deterministic simulate endpoint — the LLM cannot approve something that violates floor or budget constraints without the violation being surfaced mechanically. This is ADR D18.
Input needed¶
Is this constraint the right one? Are there cases where we'd want the LLM to do numerical reasoning (e.g. "this partner's giveaway pattern is trending up 15% quarter-on-quarter") that go beyond parsing and tool-calling? Or is that always a ClickHouse query?
2. Four insertion points¶
Each "layer" is an independent LLM integration point. They ship separately, in order of risk.
→ Full detail: 01-insertion-points.md
| Layer | What it does | Where | Model | Tools | Risk |
|---|---|---|---|---|---|
| A — Intake parsing | Extract structured tags from free-text justifications | BFF endpoint | Haiku | None | Low |
| B — Advisory allocator | Propose a giveaway vector in the sandbox | BFF endpoint | Sonnet | get_proforma, get_partner_history |
Medium-low |
| C — Decision reviewer | Review real bespoke requests; approve/decline/escalate | rev-sci Temporal activity | Sonnet | 7 tools (simulate, history, approve, decline, escalate...) | High |
| D — Outcome narrative | Summarise a completed review in plain English | BFF endpoint, cached | Haiku | None | Very low |
Rollout order¶
Phase 1: Layer A + D ← cheap, advisory-only, no tools, ships first
Phase 2: Layer B ← sandbox-only, exercises tool-calling safely
Phase 3a: Layer C shadow ← writes shadow_decision; human still decides
Phase 3b: Layer C auto ← narrow-band autonomous (low-stakes only)
Input needed¶
Are these the right four? Is there a fifth insertion point we're missing? (e.g. proactive partner outreach, portfolio-level risk scoring, quote-complexity triage before it reaches a human?)
Phase 1 priority. Layer A (intake tags) and Layer D (narrative) are bundled because they're both Haiku / no tools. Would you ship one before the other? Does the narrative add enough value on day one, or is intake parsing the only priority?
Layer C scope. The seven tools for the decision reviewer — are they the right set? Too many? Missing one?
3. LLM gateway — how we call models¶
→ Full detail: 02-gateway-and-providers.md
Two audiences, two gateways¶
The organisation has two distinct LLM access needs:
Employee chat (Azure AI Foundry). Staff need browser-based access to Claude for email wording, ad-hoc text analysis, general productivity. Foundry gives them Entra SSO, per-user audit, content filtering — zero-config onboarding gated on their existing company identity. The 15–25% markup is the price of that zero-friction access. This is an IT/procurement decision.
Application layer (LiteLLM → Anthropic direct). Our agentic workflows are service-to-service — API keys in a secrets vault, not user identities. What we need: per-execution cost caps, spend tracking by workflow/review ID, tool-call dispatch, multi-model routing. None of that is in Foundry's design surface. LiteLLM provides all of it at zero per-call markup.
These don't overlap. Deploy them alongside each other, not instead of each other.
Application layer: two-phase approach¶
Phase 1 (bootstrap): Direct Anthropic API. Thin httpx (Python) /
net/http (Go) wrappers in each service. Env-gated — no
ANTHROPIC_API_KEY, no LLM features, no errors.
Phase 2 (production): LiteLLM proxy. Single Docker container in
compose. All services point at http://litellm:4000 instead of
api.anthropic.com. Gives us:
- Unified API surface (OpenAI-compatible) across providers
- Per-service spend caps and rate limits
- Request/response logging to Postgres (full audit trail)
- Multi-provider routing (Anthropic, Ollama for local models)
- Zero per-call markup — we bring our own API keys
If compliance requires that application-layer calls also route through Azure, LiteLLM can target Foundry's endpoint as a backend — our architecture doesn't change.
Input needed¶
LiteLLM from day one? It adds spend tracking and unified logging immediately. Is that worth the extra container, or do we bootstrap with direct API and add LiteLLM when we need multi-provider?
Foundry for employee chat — who owns it? IT procurement or engineering? The application layer doesn't depend on it, but the deployment timeline may affect how staff perceive "we have AI."
Compliance routing. Does compliance require that application- layer LLM calls also go through Azure infrastructure? If yes, LiteLLM routes to Foundry as a backend — architecture unchanged, but we need the Foundry deployment stood up first.
4. Tools — ML models and analytics as agent capabilities¶
→ Full detail: 04-tool-use.md
The core insight: existing service clients are already the right shape for LLM tools. The fraud scoring model, the PyICE allocator, the rev-sci partner history endpoint — each is a typed function with a request struct, a response struct, and an HTTP call. Wrapping them as tools means generating a JSON Schema from the existing type and writing a dispatch case.
Tool catalogue (proposed)¶
| Tool | Backing service | What it does | Used by |
|---|---|---|---|
score_fraud_risk |
UW fraud-scoring ML model | Score fraud risk per quote line; returns risk bands + probability | Layer C, UW swarm |
simulate_allocation |
PyICE /vos/bespoke/simulate |
Validate a proposed giveaway vector against floor + budget constraints | Layer B, Layer C |
get_partner_history |
rev-sci / ClickHouse | Partner's bespoke review history: connection rates by decision branch | Layer B, Layer C |
explain_clamps |
PyICE | Explain why specific packages were clamped in an allocation | Layer C |
query_analytics |
ClickHouse (read-only SQL) | Ad-hoc analytical queries over parquet data lake | Layer C, UW swarm |
approve_review |
rev-sci Temporal | Approve a bespoke review with a specific allocation | Layer C only |
decline_review |
rev-sci Temporal | Decline a bespoke review with rationale | Layer C only |
escalate_to_human |
rev-sci Temporal | Escalate to human reviewer | Layer C only |
Input needed¶
Tool granularity. Are
approve_review/decline_review/escalate_to_humanthe right decision tools, or should there be a singlesubmit_decision(action, ...)tool? Fewer tools = simpler prompt; more tools = tighter validation per action.
query_analyticsguardrails. The ClickHouse tool lets agents write SQL. Mitigations: read-only user, allowed-table list, row limits, query timeout, DML regex block. Is that enough? Should we constrain further (e.g. pre-approved query templates only)?Missing tools? What about
get_quote_snapshot,get_similar_quotes,get_partner_tier? Are there things an analyst looks up during a review that aren't in this list?
5. ClickHouse + parquet data layer¶
→ Full detail: 05-data-layer.md
Agents need analytical data — partner history, quote outcomes, commission patterns. ClickHouse queries parquet files from the data lake.
Data flow¶
MySQL (rev_sci, commercial_engine, V3/VOS)
│
▼
Dagster assets (scheduled / sensor-triggered)
│ extract → transform → partition → write parquet
▼
NFS mount (dev/on-prem) or ADLS Gen2 (production)
│
ClickHouse reads parquet on query via File() / azureBlobStorage()
│
▼
query_analytics tool → agent
Dev: NFS mount at /data/lake/, bind-mounted into ClickHouse
container. Zero cost, zero latency.
Production: ADLS Gen2, Dagster-orchestrated. Partitioned assets with freshness sensors, lineage tracking, automatic backfills.
Input needed¶
Freshness SLA. How stale can the data be for agent queries? Nightly materialisation is simplest. Hourly is achievable. Near-real-time (CDC sensors) adds pipeline complexity. What does the review workflow actually need?
Dagster ownership. Who writes the Dagster asset definitions? Is this the data team's remit, or does the back-office team own the pipeline end-to-end?
What tables?
partner_metrics,quote_outcomes,commission_historyare proposed. What's missing? What does the commercial team actually query today that should be in the lake?
6. Agentic swarms — multi-agent orchestration¶
→ Full detail: 06-agentic-swarms.md
Single tool-use loops (one agent, one prompt, a few tool calls) work for Layers A–D. The broader vision — underwriting triage, cross-domain deal analysis, portfolio risk — needs specialist agents coordinated by an orchestrator.
Pattern: Temporal as swarm backbone¶
We already use Temporal for BespokeReviewWorkflow. Each specialist
agent becomes a Temporal activity. The workflow dispatches them,
handles timeouts, retries failures.
Bespoke review swarm
│
├── IntakeParser (Haiku, no tools) → advisory tags
│
├── MarginAnalyst (Sonnet, simulate + clamps) ─┐
├── PartnerAnalyst (Sonnet, history + analytics) ─┤ parallel
├── FraudScreener (Sonnet, fraud ML + analytics) ─┘
│
├── Synthesiser (Sonnet, reads specialist outputs)
│ → approve / counter / decline / escalate
│
├── [if escalate] wait for human signal
│
└── NarrativeWriter (Haiku, no tools) → summary
Second swarm: underwriting triage¶
Same pattern, different specialists:
UW triage swarm
│
├── DocumentParser (Haiku) → structured fields
├── FraudAnalyst (Sonnet, fraud ML + analytics)
├── CreditAnalyst (Sonnet, analytics)
├── ComplianceChecker (Haiku, sanctions + PEP)
├── Synthesiser (Sonnet) → approve / refer / decline
└── DecisionWriter (Haiku) → audit narrative
Input needed¶
Swarm vs single agent for Layer C. The insertion-points doc describes Layer C as a single agent with seven tools. The swarms doc proposes a multi-agent version with specialists. Which ships first? Single agent is simpler; swarm is more capable. Proposal: single agent for Phase 3a (shadow), swarm for Phase 3b (autonomous).
UW triage swarm — real or aspirational? Is the underwriting team ready for agentic review, or is this a 12-month horizon? Should we design the platform with UW in mind but ship bespoke first?
Per-execution cost cap. The swarm config proposes $0.50 per review execution. Is that the right order of magnitude? At 100 reviews/day that's $50/day for the swarm alone.
7. Cost envelope¶
→ Full detail: 07-cost-safety-audit.md
| Scenario | Daily volume | Daily cost | Notes |
|---|---|---|---|
| Phase 1 (A + D only) | 100 reviews | ~$0.80 | Haiku, no tools |
| Phase 2 (+ Layer B) | +10 sandbox uses | ~$1.20 | Sonnet, bursty |
| Phase 3a (+ shadow) | +100 shadow reviews | ~$9.20 | Sonnet, 8k tokens each |
| Full swarm | 100 reviews × 5 agents | ~$50 | Parallel specialists |
| Scale (1000/day) | 10× above | ~$500 | Prompt caching load-bearing |
Input needed¶
Budget authority. Who approves the spend? Is there a monthly cap? Does it need sign-off per phase?
Prompt caching. The cost estimates assume >80% cache hit rate on system prompts. This is realistic for repetitive tasks but unvalidated. Should we prototype Phase 1 first and measure actual cache hit rates before committing to Phase 3 cost projections?
8. Guardrails and safety¶
| Guardrail | How it works |
|---|---|
| Env-gated | No ANTHROPIC_API_KEY → no LLM features → no errors. Workflows degrade, never break |
| Deterministic validation | Every LLM-proposed allocation runs through the simulate endpoint. Floor/budget violations are caught mechanically |
| Shadow-first | Phase 3a: LLM writes what it would decide; human decides. No downstream effect |
| Narrow-band autonomous | Phase 3b: auto-approve only below £X, low variance, high confidence. Everything else escalates |
| Escalate-to-human always available | The default when confidence is low |
| Read-only ClickHouse | READONLY user, DML regex block, allowed tables, row limits, query timeout |
| Prompt injection defence | User text in user role only; no write-effect tools on untrusted-text agents |
| Audit trail | Full prompt + tool calls + response logged per LLM call, correlated by workflow/review ID |
| Per-execution cost cap | Swarm aborts if budget exhausted; synthesiser decides with available info |
Input needed¶
Concordance threshold. What % agreement between shadow and human decisions is "good enough" to graduate to autonomous? Who sets it — commercial ops, engineering, both?
Prompt governance. Who can edit prompt templates? Is there a review process? Should prompt changes go through the same PR process as code?
9. What we are NOT doing¶
- No Anthropic SDK dependency in Phase 8 (current phase) — design only
- No prompt text — describe shape, not prose
- No MCP orchestration — native tool-use API, single-loop agents
- No multi-turn conversations — each layer is one tool-use loop
- No model fine-tuning — prompt engineering only
- No self-hosted LLM training — inference access via wrappers
- No production deployment topology — compose is the dev shape
10. Summary of decisions needed¶
For quick reference — every "Input needed" section above, condensed:
| # | Decision | Who should weigh in |
|---|---|---|
| 1 | LLMs parse + reason only, or also numerical trend analysis? | Engineering, commercial |
| 2 | Are four insertion points the right set? Missing a fifth? | Commercial ops, engineering |
| 3 | Phase 1 priority: both A+D, or intake parsing (A) first? | Product |
| 4 | Layer C tool set — right size? Missing tools? | Bespoke reviewers |
| 5 | LiteLLM from day one or after bootstrap? | Engineering |
| 6 | Foundry for employee chat — who owns it? Timeline? | IT, management |
| 6b | Compliance routing — must app-layer calls also go through Azure? | Compliance, IT |
| 7 | query_analytics SQL guardrails — sufficient? |
Security, engineering |
| 8 | ClickHouse freshness SLA (nightly / hourly / CDC)? | Commercial ops, data |
| 9 | Dagster pipeline ownership — back-office or data team? | Management |
| 10 | What tables belong in the analytical lake? | Commercial, data |
| 11 | Swarm vs single agent — which ships first? | Engineering |
| 12 | UW triage swarm — real timeline or aspirational? | UW team, product |
| 13 | Per-execution cost cap — $0.50 right? | Commercial, finance |
| 14 | Monthly spend budget and approval authority? | Finance, management |
| 15 | Concordance threshold for shadow → autonomous? | Commercial ops, engineering |
| 16 | Prompt governance process? | Engineering, compliance |