01 — LLM insertion points¶
Four independent LLM integration points across the bespoke review flow. Each ships separately, in order of risk.
Layer A — Intake parsing¶
| Where | BFF — new endpoint POST /api/bespoke/intake-parse |
| What the LLM adds | Extracts structured signals from free-text bespoke justifications: deal urgency, competitive threat, partner relationship signals, hardware/buyout context (D17 — context amounts stay as context). Output is a set of advisory UI tags (COMPETITIVE_THREAT, REPEAT_PATTERN, HARDWARE_SUBSIDY, etc.) displayed alongside the request on the review page. |
| Tool surface | None. Single prompt, single structured response. |
| Model class | Haiku |
| Cost / latency | ~1.5k input tokens, ~200 output tokens. <1s wall-time. |
| Risk + mitigation | Low. Tags are additive UI hints shown to a human reviewer; no downstream action depends on them. Original justification text is always visible alongside tags. Worst case: a misleading tag — reviewer overrides by reading the source text. |
| Blockers | None. Can ship on any bespoke request that carries free-text justification. |
Layer B — Advisory allocator (sandbox-only)¶
| Where | BFF — new endpoint POST /api/bespoke/llm-allocate, wired behind a "get LLM opinion" button on the sandbox page (see bespoke-calculator-sandbox.md §Open questions #3) |
| What the LLM adds | Proposes a g_i giveaway vector for a given quote + bespoke ask, reasoning about partner history and deal shape. The proposed vector is validated by the deterministic simulate endpoint (POST /vos/bespoke/simulate) — floor violations and budget shortfalls are flagged mechanically, not by the LLM. Shown side-by-side with the margin-equalising recommendation. |
| Tool surface | get_proforma(quoteId) → ProformaResponse · get_partner_history(partnerId) → PartnerBespokeHistory |
| Model class | Sonnet |
| Cost / latency | ~4k input tokens (proforma + history + system prompt), ~500 output tokens. 2–4s wall-time. |
| Risk + mitigation | Medium-low. Output is a proposed allocation displayed in the sandbox alongside the deterministic recommendation. Never applied without human confirmation. The deterministic simulate endpoint validates every LLM-proposed vector — floor and budget violations are always surfaced. Sandbox-only; no production review path. |
| Blockers | Sandbox UI (Session 06). Simulate endpoint (Session 04). |
Layer C — Decision reviewer (LLMReviewerStrategy)¶
| Where | rev-sci-vanguard — new file internal/temporal/bespoke/llm_strategy.go, implementing the existing ReviewerStrategy interface (Primer 05) |
| What the LLM adds | Reviews a real bespoke recommendation in the context of partner history, deal shape, and margin position. Can call tools to simulate alternative allocations, inspect history, and explain clamp decisions. Outputs a structured decision: approve (with allocation), counter-offer (with alternative vector), decline (with rationale), or escalate to human. |
| Tool surface | simulate_allocation · get_partner_history · explain_clamps · approve_review · offer_review · decline_review · escalate_to_human |
| Model class | Sonnet |
| Cost / latency | ~8k input tokens (recommendation + history + system prompt + tool results), ~1k output tokens. 5–10s wall-time including tool calls. |
| Risk + mitigation | High if autonomous. An incorrect auto-approve writes a commission_override (D2) and fires the VanguardClient callback — real money. Mitigated by phased rollout: Phase 3a is shadow-only (LLM writes shadow_decision column; human decides; no downstream effect). Phase 3b is narrow-band autonomous (auto-approve only below £X threshold with low partner history variance). escalate_to_human is always available and is the default when confidence is low. |
| Blockers | Telemetry Layer 2 (outcome stamp) + Layer 3 (partner rollups) from bespoke-telemetry.md must land first — get_partner_history is useless without outcome data. Rule-based strategy must be running in production with enough reviews to calibrate concordance. |
Layer D — Outcome narrative (post-hoc)¶
| Where | BFF — new endpoint GET /api/bespoke/reviews/{id}/narrative, cached against review_id + decision_hash |
| What the LLM adds | Generates a human-readable summary of a completed review: what was asked, what the allocator recommended, what the reviewer decided, how the decision compares to the partner's historical pattern. Displayed on the review detail page and in summary emails. |
| Tool surface | None. Single prompt, single text response. Input is the review record + allocation + decision metadata. |
| Model class | Haiku |
| Cost / latency | ~2.5k input tokens, ~300 output tokens. <1s wall-time. Response cached; subsequent reads are free. |
| Risk + mitigation | Very low. Read-only summary of an already-decided review; no downstream action. Cache keyed on review_id + decision_hash — narrative regenerates only if the decision changes. Original data always accessible alongside the narrative. |
| Blockers | Review detail page must exist. Layer 1 telemetry columns populated. |
Rollout path¶
Ordered by risk (ascending) and value-today (descending).
| Phase | Layer | What ships | Prerequisites | Risk |
|---|---|---|---|---|
| 1 | A + D | Intake tags + outcome narrative | Review page exists; Layer 1 columns populated | Low — advisory-only, no tools |
| 2 | B | "Get LLM opinion" button in sandbox | Sandbox UI (Session 06); simulate endpoint | Medium-low — sandbox-only, deterministic validation |
| 3a | C shadow | LLMReviewerStrategy writes shadow_decision; human still decides |
Layer 2/3 telemetry; enough reviews to calibrate | None — no downstream effect |
| 3b | C narrow-band | Auto-approve under £X with low partner history variance | Observed concordance ≥ threshold between shadow and human decisions | High — mitigated by narrow band + escalate-to-human default |
Phase 1 ships first because it is cheap (Haiku-class, no tools), immediately useful (reviewers see structured tags and readable summaries), and builds organisational muscle with LLM outputs before any tool-calling or autonomous decision-making is introduced.
Phase 2 exercises the tool-calling pattern in the safest possible environment — the sandbox. No real money, no real reviews.
Phase 3a is the critical gate. Shadow mode means the LLM sees every real review and writes what it would have decided, but the human reviewer's decision is what actually fires. Concordance is the only metric that matters for graduating to 3b.
Phase 3b is narrow-band autonomous: auto-approve only when the giveaway is below a configurable threshold, the partner's historical connection-rate variance is low, and the LLM's confidence is above a calibrated floor. Everything else escalates to human.
File-level landing map¶
Proposed file locations. None implemented in Phase 8.
rev-sci-vanguard/
internal/clients/anthropic/client.go — see 03-client-wrappers.md
internal/temporal/bespoke/llm_strategy.go — LLMReviewerStrategy (Layer C)
internal/prompts/*.md — prompt templates (Markdown, not Go)
vos-backoffice/
bff/app/clients/anthropic_client.py — see 03-client-wrappers.md
bff/app/prompts/*.md — prompt templates (Markdown, not Python)
Each client is env-gated on ANTHROPIC_API_KEY. If the key is unset,
the client returns a no-op stub. Prompt templates are Markdown files
with {{placeholder}} interpolation, colocated with the service.
The infrastructure layer (LiteLLM gateway, client wrappers, tool dispatch loops, ClickHouse analytics) is designed in the sibling docs in this directory.
What this spike explicitly does NOT do¶
- No Anthropic SDK dependency added in Phase 8.
- No prompt text drafted — describe shape, not prose.
- No MCP scaffolding — native tool-use API, single-loop layers.
- No multi-turn agent orchestration — each layer is one tool-use loop.
- No autonomous decisions — rollout stops at shadow mode (Phase 3a) until observed concordance graduates to narrow-band (Phase 3b).
Open questions¶
- Concordance threshold for 3a → 3b. What % agreement is sufficient? On what subset? Who sets it?
- Prompt governance. Who owns prompt templates? Engineering-only or commercial ops too?
- LLM trace capture. Where does the audit trail live?
- Partner history shape for
get_partner_history. What subset of the Layer 3 rollup does the LLM need? - Haiku vs Sonnet boundary. At what error rate do Layers A/D promote to Sonnet?
- Fallback when the API is down. Layers A/D degrade; Layer C must escalate to human.
- Sandbox → production promotion. Can a Layer B opinion become a real review, or must the flow always start fresh?