01 — LLM insertion points¶

Four independent LLM integration points across the bespoke review flow. Each ships separately, in order of risk.

Layer A — Intake parsing¶


Where	BFF — new endpoint `POST /api/bespoke/intake-parse`
What the LLM adds	Extracts structured signals from free-text bespoke justifications: deal urgency, competitive threat, partner relationship signals, hardware/buyout context (D17 — context amounts stay as context). Output is a set of advisory UI tags (`COMPETITIVE_THREAT`, `REPEAT_PATTERN`, `HARDWARE_SUBSIDY`, etc.) displayed alongside the request on the review page.
Tool surface	None. Single prompt, single structured response.
Model class	Haiku
Cost / latency	~1.5k input tokens, ~200 output tokens. <1s wall-time.
Risk + mitigation	Low. Tags are additive UI hints shown to a human reviewer; no downstream action depends on them. Original justification text is always visible alongside tags. Worst case: a misleading tag — reviewer overrides by reading the source text.
Blockers	None. Can ship on any bespoke request that carries free-text justification.

Layer B — Advisory allocator (sandbox-only)¶


Where	BFF — new endpoint `POST /api/bespoke/llm-allocate`, wired behind a "get LLM opinion" button on the sandbox page (see `bespoke-calculator-sandbox.md` §Open questions #3)
What the LLM adds	Proposes a `g_i` giveaway vector for a given quote + bespoke ask, reasoning about partner history and deal shape. The proposed vector is validated by the deterministic simulate endpoint (`POST /vos/bespoke/simulate`) — floor violations and budget shortfalls are flagged mechanically, not by the LLM. Shown side-by-side with the margin-equalising recommendation.
Tool surface	`get_proforma(quoteId) → ProformaResponse` · `get_partner_history(partnerId) → PartnerBespokeHistory`
Model class	Sonnet
Cost / latency	~4k input tokens (proforma + history + system prompt), ~500 output tokens. 2–4s wall-time.
Risk + mitigation	Medium-low. Output is a proposed allocation displayed in the sandbox alongside the deterministic recommendation. Never applied without human confirmation. The deterministic simulate endpoint validates every LLM-proposed vector — floor and budget violations are always surfaced. Sandbox-only; no production review path.
Blockers	Sandbox UI (Session 06). Simulate endpoint (Session 04).

Layer C — Decision reviewer (`LLMReviewerStrategy`)¶


Where	rev-sci-vanguard — new file `internal/temporal/bespoke/llm_strategy.go`, implementing the existing `ReviewerStrategy` interface (Primer 05)
What the LLM adds	Reviews a real bespoke recommendation in the context of partner history, deal shape, and margin position. Can call tools to simulate alternative allocations, inspect history, and explain clamp decisions. Outputs a structured decision: approve (with allocation), counter-offer (with alternative vector), decline (with rationale), or escalate to human.
Tool surface	`simulate_allocation` · `get_partner_history` · `explain_clamps` · `approve_review` · `offer_review` · `decline_review` · `escalate_to_human`
Model class	Sonnet
Cost / latency	~8k input tokens (recommendation + history + system prompt + tool results), ~1k output tokens. 5–10s wall-time including tool calls.
Risk + mitigation	High if autonomous. An incorrect auto-approve writes a `commission_override` (D2) and fires the VanguardClient callback — real money. Mitigated by phased rollout: Phase 3a is shadow-only (LLM writes `shadow_decision` column; human decides; no downstream effect). Phase 3b is narrow-band autonomous (auto-approve only below £X threshold with low partner history variance). `escalate_to_human` is always available and is the default when confidence is low.
Blockers	Telemetry Layer 2 (outcome stamp) + Layer 3 (partner rollups) from `bespoke-telemetry.md` must land first — `get_partner_history` is useless without outcome data. Rule-based strategy must be running in production with enough reviews to calibrate concordance.

Layer D — Outcome narrative (post-hoc)¶


Where	BFF — new endpoint `GET /api/bespoke/reviews/{id}/narrative`, cached against `review_id + decision_hash`
What the LLM adds	Generates a human-readable summary of a completed review: what was asked, what the allocator recommended, what the reviewer decided, how the decision compares to the partner's historical pattern. Displayed on the review detail page and in summary emails.
Tool surface	None. Single prompt, single text response. Input is the review record + allocation + decision metadata.
Model class	Haiku
Cost / latency	~2.5k input tokens, ~300 output tokens. <1s wall-time. Response cached; subsequent reads are free.
Risk + mitigation	Very low. Read-only summary of an already-decided review; no downstream action. Cache keyed on `review_id + decision_hash` — narrative regenerates only if the decision changes. Original data always accessible alongside the narrative.
Blockers	Review detail page must exist. Layer 1 telemetry columns populated.

Rollout path¶

Ordered by risk (ascending) and value-today (descending).

Phase	Layer	What ships	Prerequisites	Risk
1	A + D	Intake tags + outcome narrative	Review page exists; Layer 1 columns populated	Low — advisory-only, no tools
2	B	"Get LLM opinion" button in sandbox	Sandbox UI (Session 06); simulate endpoint	Medium-low — sandbox-only, deterministic validation
3a	C shadow	`LLMReviewerStrategy` writes `shadow_decision`; human still decides	Layer 2/3 telemetry; enough reviews to calibrate	None — no downstream effect
3b	C narrow-band	Auto-approve under £X with low partner history variance	Observed concordance ≥ threshold between shadow and human decisions	High — mitigated by narrow band + escalate-to-human default

Phase 1 ships first because it is cheap (Haiku-class, no tools), immediately useful (reviewers see structured tags and readable summaries), and builds organisational muscle with LLM outputs before any tool-calling or autonomous decision-making is introduced.

Phase 2 exercises the tool-calling pattern in the safest possible environment — the sandbox. No real money, no real reviews.

Phase 3a is the critical gate. Shadow mode means the LLM sees every real review and writes what it would have decided, but the human reviewer's decision is what actually fires. Concordance is the only metric that matters for graduating to 3b.

Phase 3b is narrow-band autonomous: auto-approve only when the giveaway is below a configurable threshold, the partner's historical connection-rate variance is low, and the LLM's confidence is above a calibrated floor. Everything else escalates to human.

File-level landing map¶

Proposed file locations. None implemented in Phase 8.

rev-sci-vanguard/
  internal/clients/anthropic/client.go       — see 03-client-wrappers.md
  internal/temporal/bespoke/llm_strategy.go  — LLMReviewerStrategy (Layer C)
  internal/prompts/*.md                      — prompt templates (Markdown, not Go)

vos-backoffice/
  bff/app/clients/anthropic_client.py        — see 03-client-wrappers.md
  bff/app/prompts/*.md                       — prompt templates (Markdown, not Python)

Each client is env-gated on ANTHROPIC_API_KEY. If the key is unset, the client returns a no-op stub. Prompt templates are Markdown files with {{placeholder}} interpolation, colocated with the service.

The infrastructure layer (LiteLLM gateway, client wrappers, tool dispatch loops, ClickHouse analytics) is designed in the sibling docs in this directory.

What this spike explicitly does NOT do¶

No Anthropic SDK dependency added in Phase 8.
No prompt text drafted — describe shape, not prose.
No MCP scaffolding — native tool-use API, single-loop layers.
No multi-turn agent orchestration — each layer is one tool-use loop.
No autonomous decisions — rollout stops at shadow mode (Phase 3a) until observed concordance graduates to narrow-band (Phase 3b).

Open questions¶

Concordance threshold for 3a → 3b. What % agreement is sufficient? On what subset? Who sets it?
Prompt governance. Who owns prompt templates? Engineering-only or commercial ops too?
LLM trace capture. Where does the audit trail live?
Partner history shape for get_partner_history. What subset of the Layer 3 rollup does the LLM need?
Haiku vs Sonnet boundary. At what error rate do Layers A/D promote to Sonnet?
Fallback when the API is down. Layers A/D degrade; Layer C must escalate to human.
Sandbox → production promotion. Can a Layer B opinion become a real review, or must the flow always start fresh?