EBO Factory — proposed skill · context for your decisions

🟢 reuses an existing skill as-is 🟡 thin new adapter glue 🔴 genuinely new machinery ⚠️ a decision with big blast radius

1 · What it is, in one breath

A single global skill that takes scrambled input — a raw bug-report transcript, or your rambling about a feature — and runs it all the way to a merged, QA-proven build on a safe branch, with you signing off only twice: the EBO at the start and the result at the end. It turns the ad-hoc "4 parallel chats → manual merge → one QA agent" scramble from 2026-06-22 into a disciplined, reproducible pipeline.

Why it exists / how it differs from fleet-merge-qa. Your existing fleet-merge-qa design (agent-A) starts from N already-scoped fixes and its job is merge + QA. EBO Factory starts one step earlier and ends one step later: it derives the spec from raw input (EBO-first, Notion-gated), gives every builder its own paired QA before merge, and adds an orchestrator-run final QA with a bug→builder→fresh-QA fix loop. You wanted these two built so you can A/B them against each other — same safety spine, different ambition.

2 · The pipeline at a glance 4 macro-stages

First step, always — for a bug or a brand-new feature: understand intent + the codebase + project state, then map every scenario on the touched UI. That map is the EBO.

① Understand & Specify

scrambled in→EBO→Notion ✅

Parse the transcript (incl. queued_command follow-ups) or grill the feature ramble → author the Expected-Behavior Oracle via /user-journeys → you sign it in Notion. Blocking gate.

② Decompose & Build

group by module→N builders

Orchestrator groups related EBO rows, picks Sonnet-or-Opus per task by size/risk, spawns a builder per group — one git worktree + branch each, TDD red→green.

③ Pair-QA each builder

builder+Sonnet QA

Every builder has a Sonnet QA twin running a visual-qa-ultra profile against its EBO slice — click-by-click, pixel + state. Reports compactly to the orchestrator.

④ Merge & Prove

merge→final QA→fix loop

Orchestrator merges all green branches, then runs its own final QA over the whole build; any bug routes back to the responsible builder → fresh QA. Ends at PENDING_REVIEW.

The same flow as a phase ledger (each phase checkpoints to an on-disk file-bus, so the orchestrator's own context stays disposable and crash-resumable):

# Phase What happens Output Human?

0 Intake Adapter parses the bug transcript (type:"user" AND queued_command attachments — the §7 miss) or grills the feature ramble → normalised intent_items[] raw_asks.json —

1 EBO 🟢 Hand intent_items[] to /user-journeys unchanged → it authors the click-by-click oracle (You-do · see · element · state-delta · must-NOT) + asks every open question EBO.md + spec.json answers Qs

2 Confirm ⚠️ Mirror EBO to the existing Notion EBO database; build is blocked until Status = 🟢 Signed off ebo.signed signs in Notion

3 Decompose Group EBO rows → builder tasks (same feature / file / flow); score each task → assign Sonnet or Opus; slice the EBO per task tasks/<id>.json heartbeat

4 Fan-out 🔴 Spawn N builders (worktree + branch + TDD) each paired with a Sonnet QA twin on its EBO slice branch + return.json + qa.json —

5 Converge 🟢 git bundle backup → SHA-pin main → sequential --no-ff onto a fresh integration branch → resolve conflicts integration branch —

6 Final QA 🔴 Deploy the merged build (badge-pinned) → orchestrator runs its own visual-qa-ultra over the whole EBO (cross-feature bugs first appear here) qa/final-report/ —

7 Fix loop 🔴 Each bug → responsible builder → TDD fix → re-merge → fresh Sonnet QA re-verifies (round-capped) fix-log.json —

8 Review Open the report on localhost; verdict table; stays PENDING_REVIEW until you accept. Never pushes main. End-of-Exec report accepts

#	Phase	What happens	Output	Human?
0	Intake	Adapter parses the bug transcript (`type:"user"` AND `queued_command` attachments — the §7 miss) or grills the feature ramble → normalised `intent_items[]`	`raw_asks.json`	—
1	EBO 🟢	Hand `intent_items[]` to `/user-journeys` unchanged → it authors the click-by-click oracle (You-do · see · element · state-delta · must-NOT) + asks every open question	`EBO.md` + `spec.json`	answers Qs
2	Confirm ⚠️	Mirror EBO to the existing Notion EBO database; build is blocked until Status = 🟢 Signed off	`ebo.signed`	signs in Notion
3	Decompose	Group EBO rows → builder tasks (same feature / file / flow); score each task → assign Sonnet or Opus; slice the EBO per task	`tasks/<id>.json`	heartbeat
4	Fan-out 🔴	Spawn N builders (worktree + branch + TDD) each paired with a Sonnet QA twin on its EBO slice	branch + `return.json` + `qa.json`	—
5	Converge 🟢	`git bundle` backup → SHA-pin main → sequential `--no-ff` onto a fresh integration branch → resolve conflicts	integration branch	—
6	Final QA 🔴	Deploy the merged build (badge-pinned) → orchestrator runs its own `visual-qa-ultra` over the whole EBO (cross-feature bugs first appear here)	`qa/final-report/`	—
7	Fix loop 🔴	Each bug → responsible builder → TDD fix → re-merge → fresh Sonnet QA re-verifies (round-capped)	`fix-log.json`	—
8	Review	Open the report on localhost; verdict table; stays `PENDING_REVIEW` until you accept. Never pushes main.	End-of-Exec report	accepts

3 · The cast — who does what, on which model 6 roles

Role Model What it does · why that model

Orchestrator Opus Holds run-state, sequences phases, groups tasks, resolves merges, runs the final QA + fix loop. Judgment-heavy, multi-system, expensive-if-wrong.

Intake adapter script + Sonnet Deterministic transcript parser (queued-command-aware) + a Sonnet grilling pass for vague feature asks. No raw JSONL ever re-enters the orchestrator.

EBO author /user-journeys Reused whole — authors the human-signed oracle + Notion mirror. The single source of truth for "correct".

Builder (1 / task) Sonnet or Opus Writes the real product code, TDD, on its own branch. The orchestrator picks the model per task from a scored size/risk rule (your Q3).

Builder's QA twin always Sonnet Drives the live app against the builder's EBO slice, click-by-click. Can turn a row red freely; turns it green only when every script gate passes — anything ambiguous it escalates to the orchestrator.

Final QA + fixers Opus + fresh Sonnet The orchestrator's own whole-build QA (Opus adjudicates disputes); fixes route to the owning builder, re-verified by a fresh Sonnet twin (no prior-pass bias).

Role	Model	What it does · why that model
Orchestrator	Opus	Holds run-state, sequences phases, groups tasks, resolves merges, runs the final QA + fix loop. Judgment-heavy, multi-system, expensive-if-wrong.
Intake adapter	script + Sonnet	Deterministic transcript parser (queued-command-aware) + a Sonnet grilling pass for vague feature asks. No raw JSONL ever re-enters the orchestrator.
EBO author	`/user-journeys`	Reused whole — authors the human-signed oracle + Notion mirror. The single source of truth for "correct".
Builder (1 / task)	Sonnet or Opus	Writes the real product code, TDD, on its own branch. The orchestrator picks the model per task from a scored size/risk rule (your Q3).
Builder's QA twin	always Sonnet	Drives the live app against the builder's EBO slice, click-by-click. Can turn a row red freely; turns it green only when every script gate passes — anything ambiguous it escalates to the orchestrator.
Final QA + fixers	Opus + fresh Sonnet	The orchestrator's own whole-build QA (Opus adjudicates disputes); fixes route to the owning builder, re-verified by a fresh Sonnet twin (no prior-pass bias).

4 · The two heuristics the orchestrator runs scriptable

Model pick (per builder task)
touches live-send / payment / proposal / gcal path	→ Opus
novel architecture (no prior pattern in graph)	→ Opus
≥ 4 files, or ≥ ~120 LOC, or ≥ 5 EBO rows	→ Opus
otherwise (isolated UI / dash-module tweak)	→ Sonnet
any QA twin	→ Sonnet

Task grouping (EBO rows → builder)
same feature tag	→ one builder
same target file / `dash_*` module (via graphify)	→ one builder
same `flows.py` flow (booking/proposal/payment)	→ one builder
both touch a god-module (`flows.py`/`dash.py`)	→ serialise ⚠️
cap	≤5 rows · ≤3 concurrent

5 · The canonical worked example — your booked-card bug

The skill's demo fixture is the real chat you just ran: "after a lead booked a call, its card disappeared from the pipeline instead of moving to Booked… refreshing, I got the popup." That scrambled report compiles to one EBO row:

You do You should see Element that changes What changes underneath Must NOT happen

A lead who earlier sent a negative reply self-books a call The card moves to Booked / Sales Call Prep — live, no refresh The kanban card (deal_col() routing) stage → appt_booked AND neg_reply cleared so column-ranking no longer pins it Card must NOT stay in 🚫 Negative Replies; must NOT vanish; must NOT need a refresh

You do	You should see	Element that changes	What changes underneath	Must NOT happen
A lead who earlier sent a negative reply self-books a call	The card moves to Booked / Sales Call Prep — live, no refresh	The kanban card (`deal_col()` routing)	`stage → appt_booked` AND `neg_reply` cleared so column-ranking no longer pins it	Card must NOT stay in 🚫 Negative Replies; must NOT vanish; must NOT need a refresh

A real constraint this example exposes: there is no state-injection endpoint to set neg_reply on a sentinel lead, so this row can't be purely live-clicked — it's verified at the pytest layer. That's why Q8 asks how to route rows that aren't live-QA-able (the honest alternative to a false green).

6 · How it stands on your existing skills reuse-first

Skill How used Role in the factory

user-journeys 🟢 whole The EBO author + Notion human-gate. Forking it would split "correct" into two drifting copies — the exact problem we're solving.

visual-qa-ultra 🟢 engine Both the per-builder QA and the final QA. Its trust-core makes "green" a mechanical script conjunction (pixels ∧ state-delta ∧ no-collision ∧ coverage) — that's what lets a Sonnet twin be trusted to collect evidence + emit verdicts.

plan-orchestrate 🟢 pattern The file-bus + re-spawn + 30-min heartbeat + answer-routing pattern, reused verbatim. EBO Factory is its parallel sibling (N fixes vs one feature).

resolving-merge-conflicts 🟢 inline Invoked when a sequential merge conflicts (preserve both intents; never --abort).

board-card-qa · ai-usage-qa 🟢 fast gate Cheap sentinel-gated DOM/API per-behavior checks during convergence, before the heavier visual pass.

graphify · tdd · ponytail · document-changelog 🟢 per-builder Every builder orients via the graph, builds TDD + lazy, documents the changelog (the Stop-hook enforces it).

skill-creator 🟡 build-time Scaffolds + registers the new skill (the next instance runs this).

Skill	How used	Role in the factory
`user-journeys`	🟢 whole	The EBO author + Notion human-gate. Forking it would split "correct" into two drifting copies — the exact problem we're solving.
`visual-qa-ultra`	🟢 engine	Both the per-builder QA and the final QA. Its trust-core makes "green" a mechanical script conjunction (pixels ∧ state-delta ∧ no-collision ∧ coverage) — that's what lets a Sonnet twin be trusted to collect evidence + emit verdicts.
`plan-orchestrate`	🟢 pattern	The file-bus + re-spawn + 30-min heartbeat + answer-routing pattern, reused verbatim. EBO Factory is its parallel sibling (N fixes vs one feature).
`resolving-merge-conflicts`	🟢 inline	Invoked when a sequential merge conflicts (preserve both intents; never `--abort`).
`board-card-qa` · `ai-usage-qa`	🟢 fast gate	Cheap sentinel-gated DOM/API per-behavior checks during convergence, before the heavier visual pass.
`graphify` · `tdd` · `ponytail` · `document-changelog`	🟢 per-builder	Every builder orients via the graph, builds TDD + lazy, documents the changelog (the Stop-hook enforces it).
`skill-creator`	🟡 build-time	Scaffolds + registers the new skill (the next instance runs this).

7 · What the 8 research agents validated — and the 4 things they changed

Finding (sourced) What it changes in the design

Self-verification ≈ 0% reliable; independent verifier ≈ 100% (GitHub 2026) The QA twin must have no shared context with its builder, and the builder may never sign its own oracle. Validates the Sonnet-twin split.

"TDD Prompting Paradox" — verbose test-first can raise regressions; agents game their own tests (TDAD 2026) Don't just tell the builder "write a failing test." Q4 asks whether the QA twin should audit each test's oracle strength, or whether a separate agent freezes the tests the builder can't edit.

80% of agent-written tests are "test theater" — null/exists checks, no real oracle (arXiv 2606.18168) Every EBO row must carry a real state-delta + must-NOT, not just a "you see". Already baked into the EBO schema; the audit in Q4 enforces it.

Capability routing > category routing; never trust model self-confidence — use test pass/fail (RouteLLM, ICLR'25) The model-pick rule (§4) scores task size/risk, and escalation signals are test results, not the agent saying "I'm confident". Feeds Q3.

Loading screens are the #1 false-judgment source in agent UI QA Already handled — visual-qa-ultra hard-stops on a Loading… frame and never judges it. No change; reassuring.

Human gate must be blocking, not advisory (Kiro / NIST 2026) The Notion sign-off blocks the build. Q7 confirms blocking vs a timeout-default.

Finding (sourced)	What it changes in the design
Self-verification ≈ 0% reliable; independent verifier ≈ 100% (GitHub 2026)	The QA twin must have no shared context with its builder, and the builder may never sign its own oracle. Validates the Sonnet-twin split.
*"TDD Prompting Paradox" — verbose test-first can raise* regressions; agents game their own tests** (TDAD 2026)	Don't just tell the builder "write a failing test." Q4 asks whether the QA twin should audit each test's oracle strength, or whether a separate agent freezes the tests the builder can't edit.
80% of agent-written tests are "test theater" — null/exists checks, no real oracle (arXiv 2606.18168)	Every EBO row must carry a real state-delta + must-NOT, not just a "you see". Already baked into the EBO schema; the audit in Q4 enforces it.
Capability routing > category routing; never trust model self-confidence — use test pass/fail (RouteLLM, ICLR'25)	The model-pick rule (§4) scores task size/risk, and escalation signals are test results, not the agent saying "I'm confident". Feeds Q3.
Loading screens are the #1 false-judgment source in agent UI QA	Already handled — `visual-qa-ultra` hard-stops on a `Loading…` frame and never judges it. No change; reassuring.
*Human gate must be blocking, not advisory* (Kiro / NIST 2026)	The Notion sign-off blocks the build. Q7 confirms blocking vs a timeout-default.

8 · The decisions I need from you 10 questions, in chat

Every one is about how a sub-process executes — never about what the system achieves. ⚠️ = big blast radius. My recommended pick is in the chat answer-block; this is just the map.

Q Decides The execution choice

1⚠️ EBO authoring Reuse /user-journeys whole + thin adapters, or variant/absorb/lightweight-for-bugs?

2⚠️ Signed EBO → QA oracle Deterministic compiler to visual-qa-ultra ebo.json slices, or author the QA oracle directly?

3⚠️ Builder model selection Sonnet-default + scripted escalation triggers, or Opus-default, or free judgment?

4⚠️ TDD authorship / anti-gaming Self-TDD, or frozen tests the builder can't edit, or self-TDD + QA oracle-strength audit?

5 God-module collisions Serialise builders touching flows.py/dash.py, or parallel-then-resolve-at-merge?

6 Per-builder QA timing After the builder commits, concurrent, or cheap-during + full-vqu-after?

7⚠️ Notion confirm gate Blocking, heartbeat-then-proceed, or partial-sign?

8⚠️ Rows not live-clickable Classify live/test-layer/blocked & route each, or build a state-injection QA endpoint?

9 Merge conflict authority Doc-auto + app-code human-gated, full-auto, or auto + post-merge review agent?

10 Push / output Never push, push feature branches (announce-first), or open a PR?

Q	Decides	The execution choice
1⚠️	EBO authoring	Reuse `/user-journeys` whole + thin adapters, or variant/absorb/lightweight-for-bugs?
2⚠️	Signed EBO → QA oracle	Deterministic compiler to `visual-qa-ultra` `ebo.json` slices, or author the QA oracle directly?
3⚠️	Builder model selection	Sonnet-default + scripted escalation triggers, or Opus-default, or free judgment?
4⚠️	TDD authorship / anti-gaming	Self-TDD, or frozen tests the builder can't edit, or self-TDD + QA oracle-strength audit?
5	God-module collisions	Serialise builders touching `flows.py`/`dash.py`, or parallel-then-resolve-at-merge?
6	Per-builder QA timing	After the builder commits, concurrent, or cheap-during + full-vqu-after?
7⚠️	Notion confirm gate	Blocking, heartbeat-then-proceed, or partial-sign?
8⚠️	Rows not live-clickable	Classify live/test-layer/blocked & route each, or build a state-injection QA endpoint?
9	Merge conflict authority	Doc-auto + app-code human-gated, full-auto, or auto + post-merge review agent?
10	Push / output	Never push, push feature branches (announce-first), or open a PR?

⚠️ Safety invariants — baked in as mechanical guards, never weakened. Human gate (nothing sent to a lead without an explicit action) · AUTOSEND stays ON and is never flipped in a deploy/test · sentinel-gate (only ZZ… fixtures are drivable) + send-blocklist · Notion = archive-not-delete · Instantly warmup mail invisible everywhere · never push main without your explicit trigger · git bundle backup before any merge · QA twins can never self-sign the oracle · sweep every fixture after.

EBO Factory — context report for skill design · generated 2026-06-23 from 4 Opus + 4 Sonnet research agents. Next: you answer the 10 questions in chat → a fresh Claude Code instance builds the skill from the handoff + prompt. Sibling design to fleet-merge-qa (agent-A) — intended to be A/B tested against it.