Skill design · context for your 10 decisions · ClientsFlow

EBO Factory — a proposed autonomous bug-fix / feature factory

Working name ebo-factory (provisional) · 2026-06-23 · synthesised from 4 Opus + 4 Sonnet research agents + your existing skills. The questions in chat decide how the sub-processes run, not what it achieves.

🟢 reuses an existing skill as-is 🟡 thin new adapter glue 🔴 genuinely new machinery ⚠️ a decision with big blast radius

1 · What it is, in one breath

A single global skill that takes scrambled input — a raw bug-report transcript, or your rambling about a feature — and runs it all the way to a merged, QA-proven build on a safe branch, with you signing off only twice: the EBO at the start and the result at the end. It turns the ad-hoc "4 parallel chats → manual merge → one QA agent" scramble from 2026-06-22 into a disciplined, reproducible pipeline.

Why it exists / how it differs from fleet-merge-qa. Your existing fleet-merge-qa design (agent-A) starts from N already-scoped fixes and its job is merge + QA. EBO Factory starts one step earlier and ends one step later: it derives the spec from raw input (EBO-first, Notion-gated), gives every builder its own paired QA before merge, and adds an orchestrator-run final QA with a bug→builder→fresh-QA fix loop. You wanted these two built so you can A/B them against each other — same safety spine, different ambition.

2 · The pipeline at a glance 4 macro-stages

First step, always — for a bug or a brand-new feature: understand intent + the codebase + project state, then map every scenario on the touched UI. That map is the EBO.

① Understand & Specify
scrambled inEBONotion ✅
Parse the transcript (incl. queued_command follow-ups) or grill the feature ramble → author the Expected-Behavior Oracle via /user-journeys → you sign it in Notion. Blocking gate.
② Decompose & Build
group by moduleN builders
Orchestrator groups related EBO rows, picks Sonnet-or-Opus per task by size/risk, spawns a builder per group — one git worktree + branch each, TDD red→green.
③ Pair-QA each builder
builder+Sonnet QA
Every builder has a Sonnet QA twin running a visual-qa-ultra profile against its EBO slice — click-by-click, pixel + state. Reports compactly to the orchestrator.
④ Merge & Prove
mergefinal QAfix loop
Orchestrator merges all green branches, then runs its own final QA over the whole build; any bug routes back to the responsible builder → fresh QA. Ends at PENDING_REVIEW.

The same flow as a phase ledger (each phase checkpoints to an on-disk file-bus, so the orchestrator's own context stays disposable and crash-resumable):

#PhaseWhat happensOutputHuman?
0IntakeAdapter parses the bug transcript (type:"user" AND queued_command attachments — the §7 miss) or grills the feature ramble → normalised intent_items[]raw_asks.json
1EBO 🟢Hand intent_items[] to /user-journeys unchanged → it authors the click-by-click oracle (You-do · see · element · state-delta · must-NOT) + asks every open questionEBO.md + spec.jsonanswers Qs
2Confirm ⚠️Mirror EBO to the existing Notion EBO database; build is blocked until Status = 🟢 Signed offebo.signedsigns in Notion
3DecomposeGroup EBO rows → builder tasks (same feature / file / flow); score each task → assign Sonnet or Opus; slice the EBO per tasktasks/<id>.jsonheartbeat
4Fan-out 🔴Spawn N builders (worktree + branch + TDD) each paired with a Sonnet QA twin on its EBO slicebranch + return.json + qa.json
5Converge 🟢git bundle backup → SHA-pin main → sequential --no-ff onto a fresh integration branch → resolve conflictsintegration branch
6Final QA 🔴Deploy the merged build (badge-pinned) → orchestrator runs its own visual-qa-ultra over the whole EBO (cross-feature bugs first appear here)qa/final-report/
7Fix loop 🔴Each bug → responsible builder → TDD fix → re-merge → fresh Sonnet QA re-verifies (round-capped)fix-log.json
8ReviewOpen the report on localhost; verdict table; stays PENDING_REVIEW until you accept. Never pushes main.End-of-Exec reportaccepts

3 · The cast — who does what, on which model 6 roles

RoleModelWhat it does · why that model
OrchestratorOpusHolds run-state, sequences phases, groups tasks, resolves merges, runs the final QA + fix loop. Judgment-heavy, multi-system, expensive-if-wrong.
Intake adapterscript + SonnetDeterministic transcript parser (queued-command-aware) + a Sonnet grilling pass for vague feature asks. No raw JSONL ever re-enters the orchestrator.
EBO author/user-journeysReused whole — authors the human-signed oracle + Notion mirror. The single source of truth for "correct".
Builder (1 / task)Sonnet or OpusWrites the real product code, TDD, on its own branch. The orchestrator picks the model per task from a scored size/risk rule (your Q3).
Builder's QA twinalways SonnetDrives the live app against the builder's EBO slice, click-by-click. Can turn a row red freely; turns it green only when every script gate passes — anything ambiguous it escalates to the orchestrator.
Final QA + fixersOpus + fresh SonnetThe orchestrator's own whole-build QA (Opus adjudicates disputes); fixes route to the owning builder, re-verified by a fresh Sonnet twin (no prior-pass bias).

4 · The two heuristics the orchestrator runs scriptable

Model pick (per builder task)
touches live-send / payment / proposal / gcal path→ Opus
novel architecture (no prior pattern in graph)→ Opus
≥ 4 files, or ≥ ~120 LOC, or ≥ 5 EBO rows→ Opus
otherwise (isolated UI / dash-module tweak)→ Sonnet
any QA twin→ Sonnet
Task grouping (EBO rows → builder)
same feature tag→ one builder
same target file / dash_* module (via graphify)→ one builder
same flows.py flow (booking/proposal/payment)→ one builder
both touch a god-module (flows.py/dash.py)→ serialise ⚠️
cap≤5 rows · ≤3 concurrent

5 · The canonical worked example — your booked-card bug

The skill's demo fixture is the real chat you just ran: "after a lead booked a call, its card disappeared from the pipeline instead of moving to Booked… refreshing, I got the popup." That scrambled report compiles to one EBO row:

You doYou should seeElement that changesWhat changes underneathMust NOT happen
A lead who earlier sent a negative reply self-books a call The card moves to Booked / Sales Call Prep — live, no refresh The kanban card (deal_col() routing) stage → appt_booked AND neg_reply cleared so column-ranking no longer pins it Card must NOT stay in 🚫 Negative Replies; must NOT vanish; must NOT need a refresh

A real constraint this example exposes: there is no state-injection endpoint to set neg_reply on a sentinel lead, so this row can't be purely live-clicked — it's verified at the pytest layer. That's why Q8 asks how to route rows that aren't live-QA-able (the honest alternative to a false green).

6 · How it stands on your existing skills reuse-first

SkillHow usedRole in the factory
user-journeys🟢 wholeThe EBO author + Notion human-gate. Forking it would split "correct" into two drifting copies — the exact problem we're solving.
visual-qa-ultra🟢 engineBoth the per-builder QA and the final QA. Its trust-core makes "green" a mechanical script conjunction (pixels ∧ state-delta ∧ no-collision ∧ coverage) — that's what lets a Sonnet twin be trusted to collect evidence + emit verdicts.
plan-orchestrate🟢 patternThe file-bus + re-spawn + 30-min heartbeat + answer-routing pattern, reused verbatim. EBO Factory is its parallel sibling (N fixes vs one feature).
resolving-merge-conflicts🟢 inlineInvoked when a sequential merge conflicts (preserve both intents; never --abort).
board-card-qa · ai-usage-qa🟢 fast gateCheap sentinel-gated DOM/API per-behavior checks during convergence, before the heavier visual pass.
graphify · tdd · ponytail · document-changelog🟢 per-builderEvery builder orients via the graph, builds TDD + lazy, documents the changelog (the Stop-hook enforces it).
skill-creator🟡 build-timeScaffolds + registers the new skill (the next instance runs this).

7 · What the 8 research agents validated — and the 4 things they changed

Finding (sourced)What it changes in the design
Self-verification ≈ 0% reliable; independent verifier ≈ 100% (GitHub 2026)The QA twin must have no shared context with its builder, and the builder may never sign its own oracle. Validates the Sonnet-twin split.
"TDD Prompting Paradox" — verbose test-first can raise regressions; agents game their own tests (TDAD 2026)Don't just tell the builder "write a failing test." Q4 asks whether the QA twin should audit each test's oracle strength, or whether a separate agent freezes the tests the builder can't edit.
80% of agent-written tests are "test theater" — null/exists checks, no real oracle (arXiv 2606.18168)Every EBO row must carry a real state-delta + must-NOT, not just a "you see". Already baked into the EBO schema; the audit in Q4 enforces it.
Capability routing > category routing; never trust model self-confidence — use test pass/fail (RouteLLM, ICLR'25)The model-pick rule (§4) scores task size/risk, and escalation signals are test results, not the agent saying "I'm confident". Feeds Q3.
Loading screens are the #1 false-judgment source in agent UI QAAlready handled — visual-qa-ultra hard-stops on a Loading… frame and never judges it. No change; reassuring.
Human gate must be blocking, not advisory (Kiro / NIST 2026)The Notion sign-off blocks the build. Q7 confirms blocking vs a timeout-default.

8 · The decisions I need from you 10 questions, in chat

Every one is about how a sub-process executes — never about what the system achieves. ⚠️ = big blast radius. My recommended pick is in the chat answer-block; this is just the map.

QDecidesThe execution choice
1⚠️EBO authoringReuse /user-journeys whole + thin adapters, or variant/absorb/lightweight-for-bugs?
2⚠️Signed EBO → QA oracleDeterministic compiler to visual-qa-ultra ebo.json slices, or author the QA oracle directly?
3⚠️Builder model selectionSonnet-default + scripted escalation triggers, or Opus-default, or free judgment?
4⚠️TDD authorship / anti-gamingSelf-TDD, or frozen tests the builder can't edit, or self-TDD + QA oracle-strength audit?
5God-module collisionsSerialise builders touching flows.py/dash.py, or parallel-then-resolve-at-merge?
6Per-builder QA timingAfter the builder commits, concurrent, or cheap-during + full-vqu-after?
7⚠️Notion confirm gateBlocking, heartbeat-then-proceed, or partial-sign?
8⚠️Rows not live-clickableClassify live/test-layer/blocked & route each, or build a state-injection QA endpoint?
9Merge conflict authorityDoc-auto + app-code human-gated, full-auto, or auto + post-merge review agent?
10Push / outputNever push, push feature branches (announce-first), or open a PR?
⚠️ Safety invariants — baked in as mechanical guards, never weakened. Human gate (nothing sent to a lead without an explicit action) · AUTOSEND stays ON and is never flipped in a deploy/test · sentinel-gate (only ZZ… fixtures are drivable) + send-blocklist · Notion = archive-not-delete · Instantly warmup mail invisible everywhere · never push main without your explicit trigger · git bundle backup before any merge · QA twins can never self-sign the oracle · sweep every fixture after.