Skill design · context for your 10 decisions · ClientsFlow
Working name ebo-factory (provisional) · 2026-06-23 · synthesised from 4 Opus + 4 Sonnet research agents + your existing skills. The questions in chat decide how the sub-processes run, not what it achieves.
A single global skill that takes scrambled input — a raw bug-report transcript, or your rambling about a feature — and runs it all the way to a merged, QA-proven build on a safe branch, with you signing off only twice: the EBO at the start and the result at the end. It turns the ad-hoc "4 parallel chats → manual merge → one QA agent" scramble from 2026-06-22 into a disciplined, reproducible pipeline.
fleet-merge-qa. Your existing fleet-merge-qa design (agent-A) starts from N already-scoped fixes and its job is merge + QA. EBO Factory starts one step earlier and ends one step later: it derives the spec from raw input (EBO-first, Notion-gated), gives every builder its own paired QA before merge, and adds an orchestrator-run final QA with a bug→builder→fresh-QA fix loop. You wanted these two built so you can A/B them against each other — same safety spine, different ambition.
First step, always — for a bug or a brand-new feature: understand intent + the codebase + project state, then map every scenario on the touched UI. That map is the EBO.
queued_command follow-ups) or grill the feature ramble → author the Expected-Behavior Oracle via /user-journeys → you sign it in Notion. Blocking gate.visual-qa-ultra profile against its EBO slice — click-by-click, pixel + state. Reports compactly to the orchestrator.PENDING_REVIEW.The same flow as a phase ledger (each phase checkpoints to an on-disk file-bus, so the orchestrator's own context stays disposable and crash-resumable):
| # | Phase | What happens | Output | Human? |
|---|---|---|---|---|
| 0 | Intake | Adapter parses the bug transcript (type:"user" AND queued_command attachments — the §7 miss) or grills the feature ramble → normalised intent_items[] | raw_asks.json | — |
| 1 | EBO 🟢 | Hand intent_items[] to /user-journeys unchanged → it authors the click-by-click oracle (You-do · see · element · state-delta · must-NOT) + asks every open question | EBO.md + spec.json | answers Qs |
| 2 | Confirm ⚠️ | Mirror EBO to the existing Notion EBO database; build is blocked until Status = 🟢 Signed off | ebo.signed | signs in Notion |
| 3 | Decompose | Group EBO rows → builder tasks (same feature / file / flow); score each task → assign Sonnet or Opus; slice the EBO per task | tasks/<id>.json | heartbeat |
| 4 | Fan-out 🔴 | Spawn N builders (worktree + branch + TDD) each paired with a Sonnet QA twin on its EBO slice | branch + return.json + qa.json | — |
| 5 | Converge 🟢 | git bundle backup → SHA-pin main → sequential --no-ff onto a fresh integration branch → resolve conflicts | integration branch | — |
| 6 | Final QA 🔴 | Deploy the merged build (badge-pinned) → orchestrator runs its own visual-qa-ultra over the whole EBO (cross-feature bugs first appear here) | qa/final-report/ | — |
| 7 | Fix loop 🔴 | Each bug → responsible builder → TDD fix → re-merge → fresh Sonnet QA re-verifies (round-capped) | fix-log.json | — |
| 8 | Review | Open the report on localhost; verdict table; stays PENDING_REVIEW until you accept. Never pushes main. | End-of-Exec report | accepts |
| Role | Model | What it does · why that model |
|---|---|---|
| Orchestrator | Opus | Holds run-state, sequences phases, groups tasks, resolves merges, runs the final QA + fix loop. Judgment-heavy, multi-system, expensive-if-wrong. |
| Intake adapter | script + Sonnet | Deterministic transcript parser (queued-command-aware) + a Sonnet grilling pass for vague feature asks. No raw JSONL ever re-enters the orchestrator. |
| EBO author | /user-journeys | Reused whole — authors the human-signed oracle + Notion mirror. The single source of truth for "correct". |
| Builder (1 / task) | Sonnet or Opus | Writes the real product code, TDD, on its own branch. The orchestrator picks the model per task from a scored size/risk rule (your Q3). |
| Builder's QA twin | always Sonnet | Drives the live app against the builder's EBO slice, click-by-click. Can turn a row red freely; turns it green only when every script gate passes — anything ambiguous it escalates to the orchestrator. |
| Final QA + fixers | Opus + fresh Sonnet | The orchestrator's own whole-build QA (Opus adjudicates disputes); fixes route to the owning builder, re-verified by a fresh Sonnet twin (no prior-pass bias). |
| Model pick (per builder task) | |
| touches live-send / payment / proposal / gcal path | → Opus |
| novel architecture (no prior pattern in graph) | → Opus |
| ≥ 4 files, or ≥ ~120 LOC, or ≥ 5 EBO rows | → Opus |
| otherwise (isolated UI / dash-module tweak) | → Sonnet |
| any QA twin | → Sonnet |
| Task grouping (EBO rows → builder) | |
| same feature tag | → one builder |
same target file / dash_* module (via graphify) | → one builder |
same flows.py flow (booking/proposal/payment) | → one builder |
both touch a god-module (flows.py/dash.py) | → serialise ⚠️ |
| cap | ≤5 rows · ≤3 concurrent |
The skill's demo fixture is the real chat you just ran: "after a lead booked a call, its card disappeared from the pipeline instead of moving to Booked… refreshing, I got the popup." That scrambled report compiles to one EBO row:
| You do | You should see | Element that changes | What changes underneath | Must NOT happen |
|---|---|---|---|---|
| A lead who earlier sent a negative reply self-books a call | The card moves to Booked / Sales Call Prep — live, no refresh | The kanban card (deal_col() routing) |
stage → appt_booked AND neg_reply cleared so column-ranking no longer pins it |
Card must NOT stay in 🚫 Negative Replies; must NOT vanish; must NOT need a refresh |
A real constraint this example exposes: there is no state-injection endpoint to set neg_reply on a sentinel lead, so this row can't be purely live-clicked — it's verified at the pytest layer. That's why Q8 asks how to route rows that aren't live-QA-able (the honest alternative to a false green).
| Skill | How used | Role in the factory |
|---|---|---|
user-journeys | 🟢 whole | The EBO author + Notion human-gate. Forking it would split "correct" into two drifting copies — the exact problem we're solving. |
visual-qa-ultra | 🟢 engine | Both the per-builder QA and the final QA. Its trust-core makes "green" a mechanical script conjunction (pixels ∧ state-delta ∧ no-collision ∧ coverage) — that's what lets a Sonnet twin be trusted to collect evidence + emit verdicts. |
plan-orchestrate | 🟢 pattern | The file-bus + re-spawn + 30-min heartbeat + answer-routing pattern, reused verbatim. EBO Factory is its parallel sibling (N fixes vs one feature). |
resolving-merge-conflicts | 🟢 inline | Invoked when a sequential merge conflicts (preserve both intents; never --abort). |
board-card-qa · ai-usage-qa | 🟢 fast gate | Cheap sentinel-gated DOM/API per-behavior checks during convergence, before the heavier visual pass. |
graphify · tdd · ponytail · document-changelog | 🟢 per-builder | Every builder orients via the graph, builds TDD + lazy, documents the changelog (the Stop-hook enforces it). |
skill-creator | 🟡 build-time | Scaffolds + registers the new skill (the next instance runs this). |
| Finding (sourced) | What it changes in the design |
|---|---|
| Self-verification ≈ 0% reliable; independent verifier ≈ 100% (GitHub 2026) | The QA twin must have no shared context with its builder, and the builder may never sign its own oracle. Validates the Sonnet-twin split. |
| "TDD Prompting Paradox" — verbose test-first can raise regressions; agents game their own tests (TDAD 2026) | Don't just tell the builder "write a failing test." Q4 asks whether the QA twin should audit each test's oracle strength, or whether a separate agent freezes the tests the builder can't edit. |
| 80% of agent-written tests are "test theater" — null/exists checks, no real oracle (arXiv 2606.18168) | Every EBO row must carry a real state-delta + must-NOT, not just a "you see". Already baked into the EBO schema; the audit in Q4 enforces it. |
| Capability routing > category routing; never trust model self-confidence — use test pass/fail (RouteLLM, ICLR'25) | The model-pick rule (§4) scores task size/risk, and escalation signals are test results, not the agent saying "I'm confident". Feeds Q3. |
| Loading screens are the #1 false-judgment source in agent UI QA | Already handled — visual-qa-ultra hard-stops on a Loading… frame and never judges it. No change; reassuring. |
| Human gate must be blocking, not advisory (Kiro / NIST 2026) | The Notion sign-off blocks the build. Q7 confirms blocking vs a timeout-default. |
Every one is about how a sub-process executes — never about what the system achieves. ⚠️ = big blast radius. My recommended pick is in the chat answer-block; this is just the map.
| Q | Decides | The execution choice |
|---|---|---|
| 1⚠️ | EBO authoring | Reuse /user-journeys whole + thin adapters, or variant/absorb/lightweight-for-bugs? |
| 2⚠️ | Signed EBO → QA oracle | Deterministic compiler to visual-qa-ultra ebo.json slices, or author the QA oracle directly? |
| 3⚠️ | Builder model selection | Sonnet-default + scripted escalation triggers, or Opus-default, or free judgment? |
| 4⚠️ | TDD authorship / anti-gaming | Self-TDD, or frozen tests the builder can't edit, or self-TDD + QA oracle-strength audit? |
| 5 | God-module collisions | Serialise builders touching flows.py/dash.py, or parallel-then-resolve-at-merge? |
| 6 | Per-builder QA timing | After the builder commits, concurrent, or cheap-during + full-vqu-after? |
| 7⚠️ | Notion confirm gate | Blocking, heartbeat-then-proceed, or partial-sign? |
| 8⚠️ | Rows not live-clickable | Classify live/test-layer/blocked & route each, or build a state-injection QA endpoint? |
| 9 | Merge conflict authority | Doc-auto + app-code human-gated, full-auto, or auto + post-merge review agent? |
| 10 | Push / output | Never push, push feature branches (announce-first), or open a PR? |
ZZ… fixtures are drivable) + send-blocklist · Notion = archive-not-delete · Instantly warmup mail invisible everywhere · never push main without your explicit trigger · git bundle backup before any merge · QA twins can never self-sign the oracle · sweep every fixture after.