The Flywheel, Six Weeks Later → Eric Skram

tl;dr: I gutted everything from the original post. Same philosophy, completely different tooling. The biggest shift: stop telling agents how. Tell them what done looks like.

Six weeks ago I described an 8-step workflow using OpenCode, four different models, Beads for task management, and fire-and-forget sub-agents. The philosophy was right. The implementation was already stale by the time I published it. Here's what actually survived, what got replaced, and what I learned.

→ WHAT CHANGED

One model, not four

The original setup rotated Opus, GPT-5.2, Gemini 3 Pro, and GLM-4.7 for parallel critiques. The idea was that different models catch different things. That's true. But the coordination overhead of running a multi-model bench is real, and Opus 4.6 narrowed the gap enough that one model doing multiple passes beats three models doing one pass each.

Not a permanent conclusion. Just where the tradeoff lands today.

Claude Code, not OpenCode

Went all-in on Claude Code. Not because it's objectively better at everything, but because the ecosystem around it—native Agent Teams, native Tasks, Skills—means I'm not building plumbing anymore. I'm building workflows.

Agent Teams, not sub-agents

This is the biggest practical change. The old approach was fire-and-forget: spawn three sub-agents, wait for them to finish, synthesize. No coordination between them. If one agent's work conflicted with another's, you found out after the fact.

Agent Teams have a supervisor. The supervisor reads all the tasks, delegates them to teammates—generally one teammate per task. Each teammate does their slice: write the code, run the tests, type-check, lint, commit. Everything they learned or struggled with goes in the commit message. The supervisor reviews each one, either approves it or tells them to go fix their shit.

I can generally get through ~30 tasks on one supervisor context window. Compression in the supervisor doesn't actually cause problems the way it would in a worker, because all the important context is captured in the tasks themselves. The supervisor only needs to track one question: did this agent diverge from the plan?

Outcome-focused tasks, not imperative ones

This is the philosophical shift that made Agent Teams actually work. Tasks used to be imperative: "Go change this line of code. Go change that line of code." Now they're centered on user outcomes. Acceptance criteria. Feedback loops. What does done look like?

Opus 4.6 can execute on that level of abstraction without producing garbage. Earlier models couldn't. That's the unlock.

I don't even read the tasks regularly anymore. The brainstorming and plan critique get my full attention. The tasks are for the agents.

Skills, not commands

Replaced all my old Claude Code commands with native Skills. Better discoverability, proper frontmatter, reference docs that get loaded contextually. The workflow is the same—brainstorm, critique, task, implement, polish, remember—but the machinery is first-class now instead of duct-taped.

→ TWO WORKFLOWS, NOT ONE

The flywheel from the original post works for greenfield features. Plan hard, implement once. But there's a whole class of work where upfront planning actually makes things worse.

The Ralph Loop

Migrations. Mechanical refactors. Anything where the unit of work is "do the same transformation to 200 files." Planning all of that ahead of time means the plan will miss things, and now you're debugging a stale plan instead of just letting the agent figure it out file by file.

For these, I use a bash loop. Pick a file that matches a pattern. Transform it. Run the checks. Green? Next file. Red? Fix it, then next file. I call it the Ralph Loop (after Ralph Wiggum). It's 20 minutes of fiddling to set up and then it runs overnight.

Example: migrating our test runner from Deno's built-in to Vitest. Tried the plan-and-task approach first. Didn't work well—too many edge cases the plan couldn't anticipate. Switched to a Ralph Loop: grep for Deno.test, swap to Vitest, run until green, next file. Shipped overnight.

Knowing which workflow to reach for

• Flywheel: Novel features, architectural changes, anything with design decisions. Plan hard, implement once.
• Ralph Loop: Mechanical transformations, migrations, anything where the pattern is clear but the volume is high. Skip the plan, loop through it.

→ HARNESSES

The thing I underestimated in the original post. Harnesses—scripts and tools that let agents verify their own work—are the highest-leverage investment I make.

I'm spending roughly 50/50 between iterating on harnesses, skills, and workflow tooling versus actually shipping features. That sounds wrong. It isn't. The 50% I spend on tooling makes the other 50% radically more effective than spending 80% on features would be.

If I were building a CLI, I could probably double my output. Everything is easily and rapidly verifiable. The bottleneck right now is QA on the user experience side—visual stuff, interaction flows, the things that are hard to assert programmatically. Agent-controlled browsers are getting there. Not great yet. Fine.

→ THE UPDATED WORKFLOW

Same shape as before, different guts.

1. Brainstorm — Socratic back-and-forth. Agent asks clarifying questions, proposes approaches, pushes back on bad ideas. Uses deep research and sub-agents to build context. Output: a plan in docs/plans/.
2. Critique — Run 2-4 times. A design critique skill goes through the plan checking for holes, dumb ideas, things that will fall apart. Caps at 5 new ideas per pass to prevent scope creep. Outputs versioned plans (v2, v3...).
3. My review — Read through it myself. Get super nitpicky. Get it really, really tight. This is where my time goes.
4. Create tasks — Break the plan into outcome-focused tasks. Acceptance criteria, not instructions. Dependency graphs. Tracer bullets for risky work.
5. Implement — Spawn an Agent Team. Supervisor delegates tasks to teammates. Each teammate: TDD it if possible, type-check, lint, commit. Supervisor reviews each commit, approves or sends back.
6. Polish — Four specialized review agents in parallel: lint/types, slop/comments, test quality, design/correctness. Fixes get applied automatically.
7. QA — Put on my QA hat. Actually use the thing. Debug issues with extended thinking. File more tasks if needed, loop back through implement.
8. Remember — Mine learnings from the commits and reviews. Curate the good ones into CLAUDE.md so future sessions benefit. This is how the flywheel actually compounds.

→ WHAT I LEARNED

• It's easier to iterate in planning than in execution. This was the thesis of the original post and it's even more true now. A bad design caught in critique costs 30 seconds. Caught in an Agent Team running 8 workers in parallel, it costs an hour and a messy revert.
• Invest in your tooling like it's product work. 50% of my time goes to skills, harnesses, and workflow. That ratio feels high. The output says otherwise.
• You have to keep evolving. What's a pattern and what's an anti-pattern changes every couple of weeks. The workflow from six weeks ago was already accumulating debt when I published it. Stay uncomfortable.
• Context window management still matters, just differently. Compression in a supervisor is fine because task context is self-contained. Compression in a worker is not. Design your information architecture around this.
• Two workflows, not one. The flywheel for novel work. The Ralph Loop for mechanical work. Reaching for the wrong one wastes time in both directions.

→ THE TOOLS

• Claude Code with Opus 4.6
• Agent Teams + native Tasks (replaced Beads entirely)
• Claude Code Skills (replaced commands)
• github.com/Vpr99/agent-setup (public repo, updated)

→ WHAT I'M THINKING ABOUT NEXT

• Agent-controlled browsers for UX QA. The bottleneck is visual verification. Zoomer Selenium is fine but not great.
• Eval harnesses for agents—tools built for agents to diagnose and evaluate other agents' work.
• Whether the multi-model bench comes back as model capabilities diverge again.