The Refinement Flywheel: How I Ship With Agents → Eric Skram

tl;dr: Iterate on plans, not code. Front-load the arguing, let multiple models poke holes, only implement once there's nothing left to debate.

After a year of pairing with agents daily, the biggest shift wasn't prompt engineering—it was workflow structure. The gap between folks who get consistent results and those who bounce off isn't model quality. It's knowing when to stop prompting and start planning.

→ KEY TAKEAWAYS

• A bad design caught in a plan costs 30 seconds. Caught in code, it costs an hour.
• One model is a perspective. Three is a jury. Parallel critiques surface contradictions.
• Feedback loops are non-negotiable. Give agents a linter, type checker, and test runner.
• Context windows aren't infinite. Keep sessions bounded or you'll get dumb decisions.
• Tasks should be self-contained. Everything a "future agent" needs to know lives in the task itself.

→ THE WORKFLOW

Step 1: Brainstorm

Start with a high-level explanation of what you're trying to achieve. Work back-and-forth with the agent to design the solution.

• Agent asks clarifying questions one at a time
• Proposes 2-3 approaches with tradeoffs
• You converge on a design
• Output: a plan document in docs/plans/

Step 2: Design Review

Run 1-2x to iterate on the plan.

• Spawns 3 sub-agents in parallel (different models)
• Each reads the codebase independently, critiques the plan
• A synthesizer consolidates feedback into a v2 (then v3, v4...)
• Unless I see changes I disagree with, I don't read the plan until it's been through at least one review pass

Step 3: Make Tasks

Convert the plan into atomic work units.

• Each task is self-contained with context, rationale, and acceptance criteria
• Everything a "future agent" needs to know is in the task itself

Step 4: Improve Tasks

Run 1-2x until quality.

• Single-agent version: one pass, returns unresolved questions
• Multi-agent version: orchestrated loop—answers its own questions via codebase research, only surfaces questions that genuinely need human judgment
• Stop when agent says "no further questions" or starts asking low-value questions
• At this point, just skim for glaring mistakes

Step 5: Execute

Go for a walk.

• Works through open tasks sequentially
• Spawns sub-agents to complete each task
• Each sub-agent: picks up the task, does the work, commits, marks complete
• Reports back issues, bugs, anything notable
• Because all the work is done in sub-agents, you can chew through a whole feature in ~80k tokens of main agent context

Step 6: Code Review & More Tasks

See what the swarm comes back with.

• Spawns 3 sub-agents in parallel to review the branch
• Synthesizes feedback: what to fix, what to ignore, priority order
• Argue the points, figure out what you're actually unhappy with
• File new tasks for issues worth fixing
• Improve the new tasks
• Execute again

Step 7: QA

Put on your QA hat. Actually use the feature.

• Bonus points: give the LLM tools to do this autonomously (CLI, browser-use MCP, etc.)
• Point out issues
• Debug systematically with extended thinking for anything weird
• File tasks liberally for anything that needs fixing
• Loop back through improve → execute as needed

Step 8: Final Review

After all that, review the code once in GitHub.

• By now you've had multiple model perspectives on design
• Multiple model perspectives on implementation
• Iterated on issues found during QA
• This review is a sanity check, not the first line of defense

→ A COUPLE NOTES

Tasks are for the agents, not for you. We still use Linear to manage bugs, features, work. Tasks just let us break work into small, unambiguous pieces that an individual agent can pick up.

Watch the context window. For Opus, I keep individual sessions under 70-90k tokens. Above ~130k, you start getting dumb decisions. If I'm getting over that, I stop and ask for a summary. Then open a new session:

We were working on X—here's a summary. Use extended thinking and familiarize yourself. Let's continue...

Feedback loops are critical. Give your agent a way to interact with your software. A linter, a type checker, a test runner. Results will be 2-5x better—especially in implementation. Give them the ability to know when they're making a mistake and self-correct.

If an agent makes a bad choice, call it out. Tell it to write in claude/agents.md to not make this mistake again. Doesn't always stick, but it does help.

→ THE TOOLS

The principles above are tool-agnostic. But if you're curious:

• OpenCode with Claude Opus as the main agent
• Sub-agents for parallel work: Opus, GPT-5.2, Gemini 3 Pro, GLM-4.7 (rotating experimental slot)
• Beads for agent task management
• bv to visualize/triage

→ THE REPO

I keep my agent skills and workflow configuration in a public repo: github.com/Vpr99/agent-setup

→ THIS SEEMS LIKE A LOT OF STEPS...

• Most steps are hands-off—you kick them off and wait
• You're not reading anything until it's been through 1-2 refinement passes
• For smaller stuff, skip steps—straightforward tasks go straight to implementation

→ WHAT I'M ALSO THINKING ABOUT

Making a "real" executor (aka a bash loop) and maybe replacing Beads with a simple prd.json file. Read: Tips for AI Coding with Ralph Wiggum