Field Notes #24

Case StudyFails & F**k ups

By Amplify Team·

Jun 1, 2026

8 min read

I Trusted 5 AI Agents With My Dev Pipeline. They Shipped a Product I Never Designed

How I over-automated my agent pipeline, what went wrong, and what I do differently now

I had an idea that felt brilliant at the time: what if I chained five AI agents together and let them run an entire product pipeline? Design specs go in one end, shipped features come out the other. Fully automated. Minimal human touch. I'd just talk to them – literally, via voice messages that got transcribed to text – and they'd handle the rest.

It was the first week of this setup. Then one Friday afternoon, everything fell apart.

What came out the other end was a product with missing features, invented UI elements no one asked for, and a compliance plan that quietly paused itself and never woke up. It took about 18 hours before anyone noticed. When we did, the Lead's assessment was three words I won't forget: "complete system failure."

He was right. Not because the technology broke – but because the process did.

This is the story of how I over-automated my agent pipeline, what went wrong, and what I do differently now. If you're building multi-agent workflows, you'll probably hit this wall too. Maybe this saves you the pain.

The Setup: Five Agents, One Pipeline

I set up a product development pipeline with five specialized AI agents, each responsible for a different stage:

•The Lead – project direction and approvals. Received my voice-dictated instructions and translated them into decisions for the team.

•The Coordinator – tracked design specs, managed priorities, dispatched work. Also wore a second hat: compliance checks and documentation.

•The Implementer – took specs and wrote the actual code.

•The Reviewer – verified that shipped code matched the original spec.

•The Relay – communication bridge between the Coordinator and the Implementer.

The flow was linear: I dictate a concept via voice message. The Lead interprets it. The Coordinator picks it up, dispatches it to the Implementer via the Relay. The Reviewer checks the result. Simple chain. Five links. No one dedicated to design review.

Here's the thing about voice input that I didn't appreciate at the time: when you dictate instead of type, you filter less. Speaking is effortless, so you say more, you ramble, you think out loud. I'd send a two-minute voice note where half of it was me processing an idea and the other half was a tentative direction. The agents heard all of it. And they treated every sentence as an instruction.

Every link was an AI agent making decisions about what to pass forward, how to interpret my stream-of-consciousness, and when to escalate. In theory, this distributed the cognitive load beautifully. In practice, I'd built a game of telephone with five players who each thought they understood the message – and the message itself was half-formed to begin with.

Five Things That Broke Simultaneously

1. "Approved" Didn't Mean "Ship It"

I dictated a voice message about a critical architectural feature. Something like: "yes, that's the right approach, let's lock it in." A stream-of-thought approval – the kind of thing you say when you're thinking out loud, not issuing a command.

The Lead passed it to the Coordinator. The Coordinator interpreted it as a green light to ship. It wasn't. It was a conceptual approval – "the direction is correct" – not an action command. The feature never got dispatched to the Implementer. But the Coordinator marked it as "ratified" in her tracking.

Eighteen hours later, the feature was missing from production. The Coordinator's own status board showed it as "pending" – contradicting her earlier "ratified" label.

The problem: Voice-dictated natural language is doubly ambiguous. When you type, you edit. When you speak, you stream. "That's correct, let's do it" can mean "ship this now" or "I agree with the concept." The Lead and Coordinator pattern-matched on positive sentiment and moved on. A human would have asked: "Are you saying to ship it, or just that the direction is right?"

2. "Dispatched" Was a Wish, Not a Fact

The Coordinator logged several items as "dispatched to implementation." The Relay never received them. The Implementer never saw them.

"Dispatched" was the Coordinator's intent label – what she planned to do – not a confirmed handoff. She wrote it in her own tracking system, but no acknowledgment came back from the Relay or the Implementer. Nobody checked.

The problem: The pipeline had no delivery receipts. When you send an email, you assume it arrived. When the Coordinator tells the Relay to do something, you need the Relay to confirm it heard the instruction. We had no confirmation loop. Intent was treated as execution.

3. The Bundle That Lost a Piece

Two related features were dispatched together as a bundle. The Implementer shipped one and silently dropped the other. The pull request mentioned only the feature that was included. No one noticed the missing piece because the bundle was tracked as a single unit.

When we checked production, one feature existed and the other simply wasn't there. No error. No warning. No record that it was ever dropped.

The problem: Bundling masks component-level tracking. "Bundle shipped" tells you nothing about whether all items in the bundle actually made it. The Implementer made a judgment call about what to include, and there was no checklist at the PR level to catch the gap.

4. The Plan That Paused Itself

Remember the Coordinator's second hat – compliance? She wrote an excellent multi-phase plan covering about 1,200 lines of changes. She submitted it as a draft, pending approval from the Lead. Then she waited.

And waited.

The Lead was processing other tasks and never circled back. Neither did I v I'd already moved on to the next voice note about something else. A human would have sent a follow-up message after a few hours. The Coordinator paused herself and moved on to other tasks. She never circled back either. The plan sat in draft status for about 12 hours. Only about 15% of it shipped overnight – the parts that didn't require approval.

The problem: AI agents are not naturally persistent about follow-ups. They complete their immediate task (write the plan, submit for review) and then context-switch. Without an explicit reminder mechanism, pending items silently decay.

5. The Agent That Started Designing

This one was the most unsettling.

The Implementer was given a high-level concept to build: a component that lets users override settings. The concept was approved. The specific UI – button labels, interaction patterns, visual layout – was not. Nobody in the pipeline was responsible for design decisions.

The Implementer made those decisions herself. She invented a button label, designed an interaction flow, and shipped it. The Reviewer checked for technical correctness and gave it a pass. Nobody checked design fidelity because nobody's job was design fidelity.

When the Lead opened the product and saw a button nobody had discussed, the reaction translated roughly to: "Can we make it a rule that developers don't invent design elements? Otherwise it's a monkey with a grenade."

The problem: Implementation agents will fill gaps. If the spec says "add an override mechanism" but doesn't specify the exact UI, the agent will make something up. It will look reasonable. It will function correctly. And it will be completely unauthorized. We had no dedicated design reviewer in the pipeline – that role was added after this incident.

The Root Cause Was the Chain Itself

I initially tried to fix each failure individually. Clearer approval language. Delivery confirmations. Bundle itemization. Follow-up reminders. Design pre-approval.

Those are all good fixes. I implemented all of them. But the real insight came later:

The longer the coordination chain, the higher the breakage rate.

Five agents. Five points where information could be misinterpreted, lost, or invented. Each link in the chain added drift risk. And at the very start of the chain, a human dictating half-formed thoughts into a microphone.

The Lead interpreted my voice note. The Coordinator summarized the Lead's decision. The Relay paraphrased the Coordinator's dispatch. The Implementer interpreted the Relay's description. By the time a concept reached code, it had been through four layers of AI interpretation – and the original input was a stream-of-consciousness voice memo.

That's not a pipeline. That's a rumor mill.

What We Do Now

I didn't abandon AI agents. I simplified the architecture dramatically.

Before: Me (voice) -> Lead -> Coordinator -> Relay -> Implementer -> Reviewer (5 agents, no one checking design)

After: Me (shorter, deliberate voice notes) -> Coordinator -> Implementer (direct chain, new Designer role as mandatory gatekeeper)

Here's what changed:

1.Fewer links. I removed the Relay entirely. The Coordinator talks directly to the Implementer. Less telephone, less drift.

2.I tightened up my voice notes. I still dictate — it’s how I think best — but I stopped rambling. Now every voice note has one clear directive, not a stream-of-consciousness brainstorm. Thirty seconds max, one action per message. The agents still transcribe it, but there’s far less room for misinterpretation.

3.Explicit action verbs. "That's correct" is no longer a ship command. I use an explicit action word – SHIP, RATIFY, BUILD. No ambiguity. No interpretation.

4.Delivery receipts. Every dispatch requires a verbatim acknowledgment from the receiving agent. "Dispatched" in the Coordinator's log means nothing until the Implementer confirms receipt.

5.Bundle itemization. Every PR lists its bundle items individually. No "bundle shipped" – each component is tracked and verified separately.

6.Follow-up discipline. Pending items get a re-ping every few hours. If an agent pauses waiting for approval, she's required to follow up, not silently wait.

7.A new role: Designer as gatekeeper. We added a dedicated Designer agent. The Implementer now writes a one-paragraph plan before touching any UI element. The Designer approves it. If the spec doesn't specify a button label, the Implementer asks – she doesn't guess.

The first day on the new architecture, we shipped nine pull requests with zero incidents. Less chaos, more work done.

Lessons for Anyone Building Multi-Agent Workflows

If you're coordinating AI agents in any pipeline – development, content, operations, anything – here's what I'd take from this:

Undisciplined voice input is a trap. Dictating feels faster, but you say more and filter less. Agents don’t know which part of your rambling is an instruction and which is you thinking out loud. If you’re directing AI agents by voice, keep it short — one directive per message, no thinking out loud.

Natural language approvals are dangerous. Agents pattern-match on sentiment, not intent. "Looks good" is not "ship it." Build explicit trigger words into your pipeline.

Intent is not execution. An agent saying "I dispatched this" means it intended to. Verify it arrived. Every handoff needs a receipt.

Agents fill gaps creatively. If your spec has a hole, the agent will fill it with something plausible. That's helpful for drafts. It's dangerous for production. Be explicit about what's in scope and what requires human approval.

Self-paused work decays silently. Agents don't have anxiety about forgotten tasks. Build reminder loops. If something is waiting for approval for more than a few hours, the system should escalate.

Chain length equals drift risk. Every additional agent in your pipeline is another point where meaning can shift. Use the minimum number of agents that gets the job done. Direct communication beats relayed communication every time.

The technology works. The models are smart enough. What breaks is the space between agents – the handoffs, the interpretations, the assumptions. That's where your pipeline lives or dies.

I learned this the expensive way. Hopefully you don't have to.

Frequently Asked Questions

The article documents what happens when you give AI agents real development responsibilities. The short answer: they can handle many tasks, but guardrails and human review are essential. Trusting agents without oversight led to shipped features that were never designed.

Code review, PR summaries, test generation, documentation, and structured refactoring work well. Creative architecture decisions, nuanced UX choices, and complex debugging still benefit from human judgment.

The article learned this the hard way. Key safeguards: mandatory human review for all PRs, clear scope boundaries for agent tasks, automated testing gates, and never letting agents merge their own code. The lesson is that agents are powerful contributors but need the same oversight as junior developers.

The article discusses using multiple models for different tasks. Stronger models handle architecture and complex logic. Faster models handle formatting, documentation, and routine refactoring. The specific model matters less than the guardrails around it.

Mike is co-founder of Amplify, where we build AI agents that handle real work – and occasionally learn hard lessons about coordination. These Field Notes document the journey, including the experiments that blow up.

Case Study

Enjoyed this Field Note?

Field Notes #51

AI Agents vs AI Assistants: What's the Difference and Which Do You Need?

Field Notes #48

AI Assistant That Actually Takes Actions (Not Just Suggestions)

Field Notes #49