The Multi-Agent Debate Nobody Is Winning (And Why Your Computer Use Agent Is Probably Doing It Wrong)
In June 2025, the team behind Devin, the AI software engineer that was supposed to replace developers, published a blog post titled 'Don't Build Multi-Agents.' Cognition, a company whose entire product IS a multi-agent system, told the world to stop building multi-agent systems. Anthropic responded within weeks with their own post on how they built theirs. A Berkeley research team published a paper identifying 14 distinct failure modes in multi-agent LLM systems. And right now, at this exact moment, over 40% of knowledge workers are spending at least a quarter of their entire work week on manual, repetitive tasks that a properly orchestrated computer use agent could handle before lunch. The debate is loud, the stakes are real, and most of the people arguing loudest have never actually shipped a multi-agent system that works in production. Let's cut through it.
The Hot Take That Broke AI Twitter: 'Don't Build Multi-Agents'
Cognition's Walden Yan made a genuinely interesting argument. The core claim: most teams reach for multi-agent architectures out of habit or hype, when a single well-designed agent with good tools and context would do the job better, faster, and with fewer catastrophic failures. He's not entirely wrong. The problem is that the AI community treated it like a commandment instead of a warning label. The real insight buried in that post is about coordination overhead. Every time you split a task across multiple agents, you introduce a communication boundary. Agents misinterpret each other's outputs. Context gets lost between handoffs. Errors from Agent A don't just stay with Agent A, they cascade into Agent B, C, and D. The Berkeley MAST paper (March 2025) put hard numbers on this: researchers catalogued 14 distinct failure modes in multi-agent LLM systems, clustered into three brutal categories: specification failures, inter-agent coordination failures, and execution failures. The most common? Agents confidently completing the wrong task because the orchestrator handed off an ambiguous instruction. Sound familiar? That's not a reason to abandon multi-agent computer use. That's a reason to build it properly.
The 4 Orchestration Patterns That Actually Matter
- ●Planner-Worker-Judge: A planner agent breaks the task down, worker agents execute subtasks in parallel, and a judge agent evaluates outputs before anything gets committed. This pattern solved a problem that had stumped single-agent systems for 4 days straight, according to a March 2026 MindStudio case study. The key is the Judge, without it you're just hoping workers got it right.
- ●Hierarchical Orchestration: A top-level orchestrator delegates to specialized sub-agents, each owning a domain. Works brilliantly for computer use tasks that span multiple applications, think pulling data from a legacy ERP, formatting it in Excel, and filing it in a web portal. One agent per surface, one orchestrator to rule them all.
- ●Parallel Swarm Execution: Multiple computer use agents running the same class of task simultaneously across different accounts, datasets, or workflows. The speed math is simple. If one agent takes 8 minutes per task and you need 50 done, that's 400 minutes. A swarm of 10 agents gets it done in 40. This is where the real ROI lives for enterprise teams.
- ●Reflexive Self-Critique Loops: An agent that checks its own work before reporting success. Sounds obvious. Almost nobody implements it. The ScienceDirect AI agents paper from 2025 specifically flagged 'silent failures and error propagation' as the defining weakness of production agent systems. A reflexive loop catches the mistake before it becomes your problem.
- ●Event-Driven Handoff: Agents that sit idle until a trigger fires, then execute, then hand off to the next agent in the chain. No polling, no wasted compute, no race conditions. This is the pattern Microsoft's Azure Architecture Center highlighted in February 2026 as the most scalable for enterprise multi-agent deployments.
UK workers waste an average of 15 hours per week on repetitive administrative tasks. That's nearly two full working days, every single week, per person. Multiply that by your headcount and try not to feel sick.
Why Most 'Multi-Agent' Implementations Are Just Expensive Pipelines With Anxiety
Here's what nobody in the breathless LinkedIn posts about multi-agent orchestration wants to admit: most implementations aren't actually orchestration. They're sequential pipelines with an LLM at each step and a prayer holding it together. A Reddit thread from June 2025 put it bluntly: 'Multi-Agent AI in n8n is a total scam. You're just building pipelines and calling them agents.' That's harsh but it's directionally correct. Real orchestration means agents have genuine autonomy to make decisions, retry on failure, escalate to a supervisor, and adapt to unexpected states. What most teams build is a chain of API calls where one bad output nukes the entire workflow and nobody finds out until a human checks the output three days later. The computer use dimension makes this even more unforgiving. When your agent is clicking through a real desktop, filling out real forms, and submitting real data, a cascading failure isn't just a bad log entry. It's a wrong invoice sent. It's a record updated with garbage data. It's a compliance filing that's subtly wrong. The failure modes for computer-using AI agents are physical in a way that pure API agents aren't. That's why the architecture decisions matter more, not less.
OpenAI and Anthropic Are Still Figuring This Out Too
Let's be honest about where the big labs are. OpenAI's Operator launched in January 2025 powered by their Computer-Using Agent model. By July 2025, one reviewer called it 'unfinished, unsuccessful, and unsafe.' That's not a fringe take, that's a headline from someone who actually tested it. Claude's computer use scored 61.4% on OSWorld as of the Sonnet 4.5 release in September 2025. Respectable, but not dominant. Both products are still in research preview territory for anything resembling real enterprise workloads. Anthropic's own engineering blog post about their multi-agent research system, published June 2025, reads like a candid admission of how hard this is. They talk about 'engineering challenges' and 'lessons learned' in a way that suggests even the people who invented Claude's computer use capabilities are still working out the orchestration kinks. This isn't a knock. It's context. The gap between 'demo that works' and 'production system that works reliably at scale' is enormous for computer use AI, and anyone telling you they've fully solved it is selling something.
Why Coasty Exists
I've tried most of the computer use agents that matter. The benchmark that cuts through the marketing noise is OSWorld, a real-world test of whether an AI can actually operate a computer to complete tasks a human would recognize as work. Coasty scores 82% on OSWorld. Claude Sonnet 4.5 scores 61.4%. That gap isn't a rounding error, it's the difference between an agent that works and an agent that works most of the time. Coasty is built specifically for the orchestration patterns that actually matter in production. It runs on real desktops and real browsers, not sandboxed simulations. The agent swarm capability means you get genuine parallel execution, not faked concurrency. You can run it as a desktop app, spin up cloud VMs, or deploy swarms for the kind of high-volume parallel workloads where the ROI math gets genuinely absurd. There's a free tier if you want to test it without a procurement conversation, and BYOK support if your security team has opinions about API keys. The reason Coasty exists is that the best computer use agent shouldn't be the one with the biggest marketing budget. It should be the one that actually completes the task. At 82% on OSWorld, it's not close.
Here's where I land after all of this. Cognition is right that most teams shouldn't build multi-agents, because most teams don't have the architecture discipline to build them correctly. Anthropic is right that multi-agent systems, done properly, are genuinely more capable than any single agent for complex real-world work. Both things are true and they don't contradict each other. The answer isn't to avoid multi-agent orchestration. The answer is to stop treating it as a vibe and start treating it as an engineering problem with known patterns, known failure modes, and known solutions. Planner-Worker-Judge. Hierarchical delegation. Swarm parallelism. Reflexive self-critique. These aren't buzzwords, they're the difference between an automation that runs for 6 months without breaking and one that quietly corrupts your data on a Tuesday. Your team is wasting 15 hours a week per person on tasks that should already be automated. The architecture debate is interesting but it's also a distraction from the fact that you could start fixing this today. If you want a computer use agent that's actually been benchmarked against reality and not just a demo reel, start at coasty.ai.