Guide

Multi-Agent Orchestration Is Eating Your Budget: The Computer Use Patterns That Actually Work in 2025

Sophia Martinez||9 min
+Tab

In June 2025, Cognition published a blog post called 'Don't Build Multi-Agents.' Anthropic published one the same month called 'How to Build Multi-Agents.' Two of the most credible AI labs in the world, directly contradicting each other, in the same week. If the people building this stuff can't agree on the fundamentals, what chance does your engineering team have? Meanwhile, Gartner quietly updated their numbers: over 40% of agentic AI projects will be canceled before they ever reach production. Not paused. Canceled. That's not an AI problem. That's an architecture problem. And it's fixable, if you pick the right patterns from the start.

The Cognition vs. Anthropic Fight You Should Be Paying Attention To

Here's the core of the debate, stripped of the jargon. Cognition, the team behind Devin, argues that multi-agent systems create a context nightmare. Every time you hand off work between agents, you lose information. Summaries aren't the same as full traces. The receiving agent is working with a compressed, lossy version of what actually happened. Their argument: a single well-designed agent with access to the right context will outperform a swarm of specialized agents coordinating badly. Anthropic pushed back. Their multi-agent research system uses multiple Claude instances to explore complex topics in parallel, and they published the engineering challenges honestly. They're not claiming it's easy. They're saying the ceiling is higher when you get it right. Both positions are correct, which is the maddening part. The pattern you choose depends entirely on your task type. Sequential, dependent workflows? A single computer use agent with good context management often wins. Massively parallelizable research or data processing? Agent swarms are genuinely faster. The mistake most teams make is picking one pattern and forcing every problem to fit it.

Why 40% of These Projects Die in Proof of Concept

  • Context collapse: Agents hand off summaries instead of full traces, and downstream agents make decisions on incomplete information. Galileo's research found formal orchestration frameworks reduce failure rates by 3.2x versus unorchestrated systems.
  • No error recovery: A single agent failure in a chain can cascade silently. The orchestrator reports 'done' and the output is garbage. Nobody finds out until a human checks the work three days later.
  • Over-engineering from day one: Teams build a Planner-Worker-Judge architecture for a task that a single well-prompted computer use agent could handle in 40 seconds.
  • Token cost blindness: Running 10 agents in parallel sounds fast. It is fast. It's also 10x the token spend. Teams hit budget ceilings mid-project and have to kill the whole thing.
  • No real desktop control: Most 'agents' are just API call wrappers. They can't navigate a real browser, interact with legacy desktop software, or handle the visual, unpredictable interfaces that live in every real enterprise environment.

Gartner predicts over 40% of agentic AI projects will be canceled by end of 2027. The leading cause isn't model quality. It's that teams build orchestration systems before they understand what their agents can actually do on a real computer screen.

The Three Patterns Worth Knowing (And One That's Mostly Hype)

Pattern one is the Orchestrator-Subagent model. One coordinator agent breaks down a task and dispatches specialized subagents to execute steps. This works well when subtasks are genuinely independent and the orchestrator has enough context to route intelligently. The failure mode is an orchestrator that's too dumb to know when a subagent has failed and just moves on anyway. Pattern two is the Planner-Worker-Judge architecture. A planner generates a strategy, workers execute it, and a judge evaluates the output before it gets passed forward. MindStudio documented a case where this pattern solved a math problem in 4 days that stumped single-agent approaches. The overhead is real, but so are the results for complex, verifiable tasks. Pattern three is the parallel swarm. Multiple agents working independently on sub-tasks, with results aggregated at the end. Kimi's K2.5 model can direct a swarm of up to 100 sub-agents executing across 1,500 tool calls. That's not a demo. That's production infrastructure. The hype pattern is 'agents negotiating with each other.' Yes, you can build a system where 17 AI agents debate the best way to optimize your data warehouse. One of them will convince the others of something wrong. Consensus among confused agents is not intelligence. It's expensive confusion.

The Part Everyone Glosses Over: Your Agents Need to Actually Use a Computer

Here's what kills me about most multi-agent architecture discussions. They treat 'tool use' as a solved problem. It's not. The real world is full of interfaces that weren't designed for APIs. Legacy ERP systems. PDFs with inconsistent formatting. Web apps that break when you look at them wrong. Desktop software from 2014 that your finance team refuses to replace. A beautifully designed orchestration pattern is worthless if your agents can't interact with the actual software your business runs on. This is where computer use becomes the deciding factor, not the model, not the framework, not the prompt engineering. Can your agent actually see a screen, understand what's on it, and take the right action? Most 'agents' can't. They're calling structured APIs against clean data. That's not automation. That's a script with extra steps. Real computer use means controlling a desktop, navigating real browsers, handling popups and login flows and broken UI states. The benchmark that actually measures this is OSWorld, and the scores are humbling across the board.

Why Coasty Exists

I'm going to be direct about this because I think it matters. Coasty was built specifically for the gap I just described. It's a computer use agent that scores 82% on OSWorld, which is the highest score of any agent right now. That's not marketing copy. That's a number you can verify on the leaderboard. The reason that score matters in the context of multi-agent orchestration is this: when you're running an agent swarm, each individual agent's reliability compounds. If each agent in a 5-agent pipeline has an 80% success rate on its individual task, your end-to-end success rate is around 33%. If each agent has a 90% success rate, you're at 59%. Push that to 95% per agent and you're at 77%. The math is brutal. You need agents that are genuinely good at the underlying task, not just good at following instructions on clean data. Coasty controls real desktops, real browsers, and real terminals. Not simulated environments. It supports agent swarms for parallel execution, so you can run the orchestration patterns that actually make sense for your workload. There's a free tier if you want to test it without a procurement conversation, and BYOK support if your security team has opinions about API keys. The reason I use it and recommend it isn't the benchmark score. It's that it actually works when the interface is messy, and messy is what real work looks like.

The Context Engineering Shift That Changes Everything

The smartest thing Cognition said in their 'Don't Build Multi-Agents' post wasn't the title. It was the underlying argument about context engineering. The quality of what an agent can do is almost entirely determined by the quality of context it receives. This is true for single agents and it's even more true in multi-agent systems, because every handoff is a context event. If you're building an orchestration system right now, the question to ask at every handoff point is not 'did the previous agent complete its task?' It's 'does the next agent have everything it needs to not make a bad decision?' That means passing full traces, not summaries. It means building explicit verification steps before high-stakes actions. It means designing your orchestrator to handle failure states, not just success states. The teams shipping multi-agent systems to production aren't the ones with the fanciest architectures. They're the ones who got obsessive about what each agent actually knows at each step. That's the work. It's not glamorous. It doesn't make a good conference talk. But it's the difference between a demo and a product.

Here's my honest take after watching this space for the past year. Multi-agent orchestration is real, it's powerful, and most teams are doing it wrong. Not because they're dumb, but because they're copying patterns that look impressive without understanding the failure modes. The Cognition vs. Anthropic debate isn't a contradiction. It's a map. Use single agents when context continuity matters more than parallelism. Use swarms when tasks are genuinely independent and you need speed. Use Planner-Worker-Judge when you need verifiable outputs on complex problems. And in every single case, make sure your agents can actually operate the software your business depends on. A perfect orchestration pattern with weak computer use capability is a beautiful car with no engine. If you want to see what a real computer use agent looks like in a multi-agent setup, go to coasty.ai and run something. Don't read another benchmark post. Don't watch another demo. Run a real task on a real desktop and see what 82% on OSWorld actually feels like in practice. The 40% of teams whose projects get canceled this year will have one thing in common: they spent six months on architecture and skipped the part where they tested whether their agents could actually do the work.

Want to see this in action?

View Case Studies
Try Coasty Free