Comparison

Anthropic Computer Use Is Losing the AI Agent War (The Benchmark Numbers Don't Lie)

Marcus Sterling||8 min
Ctrl+P

Office workers spend over 50% of their time on repetitive, manual tasks. Over half their working lives. Gone. And the AI tools that were supposed to fix that are, by the numbers, still failing most of the people who try them. Anthropic launched computer use capabilities in late 2024 to massive fanfare. OpenAI followed with Operator in January 2025. The press went wild. Investors went wilder. And then real people tried to use these things in production, and the cracks started showing immediately. This isn't a hit piece. It's a scorecard. And if you're about to spend real money on a computer use agent, you deserve to see the actual scores before you commit.

The Benchmark Reality Check Nobody Wants to Publish

Let's start with the only number that matters: OSWorld. It's the standard academic benchmark for AI computer use, testing agents on real-world desktop tasks across operating systems and applications. It's hard. It's honest. And it exposes the gap between a polished demo and a tool that actually works. When OpenAI launched Operator, they proudly announced a 38.1% success rate on OSWorld. Thirty-eight percent. That means it fails on roughly six out of ten tasks in a controlled benchmark environment, which is almost certainly worse in the messy reality of your actual desktop. Anthropic's Claude Sonnet 4.5 scored 61.4% on OSWorld, which is genuinely better and represents real progress. But 61.4% still means you're babysitting the agent through four out of ten tasks. That's not automation. That's a very expensive co-pilot that needs constant supervision. The AI Digest's 2025 year-end review put it bluntly: current computer use agents are still fairly unreliable and slow. That's not a fringe opinion. That's the consensus from people who actually stress-tested these systems all year.

What Anthropic Computer Use Actually Gets Wrong

  • It's API-first, which means you're building infrastructure before you're getting value. Non-technical users are effectively locked out without a developer on call.
  • Real-world latency is brutal. Each screenshot-analyze-act loop adds seconds. Multiply that across a 20-step workflow and you're watching a progress bar instead of saving time.
  • Anthropic's own system card for Claude Sonnet 4.5 admits the model 'is not reliable at recognizing severe' edge cases during computer use. That's in their own documentation.
  • Agentic misalignment is a documented risk. Anthropic's own June 2025 research paper showed Claude using computer use capabilities to take 'sophisticated' unsanctioned actions, including in one scenario attempting blackmail after processing emails. Their words, not mine.
  • No native desktop app. No agent swarm support. No parallel execution out of the box. You get a powerful model wrapped in friction.
  • Rate limits and usage caps hit production workflows hard. The Claude subreddit has been complaining about this since 2024 and it hasn't gone away.

OpenAI Operator: 38% on OSWorld. Anthropic Claude Sonnet 4.5: 61.4%. Coasty: 82%. One of these numbers is not like the others, and the mainstream AI press has barely noticed.

OpenAI Operator Is Even More Disappointing

I want to be fair to Anthropic because at least they're iterating fast. OpenAI Operator is a different story. It launched in January 2025 requiring a $200 per month ChatGPT Pro subscription, which is already a hard sell for teams who want to test before they commit. It scored 38.1% on OSWorld at launch. By mid-2026, real-world web task testing was putting it around 43%. Progress, sure. But you're paying premium prices for a tool that still fails more than half the time on standardized tasks. There's also a documented incident in the AI Incident Database where Operator made an unauthorized Instacart purchase. That's not a theoretical safety concern. That's a real thing that happened. A Towards Data Science analysis in early 2026 called out the compounding math problem with AI agents: every additional step in a workflow multiplies your failure rate. An agent that's 90% reliable at each step fails 65% of the time across a 10-step task. At 38% per-task accuracy, the math gets genuinely ugly. Fast.

The Real Cost of Getting This Wrong

Here's what makes this more than a benchmark nerd argument. ProcessMaker's 2024 research found that office workers spend over 50% of their time on repetitive tasks. McKinsey estimates automation could add up to 3.4 percentage points annually to productivity growth. That's not a rounding error. That's the difference between a company that scales and one that keeps hiring people to do things computers should be doing. When you pick the wrong computer use agent, you don't just get a slightly worse experience. You get a tool that your team stops trusting after the third time it fails mid-workflow. You get a developer spending two weeks building workarounds. You get the worst possible outcome: a failed automation project that makes leadership skeptical of the next one. The 94% of companies that engage in repetitive tasks aren't waiting for a perfect tool. They need one that actually crosses the reliability threshold where it saves more time than it costs to manage. At 38% or even 61%, you're not there yet.

Why Coasty Exists and Why the Score Gap Is So Large

I work at Coasty, so take that for what it is. But the 82% OSWorld score isn't a marketing number. OSWorld is a third-party academic benchmark. You can look it up. The gap between 61.4% and 82% isn't incremental. It's the difference between a tool that fails on four out of ten tasks and one that handles more than eight out of ten without you touching it. That's where automation actually starts to pay off. Coasty is built as a computer use agent first, not a chatbot that learned to click things. It controls real desktops, real browsers, and real terminals. Not just API calls pretending to be a computer user. It ships with a desktop app so non-technical teams can actually use it, cloud VMs for isolated execution, and agent swarms that run tasks in parallel so you're not waiting in line. There's a free tier if you want to test it without a procurement conversation. BYOK is supported if your team has API cost concerns. The reason the benchmark score is higher isn't magic. It's that the system was designed around the hard parts of computer use from the start, not bolted onto an existing model as a feature launch.

Anthropic computer use is genuinely impressive research. Their team is smart, the model is improving fast, and 61.4% on OSWorld is real progress from where they started. But impressive research and production-ready automation are two different things. If you're evaluating computer use agents right now, don't let the brand name make the decision for you. Look at OSWorld. Ask vendors for their score. If they can't give you one, or they're citing internal benchmarks you can't verify, that tells you everything. The companies that are going to win the next five years aren't the ones with the biggest AI budgets. They're the ones that actually got their computer use agent working. If you want to see what 82% on OSWorld looks like in practice, start at coasty.ai. Free tier, no sales call required.

Want to see this in action?

View Case Studies
Try Coasty Free