Comparison

Anthropic Computer Use Scores 61%. So Why Are People Still Paying For It?

Lisa Chen||7 min
+Space

A simple web scrape using Claude's computer use API recently cost one developer $30 in a single session. Thirty dollars. For one task. And it still only works about 61% of the time on real-world benchmarks. That's not a beta limitation, that's a business model problem. The hype around Anthropic computer use has been enormous, and the reality has been quietly, consistently disappointing. OpenAI's Operator isn't much better. UiPath is still selling companies on 2018-era RPA with a fresh coat of AI paint. Meanwhile, workers are burning 15 hours a week on repetitive tasks they hate, companies are hemorrhaging money on manual processes, and the tools that were supposed to fix all of this are either too slow, too expensive, too unreliable, or all three at once. Let's actually compare what's out there, with real numbers, and stop pretending every computer use agent is created equal.

The Benchmark Everyone Ignores Until It's Inconvenient

OSWorld is the gold standard for measuring how well an AI agent can actually use a computer. Not a toy demo. Not a cherry-picked press release. Real tasks, real desktop environments, real success or failure. So let's look at the scoreboard. Claude Sonnet 4.5, Anthropic's most hyped computer use model, scores 61.4% on OSWorld. Claude Opus 4.6 gets to 72.7%. OpenAI's computer-using agent sits in a similar range. Every major lab is clustered in the 50s and 60s, patting themselves on the back for barely passing a test where failure means a real employee's workflow breaks mid-task. Coasty sits at 82% on OSWorld. That's not a small gap. That's a different category of tool entirely. When you're automating something that runs 500 times a day, the difference between 61% and 82% is the difference between a tool that saves you time and a tool that creates a new full-time job babysitting it.

What's Actually Wrong With Anthropic Computer Use

  • Still labeled 'beta' in the API docs, which means Anthropic is telling you upfront: don't depend on this yet
  • Costs can spiral fast. One developer reported $30 for a single web scrape session using Claude's computer use API. At scale, that math gets ugly fast.
  • 61.4% task success rate on OSWorld means roughly 4 in 10 real-world tasks fail or require human intervention
  • Rate limits and usage caps hit users constantly, especially on Pro plans, with Reddit threads full of complaints going back months
  • Anthropic's own research flagged 'agentic misalignment' risks where computer use agents take unexpected actions during routine tasks
  • No native desktop app or cloud VM infrastructure. You're building your own execution environment on top of an API, which means more engineering overhead before you get any actual work done
  • Slow. Current computer use agents from the major labs are widely described as 'too slow for production use' by the developers actually building with them

Office workers waste over 50% of their time on repetitive tasks. UK workers specifically lose 15 hours per week to manual admin work alone. That's not a productivity problem. That's a structural emergency that 61% accuracy AI tools are making worse, not better.

OpenAI Operator: The Hype Was Louder Than the Product

OpenAI launched Operator in January 2025 to enormous fanfare. It can book concert tickets. It can fill out grocery orders. That's the demo. In practice, Operator is a browser-only agent that can't touch your desktop, can't access local files, can't run terminal commands, and can't be deployed in a real enterprise workflow without significant workarounds. Early users who got access described it as impressive for simple web tasks and genuinely frustrating for anything more complex. The fundamental problem is the same one plaguing Anthropic computer use: these tools were designed to wow people in demos, not to run reliably at scale in production. And the OSWorld scores don't lie. When an independent benchmark puts your computer use agent in the low-to-mid 60s, you don't get to call it production-ready just because the press release said so.

RPA Is Not the Answer Either (And Never Was)

Some companies are still reaching for UiPath, Automation Anywhere, or Blue Prism when they need to automate computer tasks. This is like buying a fax machine in 2025 because you've always used fax machines. Traditional RPA is brittle. It breaks every time a UI changes. It requires dedicated engineers to maintain. It doesn't understand context. It can't adapt. A button moves three pixels to the left and your entire automation pipeline collapses. The 'AI-enhanced RPA' rebranding these companies are doing right now is mostly marketing. Bolting a language model onto a rules-based automation framework doesn't give you a real computer use agent. It gives you an expensive rules-based automation framework that occasionally hallucinates. The real computer use problem requires something that actually sees the screen, understands what it's looking at, and can make decisions the same way a human would. That's a fundamentally different architecture.

Why Coasty Exists

I'm not going to pretend I don't have a preference here. After looking at the benchmarks, the pricing complaints, the beta warnings, and the production horror stories, the case for Coasty isn't complicated. It scores 82% on OSWorld. That's the highest of any computer use agent, period. Not slightly higher. Meaningfully higher. The kind of higher that changes whether your automation actually runs or whether you're constantly cleaning up after it. Coasty controls real desktops, real browsers, and real terminals. Not just browser tabs. Not just API calls dressed up as computer use. It ships with a desktop app, cloud VMs, and agent swarms for parallel execution, so you can run multiple tasks simultaneously instead of waiting in line. There's a free tier, BYOK support if you want to bring your own API keys, and you're not locked into one underlying model. The people who built it clearly understood that the problem wasn't just 'can the AI click a button.' The problem was 'can this run reliably, at scale, without costing more than the employee it's replacing.' That's a harder problem. Coasty solved it. Check it out at coasty.ai.

Here's my honest take: Anthropic computer use is a research preview that got marketed as a production tool. OpenAI Operator is a polished demo with real limitations that most reviewers glossed over. RPA vendors are rebranding as fast as they can and hoping nobody notices the architecture underneath hasn't changed. None of that means the category is broken. It means most of the players in it are still catching up to what a real computer use agent needs to be. The 21-point gap between Anthropic's best computer use score and Coasty's 82% on OSWorld isn't a minor version difference. It's the difference between a tool you demo and a tool you actually deploy. Workers are losing half their week to tasks that should already be automated. The technology to fix that exists right now. Stop settling for beta software with a famous logo on it. Go to coasty.ai and see what 82% actually looks like.

Want to see this in action?

View Case Studies
Try Coasty Free