Comparison

Anthropic Computer Use Is Losing the Benchmark War. Here's Who's Actually Winning.

Emily Watson||7 min
Alt+F4

Manual repetitive tasks cost U.S. companies $28,500 per employee per year. Not per team. Per person. And the tools that were supposed to fix this, Anthropic's computer use, OpenAI Operator, legacy RPA platforms, have spent the last 18 months overpromising and underdelivering in ways that should make every CTO genuinely angry. This isn't a nuanced take. The benchmarks are public. The failure modes are documented. And the gap between what these tools claimed and what they actually do is wide enough to drive a truck through. Let's go through it.

Anthropic Computer Use: Bold Launch, Ugly Numbers

When Anthropic announced computer use for Claude in October 2024, the demos looked genuinely impressive. Claude clicking around a desktop, filling forms, navigating browsers. The AI Twitter crowd lost its mind. Then the benchmarks came out. On OSWorld, the standard benchmark for testing how well a computer use agent handles real-world desktop tasks, Claude's early computer use capability landed around 22%. That means it failed roughly 78% of the tasks it attempted. To be fair, Anthropic has iterated since then. Claude Sonnet 4.5 pushed that number to 61.4% on OSWorld, which is real progress. But here's what Anthropic's own documentation quietly admits: the tool is still in beta, it's still error-prone, and they explicitly warn developers to watch out for prompt injection attacks where malicious content on screen can hijack what Claude does next. Their own docs say to 'isolate Claude from sensitive data' because the security model isn't solid. That's not a production-ready computer use agent. That's a very smart prototype with a long warning label.

OpenAI Operator: The New York Times Called It 'Brittle and Erratic.' That's Being Generous.

OpenAI launched Operator in January 2025 with the kind of fanfare usually reserved for product launches that actually work. The underlying Computer-Using Agent scored 38.1% on OSWorld at launch. Better than early Claude computer use, sure. But think about what 38% actually means in practice. More than six out of ten tasks fail. The New York Times reviewed it and used the words 'brittle and occasionally erratic.' A Reddit user who got early access wrote that Operator is 'too slow, expensive, and error-prone.' A writer at Where's Your Ed At ran the numbers on both Operator and Deep Research and concluded that both are expensive products that fail to deliver consistent value at scale. OpenAI's computer use story is essentially: we built something that works sometimes, costs a lot, and we're calling it a product. The AI hype machine did the rest.

What the RPA Crowd Got Wrong (And Is Still Getting Wrong)

  • UiPath's stock has been a disaster story. A LinkedIn post from November 2025 with thousands of reactions pointed out that UiPath lost 80% of its value while Anthropic, Google, and OpenAI all launched computer use from scratch, bypassing the entire RPA model.
  • Traditional RPA tools like UiPath require brittle, hand-coded scripts for every workflow. One UI change in a web app and the whole automation breaks. That's not intelligence, that's a very expensive macro.
  • Repetitive tasks cost businesses $1.8 trillion annually according to CIO Insight research. The RPA industry has existed for years and that number hasn't moved.
  • Workers globally spend roughly 69 days per year on admin tasks alone. That's three and a half months of a working year, gone. RPA was supposed to fix this. It didn't, not at scale.
  • UiPath now integrates Anthropic computer use and OpenAI Operator into its platform, which is basically an admission that their own core technology isn't enough. You don't bolt on your competitors' tools unless you're losing the technical argument.
  • The dirty secret of enterprise RPA is maintenance cost. Automation teams spend more time keeping old bots alive than building new ones. That's not automation. That's a different kind of manual work.

The #1 computer use AI agent today sits at 82% on OSWorld. OpenAI Operator launched at 38.1%. That's not a small gap. That's a completely different category of capability.

Why Every Computer Use Benchmark Before 2026 Was a Low Bar

OSWorld is the benchmark that actually matters for computer use agents. It tests real tasks on real operating systems, not toy problems. When OpenAI Operator launched at 38.1%, the tech press treated it like a moon landing. When Claude Sonnet 4.5 hit 61.4%, Anthropic called it 'a significant leap forward.' And both of those things are true in a narrow sense, because the starting point was so low that any improvement looked dramatic. But 61% still means four out of ten tasks fail. You wouldn't accept a 61% success rate from a human employee. You wouldn't accept it from a contractor. You definitely shouldn't accept it from a tool you're paying to automate your business. The conversation in AI circles right now is about who's actually crossed the threshold from 'interesting demo' to 'reliable production tool.' That threshold isn't 38%. It isn't even 61%. The question is who's actually pushing toward the 80s, and what that unlocks for real workflows.

Why Coasty Exists (And Why the Benchmark Gap Is the Whole Story)

Coasty sits at 82% on OSWorld. Not on an internal benchmark designed to make the numbers look good. On OSWorld, the same test that exposed OpenAI Operator at 38% and put Claude Sonnet 4.5 at 61.4%. That 20-point gap over Anthropic's best computer use model isn't a marketing claim. It's a reproducible result on a standardized test. What does that mean practically? It means Coasty's computer use agent completes tasks that Anthropic's and OpenAI's tools fail on. It controls real desktops, real browsers, and real terminals. Not API wrappers pretending to be computer use. Actual screen-level control, the way a human would operate a computer. And because it supports agent swarms, you can run parallel tasks simultaneously instead of waiting for one workflow to finish before starting the next. The free tier means you can test it without a procurement process. BYOK support means you're not locked into someone else's cost structure. The reason Coasty exists is exactly because the hype-to-performance ratio on Anthropic computer use and OpenAI Operator left a massive gap. Someone had to actually close it.

The Real Cost of Picking the Wrong Computer Use Tool

Here's the math that should make you uncomfortable. $28,500 per employee per year in manual data entry costs, according to a 2025 Parseur report. Say you have 50 people doing repetitive computer tasks. That's $1.4 million a year in pure waste. Now you pick a computer use agent that works 38% of the time. You automate maybe a third of those tasks, the easy ones. You still have two-thirds of the problem, plus you're paying for the tool, plus your team is spending time managing failures and exceptions. You haven't solved the problem. You've added a new layer on top of it. This is exactly what happened to companies that adopted early RPA with the same optimism. The automation worked great in the demo. Then reality showed up. The difference with a computer use agent that actually performs, one sitting at 82% on a standardized benchmark, is that the failure rate drops to the point where real automation becomes real. Not theoretical automation. Not demo automation. The kind that actually gets 50 people's worth of repetitive work off the table.

Anthropic computer use is genuinely impressive engineering. So is what OpenAI built with Operator. But impressive engineering and production-ready automation are two different things, and the benchmarks don't lie. A 22% launch score and a current 61% score tell you everything about where something is on the maturity curve. If you're evaluating computer use agents right now, stop reading press releases and go look at OSWorld scores. Then ask the vendor to show you the number. If they can't, or won't, that's your answer. The best computer use agent available today is at 82% on that benchmark, it's open-source, it runs on real desktops, and it has a free tier. There's no reason to settle for a tool that fails more than it succeeds when the alternative is sitting right there. Go test it at coasty.ai and see what computer use actually looks like when it works.

Want to see this in action?

View Case Studies
Try Coasty Free