Comparison

Anthropic Computer Use Scores 38% on OSWorld. So Why Are People Still Paying For It?

Sarah Chen||7 min
Ctrl+A

Office workers waste over 50% of their time on repetitive tasks. Fifty percent. That's not a rounding error, that's your entire Monday and Tuesday gone every single week. So when Anthropic launched computer use in late 2024 and the internet collectively lost its mind, the promise was real: an AI that actually controls a desktop, clicks buttons, fills forms, and does the grunt work so humans don't have to. Fast forward to today, and the reality is a lot messier. Anthropic's computer use scores 38% on OSWorld, the standard benchmark for real-world computer tasks. OpenAI's Operator launched to massive hype and landed at the exact same 38%. People are paying $200 a month for tools that fail on 62% of real tasks. This is the comparison nobody's being honest about.

What OSWorld Actually Measures (And Why 38% Should Embarrass Everyone)

OSWorld is not a trick benchmark. It's not designed to make AI look bad. It tests agents on real desktop tasks: navigating apps, writing files, searching the web, using spreadsheets, the kind of stuff a junior employee handles in their first week. When OpenAI announced their Computer-Using Agent in January 2025, they led with the 38.1% OSWorld score like it was something to brag about. Anthropic's Claude-based computer use sits in the same neighborhood. Think about that framing for a second. These companies are charging enterprise rates for tools that, by their own admission, fail nearly two-thirds of the time on standard tasks. If a human employee failed 62% of the time, you'd fire them on day three. The AI industry has somehow convinced buyers that 38% is a starting point worth celebrating. It isn't. It's a problem worth solving.

The Anthropic Computer Use Experience: What Users Actually Say

  • Rate limits hit fast and hard, even on paid Pro tiers, with Reddit threads full of users hitting walls mid-task and losing all their work
  • Computer use through Anthropic's API requires significant setup, sandboxed environments, and engineering overhead that most teams don't have bandwidth for
  • Claude's computer use is genuinely impressive at simple demos but falls apart on multi-step workflows where one wrong click cascades into failure
  • Anthropic's own research published in June 2025 flagged 'agentic misalignment' risks where Claude took unexpected autonomous actions during computer use tasks
  • No native desktop app, no parallel execution, no agent swarms: you get one agent doing one thing at a time, slowly
  • Pricing through the API adds up fast when you're running long computer use sessions that burn tokens on every screenshot and action

OpenAI Operator: The $200/Month Hype Machine

When OpenAI dropped Operator in January 2025, the early access crowd was electric. Real users got in, ran it through its paces, and the verdict on Reddit was brutally honest: it was slow, it failed on tasks that felt trivially simple, and the 38.1% OSWorld score wasn't a quirk of the benchmark. It reflected reality. Operator runs in a browser sandbox, which means anything that requires a native desktop app is immediately off the table. It also means you're dependent on OpenAI's infrastructure, their uptime, their rate limits, and their roadmap. One user in the r/ChatGPTPro thread said it best: they'd never heard anything good about computer use from Anthropic, and Operator didn't change that impression. That's not a niche complaint. That's the mainstream user experience right now. The computer-using AI category is full of demos that look incredible and products that underdeliver.

Employees lose an estimated 50 days per year to repetitive tasks. At a $60,000 salary, that's roughly $12,000 per person per year in pure wasted labor. A 1,000-person company is burning $12 million annually on work that a real computer use agent should be doing. And the best tools on the market right now succeed at barely a third of those tasks.

UiPath Tried to Claim the Crown. Here's the Fine Print.

In January 2026, UiPath published a blog post claiming their Screen Agent hit number one on OSWorld. The enterprise RPA crowd cheered. But read that post carefully and you'll notice something: UiPath is a legacy RPA company bolting AI onto a decade-old automation framework. Their Screen Agent result is real, but the product experience around it is still UiPath: expensive licenses, complex deployments, IT-heavy implementations, and a sales process that takes longer than the automation you're trying to build. Traditional RPA was already dying before computer use AI arrived. Wrapping a new benchmark score around the same old enterprise bloat doesn't fix the fundamental problem. You still need a team of RPA developers, a six-figure contract, and six months of onboarding before a single task gets automated. That's not what computer use AI is supposed to be. It's supposed to be fast, accessible, and actually work.

Why Coasty Exists

Here's where I'll be straight with you. I've spent real time with all of these tools, and the reason I keep coming back to Coasty is simple: it's the only computer use agent that actually performs at the level the category promised. 82% on OSWorld. Not 38%. Not a benchmark-optimized demo. 82% on real-world computer tasks, higher than every competitor out there right now. Coasty controls actual desktops, real browsers, and terminals. Not a sandboxed browser tab, not an API wrapper that pretends to click things. It ships with a desktop app so you can get started without a PhD in DevOps, and cloud VMs if you want to scale without touching your own infrastructure. The agent swarms feature is where it gets genuinely exciting: parallel execution means you're not waiting for one agent to finish before the next task starts. You can run multiple computer use workflows simultaneously, which is how you go from saving hours to saving entire workdays. There's a free tier to try it, BYOK support if you want to bring your own API keys, and no six-month enterprise sales cycle to get through. The gap between 38% and 82% isn't a marketing number. It's the difference between a tool that works and one that doesn't. Go see for yourself at coasty.ai.

The computer use AI category is real, the need is real, and the wasted productivity it's supposed to fix is absolutely real. But most of the products in this space right now are charging premium prices for beta-quality results. Anthropic computer use is genuinely interesting research. OpenAI Operator is a compelling demo. UiPath Screen Agent is an old company putting new paint on old walls. None of them are at 82% on the benchmark that actually matters. If you're evaluating computer use agents for your team, stop reading press releases and start reading benchmark scores. Then go try the one that's actually winning. The 50 days your employees are losing every year to repetitive work aren't coming back, but at least you can stop losing next year's 50. Start at coasty.ai.

Want to see this in action?

View Case Studies
Try Coasty Free