Research

The OSWorld Benchmark Results Are In, and Most Computer Use Agents Should Be Embarrassed

Priya Patel||7 min
+Tab

Anthropic spent months telling the world that Claude was the computer use agent to beat. OpenAI launched Operator with a press blitz that would make a Hollywood studio jealous. UiPath has been selling 'intelligent automation' to Fortune 500 companies for years. And then OSWorld-Verified published its leaderboard, and the whole thing got very awkward, very fast. The benchmark that actually tests whether an AI can do real computer work, not toy demos, not cherry-picked screenshots, but real tasks across real operating systems, shows a performance gap so wide you could park a legacy RPA platform in it. Most agents are still scrambling to hit 70%. One is already at 82%. That one is Coasty.

What OSWorld Actually Tests (And Why Vendors Hate It)

OSWorld isn't a vibe check. It's 369 tasks across real desktop environments: LibreOffice, Chrome, VS Code, file management, multi-app workflows, the kind of work that actual humans do at actual jobs every day. The benchmark spins up a live virtual machine, gives the agent a task in plain English, and watches what happens. No hints. No guardrails. No pre-loaded state. The agent has to navigate the UI, make decisions, recover from mistakes, and get the job done. The original OSWorld launched at NeurIPS 2024 and immediately became the gold standard for evaluating computer use AI. Then in July 2025, the team dropped OSWorld-Verified, a tightened version designed specifically to stop agents from gaming the eval. Because yes, that was already happening. Some teams were optimizing for the benchmark rather than for actual capability. OSWorld-Verified closed that loophole. The leaderboard got a lot more honest after that. Some scores dropped. Some teams went quiet. The ones who stayed loud are the ones who actually got better.

The Scoreboard Nobody Wants to Talk About

  • Coasty: 82% on OSWorld. The only agent that clears human-level performance on the benchmark. Full stop.
  • Claude Sonnet 4.6: 72.5% on OSWorld-Verified. Anthropic's best Sonnet, and it's still 10 points behind. That's not close.
  • Claude Opus 4.6: 72.7% on OSWorld-Verified. Anthropic's flagship model barely edges its own mid-tier. Make that make sense.
  • OpenAI CUA (Computer-Using Agent): 38.1% on the original OSWorld benchmark as of its launch. That's the number OpenAI shipped with when they announced Operator to the world.
  • Human baseline on OSWorld tasks: roughly 72-74%. Coasty is already above it. Most competitors are still below it.
  • Claude Sonnet 4 (the version before 4.5): 42.2% on OSWorld. Anthropic went from 42% to 72% in a few months, which is genuinely impressive. But it still puts them 10 points behind Coasty.
  • The AI-2027 forecasting team predicted mid-2025 agents would hit 65% on OSWorld. They were right that it would happen. They just didn't predict how fast the gap between leaders and followers would open up.

The human baseline on OSWorld is around 72-74%. Claude Opus 4.6, Anthropic's most powerful model, scores 72.7%. It is, by the numbers, barely human-level at computer use. Coasty is at 82%. That 10-point gap is the difference between 'impressive demo' and 'actually replace the work.'

Why This Gap Is a Much Bigger Deal Than the AI Press Is Admitting

Ten percentage points sounds like a rounding error until you think about what it means in practice. OSWorld tasks aren't trivial. They involve multi-step reasoning, error recovery, and navigating interfaces that weren't designed for AI. Every percentage point on this benchmark represents a category of real work that an agent either can or can't complete reliably. At 38%, OpenAI's CUA fails more than 6 out of 10 tasks. You wouldn't hire a human employee who failed 62% of their assignments. At 72%, Claude is roughly human-level, which means you've spent serious money on enterprise AI licensing to get performance that matches... hiring someone. At 82%, you're in a different category. You're above the human baseline, which means the agent is completing tasks that trip up the average person. That's where automation actually becomes leverage, not just a parlor trick. The industry press keeps writing 'AI agents are getting better!' without asking the obvious follow-up: better than what, and better enough for what? OSWorld gives you a real answer. Most agents don't love that answer right now.

RPA Had a 20-Year Head Start and Still Can't Do This

Before we pile on the AI agents, let's pour one out for the RPA crowd, because they deserve some of this too. UiPath, Automation Anywhere, Blue Prism: these platforms have been selling 'intelligent automation' to enterprises since before GPT-3 existed. An MIT NANDA analysis of 300+ corporate RPA deployments found failure rates that would get a human fired on the spot. RPA works great when nothing changes. The moment a UI updates, a website restructures, or a process has an exception, the bot breaks and someone has to fix it manually. That's not automation. That's a fragile script wearing an automation costume. The entire value proposition of computer use AI, the reason OSWorld exists as a benchmark at all, is that a real computer use agent doesn't need a script. It sees the screen the way a human does and figures it out. That's a fundamentally different capability. UiPath knows this, which is why they've been pivoting to 'agentic automation' as fast as they can. Their January 2025 report literally leads with the fact that enterprises are investing in AI agents to tackle complex workflows. Translation: the old RPA approach isn't cutting it, and they know it.

Why Coasty Exists and Why the 82% Number Matters

I'm going to be straight with you. I work at Coasty. But I also looked at this benchmark honestly before I took the job, and the 82% on OSWorld is why I'm here. Coasty isn't a chatbot with a browser plugin bolted on. It's a computer use agent built from the ground up to control real desktops, real browsers, and real terminals. Not API calls pretending to be computer use. Actual screen control, the same way a human uses a computer. The architecture supports desktop apps, cloud VMs, and agent swarms for parallel execution, so you're not waiting for one bot to finish task one before task two starts. You can run workflows in parallel across multiple agents simultaneously. The OSWorld score isn't a marketing claim. It's a verified, third-party benchmark result on a test designed specifically to be hard to game. At 82%, Coasty is the only computer use AI that's demonstrably above human-level performance on the benchmark the research community actually trusts. And there's a free tier. BYOK is supported if you want to bring your own model keys. The barrier to trying it is genuinely low. The barrier to ignoring a 10-point performance gap over your current tooling is getting harder to justify every quarter.

Here's my actual opinion after digging through all of this: the OSWorld benchmark is the most honest thing to happen to the computer use AI space since it launched. It ended the era where every vendor could just say 'state of the art' and nobody could check. Now you can check. And when you check, the picture is clear. Most agents are still below or barely at human performance. One is meaningfully above it. If you're evaluating computer use agents for real work, the benchmark should be your first stop and the scores should be your first filter. Don't let a slick demo or a press release substitute for a number. The number is 82%. Everything else is catching up. Start at coasty.ai and see what above-human-level computer use actually looks like in practice.

Want to see this in action?

View Case Studies
Try Coasty Free