Comparison

The OSWorld Benchmark Results Are In and Most AI Computer Use Agents Should Be Embarrassed

Marcus Sterling||7 min
Ctrl+S

The human baseline on OSWorld, the toughest real-world benchmark for AI computer use agents, is 72.36%. That's the score a regular person gets clicking around a real desktop doing real tasks. For most of 2024 and into 2025, every major AI lab was shipping 'computer use' products that couldn't clear 40%. OpenAI's Computer-Using Agent launched to massive fanfare in January 2025 and scored 38.1%. Anthropic's original Computer Use feature, which they announced like it was going to eat the automation industry alive, scored 22%. Twenty-two percent. A coin flip beats that on some tasks. The AI computer use race has been one of the most overhyped, underdelivered stories in recent tech history, and the OSWorld leaderboard is the receipts.

What OSWorld Actually Tests (And Why It's So Hard to Fake)

OSWorld isn't a vibe check. It's not 'write me an email' or 'summarize this PDF.' It's a benchmark that drops an AI agent into a real computer environment, running real software like LibreOffice, Chrome, VS Code, and GIMP, and tells it to complete actual tasks. No API shortcuts. No pre-built integrations. The agent has to look at the screen, figure out what to do, click things, type things, and get a result. It's the closest thing we have to measuring whether a computer use agent can actually do a job. The human baseline sits at 72.36%, which means a person with no special training scores about 72 out of 100 tasks correctly. That number is the real bar. Not some abstract leaderboard score. Not a cherry-picked demo. Can your AI agent do what a human can do, on a real computer, without hand-holding? For most of 2025, the honest answer from every major player was: not even close.

The Scorecard Nobody Wants to Talk About

  • Anthropic Computer Use (original launch): 22% on OSWorld. They announced it like a revolution. It couldn't beat a random guesser on half the tasks.
  • OpenAI CUA at launch (January 2025): 38.1% on OSWorld. Better, but still nearly 35 points below the human baseline. Operator shipped to the public at this level.
  • Claude Sonnet 4.5 (September 2025): 61.4%. Real progress, finally. But still 11 points short of what a regular human scores.
  • Simular Agent S2 (December 2025): 72.6%. First AI agent to technically clear the human baseline of 72.36%, by a razor-thin 0.24 points.
  • Claude Sonnet 4.6 (February 2026): 72.5% on OSWorld-Verified. Matches Simular almost exactly. Two agents, both just barely human-level.
  • Coasty: 82% on OSWorld. That's not barely human-level. That's 10 full points above the human baseline, and nearly 10 points above the next closest competitor.

OpenAI shipped Operator to the public at 38.1% on OSWorld. That means their 'computer use agent' was failing at roughly 62% of real computer tasks when real people started paying for it. Anthropic's was failing at 78%. These are products, not research previews.

The Gap Between 'Human-Level' and Actually Useful Is Bigger Than You Think

Here's the thing that the AI labs don't want you to think too hard about. Crossing the human baseline at 72.36% sounds impressive until you realize that the human in that comparison is doing tasks cold, with no context, no memory of previous runs, and no optimization. Your actual employees, the ones you're hoping to automate, are specialists. They know your systems. They have workflows. A general computer use agent that scores 72.5% is not replacing your ops team. It's replacing the least experienced temp you've ever hired, on a good day. Simular made a lot of noise about being the first to beat the human baseline in December 2025. Good for them. Coasty was already at 82% and climbing. The difference between 72% and 82% in real-world automation isn't a rounding error. It's the difference between an agent that gets through most of a task and one that actually finishes it reliably. At scale, across hundreds of parallel tasks, that 10-point gap compounds into an enormous difference in real output.

Why Are Companies Still Shipping Half-Baked Computer Use Products?

This is the part that genuinely frustrates me. The AI labs know their OSWorld scores before they ship. Anthropic knew 22% was embarrassing. OpenAI knew 38.1% wasn't ready for production workflows. They shipped anyway, because the race for mindshare matters more to them than the race for actually working software. And enterprise buyers, desperate to show their boards that they're 'doing AI,' bought in. The result is a graveyard of failed computer use pilots. Teams that spent Q1 2025 trying to get Claude Computer Use to reliably handle invoice processing, only to watch it fail on the same task three times in a row. Developers who built workflows on top of Operator and had to babysit every run. The benchmark scores predicted exactly this. A 22% or 38% success rate on benchmark tasks translates directly to a tool that needs constant human supervision, which defeats most of the point. The labs moved fast and broke your automation budget.

Why Coasty Exists

I'm not going to pretend I stumbled onto Coasty by accident. I went looking for a computer use agent that I could actually trust to run unsupervised, and the OSWorld leaderboard pointed me there. 82% on OSWorld isn't just a marketing number. It means Coasty is completing tasks that every other agent on the market is still fumbling through. It controls real desktops, real browsers, real terminals, not a sanitized API layer that pretends to be a computer. It runs agent swarms for parallel execution, so if you've got 50 workflows to run simultaneously, you're not waiting in a queue. There's a desktop app, cloud VMs, BYOK support for teams that need to keep their data in-house, and a free tier so you can actually test it against your real workflows before committing. The thing that gets me is how much cleaner the experience is compared to the alternatives I've tried. When a computer use agent fails at 28% of tasks, you spend more time debugging the automation than you would have spent doing the work manually. At 82%, you're in a completely different category. You're actually delegating, not babysitting. That's the whole point of computer use AI, and it took until now for something to actually deliver it.

The OSWorld benchmark is the most honest thing in AI right now. It doesn't care about your press release or your demo video. It puts your agent in front of a real computer and watches what happens. And what's happened, over and over, is that the biggest names in AI have shipped computer use products that fail the majority of the time. Claude has come a long way, going from 22% to 72.5% in roughly 16 months. That's real progress. But 72.5% in production still means roughly 1 in 4 tasks ends in failure. If you're building serious automation, that's not acceptable. The OSWorld results tell a clear story: most computer use agents are still in the 'impressive demo' phase. Coasty is in the 'actually works' phase. If you're still evaluating or still burned from a failed automation pilot with one of the other tools, go test it yourself at coasty.ai. The benchmark scores are public. The free tier is real. There's no reason to keep paying for agents that can't beat a distracted human.

Want to see this in action?

View Case Studies
Try Coasty Free