Research

The OSWorld Benchmark Just Exposed Which Computer Use AI Agents Are Actually Worth Your Time

Michael Rodriguez||7 min
Alt+Tab

OpenAI's Operator, the computer use agent that was supposed to change everything, scored 38% on OSWorld. A benchmark where humans score 70%. That's not a rough launch. That's a product that can't do half the things a person can. And yet the hype machine kept spinning. The OSWorld benchmark has quietly become the most important stress test in AI right now, and the results are separating the real computer use agents from the vaporware. Some companies are close to matching human performance. Others are charging enterprise prices to deliver a tool that fails more than it succeeds. Let's talk about what the numbers actually say.

What OSWorld Actually Tests (And Why It Matters More Than Marketing Slides)

OSWorld is a benchmark from NeurIPS 2024 that puts AI agents inside real operating systems and tells them to get things done. Not toy tasks. Not fill-in-the-blank prompts. We're talking about opening apps, navigating GUIs, writing files, managing browsers, and completing multi-step workflows across real software. The kind of work that costs companies thousands of hours a year. The human baseline sits at around 70%. That's the bar. Anything below that means the agent is slower, less reliable, and more error-prone than just hiring someone. OSWorld-Verified, launched in July 2025, tightened the evaluation further to stop agents from gaming the benchmark with brittle tricks. The leaderboard got harder. The weak agents got exposed. J.P. Morgan noted in their 2026 Outlook that leading models were improving on OSWorld by roughly 37 percentage points per year. That's an insane pace. But the current snapshot still shows a massive spread between the top performers and everyone else.

The Leaderboard Doesn't Lie: Here's Where Everyone Actually Stands

  • OpenAI Operator scored 38% on OSWorld as of late 2025, well below the 70% human baseline, meaning it fails on more than 6 out of 10 real computer tasks
  • Claude Sonnet 3.5 started at just 14.9% on OSWorld in late 2024, while most other models were stuck around 7.7%
  • Claude Sonnet 4.5 climbed to 61.4% by September 2025, a real improvement, but still below human-level performance
  • The AI-2027 research team predicted 80% on OSWorld would represent near-human-level computer use, a target only the very best agents are approaching
  • Coasty hits 82% on OSWorld, which puts it above the human baseline and above every publicly tracked competitor on the leaderboard
  • 30 to 50% of traditional RPA projects fail to meet their objectives according to EY's research, and more than half of RPA initiatives never scale beyond 10 bots
  • OSWorld-Verified was introduced specifically because older benchmark versions were being gamed, meaning any score on the old leaderboard should be treated with suspicion

OSWorld scores have improved by roughly 37 percentage points per year. The gap between the best computer use agent and the worst is now larger than the gap between the worst agent and doing nothing at all. Picking the wrong tool isn't a minor inconvenience. It's a strategic mistake.

Why Most Companies Are Still Getting Burned By Bad Automation

Here's what nobody in the enterprise software space wants to admit. Most automation tools sold to businesses right now are still RPA with a ChatGPT wrapper slapped on top. They break when a UI changes. They choke on anything that requires judgment. They need a dedicated team to babysit them. EY found that 30 to 50% of initial automation projects fail outright. Over half of RPA programs never grow beyond 10 bots. Companies spend six figures on implementation, training, and maintenance, and then end up with a bot that handles three specific workflows and falls apart the moment someone updates the software it's reading. The reason OSWorld matters so much is that it tests exactly this. Can the agent actually navigate a real desktop, inside real applications, without a human holding its hand? A 38% score means no, not reliably. A 61% score means sometimes. An 82% score means yes, consistently, and better than most people would expect.

The Benchmark Arms Race Is Getting Messy

One thing the OSWorld team figured out fast is that AI labs will optimize for the test if you let them. That's why OSWorld-Verified exists. The July 2025 update changed the evaluation protocol to catch agents that were essentially memorizing patterns rather than solving tasks. Scores dropped across the board for agents that had been gaming the old framework. Anthropic's Claude models, to their credit, have shown genuine improvement on the verified version, not just benchmark hacking. The trajectory from 14.9% in late 2024 to 61.4% by September 2025 is real progress. But there's a difference between improving fast and being the best. The AI-2027 forecasters put 80% as the threshold for near-human computer use. Coasty is already past that. The conversation in AI research circles right now isn't whether computer use agents will surpass humans on OSWorld. It's about who got there first and who's still catching up.

Why Coasty Exists

I'm not going to pretend I don't have a dog in this fight. But the reason Coasty sits at 82% on OSWorld isn't marketing spin. It's because the team built something that actually controls real desktops, real browsers, and real terminals. Not API calls dressed up as computer use. Not a browser extension that breaks when you change tabs. An agent that can see your screen, move your cursor, type into applications, and complete multi-step workflows the same way a competent person would, except it doesn't get tired, it doesn't miss Slack messages, and it can run in parallel swarms across multiple VMs at the same time. That's why the benchmark score is what it is. You can try it for free, bring your own keys if you want, and see what a computer use agent looks like when it's actually been built to pass the hardest evaluation in the field. The OSWorld leaderboard is public. The score is verifiable. That's the whole point.

The OSWorld benchmark is the closest thing we have to an honest answer to the question: can this AI agent actually do real work? And right now, the honest answer for most tools is no, not at human level, not even close. Operator is at 38%. That's not a v1 rough edge. That's a fundamental capability gap. The agents that are genuinely useful in 2026 are the ones that score above the human baseline and keep improving on verified evaluations, not ones riding press releases. If you're evaluating computer use agents for your team or your company, stop reading blog posts from the vendors and go look at the OSWorld leaderboard yourself. Then go try Coasty at coasty.ai. It's free to start, it runs on real desktops, and 82% on the hardest computer use benchmark around speaks for itself.

Want to see this in action?

View Case Studies
Try Coasty Free