Comparison

The OSWorld Benchmark Is Exposing Who's Actually Building Real Computer Use AI (And Who's Just Talking)

Lisa Chen||7 min
+Space

Eighteen months ago, the best AI agent in the world could complete roughly 12% of real computer tasks on its own. Twelve percent. A distracted intern with no training could beat that. Today the number one computer use AI agent sits at 82%. That jump is one of the most dramatic capability leaps in modern AI history, and almost nobody outside the research community is paying attention to it. They should be, because the OSWorld benchmark is the clearest signal we have for which AI agents are actually ready to do your job, and which ones are dressed up chatbots with a screenshot feature bolted on.

What OSWorld Actually Tests (And Why It's So Hard to Fake)

OSWorld isn't a quiz. It's not a coding challenge. It's a real computer, running real software, given real tasks. We're talking about things like 'open this spreadsheet, find the discrepancy, fix it, and save the file' or 'navigate to this web app, fill out this form, and confirm the submission.' The agent controls an actual desktop environment. It sees the screen. It moves the mouse. It types. It fails just like a human would fail if they didn't know what they were doing. That's what makes it the closest thing we have to a legitimate stress test for computer use AI. You can't prompt-engineer your way to a good score. You either control the computer or you don't. The benchmark was introduced at NeurIPS 2024 and it immediately became the standard because it exposed something uncomfortable: almost every AI agent being marketed as 'autonomous' was, in practice, barely functional on real-world desktop tasks.

The Scoreboard Is Brutal and the Gap Is Widening

  • Early 2024 baselines: most agents scored under 15% on OSWorld. That's not a typo.
  • Anthropic's Claude Sonnet 3.5 launched in late 2024 with computer use capabilities. OSWorld score: roughly 22%. Anthropic's own system card called it 'error-prone.'
  • OpenAI's Computer-Using Agent (CUA), the thing powering Operator, scored 38.1% on OSWorld general tasks. Better, but still failing on more than 6 out of every 10 tasks.
  • Claude Sonnet 4.5 in September 2025 hit 61.4%. Real progress. Anthropic deserves credit for the improvement arc.
  • UiPath Screen Agent claimed the #1 spot on OSWorld-Verified in December 2025, triggering a wave of scrutiny about what 'verified' actually means.
  • A 72.6% result posted in December 2025 on 369 tasks sparked a Reddit debate about whether benchmark conditions were being manipulated to inflate scores.
  • Coasty sits at 82% on OSWorld. That's not a rounding error above the competition. That's a different category entirely.

OpenAI's computer use agent fails on more than 6 out of every 10 real desktop tasks. Anthropic's was 'error-prone' at launch. Meanwhile companies are paying enterprise licenses for these tools right now.

The Benchmark Gaming Problem Nobody Wants to Talk About

Here's where it gets messy. OSWorld-Verified was introduced specifically to address concerns that teams were overfitting to the original benchmark, cherry-picking tasks, or running under conditions that don't reflect real deployment. When a 72.6% result dropped in December 2025 on a subset of 369 tasks, the r/singularity thread lit up. One comment that got heavily upvoted called it 'a highly atypical practice for non-live benchmarks' because the benchmark maintainers had quietly corrected errors in the task set, which retroactively changed what counts as a passing score. That kind of thing matters. A lot. When companies are marketing their computer use AI based on leaderboard position, and the leaderboard itself is being revised mid-race, you have a trust problem. UiPath, to their credit, published a detailed blog post explaining their #1 OSWorld-Verified ranking. But the broader pattern in AI benchmarking right now is companies optimizing for the test rather than the real-world task. The researchers behind OSWorld know this. That's why they keep updating the evaluation protocol. It's an arms race between benchmark integrity and marketing departments.

Why Most 'Computer Use' Products Are Still Vaporware in Practice

Let's be honest about what most computer use AI tools actually do in production. They handle the demo well. The carefully scripted, low-complexity, single-step task that the sales engineer practiced thirty times. Then you hand it a real workflow, with a legacy app, an unexpected pop-up, a login timeout, and a PDF that wasn't formatted the way the training data expected, and it falls apart. That's not a harsh take. That's what a 38% OSWorld score means in plain English. The agent is confused more than half the time on standardized tasks. On your actual messy production environment, the failure rate is almost certainly higher. This is why enterprises that bought into the first wave of AI computer use agents in 2024 are quietly frustrated. The promise was autonomous task completion. The reality was a lot of human babysitting of an agent that kept getting stuck on modal dialogs. The benchmark scores don't lie. The marketing copy does.

Why Coasty Exists and Why 82% on OSWorld Actually Means Something

I'm not going to pretend I don't work at Coasty. But I'm also not going to pretend the 82% OSWorld score is just a number on a slide deck. It's the result of building a computer use agent that actually controls real desktops, real browsers, and real terminals, not a sandboxed simulation, not an API wrapper that calls browser automation under the hood. Coasty runs on a desktop app and on cloud VMs, and it supports agent swarms for parallel execution, meaning you can run multiple computer-using AI instances simultaneously on different tasks. That's the thing that actually moves the needle for businesses. Not one agent completing one task. Multiple agents completing dozens of tasks in parallel, reliably, without falling over when the screen looks slightly different than expected. The 82% score means Coasty succeeds on 82 out of 100 standardized real-world computer tasks. Compare that to OpenAI's CUA at 38.1% or early Claude computer use at 22%. The gap isn't about one clever trick. It's about the entire approach to how the agent perceives the screen, reasons about state, and recovers from failure. There's a free tier if you want to see it yourself. BYOK is supported if you're already paying for model access. The benchmark score is the starting point, not the pitch.

OSWorld is the most important benchmark in AI right now that most people haven't heard of. It's the one test that cuts through the noise and asks a simple question: can your AI agent actually use a computer? The answer, for most of the tools being marketed aggressively right now, is 'kind of, sometimes, under ideal conditions.' That's not good enough for anyone trying to automate real work. The scores are public. The methodology is documented. The gap between 38% and 82% is not a rounding error, it's the difference between a tool that works and a tool that needs a babysitter. If you're evaluating computer use AI for anything serious, start with the OSWorld leaderboard and work backwards. Or skip the research and go straight to the thing that's actually winning it. coasty.ai.

Want to see this in action?

View Case Studies
Try Coasty Free