Research

The OSWorld Benchmark Results Are In and Most AI Computer Use Agents Should Be Embarrassed

Lisa Chen||7 min
F12

Manual data entry alone costs U.S. companies $28,500 per employee every single year. Not over a career. Per year. And yet, in 2026, most companies are still watching their AI vendors brag about benchmark scores that wouldn't impress a distracted intern. OSWorld is the benchmark that actually matters for computer use AI. It tests agents on real tasks in real computer environments, no shortcuts, no API tricks, no cherry-picked demos. And when you look at the honest results, a very uncomfortable truth emerges: most of the tools getting the loudest press coverage are scoring well below human performance. The gap between the marketing and the leaderboard is almost funny. Almost.

What OSWorld Actually Tests (And Why Other Benchmarks Are Mostly Theater)

OSWorld was introduced at NeurIPS 2024 and it's become the gold standard for evaluating computer use agents because it doesn't let anyone cheat. It throws agents into real operating system environments and asks them to complete 369 open-ended tasks across apps like Chrome, LibreOffice, VS Code, and system-level utilities. No training on the test set. No simplified web forms. Real screenshots, real mouse clicks, real consequences when you get it wrong. Compare that to WebArena or WebVoyager, which only test browser-based tasks, and you start to see why OpenAI was happy to tout Operator's WebArena score at launch while quietly not leading with OSWorld. Browser tasks are a subset of what a real computer use agent needs to handle. OSWorld tests the whole thing. Human performance on OSWorld sits at roughly 72%. That's the ceiling the agents are all chasing. Keep that number in mind as we go through the scores.

The Leaderboard Is Brutal. Here's Who's Struggling.

  • OpenAI's Computer-Using Agent (CUA) powering Operator launched in January 2025 with strong WebArena numbers but still fell short of human performance on OSWorld, per InfoQ's independent analysis at launch.
  • Anthropic's Claude models have been on a steep upward trajectory. Claude Sonnet 4.6 (February 2026) shows a 'steep upward trend in computer use' on OSWorld-Verified, per Anthropic's own system card. They're improving fast. They're not at the top.
  • OpenAGI's Lux agent claimed 83.6% on a computer use benchmark in December 2025, but that's on their own internal evaluation setup. Independent OSWorld-Verified scores are a different beast entirely.
  • Simular's Agent S2 made waves on OSWorld in late 2025, but 'making waves' in AI research and being production-ready for real enterprise workflows are two very different things.
  • Human performance on OSWorld is approximately 72%. Any agent not clearing that bar is still asking you to babysit it.
  • The AI Digest's May 2025 forecast explicitly noted that the gap between AI performance and best observed human performance on OSWorld was expected to close by end of 2025. It's still being contested in early 2026.
  • 56% of employees report burnout from repetitive computer tasks, per Parseur's 2025 report. The agents meant to fix this are still failing nearly a third of the tasks they attempt.

Human performance on OSWorld is ~72%. Most well-funded, heavily marketed computer use agents are still below that. You are, in many cases, literally better at using your own computer than the AI you're paying for.

Why Everyone Is Gaming the Benchmark (And What That Costs You)

Here's the dirty secret about the AI agent benchmark wars: companies pick the leaderboard where they look best and bury the ones where they don't. OpenAI launched Operator with WebArena and WebVoyager front and center. Anthropic leads with SWE-bench for coding tasks. Everyone has a benchmark they like. OSWorld is harder to spin because it tests real, messy, multi-step computer use across the full desktop environment. You can't just fine-tune on browser interactions and call yourself a computer use agent. The 2025-2026 AI Computer-Use Benchmarks Guide from o-mega.ai puts it plainly: OSWorld is 'a dynamic leaderboard' where new results regularly shuffle the rankings. That dynamism is good for research. It's confusing as hell for buyers. When a vendor tells you their agent is 'state of the art,' ask them specifically: what's your OSWorld-Verified score? Watch how fast the conversation changes. Meanwhile, the cost of getting this wrong is not abstract. $10.9 trillion is lost on unproductive tasks in the U.S. annually, per Clockify's 2025 research. Companies are bleeding money while waiting for their chosen AI vendor to catch up to a benchmark that a human can already clear.

The Anthropic Situation Is Interesting and Telling

Anthropic deserves credit for being more transparent than most. Their system cards for Claude Sonnet 4.5, Haiku 4.5, Sonnet 4.6, and Opus 4.5 all publish OSWorld scores openly and acknowledge the benchmark's role as 'the standard benchmark for AI computer use.' That's actually rare in this industry. What's also telling is the trajectory. Each model release shows improvement. The Sonnet 4.6 system card describes 'a steep upward trend.' That's good. But steep upward trends from a low base are still a low base. And Anthropic's own documentation for computer use explicitly warns developers to 'carefully review any actions Claude proposes.' That's not a knock on Anthropic specifically. It's a knock on the entire category of agents that still need a human supervisor hovering over them. If you need to babysit the agent, you haven't automated anything. You've just added a step.

Why Coasty Exists and Why 82% on OSWorld Is Not a Coincidence

I'm going to be straight with you. Coasty was built specifically for this problem. Not browser automation. Not chatbot wrappers. Actual computer use, controlling real desktops, real browsers, and real terminals the way a human operator would. The 82% OSWorld score isn't a press release number. It's the verified result on the same benchmark everyone else is being measured against, and it's higher than every competitor currently on the leaderboard. That gap matters in practice. An agent that completes 82% of real computer tasks autonomously versus one clearing 55-65% means the difference between a workflow that actually runs unattended and one that jams up and waits for you to intervene every third task. Coasty runs as a desktop app, in cloud VMs, and in agent swarms for parallel execution when you need to scale. There's a free tier if you want to test it without a procurement conversation. BYOK is supported if your company has API key policies. The reason it scores where it scores is because the team optimized for OSWorld-style real-world task completion from day one, not for whichever benchmark made the press release look good. If you want to actually close that $28,500-per-employee gap, the score on the hard benchmark is the only number that matters.

The OSWorld leaderboard is the most honest document in the AI agent industry right now. It doesn't care about your funding round. It doesn't care about your launch blog post. It asks one question: can your agent actually use a computer? Most of them can't, not reliably, not at human level, not without supervision. The companies still pitching you on vibes and cherry-picked demos are hoping you don't know what OSWorld is. Now you do. Go look up the score of whatever tool you're currently evaluating. If they can't tell you their OSWorld-Verified number, that's your answer. If you want the agent that actually tops the leaderboard, start at coasty.ai. The free tier is right there. No babysitting required.

Want to see this in action?

View Case Studies
Try Coasty Free