Research

The OSWorld Benchmark Results Are In and Most AI Computer Use Agents Should Be Embarrassed

Michael Rodriguez||7 min
Ctrl+C

The human baseline on OSWorld is 72.36%. That's the score a real person gets when you sit them down and ask them to complete 369 real desktop tasks: file management, web browsing, multi-app workflows, spreadsheet work, the whole messy reality of modern computer use. For most of 2024 and into 2025, every single AI agent on earth was losing to that human. Not close. Not almost there. Losing badly. OpenAI's Computer-Using Agent launched in January 2025 with a 38.1% score and called it 'state of the art.' That's not a flex. That's a confession. The race to build a genuinely useful computer use agent has been brutal, embarrassing for most players, and absolutely fascinating to watch. Now the 2026 numbers are in. The field has split hard between agents that actually work and agents that are basically expensive demos. Here's the full picture.

What OSWorld Actually Tests (And Why Most Agents Fail It)

OSWorld isn't a toy benchmark. It's 369 real tasks across real operating systems: Ubuntu, Windows, macOS. Tasks that require actual computer use, not just answering questions or generating text. We're talking things like 'open this spreadsheet, find the outlier, move it to a new tab, and email it.' Multi-step. Multi-app. No hand-holding. The benchmark was designed specifically because older tests were too narrow. They'd let agents call APIs, use shortcuts that real users can't access, or work in sandboxed environments that look nothing like an actual desktop. OSWorld forces agents to see the screen like a human does and act accordingly. That's why the scores are so much lower than on easier benchmarks. There's no faking it here. You either control the computer or you don't. And for most of 2025, the honest answer for most agents was: they don't. Claude Sonnet 4.5 hit 61.4% in September 2025, which was genuinely impressive progress from Anthropic, but still below the human baseline. Simular's Agent S2 crossed 72.6% in December 2025, technically clearing the human bar by 0.24 points. A razor-thin margin that the AI community celebrated like a moon landing. Which tells you everything about how hard this problem actually is.

The 2026 Leaderboard: A Timeline of Humiliation and Progress

  • January 2025: OpenAI CUA launches at 38.1% on OSWorld. Celebrated as 'state of the art.' Human baseline is 72.36%. Do the math.
  • Mid-2025: Claude Sonnet 4.5 reaches 61.4%. Real progress, but still 11 points below a human with a mouse.
  • October 2025: Agent S3 scaling research pushes the open-source frontier to ~69.9%, approaching but not clearing the human line.
  • December 2025: Simular Agent S2 hits 72.6%. First agent to technically beat the human baseline. By 0.24 points. The celebration was enormous. The margin was microscopic.
  • 2026: Coasty hits 82% on OSWorld. That's not barely human-level. That's 10 full points above the human baseline, and nearly 10 points clear of the next best agent.
  • The benchmark also introduced OSWorld-Verified in July 2025, adding independent validation after concerns about self-reported scores inflating leaderboard numbers.
  • Stanford's 2026 AI Index confirms OSWorld as the definitive standard for measuring real-world computer use agent capability.

OpenAI launched its computer use agent at 38.1% on OSWorld. The human baseline is 72.36%. They were less than half as capable as a person with a mouse, and they announced it like a triumph. That's the state of the industry in early 2025.

The Benchmark Gaming Problem Nobody Wants to Talk About

Here's the uncomfortable part. As OSWorld scores climbed, researchers started asking harder questions about how those scores were achieved. Berkeley's RDI lab published work on how top AI agent benchmarks get broken, and the Holistic Agent Leaderboard paper from October 2025 specifically called out leaderboard gaming as a systemic problem: agents optimized for benchmark tasks rather than general capability, self-reported scores without independent verification, and evaluation setups that don't reflect real deployment conditions. That's why OSWorld-Verified matters. The XLANG Lab introduced it specifically to require independent confirmation before scores go on the official board. Some previously celebrated results didn't survive the scrutiny. This is the part of the benchmark conversation that most AI companies would rather skip past. When you're selling enterprise automation contracts on the back of a leaderboard number, the last thing you want is someone actually checking. The agents that score well on OSWorld-Verified, under real conditions, with independent eyes on the process, are a much shorter list than the agents that claim to score well. That distinction matters enormously if you're actually trying to automate real work.

Why the Gap Between 72% and 82% Matters More Than It Looks

Ten percentage points sounds modest. It isn't. Think about what OSWorld is actually measuring: 369 distinct real-world computer tasks. The difference between 72% and 82% is roughly 37 additional tasks completed correctly. But it's not evenly distributed. The tasks that separate the top agents from the middle of the pack are the hard ones: multi-app workflows, tasks that require recovering from errors mid-sequence, tasks that require understanding context across multiple windows. The easy tasks, the ones every agent gets right, are already saturated at the top of the leaderboard. Every point above 72% is earned on genuinely difficult computer use scenarios. That's also why Smartsheet's research finding that workers waste a full quarter of their work week on manual, repetitive computer tasks is so relevant here. The tasks that eat people's time aren't the simple ones. They're the multi-step, cross-application workflows that require sustained attention and don't tolerate mistakes. An agent at 72% handles the easy stuff and stumbles on exactly the work that costs companies the most. An agent at 82% is operating in a different category entirely. That's why Citrix published a piece in July 2025 asking what happens when agents hit 100% on benchmarks, because the enterprise implications of truly reliable computer use are enormous and the industry is only starting to price that in.

Why Coasty Exists and Why the Score Isn't Marketing

I'll be straight with you. I work at Coasty. But the 82% on OSWorld isn't a number we invented. It's a verified score on the hardest real-world computer use benchmark that exists. And the reason I'm writing this post is because I genuinely think most people evaluating computer use agents are getting tricked by demos and self-reported numbers that don't hold up under scrutiny. Coasty controls real desktops, real browsers, and real terminals. Not API wrappers. Not browser extensions that only work on three websites. Actual computer use, the way a human does it, seeing the screen and acting on what it sees. The desktop app runs locally. Cloud VMs are available for parallel execution. Agent swarms let you run multiple tasks simultaneously instead of waiting in a queue. There's a free tier so you can actually test it before committing. BYOK is supported if you want to use your own API keys. The reason those details matter is that a benchmark score only means something if the underlying system works in real conditions. OSWorld-Verified exists precisely because the industry needed a way to separate agents that perform from agents that just test well. Coasty's 82% is the former. If you're comparing computer use agents right now, the OSWorld-Verified leaderboard is the only honest starting point. Everything else is a vendor telling you what they want you to believe.

The OSWorld benchmark has done something genuinely valuable: it gave the industry a shared, hard-to-fake standard for what computer use actually means. And the results have been clarifying. Most agents, including ones from very well-funded, very well-marketed companies, are still not reliably better than a human at real desktop work. Some are getting close. One is clearly ahead. The next time someone pitches you on an AI agent for computer automation, ask them their OSWorld-Verified score. If they don't have one, or if they pivot to talking about a different benchmark, you have your answer. The agents that can do real work aren't hiding from this test. They're leading it. If you want to see what 82% on OSWorld looks like in practice, go try Coasty at coasty.ai. Free tier, no credit card required. See the difference between a computer use agent that actually works and the demos you've been watching on LinkedIn.

Want to see this in action?

View Case Studies
Try Coasty Free