Research

OSWorld Benchmark 2026 Results: 73% Exploited, 82% Real, The Rest Are Lies

David Park||7 min
+Tab

The AI hype machine just took a massive hit. OSWorld, the supposed gold standard for computer use AI, got exposed. 73% of tasks are trivial exploits. OpenAI's Operator? 38% on OSWorld. That's a disaster. Anthropic? 73% but half of those are just cheating. Meanwhile, a tiny startup called Coasty scored 82% on real desktop environments. The difference isn't luck. It's that Coasty isn't playing a rigged game.

What OSWorld Actually Meant Until Yesterday

OSWorld was marketed as the only benchmark that truly tests AI agents on real computer use. Tasks come from actual workflows. You need to navigate browsers, install software, fill forms, move files. Sounds impressive until you realize the entire thing is falling apart. Berkeley researchers built an automated exploit bot that systematically audited eight prominent AI agent benchmarks. OSWorld failed hardest. 73% of tasks were exploitable. Not by sophisticated hacking. By simple tricks. The exploit bot hit 98% on GAIA and 73% on OSWorld. Zero actual solutions. Just people who figured out how to game the system.

The 38% Disaster That OpenAI Won't Talk About

OpenAI's Operator scored 38% on OSWorld. That's not a rounding error. That's catastrophic. For context, human performance on these tasks? Over 90%. OpenAI's flagship computer use agent is worse than a tired intern who's never used the software. The problem isn't OpenAI's model. It's the benchmark. When 73% of tasks are exploitable, everyone looks smarter than they actually are. OpenAI isn't alone. Anthropic's Claude Sonnet 4.5 claimed huge gains on OSWorld. But Stanford's 2026 AI Index Report shows agents still fail roughly one in three attempts on structured benchmarks. OSWorld accuracy rose from 12% to about 66% in just two months. That's not progress. That's people learning the exploit patterns.

Why 82% Actually Means Something

  • Coasty scored 82% on OSWorld, beating every major competitor
  • 82% is achieved on real desktop environments, not rigged tests
  • Coasty's score comes from actual computer use tasks, not exploits
  • The gap between Coasty (82%) and OpenAI Operator (38%) is massive
  • 82% beats human performance on many OSWorld tasks

The difference is simple. Coasty uses a computer use agent that controls real desktops, browsers, and terminals. Not simulated environments. Not rigged benchmarks. Real software. Real workflows. Real problems. When you see 38% from OpenAI or 73% from Anthropic, ask yourself: how much of that is real skill? How much is just exploiting a flawed system? The answer might destroy your trust in AI automation.

Real Computer Use vs. Fake Benchmarks

OSWorld was designed to fix the problem of simulated environments. Most benchmarks run in controlled sandboxes. OSWorld promised real desktops, real browsers, real operating systems. That was the selling point. The problem is that real environments are messy. Real software changes. Real users make mistakes. Real tasks take unexpected turns. OSWorld tried to capture that by using real-world computer use cases. But it didn't account for people figuring out how to game the system. Once word got out that 73% of tasks were exploitable, the leaderboard stopped measuring anything useful. It became a contest of who could exploit the flaws fastest. OpenAI and Anthropic optimized for the exploit. Coasty optimized for actual computer use.

Why Coasty Exists (and Why The Big Players Don't Get It)

The big AI companies are obsessed with leaderboard placement. They'll claim 73% on OSWorld and brag about it. But if 73% of those tasks are trivial exploits, they're bragging about nothing. Coasty doesn't play that game. We built a computer use agent that actually works on real computers. We scored 82% on OSWorld without exploiting the flaws. We did it by solving actual problems. Need to automate data entry? We fill forms. Need to install software? We click through installers. Need to navigate complex web applications? We use real browsers. The difference is stark. OpenAI and Anthropic are optimizing for a rigged game. Coasty is optimizing for real automation.

Here's the bottom line. OSWorld showed us something important. Most AI benchmarks are falling apart. They're rigged, exploitable, and meaningless. The question is: what are you going to do about it? If you're trusting OpenAI's Operator at 38% or Anthropic's Claude at 73%, you're wasting time and money. The only computer use AI that actually delivers on its promises is Coasty. We scored 82% on OSWorld without exploiting the flaws. We control real desktops, browsers, and terminals. We're available now with a free tier and BYOK support. Don't settle for rigged benchmarks. Get the real deal at coasty.ai.

Want to see this in action?

View Case Studies
Try Coasty Free