Research

OSWorld 2026 Results Are Brutal: 82% vs 38% vs 22% , Why Your AI Agent Is Wasting Money

Rachel Kim||5 min
Ctrl+S

The OSWorld benchmark just dropped and it's brutal. OpenAI's computer-use agent scored 38%. Anthropic's Claude Sonnet 4.6 barely cracked 72%. Coasty hit 82% and is the only agent on the leaderboard to beat the human baseline of 72.36%.

OSWorld isn't a toy. It's 369 real desktop tasks.

OSWorld isn't a made-up leaderboard with cute screenshots. It's a real environment with 369 tasks on Ubuntu Linux. File management, web browsing, multi-app workflows. The exact stuff your team does every day. The OSWorld-Verified upgrade in July 2025 tightened the evaluation rules, so no more cheating with hidden APIs or easy tasks. If you want real computer use capability, this is the yardstick.

The math is insulting. 44 percentage points of failure.

  • OpenAI's GPT-5.4 computer-use agent: 38% on OSWorld. That's worse than random guessing for many of the 369 tasks.
  • Anthropic's Claude Sonnet 4.6: 72.5%. Close to human performance, but still leaves thousands of failed tasks per run.
  • Coasty: 82%. That 10-point gap isn't marketing fluff. It's a different tier of reliability.
  • Every 10% point on OSWorld translates to fewer retries, less human intervention, and massive cost savings.

Coasty's 82% on OSWorld is the highest score on the leaderboard, beating OpenAI, Anthropic, and every other computer-use agent. That's not just data. It's proof that harness design, how the agent actually interacts with the desktop, matters more than the model alone.

Why most agents fail at computer use

The problem isn't the model. It's the harness. Anthropic's Computer Use and OpenAI's Operator both rely on APIs that wrap desktop interactions. They don't actually see the screen. They don't click buttons the way a human does. That's why they struggle with nuanced tasks like finding a button that's partially hidden or recovering from a crash. The harness creates a brittle layer between the AI and the real computer. When that layer fails, the whole task fails.

How Coasty actually controls desktops

Coasty doesn't use a wrapper. It controls real desktops, browsers, and terminals directly. You can run it on your own machine or in cloud VMs. Need parallel execution? Coasty supports agent swarms. Need to bring your own keys? That's supported too. The 82% OSWorld score comes from a harness that truly interacts with the OS, not a simulation. That's why Coasty actually works in production, while other agents are still stuck in the lab.

The business case is obvious

Let's say you have 100 employees doing repetitive desktop work. If you deploy a GPT-5.4 wrapper that succeeds 38% of the time, you're wasting millions in failed attempts and human handoffs. A Coasty agent that hits 82% reduces failures by more than half. Fewer retries, less supervision, faster throughput. The ROI is immediate. The question isn't whether AI automation can save money. It's which agent actually delivers on that promise.

The OSWorld 2026 results are a wake-up call. The gap between 38% and 82% isn't a technical curiosity. It's a billion-dollar difference for enterprises. If you're still using agents that claim to do computer use but can't prove it on OSWorld, you're rolling the dice with your budget. Coasty is the only agent that's actually beating the benchmark and the human baseline. Try it for free at coasty.ai and see how much faster your team can work.

Want to see this in action?

View Case Studies
Try Coasty Free