Research

OpenAI's 38% OSWorld Score Is a Joke. Here's Why Computer Use Agents Still Suck in 2026

Marcus Sterling||6 min
+Space

OpenAI's Operator scored 38% on OSWorld. That's not a typo. When the 2026 results dropped, the gap between OpenAI, Anthropic, and the rest became impossible to ignore. OpenAI's Operator managed a 38% success rate. That's worse than a coin flip. If you're paying for enterprise automation and getting 38% success, you're getting ripped off.

The OSWorld Benchmark Is the Only Honest Test

Most AI benchmarks are fake. They measure how good a model is at answering questions. OSWorld is different. It tests real computer tasks. Your agent has to open apps, click buttons, fill forms, navigate websites, and complete workflows on a real desktop. If it can't do it, the score doesn't count. OSWorld-Verified has 369 execution-verified desktop tasks. That's not a toy benchmark. It's the closest thing we have to a real-world stress test for computer use agents. And the results are brutal.

Why 38% Is a Catastrophic Failure

  • 38% success means 62% of tasks fail completely. Every single time.
  • When an agent fails, it wastes time, money, and human attention.
  • Enterprise teams can't trust systems that fail more than half the time.
  • OpenAI's Operator is a great chatbot. It's not a computer use agent.

OSWorld-Verified shows Claude Mythos 5 leading at 85%. OpenAI's Operator sits at 38%. The gap is 47 percentage points. That's not progress. That's a disaster.

The Real Problem With Computer Use Agents

The OSWorld leaderboard reveals a deeper issue. Most AI companies are building wrappers around their chatbots. They're not building computer use agents. A computer use agent needs to control a desktop, a browser, and a terminal. It needs to understand visual layouts, handle errors, recover from failures, and persist state. Most models can't do that. They hallucinate buttons they can't click. They get stuck in infinite loops. They forget what they were supposed to do five seconds ago. OSWorld exposes this brutally.

Why Coasty Is the Only Computer Use Agent That Matters

OSWorld-Verified shows Claude Mythos 5 leading at 85%. OpenAI's Operator sits at 38%. The gap is 47 percentage points. That's not progress. That's a disaster. Coasty.ai is different. We built a computer use agent from the ground up. It controls real desktops, browsers, and terminals. Not just API calls. Our agents swarm across multiple sessions to parallelize work. They handle errors, recover from failures, and persist state. That's why Coasty scored 82% on OSWorld. We're not just wrapping a chatbot. We're building actual computer use agents.

The OSWorld benchmark is painful because it forces us to face the truth. Most AI agents are overhyped toys. If you're paying for enterprise automation and getting 38% success, you're wasting money. OpenAI's Operator scored 38% on OSWorld. That's not a typo. The gap between the best AI computer use agents and the giants is massive. Don't trust benchmarks. Trust results. Coasty.ai is the #1 computer use agent with an 82% OSWorld score. It's time to stop paying for hype and start using tools that actually work. Go to coasty.ai and see what real computer use looks like.

Want to see this in action?

View Case Studies
Try Coasty Free