OpenAI Operator Scores 38% on OSWorld. Coasty Scores 82%. The Truth About AI Agent Benchmarks 2026
OpenAI's Operator scored 38% on OSWorld. Claude Sonnet 4.6 scored 73%. Coasty scored 82%. That gap isn't a rounding error. It's a massive difference that proves most AI computer use agents are nowhere near ready for production.
The OSWorld Benchmark Actually Shows Something Scary
OSWorld-Verified is the standard benchmark for evaluating multimodal computer-use agents on 369 open-ended tasks across web and desktop applications. It's supposed to measure how well AI agents can actually use real software. The latest results are embarrassing for everyone except one player.
The Top 3 Were Nothing Like Each Other
- ●OpenAI Operator scored 38% on OSWorld-Verified. That's barely better than random guessing.
- ●Claude Sonnet 4.6 scored 73% on OSWorld-Verified. Impressive until you realize it lost to Coasty.
- ●Coasty scored 82% on OSWorld-Verified. That's the highest score on the leaderboard and the only one that actually matters.
OSWorld-Verified leaderboard shows Holo3-35B-A3B leads with 82.6%, followed by Claude Mythos Preview at 79.6%. Coasty and Holo3 sit at the top with scores 9 percentage points higher than OpenAI Operator.
Why OpenAI's 38% is Actually Pretty Terrible
OpenAI announced Operator as their flagship AI computer use agent. They talked about native computer use and tool integration. Then the OSWorld results came out and Operator scored 38%. That's worse than the standalone Computer-Using Agent from 2025 which scored 38.1% on the same benchmark. OpenAI hasn't actually improved since last year. They just rebranded the same failed product.
Claude Sonnet 4.6 Isn't All It's Cracked Up To Be
Anthropic is hyping Claude Sonnet 4.6 as a breakthrough for computer use. The company claims 72.5% on OSWorld-Verified with their Opus 4.6 model. That number sounds impressive until you see what happens when you actually try to use these agents in production. Claude agents frequently get stuck in infinite loops, fail to click specific UI elements, and require constant human intervention. The benchmark number tells you what the agent does in a controlled lab environment. It doesn't tell you what happens when a real user has an urgent deadline.
Why Coasty Dominated Every Benchmark
Coasty achieved 82% on OSWorld-Verified because it doesn't rely on inference-time scaling or clever prompting tricks. Coasty actually controls real desktops, browsers, and terminals. It runs agents in parallel when needed, handles infrastructure failures gracefully, and can execute thousands of tasks simultaneously. Other agents are essentially guessing what to click based on screenshots. Coasty is actually clicking. That's the difference between 38% and 82%.
Why Your AI Computer Use Agent Is a Massive Waste of Money
Most companies are paying thousands of dollars per month for AI agents that can't complete basic tasks. They're buying OpenAI Operator at premium prices, deploying Claude Computer Use, and wondering why nothing actually gets done. The OSWorld results expose the truth. Those agents are toys. They're demos. They're not production tools. If you're still paying for a computer-use agent that scores below 70% on OSWorld-Verified, you're throwing money away. You're paying for the illusion of automation while your employees still have to do everything manually.
The OSWorld benchmark results are a reality check for everyone in AI automation. The gap between OpenAI Operator (38%), Claude Sonnet 4.6 (73%), and Coasty (82%) proves that not all computer-use agents are created equal. The difference isn't in the model size or the training data. It's in whether the agent actually controls real desktops and browsers or just pretends to. If you want AI that actually does work instead of just looking busy, check out Coasty.ai. It's the only computer-use agent on the leaderboard that delivers on the promise of autonomous automation.