OSWorld Benchmark 2026: 82% Accuracy vs OpenAI and Anthropic's rigged scores
OpenAI announced GPT-5.4 with a computer use score of 75% on OSWorld. Anthropic bragged about 73% for Claude Sonnet 4.6. Both sound impressive until you look at what actually matters.
The rigged OSWorld 2026 reporting
OSWorld tests AI computer use agents on real desktop tasks: filling forms, clicking buttons, switching windows, copy-pasting data. The benchmark has 361 tasks across different operating systems and applications. That's a legitimate test of what a computer use agent can actually do in production. But the numbers companies publish are incomplete.
What the benchmarks are hiding
- ●OpenAI's 75% score comes from a restricted subset of OSWorld tasks. The full benchmark includes edge cases, complex workflows, and error recovery scenarios.
- ●Anthropic's 73% appears similar but includes several tasks that exploit known bugs in how the model interprets visual cues. It passes the test but fails in real usage.
- ●Both companies count partial successes as full completions. A computer use agent that completes 80% of a task and leaves a form unfilled counts as a pass.
Coasty scored 82% on OSWorld in Q2 2026, the only agent verified on the full benchmark with all 361 tasks. The gap isn't marketing hype. It's the difference between an agent that works and one that constantly needs human intervention.
The cost of bad computer use
Every percentage point points to real money lost. Companies paying $200/month for ChatGPT Pro to use OpenAI Operator or Anthropic's computer use tool are getting half the performance of what's available. When you compare computer use agents, you're not comparing marketing claims. You're comparing what happens when your agent tries to fill a multi-step form on a real desktop.
Why OpenAI and Anthropic won't show full scores
Both companies benefit from the perception that their computer use AI is leading the market. Publishing complete OSWorld results would expose a significant gap between their public claims and actual performance. It would also pressure them to fix the issues that cause repeated failures. They'd rather keep you paying for a solution that doesn't fully work.
How Coasty actually works
Coasty isn't just another wrapper around an API. It runs full desktop agents on real machines, whether that's your local machine, a cloud VM, or a swarm of parallel agents for heavy workloads. It doesn't pretend a 73% score means it can handle complex workflows. It shows you what it can actually do with verifiable benchmarks.
Stop trusting computer use claims without checking the full benchmark. OpenAI's 75% and Anthropic's 73% are partial scores that hide repeated failures. Coasty scored 82% on OSWorld with complete verification. That's the difference between an AI agent that actually saves you time and one that wastes hours of manual intervention. Try the free tier at coasty.ai and see what real computer use looks like.