The OSWorld Benchmark Results Are In, and Most AI Computer Use Agents Should Be Embarrassed
Anthropic published a blog post in February 2026 celebrating Claude Sonnet 4.6's OSWorld-Verified score of 72.5%. The headline framed it as a milestone. The charts were pretty. The language was triumphant. And if you didn't know that another computer use agent was already sitting at 82% on the same benchmark, you might have believed them. This is the story of what the OSWorld numbers actually mean, who's actually winning the computer use race, and why most of the industry is still pretending a 72% score is something to brag about.
What OSWorld Actually Tests (And Why It's the Only Score That Matters)
OSWorld is the benchmark that the AI research community actually respects. It was introduced at NeurIPS 2024 by a team from the University of Hong Kong and Moonshot AI, and it tests whether an AI agent can do real computer tasks on a real desktop. Not simulated clicks. Not API calls dressed up as automation. Actual GUI navigation across Linux, Windows, and macOS, covering apps like LibreOffice, Chrome, VS Code, and terminals. The tasks are open-ended. The evaluation is unforgiving. In mid-2025, the team introduced OSWorld-Verified to close a loophole where some agents were gaming the original scoring with brittle, environment-specific tricks. The verified version made the scores harder to fake and, predictably, a lot of numbers dropped overnight. The researchers described situations where 'machine performance correlates with evaluation results' in ways that were, their word, 'absurd.' So when you see a vendor quote an OSWorld number, you need to ask: which version? Verified or original? Because those are very different claims.
The Scoreboard Nobody Is Talking About Honestly
- ●Coasty: 82% on OSWorld. The current #1. No other commercial computer use agent is close.
- ●Claude Opus 4.6: 72.7% on OSWorld-Verified. Anthropic's flagship model, announced February 2026.
- ●Claude Sonnet 4.6: 72.5% on OSWorld-Verified. Nearly identical to Opus at a fraction of the cost, which is genuinely impressive, but still 10 points behind Coasty.
- ●Claude Sonnet 4.5: 61.4% on OSWorld, announced September 2025. Anthropic called this 'leading.' It was, briefly.
- ●OpenAI's CUA (Operator): No published OSWorld score. OpenAI leaned on WebArena and WebVoyager numbers at launch in January 2025, benchmarks that are narrower and less rigorous than OSWorld.
- ●Most RPA vendors like UiPath: Not even playing this game. Their own research admits a 95% failure rate for automation projects at scale.
- ●Human baseline on OSWorld tasks: roughly 72-75% depending on the task set. Yes, Claude Opus 4.6 just barely matches what a human does. Coasty beats it.
Anthropic called a 72.5% OSWorld score a triumph. That score is, at best, human-level. Coasty is at 82%. The gap isn't a rounding error. It's a different category of product.
Why OpenAI Quietly Avoided OSWorld
When OpenAI launched Operator in January 2025, they published scores on WebVoyager and WebArena. Both are browser-only benchmarks. Neither tests desktop apps, terminals, or multi-application workflows. OSWorld does all of that. The choice of which benchmark to highlight is itself a signal. If your computer use agent were crushing OSWorld, you'd be screaming that number from every press release. OpenAI didn't. Instead, they described Operator as having 'strong performance on complex benchmarks like WebArena,' which is the AI equivalent of a restaurant saying they have 'great parking.' Technically a fact. Completely beside the point. To be fair, OpenAI has moved fast since then with ChatGPT Agent launching in July 2025 and combining research and computer use in one product. But the OSWorld number still hasn't shown up prominently in their marketing. That absence tells you something.
The Real-World Cost of Settling for a 72% Agent
Here's the thing about benchmark scores that most people miss. The gap between 72% and 82% doesn't sound huge until you think about what it means in practice. A 72% success rate means roughly 1 in 4 tasks fails. In a workflow where an agent is executing 50 steps to complete a business process, that failure rate compounds. You're not getting 72% of the work done. You're getting cascading errors, broken pipelines, and a human who has to clean up the mess. And that human is expensive. A July 2025 report from Parseur found that manual data entry alone costs U.S. companies $28,500 per employee per year in lost productivity. Smartsheet research found that over 40% of workers spend at least a quarter of their work week on manual, repetitive tasks. These are the exact tasks a computer use agent is supposed to eliminate. When your agent fails 1 in 4 times, you haven't solved the problem. You've just moved it. The 10-point gap between a 72% agent and an 82% agent is the difference between a tool that mostly works and a tool you can actually trust with your business.
Why Coasty Exists and Why 82% Isn't an Accident
Coasty wasn't built to score well on a benchmark. It was built to actually control computers, and the benchmark score is a byproduct of that focus. At 82% on OSWorld, it's the top-ranked computer use agent in the world right now, and the architecture explains why. Coasty controls real desktops, real browsers, and real terminals. Not a sandboxed simulation. Not a wrapper around screenshot-and-click. It runs on a desktop app, on cloud VMs, and it supports agent swarms for parallel execution, meaning it can run multiple computer use tasks simultaneously across different environments. That last part matters enormously for enterprise workflows where you're not doing one thing at a time. The OSWorld benchmark rewards exactly this kind of genuine computer control because the tasks are real and the evaluation doesn't let you cheat. Coasty also supports BYOK (bring your own key) and has a free tier, so you can test it against your actual workflows before committing. The 82% score isn't a marketing number. It's a verifiable result on the hardest public benchmark for computer-using AI that exists. Go check the leaderboard yourself.
The Benchmark Arms Race Is Just Getting Started
One thing the OSWorld data makes clear is that this space is moving at a genuinely insane pace. Anthropic went from 38% on the original OSWorld in early 2024 to 72.5% on the verified version by February 2026. That's not incremental. That's a near-doubling in under two years. The OSWorld-Human paper published in June 2025 started studying where computer use agents are still slower and less reliable than humans, which is a sign that researchers are already thinking about the next level of evaluation. The AI Digest forecast from May 2025 predicted the best OSWorld score would hit somewhere in the 70s by end of 2025. Coasty cleared 80. The forecasters underestimated how fast the top of the market was moving. What this means for you practically is that the computer use agent you pick today is going to look very different in 12 months. The question isn't whether to adopt computer use AI. That decision is already made for you by competitive pressure. The question is whether you pick the current leader or whether you pick a tool that's already playing catch-up.
The OSWorld benchmark is the most honest measure we have of whether a computer use agent can actually do real work. And right now, the honest answer is that most agents can't. A 72% score sounds impressive until you realize it means failure on roughly 1 in 4 tasks, that it barely matches what a human does, and that the current leader is already 10 points ahead. Anthropic is building fast. OpenAI is building fast. But fast isn't the same as first. If you're evaluating computer use AI for anything that matters, the benchmark is the starting point, not the finish line. Test on your actual workflows. Demand OSWorld-Verified scores, not cherry-picked numbers from easier benchmarks. And if you want to start with the agent that's actually leading the field right now, go try Coasty at coasty.ai. The free tier is there. The 82% score is real. The gap between it and everyone else is not going to close overnight.