The OSWorld Benchmark Results Are In and Most Computer Use Agents Are Embarrassingly Bad
Everyone in AI Twitter is popping champagne over OSWorld scores right now. GPT-5.4 hit 75%. Claude Opus 4.6 hit 72.7%. Sonnet 4.6 crept in at 72.5%. And the breathless press releases are rolling in like these numbers represent some kind of species-defining leap forward. Here's the thing nobody wants to say out loud: a human doing these same tasks scores somewhere between 72% and 84% on OSWorld. So after billions of dollars in compute, years of research, and enough hype to fill a stadium, the best computer use agents from the two most well-funded AI companies on earth are just now, barely, approaching what your average office worker does before their second cup of coffee. That's not a celebration. That's a confession. And one company, quietly, has already lapped the field.
What OSWorld Actually Tests (And Why the Scores Are Damning)
OSWorld is the benchmark that actually matters for computer use AI. Not coding puzzles. Not trivia. Not reasoning about hypotheticals. OSWorld drops an AI agent into a real desktop environment, with real applications, and asks it to complete real tasks. Spreadsheets. File management. Browser workflows. Terminal commands. Cross-app operations. The kind of stuff that eats 15 hours of your employees' week, every single week, according to productivity research across thousands of businesses. The benchmark was introduced at NeurIPS 2024 and has since become the gold standard for evaluating whether a computer-using AI can actually function in the world, not just in a demo. The scoring is brutally honest. Either the task gets done correctly or it doesn't. There's no partial credit for 'almost filling out the form.' When GPT-5.4 scores 75% on OSWorld, it means one in four tasks fails. In an actual business workflow, that failure rate compounds. Three failed steps in a ten-step process doesn't give you a 75% outcome. It gives you a broken outcome.
The Leaderboard Nobody Wants to Talk About Honestly
- ●GPT-5.4 (OpenAI, March 2026): 75.0% on OSWorld-Verified. OpenAI called it a breakthrough. The benchmark called it third place.
- ●Claude Opus 4.6 (Anthropic, Feb 2026): 72.7%. Anthropic's most expensive, most powerful model. Scores lower than GPT-5.4 on the one benchmark that actually tests computer use.
- ●Claude Sonnet 4.6 (Anthropic, Feb 2026): 72.5%. Nearly identical to Opus at a fraction of the cost, which raises its own uncomfortable questions about what you're paying for.
- ●Earlier Claude models (Sonnet 4.5, Sept 2025): 61.4%. That's the score Anthropic was celebrating just six months ago. The pace of improvement is real, but the starting point was rough.
- ●Human baseline on OSWorld: roughly 72-84% depending on the task category. The AI is finally in the conversation. It's not yet pulling ahead.
- ●Coasty: 82% on OSWorld. No press release. No conference keynote. Just the number, sitting there, above every competitor on the board.
GPT-5.4 and Claude Opus 4.6 together represent tens of billions in R&D. Coasty beats both of them on the only benchmark that tests whether a computer use agent can actually do real work. The score gap between Coasty at 82% and OpenAI at 75% is larger than the gap between OpenAI and a model from early 2024.
Why 75% Is Not Good Enough for Real Business Automation
Let's do some math that the benchmark press releases conveniently skip. Say your company automates 200 computer tasks per day using an AI agent scoring 75% on OSWorld. That's 50 failures every single day. Some of those failures are recoverable. Some silently corrupt data. Some trigger downstream errors in other systems that nobody catches for a week. This is exactly the failure mode that killed the first wave of RPA. UiPath and its peers built automation on brittle, rules-based robots that broke whenever an interface changed by a pixel. Companies spent more on maintenance than they saved on labor. The promise was 'set it and forget it.' The reality was a full-time job babysitting robots. AI computer use agents were supposed to fix this by being adaptive, by seeing the screen the way a human does and figuring it out. And they are better than RPA. But a 25% failure rate is not 'figuring it out.' It's a different flavor of the same problem. The jump from 75% to 82% sounds small. In production, it's the difference between a system you can trust and a system you have to supervise.
The Dirty Secret of AI Agent Benchmarks in 2026
Here's what makes the OSWorld results even more interesting: most of the agents people are actually deploying in production have never been tested on OSWorld at all. The benchmark leaderboard is crowded at the top with research configurations and carefully tuned evaluation runs. The chatbot your company bolted onto a browser last quarter? It's probably scoring somewhere in the 40s on OSWorld tasks, if anyone bothered to check. Research from the Stanford 2026 AI Index confirms that while frontier models are advancing fast, the gap between benchmark performance and real-world deployment performance remains wide. Models that look good in controlled evaluations often fall apart on the messy, inconsistent interfaces of actual enterprise software. The companies winning at computer use AI right now aren't the ones with the biggest models. They're the ones who built specifically for the desktop environment, who tested against real applications, and who closed the gap between 'works in the demo' and 'works in your specific, weird, legacy-software-laden environment.' That's a product problem as much as a model problem. And it's why a purpose-built computer use agent outperforms a general-purpose LLM with computer use bolted on.
Why Coasty Exists
I'm going to be straight with you. I work at Coasty. But I also looked at every score on that leaderboard before I took this job, and the 82% on OSWorld is why I'm here. Coasty isn't a general-purpose LLM that learned to click buttons as an afterthought. It's a computer use agent built from the ground up to control real desktops, real browsers, and real terminals. Not API calls pretending to be automation. Actual screen control, the same way a human operator would do it. The desktop app runs locally. Cloud VMs are available for parallel execution. Agent swarms let you run multiple tasks simultaneously, which means the 'I'll get to that in ten minutes' bottleneck of single-threaded automation just disappears. There's a free tier if you want to test it without a procurement process. BYOK is supported if your security team has opinions about API keys. The 82% OSWorld score isn't marketing. It's a measurement of how often the agent actually completes the task. At 82%, you're above the human baseline range for most task categories. You're not supervising the agent. The agent is handling it. That's the line everyone else is still trying to cross.
The OSWorld benchmark results for 2026 tell a clear story if you're willing to read it without the PR spin. The biggest names in AI are finally approaching human-level computer use, and that's genuinely impressive progress. But 'approaching' is not 'exceeding,' and for automation to actually save your company the 15 hours per employee per week that's currently being eaten by repetitive computer tasks, you need an agent that's above the human baseline, not just near it. One agent is there. The rest are still catching up. If you're evaluating computer use AI right now, the only honest question is: why would you deploy something scoring in the 70s when 82% exists and has a free tier? Go check it out at coasty.ai. Run it on your actual workflows. The benchmark score will start making sense very quickly.