The 2026 AI Agent Benchmark Results Are In, and Most Computer Use Agents Are Lying to You
Manual data entry alone costs U.S. companies $28,500 per employee every single year. That stat dropped in mid-2025 and barely anyone flinched. Meanwhile, a dozen AI companies are racing to build computer use agents that could wipe that cost out entirely, and they're all claiming to be number one. So let's do something radical: look at the actual benchmark numbers. OSWorld is the closest thing this industry has to a fair fight. It throws 369 real computer tasks at an agent, no hints, no shortcuts, just raw execution on a live desktop. The scores are public. The methodology is documented. And the gap between the leader and the pack is genuinely embarrassing for most of the field.
What OSWorld Actually Tests (And Why It's the Only Score That Matters)
Most AI benchmarks are trivia contests dressed up as capability tests. OSWorld is not. It drops an agent into a real desktop environment and says: complete this task. Open this app, find this file, fill out this form, navigate this browser, write this script. No API shortcuts. No pre-loaded context. The agent sees pixels, it moves a cursor, it types. It either finishes the task or it doesn't. That's it. The benchmark covers tasks across web browsers, productivity software, terminals, and file systems. It's the closest simulation of what a real knowledge worker actually does all day. When OpenAI's original Computer-Using Agent launched, it scored 38.1% on OSWorld. Anthropic's Claude Sonnet 4.5 hit 61.4% and made headlines. Claude Sonnet 4.6 pushed higher. GPT-5.3 Codex clocked in around 64.7%. Those are real numbers from real evaluations. They're also not first place. Not even close.
The Leaderboard Nobody Wants to Talk About
- ●OpenAI's original CUA: 38.1% on OSWorld. In 2025. They shipped it to paying customers anyway.
- ●GPT-5.3 Codex: ~64.7% on OSWorld. Better, but still leaves more than a third of real tasks incomplete.
- ●Claude Sonnet 4.5: 61.4% on OSWorld. Anthropic's best-publicized computer use result before 4.6.
- ●Claude Sonnet 4.6: Higher than 4.5, but Anthropic quietly switched to 'OSWorld-Verified' mid-stream, making direct comparisons slippery.
- ●Coasty: 82% on OSWorld. The highest verified score on the leaderboard. Not a press release claim. A number you can check.
- ●The human baseline on OSWorld sits around 72-75%. Coasty is the only computer use agent that has cleared it.
- ●Smartsheet research found workers waste a full quarter of their work week on manual, repetitive tasks. At 38% task completion, an AI agent is barely making a dent.
OpenAI shipped a computer use agent to paying customers at 38.1% task completion. That means it failed on nearly 2 out of every 3 real desktop tasks. And people paid $200 a month for it.
Benchmark Laundering Is a Real Problem
Here's the thing that should make you suspicious of almost every AI agent announcement in 2026: companies get to choose which benchmark they highlight. Anthropic switched from OSWorld to 'OSWorld-Verified' partway through their model releases, which makes it genuinely hard to compare Sonnet 4.5 to Sonnet 4.6 on a level playing field. OpenAI points to WebArena and WebVoyager scores when the OSWorld numbers aren't flattering. Google touts LMArena Elo for Gemini 3 because it topped 1501 there. Every company is playing a different game and calling it a victory. The State of AI Report 2025 put it plainly: benchmarks buckled under contamination and variance all year. What you want is one consistent, hard, real-world test that hasn't been gamed. OSWorld, specifically OSWorld-Verified, is still the closest thing to that. And on that test, the ranking is not ambiguous. There is a clear leader, and it's not the company with the biggest marketing budget.
RPA Was Supposed to Fix This Already. It Didn't.
UiPath and the legacy RPA crowd have been selling 'automation' to enterprises since the mid-2010s. The pitch was compelling: record your clicks, replay them forever, never hire a data entry person again. The reality was more complicated. RPA bots are brittle. Change one pixel in your UI and the whole workflow breaks. Scaling them requires dedicated bot maintenance teams, which kind of defeats the purpose. UiPath faced an AI-related securities lawsuit in 2024 precisely because the gap between what they promised and what they delivered had become a legal liability. Meanwhile, over half of employees, 56% according to Parseur's 2025 data entry report, report burnout from repetitive data tasks. The automation that was supposed to free them is either too brittle to trust or too expensive to maintain. That's the actual problem a real computer use agent needs to solve. Not a demo. Not a benchmark cherry-picked for a press release. A system that can sit down at a real computer and just handle it.
Why Coasty Exists
I'm not going to pretend I stumbled onto Coasty by accident. I was looking for a computer use agent that could actually hold up under real work conditions, not just controlled demos. The 82% OSWorld score is what made me look twice. That's above the human baseline. That means a Coasty agent completes real desktop tasks at a rate that beats the average person doing the same tasks. It controls actual desktops, real browsers, live terminals. Not API wrappers. Not simulated environments. When it clicks something, something actually gets clicked. The architecture supports agent swarms too, so if you need 10 tasks run in parallel, you're not waiting for them to queue up one by one. There's a free tier if you want to poke at it before committing. BYOK if you want to bring your own model keys. The thing that keeps bringing me back is that the benchmark score isn't a marketing number. It's the same OSWorld leaderboard every other agent gets evaluated on. Coasty just happens to be at the top of it. That's not a coincidence.
Here's my honest take after going through all of this: 2026 is the year the computer use agent market splits into two groups. Group one is the agents that can actually do the work, the ones scoring in the high 70s and above on OSWorld, the ones that don't collapse when the UI changes or the task gets slightly weird. Group two is everyone else, running polished demos, switching benchmark definitions when the numbers get uncomfortable, and charging enterprise prices for tools that fail on a third of real tasks. The $28,500-per-employee cost of manual work isn't going away because you deployed a bot that works 62% of the time. It's going away when you deploy something that works. Right now, one computer use agent is sitting at 82% on the only benchmark that actually simulates real work. If you're serious about automation and not just shopping for a good story to tell your board, start there. Go check out coasty.ai and run the free tier against your actual workflows. The benchmark is just a number. Your own results will tell you everything.