The OSWorld Benchmark Results Are In and Most AI Computer Use Agents Should Be Embarrassed
OSWorld is the closest thing AI has to a real-world driving test. Not a trick question quiz, not a cherry-picked demo. It's 369 actual desktop tasks across real operating systems, real apps, and real workflows where the agent either gets it done or it doesn't. The latest results are out, and they tell a story that a lot of AI labs would rather you not read. The human baseline on OSWorld is 72.36%. When OpenAI launched its Computer-Using Agent (CUA) with a Super Bowl-level press cycle in January 2025, it scored 38.1%. That's not a rough start. That's not even close to half of what a regular person can do. And yet companies were being sold on replacing human workflows with it. This is why benchmarks matter, and this is why the OSWorld leaderboard in 2026 is one of the most important documents in enterprise AI right now.
What OSWorld Actually Tests (And Why Most Agents Fail It)
A lot of AI benchmarks are, to be blunt, nonsense. They test trivia recall, they test reasoning on toy problems, they let models see the answer format before answering. OSWorld is different. It drops an agent into a live computer environment, gives it a natural language instruction like a real user would, and watches what happens. File management, web research, multi-app workflows, terminal commands, spreadsheet manipulation. The full messy reality of desktop work. There's no API shortcut. The agent has to look at a screen, understand context, click, type, scroll, and verify its own work. That's what makes it the gold standard for computer use AI. And that's exactly why scores are so revealing. Passing OSWorld tasks isn't just an academic win. It's evidence that an agent can actually replace human labor on real computers. Every percentage point on that leaderboard represents real tasks, real time saved, and real money.
The 2026 Leaderboard: A Brutal Ranking
- ●OpenAI CUA at launch (January 2025): 38.1% on OSWorld. Announced as 'state of the art.' Was less than half the human baseline.
- ●Claude Sonnet 4.5 (September 2025): 61.4%. Real progress, finally. But still 11 points short of what a regular human scores.
- ●Simular Agent S2 (December 2025): 72.6%. First agent to cross the human baseline of 72.36%. Legitimately impressive.
- ●Claude Sonnet 4.6 (February 2026): 72.5% on OSWorld-Verified. Matches Simular almost exactly. Two agents, both just barely human-level.
- ●Coasty: 82% on OSWorld. Nearly 10 points clear of the next best agent. The only system that isn't just matching humans but outrunning them.
- ●For context: manual data entry alone costs U.S. companies $28,500 per employee per year. At 38%, OpenAI CUA was never going to justify that ROI.
- ●56% of employees report burnout from repetitive computer tasks. The benchmark gap isn't just academic. It's why automation keeps failing people.
OpenAI launched its computer use agent to global headlines and a Super Bowl ad budget. It scored 38.1% on OSWorld. The human baseline is 72.36%. Coasty scores 82%. That's not a gap. That's a different category entirely.
Why Everyone Was Lying to You About Computer Use Readiness
Here's what actually happened in 2025. A wave of AI labs, including some very well-funded ones, shipped computer use products with marketing that implied they were ready for real work. The demos looked great. Controlled environments, simple tasks, ideal conditions. Then real users tried to automate their actual workflows and hit walls constantly. Agents clicking the wrong button. Agents getting stuck in loops. Agents confidently completing the wrong task entirely. The OSWorld scores explain all of it. An agent scoring 38% or even 61% on a standardized benchmark isn't ready to handle the unpredictable, multi-step, context-heavy work that fills a real knowledge worker's day. It means it fails more than it succeeds. And when you're paying for automation that fails more than it succeeds, you haven't automated anything. You've just added a new source of errors to babysit. The benchmark isn't pessimistic. It's honest. And the honest truth is that most computer-using AI products shipped in 2025 were not ready for unsupervised production use.
The Simular Moment and Why It Matters
Give credit where it's due. When Simular's Agent S2 crossed the 72.36% human baseline in December 2025, it was a genuine milestone. One year earlier, the best score on OSWorld was hovering around 38%. The field moved fast. Crossing human-level on a rigorous desktop benchmark is not a small thing, and the researchers who built Agent S2 earned that moment. But here's the thing about benchmarks. Crossing human-level isn't the finish line. It's the starting gun for serious enterprise deployment. A computer use agent that performs at human-level is interesting. An agent that performs significantly above human-level is the one that actually changes how companies operate. That's why the jump from 72% to 82% matters more than it looks on paper. It's the difference between 'as good as a person' and 'better than a person, and available 24 hours a day, running in parallel, without taking breaks.' That's where the real automation ROI lives.
Why Coasty Exists
I'm not going to pretend I stumbled onto Coasty by accident. I was looking for a computer use agent that could handle real work without me hovering over it like a nervous parent. The OSWorld score is 82%. That's not a marketing claim, it's a verifiable benchmark result, higher than every competitor including Claude, GPT-based agents, and UiPath. But the score is almost secondary to what it means in practice. Coasty controls actual desktops, real browsers, and terminals. Not API wrappers, not simulated environments. Real computer use. It runs on a desktop app, on cloud VMs, and supports agent swarms for parallel execution, meaning you can run multiple tasks simultaneously instead of waiting for one to finish before starting the next. There's a free tier, BYOK support, and it doesn't require an engineering team to set up. The reason Coasty scores 82% on OSWorld while others are still in the 60s isn't magic. It's that the system was built specifically to handle the messy, unpredictable reality of actual computer environments, not just clean benchmark conditions. That's exactly what OSWorld is designed to test. And that's exactly why the score holds up.
Here's my take, and I'll be direct about it. We're at the point in AI computer use where the benchmark scores actually tell you something real. The gap between a 38% agent and an 82% agent isn't a nuance. It's the difference between a tool that wastes your time and a tool that genuinely replaces repetitive human labor. Forty percent of workers spend at least a quarter of their week on manual, repetitive computer tasks. Manual data entry costs companies $28,500 per employee per year. The technology to fix this exists right now, today, and it scores 82% on the hardest benchmark in the field. If you're still evaluating computer use agents based on demo videos and press releases, you're making a very expensive mistake. Look at the OSWorld scores. Compare them honestly. Then go try the one that's actually winning. That's coasty.ai. The benchmark doesn't lie.