Research

The OSWorld Benchmark Results Are In, and Most Computer Use Agents Should Be Embarrassed

Name: Coasty AI Employee
Brand: Coasty
Availability: InStock
Rating: 4.8 (1250 reviews)

Priya Patel|March 29, 2026|7 min

⌘+N

A year ago, the best AI agent in the world completed about 20% of real computer tasks correctly. That number felt embarrassing then. What's more embarrassing is that most of the tools being actively sold to businesses right now still can't reliably beat a human on the same test. OSWorld, the gold-standard benchmark for computer use agents, measures performance on 369 real desktop tasks across web browsers, file systems, and multi-app workflows. It's not a trick. It's not a cherry-picked demo. It's the closest thing we have to a fair fight between AI and the person currently doing your data entry. The results, as of early 2026, tell a story that the AI hype machine really doesn't want you to hear.

What OSWorld Actually Tests (And Why It's Hard to Fake)

OSWorld isn't a multiple choice quiz. It's 369 tasks that mirror what a real office worker does all day: move files, fill out web forms, navigate spreadsheets, switch between apps, and complete multi-step workflows without someone holding their hand. The benchmark was introduced at NeurIPS 2024 and quickly became the standard because it's brutally hard to game. You either complete the task or you don't. The human baseline sits at 72.36%. That number is important. It's not some superhuman ceiling. It's what a normal person scores when asked to do normal computer work. For a long time, every single AI agent on the planet was below it. Not close below. Embarrassingly below. We're talking 20%, 30%, 40% scores from systems that companies were charging enterprise prices for. The gap between the benchmark score and the sales pitch was, and in many cases still is, enormous.

The Leaderboard Right Now: Who's Winning and Who's Still Pretending

●Coasty: 82% on OSWorld. The only computer use agent currently above 80%. That's not a rounding error, that's a category lead.
●Surfer 2 (H Company): 77.0% with pass@10. Solid result, but pass@10 means it gets 10 attempts. Real work doesn't give you 10 attempts.
●Simular Agent S2: 72.6%. Made headlines in December 2025 for 'beating humans.' It beat the human baseline by 0.24 percentage points. That's the margin people celebrated.
●Claude Sonnet 4.5 (Anthropic): 61.4%. Anthropic called it 'a significant leap forward.' It's still 11 points below the human baseline on a test of basic computer tasks.
●Most enterprise RPA tools and older agents: Not even submitting scores publicly anymore, which tells you everything.
●The gap from 20% to 82% happened in roughly 18 months. The gap from 61% to 82% is where the real separation lives.

Manual data entry costs U.S. companies an average of $28,500 per employee per year, according to a July 2025 Parseur and QuestionPro survey of 500 professionals. Over 40% of workers spend at least a quarter of their entire work week on manual, repetitive tasks. That's not a productivity problem. That's a math problem. And most of the AI agents being sold as the solution can't even beat a human on a controlled benchmark.

Anthropic and OpenAI Are Losing This Race and Acting Like They're Not

Let's be direct. Anthropic has been marketing Claude's computer use capabilities since late 2024. Claude Sonnet 4.5 scores 61.4% on OSWorld. Claude Sonnet 4.6 improved things further, and Anthropic's own system card acknowledges the 'steep upward trend,' which is a polite way of saying they were starting from a low floor. Meanwhile, OpenAI launched Operator in January 2025 with enormous fanfare. By July 2025, independent reviewers at Understanding AI published a piece with the headline: 'ChatGPT Agent: a big improvement but still not very useful.' Another reviewer called it 'unfinished, unsuccessful, and unsafe.' These aren't fringe takes. These are people who wanted the tools to work and were disappointed. The computer use benchmark scores back them up. When you're scoring in the low-to-mid 60s on a test where the average human scores 72, you're not ready to replace anyone's workflow. You're a demo.

Why Benchmark Scores Actually Matter for Real Businesses

I've heard the counterargument: 'Benchmarks don't reflect real work.' That's partially true for language tasks. It's much less true for computer use. OSWorld tasks are real apps, real interfaces, real failure modes. When an agent fails a file management task in OSWorld, it fails the same way it'll fail when your accounts payable person is out sick and you need it to process invoices. The 10-point gap between 72% and 82% sounds academic until you realize it means roughly 37 more tasks completed correctly out of 369. At scale, across a team, across a year, that difference is the difference between an agent you can trust with real work and one you have to babysit. And babysitting an AI agent is somehow more exhausting than just doing the task yourself, because at least you know when you've made a mistake. A 61% agent confidently fails 39% of the time. That's a liability, not an asset.

Why Coasty Exists and Why the Score Gap Isn't an Accident

Coasty was built around one obsession: making a computer use agent that actually works on real desktops, real browsers, and real terminals. Not API wrappers. Not sandboxed toy environments. Actual computer use, the kind where the agent sees the screen, moves the mouse, types in fields, and handles whatever weird edge case your legacy software throws at it. The 82% OSWorld score isn't a marketing stat. It's the result of building an agent that can handle multi-app workflows, recover from errors, and operate at the speed of a fast human. The desktop app runs locally. Cloud VMs are available for teams that need scale. Agent swarms let you run parallel tasks so you're not waiting on a sequential bottleneck. There's a free tier if you want to see for yourself before committing. BYOK support if you want to bring your own model keys. The architecture is built for the 82% case, not the 61% demo case. That gap matters because the tasks that fall in the difference between those two numbers are usually the ones that matter most, the edge cases, the multi-step workflows, the things that break when an agent gets confused halfway through.

Here's the honest take on the OSWorld results: we're at an inflection point. Eighteen months ago, AI computer use was a party trick. Today, one agent is genuinely above human baseline performance, and the rest of the field is scrambling to catch up. The companies still paying people $28,500 a year to copy-paste data between apps aren't waiting for AI to get better. They're waiting for someone to tell them the right tool exists. It does. The benchmark proves it. If you want to see what 82% on OSWorld looks like in practice on your actual workflows, go to coasty.ai and try it. Don't take my word for it. The leaderboard doesn't lie.