Research

The OSWorld Benchmark Results Are In, and Most AI Agents Should Be Embarrassed

Name: Coasty AI Employee
Brand: Coasty
Availability: InStock
Rating: 4.8 (1250 reviews)

Michael Rodriguez|March 31, 2026|7 min

Ctrl+F

Every AI company on earth is telling you their agent can use a computer. Most of them are lying, at least partially. The OSWorld benchmark exists for exactly this reason: to cut through the demo videos and the press releases and find out which computer use agents can actually sit down at a real desktop and get work done. The latest results are out, and the spread between the best and the rest is so wide it should make you angry if you've been paying for the wrong tool. Coasty sits at 82% on OSWorld. Claude Opus 4.6, Anthropic's best model, just hit 72.7%. That's a 9-point gap. In benchmark terms, that's a canyon.

What OSWorld Actually Tests (And Why It's So Hard to Fake)

OSWorld isn't a multiple choice quiz. It's not a coding challenge where you spit out Python and call it a day. It's a real operating system environment where an AI agent has to complete genuine computer tasks: navigating GUIs, filling out forms, moving files, using spreadsheets, writing emails, switching between apps. The kind of stuff your employees do every single day. The benchmark launched with 369 real-world tasks across Windows, macOS, and Ubuntu. Then in July 2025, the XLANG Lab released OSWorld-Verified, a tighter, noise-reduced version designed to stop agents from gaming the evaluation. Scores shifted. Some models looked worse overnight. The ones that were actually good stayed good. That's the point. When the researchers tightened the screws, the posers got exposed. Early human performance on OSWorld sits around 72 to 74% depending on the task category. Read that again. Until very recently, no AI agent could match a regular human doing regular computer work. That's how hard this benchmark is. And that's why an 82% score isn't just impressive. It's genuinely shocking.

The Full Leaderboard, Ranked Without the Spin

●Coasty: 82% on OSWorld. Highest publicly verified score. Not close.
●Claude Opus 4.6 (Anthropic): 72.7% on OSWorld-Verified. Anthropic's flagship. Good, but 9+ points behind.
●Claude Sonnet 4.6 (Anthropic): 72.5% on OSWorld-Verified. Nearly identical to Opus at a fraction of the cost, which tells you something about how Anthropic is optimizing.
●Claude Sonnet 4.5 (Anthropic): 61.4% on OSWorld. A 11-point jump to Sonnet 4.6 in just one model generation, which is either impressive progress or a sign of how much runway was left on the table.
●GPT-5.4 (OpenAI): LinkedIn posts are claiming it 'surpasses human performance on OSWorld' as of March 2026. No official verified score published yet. Until the number is on the official leaderboard, it's marketing.
●OpenAI Operator / ChatGPT Agent: Launched with fanfare in January 2025, still has no credible OSWorld score to point to. A July 2025 review called it 'unfinished, unsuccessful, and unsafe.' That review was from someone who wanted it to work.
●Most other agents: Nowhere near human-level performance. Many haven't even submitted to the official benchmark.

Manual data entry alone costs U.S. companies $28,500 per employee per year in lost productivity. Meanwhile most AI agents can't even complete 73% of basic computer tasks reliably. You're bleeding money from both ends.

Why a 9-Point Gap Is Actually Enormous

People see 82% vs 72.7% and think 'close enough.' It's not. Think about it in terms of work. If you give an AI agent 100 real computer tasks, the difference between 82% and 72.7% is roughly 9 extra tasks completed correctly. At scale, across a team, across a week, that gap compounds into hours of human intervention, error correction, and babysitting. The whole point of a computer use agent is that you don't have to watch it. Every failed task is a task a human has to redo. And the tasks that fail aren't random, they tend to be the complex, multi-step ones that take the most time. That's where the score gap hurts the most. Anthropic has made genuinely impressive progress. Going from 61.4% with Sonnet 4.5 to 72.7% with Opus 4.6 in a few months is real. Nobody should dismiss that. But 'impressive progress' and 'best in class' are two different things, and right now they're not the same agent.

OpenAI's Computer Use Problem Is Getting Embarrassing

OpenAI launched Operator in January 2025 with the kind of hype that makes tech journalists forget to ask hard questions. Sixteen months later, reviewers are still writing things like 'unfinished, unsuccessful, and unsafe.' One independent analysis from July 2025 noted that Anthropic's Computer Use was on the market a full year before Operator shipped, and Operator still doesn't work reliably. Now OpenAI has folded Operator into ChatGPT Agent and is making noise about GPT-5.4 beating human performance on OSWorld. Maybe. Show me the verified score on the official leaderboard and I'll update my opinion. Until then, this is the same pattern we've seen for 18 months: a flashy announcement, a cherry-picked demo, and real users left wondering why it can't complete a simple workflow without asking for confirmation every 30 seconds. The computer use space deserves better than vaporware benchmarks.

Why Coasty Exists and Why 82% Matters in the Real World

Coasty wasn't built to win a benchmark. It was built because the benchmark reflects real work, and real work is what companies are drowning in. Over 40% of workers spend at least a quarter of their week on manual, repetitive tasks. More than half of employees doing repetitive data work report burnout. The cost isn't abstract. It's $28,500 per employee per year, according to a 2025 Parseur study. Coasty's 82% on OSWorld means it handles 82% of the real computer tasks you throw at it without needing a human to step in. It controls actual desktops, real browsers, and live terminals. Not API wrappers. Not simulated environments. Real computer use. You can run it as a desktop app, spin up cloud VMs, or run agent swarms for parallel execution when you need to process a hundred things at once. There's a free tier if you want to test it before committing. BYOK is supported if you want to bring your own model keys. The 82% score isn't a marketing number. It's on the OSWorld leaderboard. Go look it up.

The OSWorld-Verified Controversy Nobody Is Talking About

Here's the thing about benchmark scores that most coverage skips entirely: they changed the rules in July 2025. OSWorld-Verified tightened the evaluation protocol to reduce noise and stop agents from getting credit for tasks they kind of completed. When this happened, scores dropped across the board for most models. The agents that were genuinely good held their numbers. The ones that were inflated by loose evaluation criteria took a hit. This is actually great news for anyone trying to make a real purchasing decision. It means the current leaderboard is more trustworthy than it was a year ago. It also means any score from before the OSWorld-Verified update needs a grain of salt. If a company is citing an old OSWorld number to sell you their computer use agent, ask them for their OSWorld-Verified score. If they can't give you one, that tells you everything.

Here's where I land on all of this. The OSWorld benchmark is the most honest thing we have right now for evaluating computer use agents. It's not perfect, but it's real tasks on real operating systems with verified evaluation. The results show a clear hierarchy: Coasty is leading at 82%, Anthropic is genuinely competitive and improving fast, and OpenAI is still making promises it hasn't fully kept. If you're a business still paying people to do repetitive computer work in 2026, you're not waiting for the technology to get good enough. It already is. You're just using the wrong tool. The gap between the best computer use agent and the rest isn't closing fast enough for it to matter to you this quarter. Stop waiting for everyone to catch up. Go to coasty.ai, run the free tier, and see what 82% actually feels like on your real workflows. The benchmark number is just a number until it's saving your team 10 hours a week.