The OSWorld Benchmark Results Are In and Most AI Computer Use Agents Should Be Embarrassed
OpenAI has been telling anyone who'll listen that its Computer-Using Agent is the future of work. The OSWorld benchmark just told a different story. OpenAI CUA scored 38.1% on OSWorld, the gold-standard test for real-world computer use tasks. That means it fails on nearly two thirds of the things a competent human does every day at a computer. Anthropic's Claude, which Anthropic hyped as a 'significant leap forward on computer use,' came in at 72.5%. Respectable, sure. But Coasty is at 82%. That's not a minor edge. That's a different category entirely. So let's talk about what these numbers actually mean, why most vendors are quietly hoping you don't look too closely at the leaderboard, and why the OSWorld benchmark is the only score that matters right now.
What OSWorld Actually Tests (And Why It's So Hard to Fake)
OSWorld isn't a multiple choice quiz. It's not a curated set of easy wins designed to make a press release look good. It's 369 real computer tasks across real software: LibreOffice, Chrome, VS Code, file managers, terminals, multi-app workflows. The agent has to actually control a desktop environment, see what's on screen, decide what to click, type, drag, or execute, and complete the task correctly from start to finish. No shortcuts. No API cheats. No 'close enough.' Either the spreadsheet formula is right or it isn't. Either the file got moved to the right folder or it didn't. The human baseline on OSWorld sits at 72.36%. That number is important. It's the bar that separates 'impressive demo' from 'actually useful in the real world.' For years, every AI computer use agent on the planet was comfortably below it. Simular's Agent S3 crossed it in December 2025, by a razor-thin 0.24 percentage points. Then Coasty came in and didn't just cross the line. It blew past it by nearly 10 full points. That's not incremental progress. That's a different class of agent.
The Full Leaderboard, Ranked From Honest to Humbling
- ●Coasty: 82% on OSWorld. Current #1. More than 10 points ahead of every other agent, including those built on GPT-5 and Claude.
- ●Simular Agent S3: 72.6%. First agent to technically beat the human baseline of 72.36%, by 0.24 points. Impressive milestone, still 9.4 points behind Coasty.
- ●Claude Sonnet 4.5 (Anthropic): 61.4% on OSWorld-Verified. Anthropic called this 'a significant leap forward.' It's still failing on 38.6% of tasks.
- ●OpenAI CUA (powers Operator): 38.1%. This is the agent OpenAI shipped to paying customers. It fails on 61.9% of real computer tasks.
- ●Early Claude Computer Use (2024): 22%. Anthropic's original computer use launch score. They've improved a lot. They're still not leading.
- ●Human baseline: 72.36%. The score every vendor should be chasing before they start running ads about replacing human workers.
OpenAI's CUA, the engine behind the Operator product it's charging people for, scores 38.1% on OSWorld. Coasty scores 82%. That's not a product comparison. That's two different eras of technology.
Why Vendors Hate Talking About OSWorld Scores
Here's what's happening behind the scenes. Most AI companies announcing 'computer use' features are benchmarking on their own internal evals, cherry-picked demos, or narrow web-only tasks. OSWorld is uncomfortable because it covers the full stack: GUI navigation, terminal commands, file operations, cross-app workflows, and tasks that require multiple sequential decisions without any human correction. It's the difference between a driving test with a closed course and one on an actual highway. When Anthropic launched computer use in late 2024, the 22% OSWorld score got quietly buried under a wave of impressive-looking GIFs. When OpenAI launched Operator in January 2025, the 38.1% score was mentioned once in a technical footnote. Neither company is lying. They're just hoping you're too busy watching the demo to check the leaderboard. The vendors who are winning on OSWorld are talking about it loudly. The ones who aren't are talking about 'responsible deployment' and 'iterative improvement.' You can figure out which category each one falls into.
The Human Baseline Drama Nobody Is Talking About Enough
The 72.36% human baseline number deserves more attention than it gets. Think about what it means. A random human completing OSWorld tasks doesn't score 100%. They score 72.36%. That's because some tasks are genuinely ambiguous, some software behaves unexpectedly, and real computer work is messy. So when an AI agent scores below 72%, it's not just 'not as good as a human.' It's not even as reliable as a person who has never used the specific software before. For most of 2024 and well into 2025, every single AI computer use agent on the market was in that bucket. Companies were selling 'AI automation' products that, by the only objective measure we have, were less capable than handing the task to a new hire on their first day. Simular crossed the human line in December 2025. Coasty crossed it and kept going, reaching 82% in early 2026. That 10-point gap above the human baseline isn't just a benchmark win. It means the agent is completing tasks that confuse humans, handling edge cases that trip up real people, and doing it faster and without complaining about the workload.
Why Coasty Exists and Why 82% Is the Number That Actually Matters
I've used a lot of these tools. The honest truth is that most 'computer use' products are wrappers around a vision model that can click buttons when the UI is clean and the task is simple. They fall apart the moment anything unexpected happens, a popup appears, a file is in the wrong place, or a workflow spans more than two applications. Coasty was built specifically to solve that problem. It controls real desktops, real browsers, and real terminals, not sanitized API environments. The 82% OSWorld score isn't a marketing number. It's a verified result on the hardest, messiest, most realistic computer use benchmark that exists. That score means Coasty handles the multi-step chaos of actual work: opening a spreadsheet, pulling data from a web source, running a terminal command, formatting a report, and sending it, without you babysitting every step. It runs on a desktop app, on cloud VMs, and in agent swarms for parallel execution when you need to scale. There's a free tier if you want to see it before you commit, and BYOK support if you're already paying for your own model access. The benchmark score is the proof. The product is what you do with it.
The OSWorld leaderboard is the most honest document in AI right now. It doesn't care about your funding round, your press release, or how good your demo looked at a conference. It asks one question: can your agent actually use a computer? Most of them can't, not reliably, not above the human baseline, not in the real messy conditions where work actually happens. Coasty can. 82% is the number. It's the highest score any computer use agent has posted on this benchmark, full stop. If you're evaluating AI agents for real work, that's where the conversation should start and, honestly, end. Stop watching demos. Check the leaderboard. Then go try Coasty at coasty.ai and see what a computer use agent that actually works feels like.