Research

The OSWorld Benchmark Is Exposing Who's Actually Building Real Computer Use AI (And Who's Faking It)

Sophia Martinez||7 min
Ctrl+P

In 2024, researchers released OSWorld, a benchmark that asks AI agents to do real computer tasks on a real desktop. Spreadsheets, terminals, browsers, file management. Actual work. The best AI model at launch scored around 7%. A distracted intern with no training could beat that. Humans, by comparison, scored 72%. The AI community collectively had an awkward moment and moved on. Fast forward to right now, and the number one computer use AI agent sits at 82%, surpassing human performance on the same benchmark. That jump is one of the most dramatic in benchmark history. But here's what nobody's talking about loudly enough: most of the field is still embarrassingly far behind, the scoring gaps between competitors are enormous, and a lot of companies are quietly hoping you don't look too closely at the leaderboard.

What OSWorld Actually Tests (And Why It's So Hard to Fake)

OSWorld isn't a multiple choice quiz. It's not a coding autocomplete test. It's 369 real tasks inside real computer environments, things like editing a spreadsheet formula, navigating a Linux terminal, dragging files between folders, filling out a web form without breaking anything. The agent has to see the screen, decide what to click, type, scroll, or drag, and actually complete the job. No shortcuts, no API hacks, no pretending. This is why it matters so much as a computer use benchmark. You can't prompt-engineer your way to a good score. Either your agent can use a computer or it can't. That's it. The benchmark was published at NeurIPS 2024 and immediately became the standard test for anyone claiming to build a serious computer-using AI. The human baseline sits at 72.36%, which means any agent above that number isn't just matching humans, it's beating them. That threshold is the line between a demo and a product.

The Scoreboard Is a Bloodbath for Most Competitors

  • Early 2024: Best AI agents scored roughly 7% on OSWorld. Human score: 72%. That's not a gap, that's a canyon.
  • Claude Sonnet 4 (mid-2025): Anthropic was bragging about their computer use capabilities. Their score? Around 38-39% on OSWorld. Less than half the human baseline.
  • Claude Sonnet 4.5 (September 2025): Anthropic pushed hard and reached 61.4%. Real progress, but still 10+ points below human performance.
  • OpenAI's CUA (Computer-Using Agent): Scored 38.1% on OSWorld when it launched. OpenAI called it a breakthrough. The benchmark said otherwise.
  • Simular Agent S2 hit new state-of-the-art results in mid-2025, pushing into the 70s, which was genuinely impressive at the time.
  • Coasty (2026): 82% on OSWorld Verified. That's above human baseline. That's above every competitor. That's the number that ends the argument.
  • The gap between first and second place on the leaderboard right now is not a rounding error. It's a different category of product entirely.

Anthropic's Claude went from 38% to 61% on OSWorld in four months and called it a 'significant leap.' Coasty is sitting at 82%. There's a 21-point gap between the loudest voice in computer use AI and the actual leader. That's not a benchmark difference. That's a product difference.

Why Companies Keep Overhyping Their Computer Use Scores

Here's the thing about the AI benchmark wars. Every company picks the metric that makes them look best. OpenAI launched Operator in January 2025 with enormous fanfare. Their OSWorld score was 38.1%. Meanwhile they were running WebVoyager numbers alongside it because those looked better. Anthropic did the same thing, emphasizing SWE-bench results when their computer use numbers were weak, then pivoting to celebrate OSWorld progress once Claude Sonnet 4.5 hit 61%. This is the benchmark shell game. You announce the number that sounds impressive and bury the ones that don't. OSWorld is hard to spin because it's testing the one thing everyone claims to do: use a computer autonomously, on real tasks, without hand-holding. When J.P. Morgan's 2026 Outlook report noted that leading agentic frameworks were sitting at 61% on computer use benchmarks while human baseline was 72%, that was a polite way of saying most of these agents aren't ready for real work yet. The gap between a well-funded press release and an agent that can actually do your job is still enormous for most players in this space.

What an 82% Score Actually Means in the Real World

People get lost in percentages. Let me make this concrete. OSWorld's 369 tasks cover the kinds of things knowledge workers do every day: opening applications and navigating menus, editing documents and spreadsheets, managing files across directories, running terminal commands, filling web forms, and switching between apps to complete multi-step workflows. An agent scoring 38% is failing on nearly two thirds of those tasks. You can't deploy that in a real business environment without constant babysitting, and at that point you're not saving time, you're adding a new layer of management overhead. An agent at 61% is getting there, but still dropping the ball on four out of ten tasks. For anything mission-critical, that failure rate is a dealbreaker. An agent at 82% is completing more tasks successfully than the average human doing the same work. That's the threshold where computer use AI stops being a toy and starts being infrastructure. It's the difference between a cool demo you show at all-hands and a tool that actually changes your headcount math.

Why Coasty Exists and Why the Score Isn't an Accident

I'm going to be straight with you. I work for Coasty. But the 82% on OSWorld Verified isn't a marketing claim, it's a public benchmark result that anyone can verify. The reason Coasty hits that number is architectural, not cosmetic. While most competitors are building wrappers around vision models and hoping the underlying LLM figures out the screen, Coasty is built from the ground up to control real desktops, real browsers, and real terminals. Not API calls pretending to be computer use. Actual mouse clicks, keyboard inputs, and screen-reading on live environments. The desktop app works on your local machine. The cloud VMs mean you can spin up isolated environments for sensitive tasks. The agent swarm architecture means you can run parallel execution across multiple tasks at the same time, which is where the real productivity math gets interesting. BYOK support means you're not locked into one model provider. The free tier means you can test this without a procurement process. The 82% score is what happens when you build the right thing instead of the fastest thing to announce. Anthropic and OpenAI are chasing that number. Coasty already has it.

The OSWorld benchmark is the most honest thing in AI right now. You can't charm it, you can't spin it, and you can't cherry-pick your way to a good score. It just asks: can your agent actually use a computer? Most of the industry is still answering that question with a nervous 'kind of.' One agent is answering with 82%. If you're evaluating computer use AI for anything real, start with the benchmark and work backwards. Ask every vendor for their OSWorld Verified score. Watch how many of them change the subject. The ones who don't change the subject are the ones worth talking to. Coasty is at coasty.ai. The benchmark result is public. The free tier is real. Go test it yourself and stop taking anyone's word for it, including mine.

Want to see this in action?

View Case Studies
Try Coasty Free