The OSWorld Benchmark Results Are In, and Most AI Computer Use Agents Should Be Embarrassed
The human baseline on OSWorld is 72%. Let that sink in for a second. For most of 2024 and well into 2025, the best AI computer use agents in the world were scoring in the 30s and 40s on this benchmark. Grown adults at billion-dollar AI labs were shipping products that couldn't outperform a random office worker clicking around a desktop. And they were charging enterprise prices for it. OSWorld is the closest thing we have to a real, honest test of whether an AI agent can actually use a computer, not just pretend to. It throws 369 tasks across real operating system environments, real apps, real file systems. No guardrails. No shortcuts. Just: can your agent get the job done? The 2025 and early 2026 results have reshuffled the entire competitive picture, and if you're still betting on the big-name incumbents, you might want to check the scoreboard.
What OSWorld Actually Tests (And Why Most Benchmarks Are Garbage)
Most AI benchmarks are a joke. They test pattern matching on datasets the models have probably seen during training. OSWorld is different and the AI industry hates it for that reason. It runs agents inside real virtual machines. Real Ubuntu, real Windows, real macOS environments. The agent has to open LibreOffice, manipulate files, navigate browsers, run terminal commands, and complete multi-step workflows that require genuine understanding of what's on screen. There's no API to call. There's no shortcut. The agent sees pixels, it reasons, it acts. It either completes the task or it doesn't. That's why the scores are so much lower here than on every other benchmark these companies love to cite. When Anthropic brags about Claude's performance on some curated reasoning test, that's marketing. When a model scores on OSWorld, that's reality. The benchmark was published at NeurIPS 2024 and has quickly become the de facto standard for evaluating genuine computer use capability. If your AI agent vendor isn't citing OSWorld, ask yourself why.
The Leaderboard Is a Massacre. Here Are the Real Numbers.
- ●Human experts score roughly 72% on OSWorld. That's the bar. Everything below it means the agent is worse than a person.
- ●Claude Sonnet 4.5 (Anthropic) scores 61.4% on OSWorld. That's after Anthropic spent a year hyping computer use as their flagship capability.
- ●OpenAI's CUA model with the best configuration hits around 60.76% according to the CoAct-1 paper from August 2025. Below human. Still.
- ●Simular's Agent S3 crossed 72.6% in December 2025, finally cracking the human baseline. Big deal, genuinely. But it didn't stay on top for long.
- ●AGI Inc. (the team behind Coasty) published a result of 76.26% on OSWorld in October 2025, beating human performance by a meaningful margin.
- ●Coasty's current score sits at 82% on OSWorld. That's not a rounding error. That's a 10-point gap over the next best competitor.
- ●The gap between 82% and 61% isn't incremental. In real tasks, that's the difference between an agent that finishes the job and one that fails more than a third of the time.
Claude Computer Use scores 61.4% on OSWorld. Coasty scores 82%. That 20-point gap means Anthropic's flagship computer use product fails on tasks that Coasty completes. At enterprise scale, that's not a footnote. That's your automation strategy collapsing.
Why This Gap Costs Real Companies Real Money
Here's where the benchmark stops being abstract. Over 40% of workers spend at least a quarter of their work week on manual, repetitive tasks, according to Smartsheet's research. UK data shows workers waste 12.6 hours per week on tasks that should be automated. Each employee stuck doing manual data entry costs organizations thousands of dollars a year in pure lost productivity. When you deploy a computer use agent that fails 38% of the time, you don't save that time. You create a new problem: someone has to audit the failures, fix the mistakes, and babysit the agent. You've traded one manual process for two. That's the dirty secret of deploying mediocre AI agents in production. A 61% success rate sounds impressive until you realize it means your agent is botching roughly 4 out of every 10 tasks. In a workflow running hundreds of tasks a week, that's a part-time job just cleaning up the mess. The OSWorld benchmark scores aren't academic vanity metrics. They're a direct proxy for how often your automation is going to call you at 2am.
The Anthropic and OpenAI Computer Use Problem Nobody Wants to Say Out Loud
Anthropic launched Claude Computer Use in late 2024 with enormous fanfare. The demos were impressive. The marketing was loud. The actual OSWorld performance was in the 40s when it launched. To their credit, they've improved steadily, and Sonnet 4.6 shows continued progress on the benchmark. But here's the uncomfortable truth: Anthropic's computer use product is built on top of a general-purpose model that was never specifically architected to be a computer use agent. It's a brilliant LLM that also happens to click things. That's a fundamentally different design philosophy than building an agent from the ground up to control desktops, browsers, and terminals. OpenAI's Operator has the same problem. It's impressive for demos and light web tasks, but the benchmark scores tell you it's not ready for the complex, multi-step desktop workflows that actually matter in enterprise environments. Both companies are iterating fast, nobody's disputing that. But right now, in early 2026, you're looking at a 20-point performance gap between the incumbents and the actual best computer use agent on the market. That gap is not closing overnight.
Why Coasty Exists and Why the 82% Score Is Not a Coincidence
I'll be straight with you. I work at Coasty. But I'm going to tell you why the benchmark score is real and what's behind it, because you deserve more than a press release. Coasty is built by AGI Inc., a team that has been laser-focused on one thing: making an AI agent that can actually use a computer the way a human does. Not an LLM with a screenshot tool bolted on. A purpose-built computer use agent with a desktop app, cloud VMs, and agent swarms for parallel execution. The 82% on OSWorld isn't a cherry-picked configuration or a one-time run. It reflects an architecture designed around real computer use from day one, including handling the messy, unpredictable things that real desktops throw at you, like unexpected popups, slow-loading pages, apps that don't respond, and workflows that require genuine multi-step reasoning across different applications. The practical difference is real. Coasty controls actual desktops and browsers and terminals, not just API endpoints. It can run agent swarms to parallelize work. It has a free tier so you can test it without a six-figure contract. And it supports BYOK so you're not locked into one model provider. The OSWorld score is the proof point. The architecture is the reason.
Here's my honest take after digging through all these benchmark results. We are at an inflection point for computer use AI, and the gap between the best and the rest is wider than the headlines suggest. Most companies are going to deploy the most familiar name, Claude or Operator, because it's the safe choice. And they're going to get 61% reliability on their automation workflows and wonder why the ROI isn't materializing. The OSWorld benchmark exists precisely so you don't have to learn this lesson the hard way. The scores are public. The gap is real. An agent that scores 82% on the hardest computer use benchmark in existence is not a marginal improvement. It's a different category of tool. If you're serious about automating real desktop work in 2026, not just web scraping or simple form fills but actual multi-step computer use across real applications, you owe it to yourself to test the thing that actually leads the benchmark. Go try it at coasty.ai. Free tier, no sales call required. The numbers speak for themselves.