Research

OSWorld Went from 12% to 82% in 18 Months. Most AI Agents Are Still Failing.

Daniel Kim||7 min
F12

Eighteen months ago, the best AI agent on the planet could complete about 12% of real computer tasks on its own. Twelve percent. That number was so bad that most enterprise software vendors quietly ignored the benchmark entirely and hoped nobody would ask. Then something happened. Scores started climbing fast, then faster, then at a rate J.P. Morgan's analysts called '37 percentage points per year.' Now the top of the OSWorld leaderboard sits at 82%, Anthropic's own system card admits Claude went from 'the teens to the low 70s' since late 2024, and the AI computer use arms race is fully, undeniably real. The problem is that most companies, and most AI tools, are still stuck at the bottom of that curve. And they're charging you like they're not.

What OSWorld Actually Tests (And Why It's the Only Number That Matters)

OSWorld is not a vibe check. It's not a curated demo or a cherry-picked use case from a vendor's marketing deck. It's a standardized benchmark of 369 real computer tasks run inside actual desktop environments, covering web browsers, file systems, terminals, spreadsheets, and native apps across Linux, macOS, and Windows. The agent either completes the task or it doesn't. No partial credit for 'almost.' When a computer use agent scores 38% on OSWorld, that means 62% of the tasks you'd actually give it will fail. When it scores 82%, the math flips. That's the difference between a toy and a tool. The reason the industry fought so hard to ignore this benchmark for so long is exactly why you should pay attention to it now. It's the one number vendors can't spin.

The Leaderboard Timeline Is Genuinely Shocking

  • Early 2024: Best available AI computer use agents score around 12-15% on OSWorld. Human performance sits at roughly 72%.
  • Early 2025: OpenAI's Computer-Using Agent (CUA) powering Operator launches at 38.1%, which felt impressive at the time. It wasn't.
  • Mid-2025: OSWorld-Verified launches in July, tightening the evaluation methodology so scores are harder to game.
  • September 2025: Claude Sonnet 4.5 hits 61.4% on OSWorld, and Anthropic calls it 'the best model at using computers.' Bold claim for a score that still fails 4 out of 10 tasks.
  • Late 2025 to early 2026: Claude Sonnet 4.6 pushes into the low 70s. The gap to human performance is closing fast.
  • Now: Coasty sits at 82% on OSWorld, above every public competitor, and above the original human baseline. That's not a marketing number. That's the leaderboard.
  • J.P. Morgan's 2026 outlook explicitly cited OSWorld improvement rates as evidence of the fastest capability ramp in AI history.

"Since Claude Sonnet 3.5 in October 2024, OSWorld scores have gone from the teens to the low 70s." That's Anthropic's own system card. In one year. And they're still not at the top.

Why Your Current Automation Stack Is Already Obsolete

Here's the part nobody in the RPA industry wants to talk about. Traditional tools like UiPath are built on brittle selectors, rigid workflows, and the assumption that every UI element will be exactly where it was last Tuesday. UiPath's own engineering blog admits their 'Healing Agent' feature exists specifically because UI automation has a 'significant failure rate' from selector breakage. They built a band-aid for a structural wound. Meanwhile, a real computer use agent doesn't care if someone moved a button or redesigned a form. It sees the screen the way a human does and figures it out. The enterprises that are still paying for RPA maintenance, selector updates, and bot babysitters in 2026 are doing the equivalent of paying someone to maintain a fax machine. And the cost of that inertia is not abstract: McKinsey puts the AI productivity opportunity at $4.4 trillion, and manual invoice processing alone averages $15 per invoice in labor costs. Every month you wait is money you're handing to your competitors.

The Dirty Secret About Vendor Benchmark Claims

When OpenAI launched Operator in January 2025, they led with WebVoyager and WebArena scores, where CUA scored 87% and looked dominant. What they didn't lead with was the OSWorld number: 38.1%. Why? Because WebVoyager tests browser tasks. OSWorld tests the real computer, the full desktop, the messy reality of enterprise software. Anthropic called Claude Sonnet 4.5 'the best model at using computers' on their launch blog. At 61.4%, that was arguably true for about six weeks. The lesson here is that every AI vendor picks the benchmark where they look best. The only honest move is to look at the hardest, broadest, most manipulation-resistant evaluation available, and right now that's OSWorld-Verified. If a vendor isn't publishing their OSWorld score, ask yourself why.

Why Coasty Exists

I'm not going to pretend I don't have a dog in this fight. Coasty was built specifically to win on real-world computer use, not on hand-picked demos or narrow browser benchmarks. The 82% OSWorld score isn't a talking point, it's the result of building an agent that actually controls desktops, browsers, and terminals the way a skilled human operator would. Not API calls dressed up as automation. Not a chatbot with a screenshot plugin. A genuine computer use agent that sees your screen, reasons about what needs to happen, and executes. The desktop app runs locally. Cloud VMs are available for teams that want to scale without touching their own infrastructure. Agent swarms let you run parallel tasks so a job that takes a human eight hours can finish in under one. There's a free tier if you want to see the OSWorld gap for yourself before committing. BYOK is supported if you're particular about model choice. The pitch is simple: if you're evaluating computer use AI agents and you're not starting with the one that scores highest on the hardest benchmark, you're not actually evaluating, you're just shopping for the logo you already trust.

The OSWorld benchmark went from an ignored academic paper to the most important number in enterprise automation in less than two years. The scores tell a clear story: this technology is moving faster than any analyst predicted, the gap between the best and the rest is widening, and the window to get ahead of it is closing. If you're still running manual processes because 'automation is too risky' or 'we tried RPA and it broke,' those objections made sense in 2022. They don't anymore. The benchmark says so. An 82% task completion rate on the hardest real-world computer use evaluation available is not a reason to wait and see. It's a reason to move. Go to coasty.ai, spin up the free tier, and give it the most annoying repetitive task your team does every week. You'll stop asking questions about the benchmark pretty fast.

Want to see this in action?

View Case Studies
Try Coasty Free