Research

The OSWorld Benchmark Is Being Demolished, and Most Companies Have No Idea What That Means for Computer Use AI

David Park||7 min
+K

In early 2024, Anthropic shipped Claude Computer Use and the AI world lost its mind. It could click buttons. It could open apps. It could sort of, kind of, do things on a real computer. The OSWorld benchmark score for that version? Around 22%. Everyone called it revolutionary. That's how low the bar was. Fast forward to right now, and the best computer use agents are scoring above 80% on that same benchmark. Citrix is already writing blog posts asking what happens when someone hits 100%, because saturation is a real concern. And yet, somehow, the average knowledge worker is still spending 8.2 hours every single week on manual, repetitive computer tasks that a good AI agent could handle before lunch. Something is deeply broken here, and it isn't the benchmark.

What OSWorld Actually Tests (And Why the Scores Should Scare Your Competition)

OSWorld isn't a toy benchmark. It's 369 real desktop tasks across file management, web browsing, multi-app workflows, and terminal operations, all run in actual computer environments with no hand-holding. When OpenAI launched Operator in January 2025 with massive fanfare, its underlying Computer-Using Agent model scored 38.1% on OSWorld. Anthropic's Claude Computer Use at the time sat around 22%. The research community was impressed. Investors were excited. Regular people were skeptical. They were right to be. A 38% success rate means the agent fails on 6 out of every 10 tasks. You wouldn't hire a contractor who only shows up 38% of the time. You definitely wouldn't automate your business on one. The benchmark exists precisely to cut through the marketing noise and ask a brutally simple question: can this thing actually do computer work? For most of 2024 and early 2025, the honest answer was 'barely.'

The Leaderboard in Early 2026: A Very Different Story

  • Claude Sonnet 4.6 (Anthropic, Feb 2026) hit 72.5% on OSWorld-Verified, a massive jump from the ~22% starting point of Claude Computer Use in late 2024
  • Claude Sonnet 4.5 (Sep 2025) scored 61.4%, showing just how fast the curve is accelerating every few months
  • OpenAI's original CUA launched at 38.1%, and forecasters who predicted 65% would be 'nuts' by 2027 were already proven wrong by mid-2025
  • Coasty sits at 82% on OSWorld, the highest verified score of any computer use agent available today, and the gap to second place is not small
  • Citrix published a July 2025 analysis asking what the industry does when agents start hitting 100%, because the benchmark is approaching saturation at the top
  • Agent S2 and other specialized models are also pushing into the high-score territory, confirming this isn't a one-lab fluke but a genuine industry inflection point

Manual data entry and repetitive computer tasks cost U.S. companies $28,500 per employee per year. Over half of those employees report burnout from the work. The AI agents that could fix this are already scoring 80%+ on real-world benchmarks. The only thing missing is someone at your company paying attention.

Why 72% Is Not Good Enough and Why the Gap to 82% Is Enormous

People see benchmark numbers and treat them like school grades. 72% sounds like a solid B. 82% sounds like an A. That framing is completely wrong when you're talking about autonomous computer use agents running real business workflows. Think about what a 10-percentage-point gap actually means in production. If you're running 500 automated tasks per day, the difference between 72% and 82% is 50 additional failures every single day. Those failures need human review. They create exceptions. They break downstream processes. They erode trust in the whole system until someone decides to just do it manually again. The history of enterprise automation is a graveyard of tools that were 'good enough' in demos and disasters in production. RPA vendors sold this dream for a decade. Fragile scripts. Constant maintenance. Massive implementation costs. The reason OSWorld matters so much is that it finally gives buyers a way to see through the demos and ask for a real number. And right now, the real numbers show a clear winner.

The $28,500 Problem Nobody Wants to Talk About

Here's the stat that should be pinned to every executive's monitor. Manual data entry and repetitive computer tasks cost U.S. companies $28,500 per employee per year, according to a 2025 Parseur analysis. That's not total compensation. That's just the cost of the wasted time on work that computers should be doing. For a 50-person operations team, you're burning through $1.4 million a year on tasks that a well-configured computer use agent could handle. And 56% of the people doing that work report burnout from it. You're not just wasting money. You're grinding down your best people with the most soul-crushing possible version of their jobs. The argument for deploying a real AI computer use agent isn't about replacing workers. It's about stopping the practice of paying skilled humans to do things that are genuinely beneath them. The benchmark scores tell us the technology is ready. The cost data tells us the urgency is real. The only question left is why so many companies are still waiting.

Why Coasty Exists

I'll be straight with you. I work for Coasty, so take that for what it's worth. But the 82% OSWorld score isn't marketing copy, it's the highest verified result on the benchmark, higher than Anthropic, higher than OpenAI, higher than every specialized agent that's been thrown at it. What makes that number mean something in the real world is how Coasty actually works. It controls real desktops, real browsers, and real terminals. Not API wrappers. Not screen-scraping hacks. Actual computer use the way a human does it, which is exactly what OSWorld tests. You can run it as a desktop app, spin up cloud VMs, or deploy agent swarms for parallel execution when you need to scale. There's a free tier if you want to see what 82% actually feels like on your own workflows before committing to anything. BYOK is supported if you have model preferences. The reason I bring this up after walking through the benchmark data is simple: the scores are finally high enough to matter in production, and Coasty is the one sitting at the top of that leaderboard. If you're evaluating computer use agents right now, that's the relevant fact.

Here's my actual take after watching this benchmark evolve for the past 18 months. The OSWorld leaderboard is the most honest document in enterprise AI right now. It doesn't care about press releases or demo videos or how many Fortune 500 logos a vendor can put on a slide deck. It asks one question: can your agent actually do computer work? The scores have gone from 22% to 82% in roughly 18 months. That is an insane rate of progress. The companies that are paying attention and deploying real computer use agents today are building an operational advantage that will be very hard to close in two or three years. The companies still debating whether AI is 'ready' are going to look back at 2026 the way people look back at companies that were still faxing documents in 2005. Don't be that company. Check the leaderboard. Run the free tier. See what 82% actually does to your workflows. Start at coasty.ai.

Want to see this in action?

View Case Studies
Try Coasty Free