Research

AI Agent Benchmark Results 2026: Most Computer Use Scores Are Lies, and One Isn't

Marcus Sterling||7 min
+T

A UC Berkeley research team just scored 100% on a major AI agent benchmark without solving a single actual task. Let that sink in. The benchmark that half the industry uses to justify their pricing, their press releases, and their investor decks, got completely cooked by a one-character exploit. This is the state of AI agent evaluation in 2026, and if you're making purchasing decisions based on the numbers companies are throwing at you, you deserve to know how dirty this game has gotten. The Stanford HAI 2026 AI Index confirmed that computer use agents went from roughly 12% accuracy on OSWorld to around 66% in just a year. Real progress is happening. But buried inside that progress story is a benchmark integrity crisis that nobody in a vendor marketing department wants to talk about.

Berkeley Broke Every Benchmark, and Nobody Is Talking About It Enough

In April 2026, researchers from UC Berkeley's Center for Responsible, Decentralized Intelligence published a paper called 'How We Broke Top AI Agent Benchmarks: And What Comes Next.' It's one of the most important documents in AI right now and most people in the industry are quietly hoping you don't read it. The headline finding: FieldWorkArena, a benchmark with 890 tasks, could be gamed to 100% completion using a single character manipulation. Not a clever prompt. Not weeks of fine-tuning. One character. The exploit coverage across multiple top benchmarks was staggering. This isn't a fringe academic critique. This is Berkeley, Dawn Song's lab, telling the entire industry that the scorecard is broken. So when OpenAI announces GPT-5.5 hit 78.7% on OSWorld-Verified, or Anthropic's Claude Sonnet 4.6 brags about 72.5%, you have every right to ask: verified by whom, under what conditions, and who checked the exploit surface? The word 'verified' is doing a lot of heavy lifting in those press releases.

The 2026 Leaderboard, Stripped of the Marketing Spin

  • Coasty: 82% on OSWorld. The highest independently validated score on the board. Not a cherry-picked internal eval.
  • GPT-5.5 (OpenAI): 78.7% on OSWorld-Verified, per OpenAI's own April 2026 release data. Self-reported. OpenAI eventually folded Operator into ChatGPT after it underdelivered as a standalone computer use agent.
  • Claude Sonnet 4.6 (Anthropic): 72.5% on OSWorld, announced February 2026. A genuine improvement over previous Claude versions, but still 10 full points behind the top.
  • Stanford HAI 2026 AI Index baseline: The broader field sits at ~66.3% on OSWorld. Most agents you're being sold are average.
  • Legacy RPA tools (UiPath, Automation Anywhere): Not even on this leaderboard. They don't do computer use. They do brittle script execution dressed up in a suit. Different category, worse results, higher price.

UC Berkeley researchers scored 100% on a top AI agent benchmark without completing a single real task. If that doesn't make you question every benchmark score in every vendor's pitch deck, nothing will.

Why OSWorld Is Still the Benchmark That Actually Matters

Not all benchmarks are equal. OSWorld tests agents on real computer tasks across real operating systems, the kind of messy, unpredictable, multi-step work that actual humans do all day. Clicking through UIs, navigating file systems, running terminal commands, handling browser workflows. It's not a trivia quiz. It's not a coding puzzle with a clean test suite. It's as close to 'does this thing actually work on a real computer' as the field has. That's why the jump from 12% to 66% over recent years is genuinely exciting, and why the gap between 66% and 82% is enormous in practice. A 16-point difference on OSWorld isn't a rounding error. It's the difference between an agent that completes roughly two-thirds of tasks and one that completes more than four out of five. At scale, across thousands of workflows, that gap is the difference between automation that saves your team 20 hours a week and automation that creates a new category of cleanup work.

The RPA Graveyard Is Full of Companies That Believed the Hype

Here's what the benchmark wars distract from: most enterprise automation still fails. RPA promised the world in the late 2010s. Companies spent millions on UiPath and Automation Anywhere deployments that required constant maintenance every time a UI changed, a button moved, or a form got a new field. The tools weren't bad at what they did. What they did just wasn't enough. They automated the path, not the intent. A real computer use agent understands what it's trying to accomplish. It adapts when the screen looks different. It doesn't break because a dropdown menu got redesigned. That's the actual promise of AI computer use, and it's why the jump in OSWorld scores matters beyond the academic flex. Gallup's 2026 State of the Global Workplace report found that only 20% of employees worldwide are engaged, costing the global economy $10 trillion in lost productivity. A chunk of that is people doing work that should have been automated five years ago but wasn't, because the tools weren't good enough. They're good enough now. Some of them, anyway.

Why Coasty Exists, and Why 82% Isn't Just a Number

I'll be straight with you. I work at Coasty and I think it's the best computer use agent available. But I'm not asking you to take my word for it. The OSWorld number is public, it's independently benchmarked, and at 82% it's higher than every competitor in the field right now. Coasty controls real desktops, real browsers, and real terminals. Not API wrappers pretending to be agents. Not a chatbot with a screenshot tool bolted on. It runs on a desktop app, spins up cloud VMs, and supports agent swarms for parallel execution when you need to run dozens of workflows simultaneously. There's a free tier, BYOK support if you want to bring your own model keys, and it's built for the kind of work that actually wastes people's time: data entry across systems that don't talk to each other, multi-step browser workflows, report generation that requires clicking through five different tools. The reason Coasty scores 82% on OSWorld isn't because the team got lucky on the benchmark. It's because the benchmark tests exactly the kind of real computer use that the product was built to handle. That's a meaningful distinction in a field where plenty of vendors optimize for the test, not the task.

Here's my honest take on where we are in 2026. The progress is real. Going from 12% to 82% on OSWorld in a few years is not hype. That's a genuine capability shift. But the benchmark integrity crisis Berkeley exposed means you can't trust vendor scores at face value anymore. You need to know which benchmark, which version, which conditions, and whether anyone independent validated the result. The companies that are shouting the loudest about their scores are often the ones most motivated to obscure the methodology. Demand transparency. Ask for OSWorld specifically. Ask whether the results are self-reported or third-party verified. And when you're ready to actually try a computer use agent that earns its number rather than games it, start with Coasty at coasty.ai. The free tier is there. The 82% is real. Your team's time is worth more than another failed automation pilot.

Want to see this in action?

View Case Studies
Try Coasty Free