Research

The 2026 AI Agent Benchmark Results Are In. Most Computer Use Agents Are Embarrassingly Bad.

Lisa Chen||7 min
Ctrl+F

Manual data entry costs U.S. companies $28,500 per employee every single year. That stat is from a July 2025 report and it's still true today, because most companies still haven't automated the repetitive computer work that's bleeding them dry. You'd think that in 2026, with every AI company on earth screaming about agents, we'd have solved this. We haven't. The 2026 Stanford HAI AI Index Report just confirmed what a lot of us suspected: OSWorld accuracy for AI agents went from roughly 12% to 66.3% in a single year, which sounds incredible until you realize the human baseline is 72.36%. Most AI agents claiming to do computer work can't even match a regular person sitting at a keyboard. The benchmarks are in. The results are messy, controversial, and in some cases, outright embarrassing. Let's go through them.

What OSWorld Actually Tests (And Why Most Agents Fail It)

OSWorld is the hardest, most respected benchmark for computer use AI. It doesn't ask an agent to answer trivia or write a poem. It drops an agent into a real desktop environment and says: open this spreadsheet, find this file, run this terminal command, fill out this form across three different apps. Real tasks. Real software. Real failure modes. The human baseline, meaning what a normal non-expert person scores when given the same tasks, is 72.36%. That's the bar. That's what you need to clear before you can honestly tell a business that your AI agent can replace manual computer work. The 2026 Stanford AI Index report confirmed that the field as a whole only just crept to 66.3% average accuracy. That means the average computer use agent on the market today still fails more than one in three tasks a human would nail. Think about that the next time someone's pitch deck says their agent is 'production ready.'

The Full 2026 Scoreboard: Winners, Losers, and One Shocking Gap

  • Coasty: 82% on OSWorld. The only agent that meaningfully beats the human baseline of 72.36%. Not close. Ten full percentage points above human-level performance.
  • Claude Sonnet 4.6 (Anthropic, February 2026): 72.5% on OSWorld-Verified. Respectable. Barely human-level. Anthropic's best computer use result to date, and Coasty still beats it by nearly 10 points.
  • Simular AI: ~72% on OSWorld-Verified. Almost identical to Claude Sonnet 4.6. Two agents in a dead heat just scraping past human baseline.
  • OpenAI CUA (Computer-Using Agent): 38.1%. Yes, that's the score from the team that invented the category. Less than half of what Coasty scores. This is not a rounding error.
  • GPT-5.5 on OSWorld-Verified: 78.7%, OpenAI's newest model shows real improvement, but that's a model score, not a full agent pipeline score. The agent infrastructure around it still matters enormously.
  • The field average: 66.3% per Stanford HAI. Below human baseline. Most agents you can buy or deploy right now lose to the intern.

OpenAI CUA scored 38.1% on OSWorld. The human baseline is 72.36%. Coasty scored 82%. That's not a benchmark gap. That's a different category of product entirely.

The Benchmark Gaming Problem Nobody Wants to Talk About

Here's where it gets ugly. Berkeley researchers published a paper in April 2026 titled 'How We Broke Top AI Agent Benchmarks,' and the New York Times ran a piece in early 2025 calling benchmark cheating a full-blown scandal. The problem is real: companies cherry-pick evaluation conditions, train on benchmark-adjacent data, and report numbers that reflect their best possible run rather than typical performance. A February 2026 analysis found inflated scores by up to 100 points through cherry-picking. One hundred points. Mindstudio published a breakdown this month on what benchmark gaming in AI actually looks like and why self-reported scores are almost always inflated. This matters for computer use AI specifically because the stakes are high. You're not just getting a wrong answer in a chatbot. You're giving an agent access to real software, real files, and real business systems. An agent that scores 78% in a cherry-picked internal test might perform at 55% on your actual workflows. OSWorld is harder to game than most benchmarks because it uses real operating system environments with real applications. That's why the scores there are the ones worth trusting, and why Coasty publishing an 82% score on OSWorld specifically is significant.

RPA Is Not the Answer Either. Stop Pretending It Is.

Every time AI agent benchmarks disappoint, someone in the room says 'well, we'll just use UiPath.' Here's the thing about legacy RPA: it's brittle by design. UiPath itself published a blog post in July 2025 about their new 'Healing Agent' feature, and the whole premise of that feature is that their traditional automation breaks constantly when UI elements change. They built a product to fix the failures of their product. That's not a solution. That's a patch on a patch. Over 40% of workers still spend at least a quarter of their work week on manual, repetitive computer tasks according to Smartsheet research. RPA was supposed to fix that a decade ago. It didn't. It created a new class of 'bot maintenance' jobs and a new category of IT debt. The promise of a real computer use agent, one that can see a screen, reason about what's on it, and take action the way a human would, is that it doesn't need a rigid script. It adapts. But only if the underlying model is actually good enough to handle real-world variation. Most aren't. The benchmark scores prove it.

Why Coasty Exists and Why 82% Is the Number That Matters

I'm not going to pretend I don't have a dog in this fight. Coasty is the product I use, and it's the product I recommend, and the reason is simple: 82% on OSWorld is not a marketing claim. It's a reproducible score on the hardest real-world computer use benchmark that exists. The human baseline is 72.36%. Coasty beats it by nearly 10 points. Every other agent in the field is either at human-level or below it. Coasty controls real desktops, real browsers, and real terminals. Not API wrappers. Not simulated environments. It runs on a desktop app, on cloud VMs, and it supports agent swarms for parallel execution, so you can run multiple tasks simultaneously instead of queuing them up like it's 2018. There's a free tier if you want to test it without a procurement process. BYOK is supported if you have model preferences. The reason this matters for businesses bleeding $28,500 per employee on manual computer work is that accuracy is everything. A computer use agent at 38% accuracy doesn't save you time. It creates a supervision burden. You spend more time checking its work than you would have spent doing the work yourself. At 82%, you can actually trust the output. That's the line between a demo and a product. Coasty is at coasty.ai.

The 2026 AI agent benchmark results tell a clear story if you're willing to read it honestly. Most computer use AI is still below human baseline. The benchmark gaming is real and widespread. RPA is not coming to save you. And the gap between the best computer use agent and the rest of the field is not a few percentage points. It's a chasm. Coasty at 82% versus the field average of 66.3% is the difference between automation that works and automation that requires a babysitter. My take: stop evaluating AI agents based on press releases and demo videos. Demand OSWorld scores. Ask if those scores are self-reported or independently verified. Ask what the accuracy is on your specific workflows, not on the vendor's curated test suite. And if you're still paying people to do repetitive computer work in 2026, that's a choice you're making, not a constraint you're stuck with. The tools exist. One of them is very clearly better than the others. Go check the scores yourself at coasty.ai.

Want to see this in action?

View Case Studies
Try Coasty Free