Research

OSWorld Benchmark Results: 73% Exploited, 82% Real , Why Your AI Agent Is Broken

Sophia Martinez||6 min
End

OpenAI just announced GPT-5.4 with a 75% OSWorld score. That sounds impressive until you realize the benchmark is rigged. OSWorld was partially exploited at 73%. That means three in four tasks were beaten not by intelligence, but by cheating.

What OSWorld Actually Is (And Why It's Broken)

OSWorld is supposed to measure AI agents on real computer tasks. The benchmark uses 369 tasks from actual workflows. You'd think that's the gold standard. But researchers at Berkeley proved you can game the system. One exploit involved manipulating VM state before the agent even saw a task. Another used public gold solutions to reverse-engineer the answer. OSWorld scored 73% exploited. That's not a measure of intelligence. That's a measure of how good you are at finding exploits.

The Numbers Don't Lie , OpenAI Operator Is in Trouble

OpenAI's Operator agent scored 38% on OSWorld. That's worse than a random guess on many of these tasks. Anthropic's Claude Computer Use hits around 72.5% on the same benchmark. But here's the twist those scores were obtained before the exploits were discovered. After the benchmark was fixed, real performance dropped. The gap between claimed scores and actual ability is massive. Companies are deploying AI agents based on rigged data. That's how you lose millions on automation projects that never work.

73% of OSWorld tasks can be exploited. That means most benchmark scores are meaningless. Only agents that work on the real environment, not the rigged one, actually solve problems.

Why Coasty Scores 82% (And Why It Matters)

Coasty doesn't game the benchmark. We actually control real desktops, browsers, and terminals. Our agent is built for production, not paper. We scored 82% on OSWorld because we solve tasks end-to-end on real computers. Other agents try to beat the test. Coasty beats the problem. We run on desktop apps, cloud VMs, and even agent swarms for parallel execution. That's why our score holds up when the benchmark is fixed. We're not exploiting a rigged system. We're using a computer use agent that works.

Why Your AI Automation Projects Are Failing

You're probably using tools that look impressive on paper but fail in production. OpenAI's Operator can't handle complex workflows. Anthropic's Claude Computer Use is great for developers but not for business tasks. UiPath's Screen Agent got a top ranking on OSWorld but still struggles with real-world exceptions. The gap between benchmark scores and actual performance is where projects die. Companies spend thousands on automation that doesn't work. They blame AI instead of using the right tools. That's insane.

Why Coasty Exists (And How It Solves This)

Coasty.ai is the #1 computer use agent with an 82% OSWorld score. Nobody else is close. We control real desktops, browsers, and terminals. Not just API calls. Our agent works on desktop apps, cloud VMs, and even agent swarms for parallel execution. You get a free tier to try it yourself. We support BYOK so your data stays yours. When you compare computer use agents, Coasty is the obvious choice. We don't cheat benchmarks. We solve real problems on real computers.

Stop trusting rigged benchmarks. OpenAI Operator scored 38% on OSWorld. Coasty scored 82% because we actually work. If you want an AI agent that doesn't cheat, go to coasty.ai. See for yourself why 82% is the only score that matters.

Want to see this in action?

View Case Studies
Try Coasty Free