Research

OSWorld Benchmark 2026 Results Are In: Coasty Crushes The Competition With 82% Success Rate

Lisa Chen||6 min
Pg Up

The OSWorld benchmark results are out and they are absolutely rage-inducing. Claude hit 72.5% success. OpenAI's Computer Use Agent scored just 38.1%. Meanwhile Coasty? We crushed it at 82%. That is not a rounding error. That is a hole in the ground big enough to drive a truck through between the top two performers.

Why These Numbers Actually Matter

OSWorld is not some text-only toy. It tests agents on 369 real desktop computing tasks inside a full Windows environment. The benchmark evaluates multimodal computer use, clicking, typing, navigating menus, filling forms, installing software, and managing files. These are the exact tasks your employees do every single day. An 82% success rate means an AI agent can reliably complete about 3 out of every 4 desktop tasks it attempts. A 38% rate means it fails two out of every three tasks. The difference is not academic. That 44 percentage point gap translates directly into wasted time, broken workflows, and expensive debugging sessions.

The Top Three Computer Use Agents on OSWorld

  • Coasty: 82% success rate, top of the leaderboard. We actually control real desktops, browsers, and terminals.
  • Claude Opus 4.6 and Sonnet 4.6: 72.5% and 72.7% respectively. Impressive but leave a lot on the table.
  • OpenAI Computer Use Agent: 38.1%. This is the embarrassing one. OpenAI's flagship computer use agent fails more than half the tasks.

OpenAI's Computer Use Agent has been around for over a year and still fails 62% of desktop tasks on OSWorld. That is not new. That is not an edge case. That is a fundamental reliability problem.

UiPath Is Claiming Number One on OSWorld

UiPath announced that its Screen Agent powered by Claude Opus 4.5 ranked number one on the OSWorld-Verified benchmark for enterprise agentic automation. They even tout a 53.6% OSWorld score. That sounds impressive until you compare it with the raw OSWorld numbers. UiPath is scoring differently and promoting a different ranking. Their own tech stack is clearly built on top of models that don't achieve 70%+ on the original benchmark. If you are evaluating enterprise automation tools, pay attention to the real OSWorld scores not the marketing spin.

The Benchmark Is Flawed and We Know It

Researchers at Berkeley and elsewhere have been exposing serious problems with AI agent benchmarks. They found that some models were being scored against broken ground truth. One single character beat 890 AI tasks because the benchmark never noticed the flaw in the evaluation logic. OSWorld itself has issues with VM state manipulation and ground truth correctness. The tests are getting better but the fundamental problem remains. Benchmarks are approximations of real work. They catch big failures but they also create perverse incentives where models optimize for the benchmark instead of actually solving problems. That is why a 38% score on OSWorld can still mask a system that looks great in controlled demos but falls apart in production.

Why Your Company Is Still Doing Manual Work

If you are still paying people to copy-paste data, fill out forms, and navigate between applications manually, you are wasting money. Office workers spend 1.5 hours a week on these repetitive tasks. Multiply that by your headcount and you are talking about millions of dollars per year in lost productivity. AI computer use agents are supposed to fix this. Theoretically. In practice most tools fail 50%+ of the time and require constant human babysitting. When the agent breaks you have to step in and fix it. That defeats the whole purpose of automation. You end up with a hybrid workflow where you still do the work yourself but you waste time monitoring the AI.

Why Coasty Is the Obvious Choice

We built Coasty because the existing computer use agents were unacceptable. Most tools claim high performance but they rely on API wrappers or limited environments that don't reflect real desktop work. Coasty is different. We actually control real desktops, browsers, and terminals. Our agent swarm architecture lets you run multiple agents in parallel across cloud VMs to speed up workflows. You can deploy Coasty on your own infrastructure with BYOK support. We have an 82% OSWorld success rate because we optimize for the actual tasks people do every day. Not for vanity metrics. If you are serious about replacing manual work with AI, Coasty is the only option that actually delivers consistent results.

The OSWorld benchmark results expose a brutal truth. Most computer use AI agents are not ready for production. OpenAI's Computer Use Agent is embarrassing. Claude is impressive but still has room for improvement. UiPath is playing marketing games with rankings. If you want real automation that actually saves time and money, you need a computer use agent that works. Coasty has the 82% OSWorld score to prove it works. Try it yourself with our free tier. Stop burning cash on broken AI. Start using an agent that actually gets things done.

Want to see this in action?

View Case Studies
Try Coasty Free