Anthropic Computer Use Is Losing the Race It Started (Here's Who's Actually Winning)
Anthropic invented the computer use agent category. They dropped Claude Computer Use in October 2024, the whole internet lost its mind, and for about six months they owned the conversation. Then something uncomfortable happened: everyone else shipped, the benchmarks came out, and Anthropic stopped being the best at the thing they created. That's the story nobody in the AI press wants to write directly. So let's write it. Manual data entry alone costs U.S. companies $28,500 per employee per year according to a 2025 Parseur report, and over half of those employees are burned out from the repetitive grind. The promise of computer use agents was that this number goes to zero. The question in 2025 is not whether computer use AI is real. It is. The question is which tool actually delivers, and the honest answer will surprise people who've been sleeping on the benchmarks.
Anthropic Started This Fight and Is Now Losing It on Points
Credit where it's due: Anthropic's Claude Computer Use was genuinely stunning when it launched. Watching Claude scroll through a browser, fill out forms, and navigate desktop apps like a slow but determined intern felt like science fiction in real time. The demos were everywhere. The hype was real. But hype and performance are different things, and the OSWorld benchmark does not care about hype. OSWorld is the gold-standard test for computer use agents. It throws real-world GUI tasks at models and measures whether they actually complete them. When Anthropic's Claude Sonnet 4.5 launched in late September 2025, they celebrated a score of 61.4% on OSWorld. Anthropic's own blog called it 'a significant leap forward.' And sure, compared to where they started, it was. But 61.4% means the model fails on nearly 4 out of 10 real computer tasks. That's not a product you deploy at scale without a human watching over its shoulder the whole time. Meanwhile, the leaderboard kept moving. The gap between 'we invented this' and 'we're the best at this' is now embarrassingly wide.
OpenAI Operator: Late, Hyped, and Still Not Working
OpenAI watched Anthropic's computer use launch and took three months to respond with Operator. The positioning was aggressive. The reality was rougher. One independent reviewer, writing for Understanding AI in July 2025 after testing ChatGPT Agent (the rebranded Operator), put it plainly: the tool 'performed poorly' on real-world tasks and was 'still not very useful.' Another detailed review titled 'Unfinished, Unsuccessful, and Unsafe' noted that Operator arrived a full year after Anthropic's Computer Use and still didn't work reliably. A year behind and still catching up. That's the OpenAI computer use story. The CUA model underneath Operator combines GPT-4o's vision with reinforcement learning, and it's genuinely impressive in controlled demos. But demos and production are different planets. Users testing it on actual workflows kept running into the same wall: it looks great until it doesn't, and when it doesn't, it fails in spectacular and unpredictable ways. For anyone building real automations on top of a computer-using AI, unpredictable failure is a dealbreaker.
The RPA Graveyard: Why Old-School Tools Can't Save You Either
- ●30% to 50% of RPA implementations fail outright, according to figures cited in UiPath's own blog posts from EY research. The vendor is quoting their customers' failure rates. Let that sink in.
- ●UiPath implementations routinely take months of engineering work, require dedicated bot maintenance teams, and break every time the target application updates its UI. You're paying a six-figure implementation bill to automate something that will need re-automating in six months.
- ●Traditional RPA tools like UiPath work by following rigid scripts. They don't see the screen the way a human does. Change one pixel in the wrong place and the whole bot falls over. Computer use agents actually look at what's on screen and reason about it, which is a fundamentally different and more robust approach.
- ●The average enterprise spends more maintaining their RPA bots than they saved in the first year. This is not a secret. The RPA vendors know it. They just don't advertise it.
- ●56% of employees doing repetitive data tasks report burnout, per the 2025 Parseur manual data entry report. RPA was supposed to fix this years ago. It didn't. The tools were too brittle, too expensive, and too slow to deploy.
Manual data entry costs U.S. companies $28,500 per employee per year, 56% of those workers are burned out, and the 'solutions' companies have thrown at this problem for a decade either fail half the time or cost more to maintain than they save. The computer use agent that actually works isn't a nice-to-have. It's the most important software decision a company makes in 2025.
What the OSWorld Benchmark Actually Tells You (And Why Vendors Hate It)
Here's why OSWorld matters and why you should care about it even if you've never heard of it. Most AI benchmarks test what a model knows. OSWorld tests what a model can do. It puts agents in front of real operating systems, real applications, and real tasks, then measures whether the task actually gets done. No partial credit for 'it was close.' Either the spreadsheet got filled out or it didn't. Either the form got submitted or it didn't. This is the only benchmark that maps directly to whether a computer use agent will save you money in the real world. Anthropic's best score on OSWorld sits around 61%. OpenAI's CUA model launched with scores in the 38% range on the original benchmark before improvements. These are not tools you hand a mission-critical workflow and walk away from. The top of the OSWorld leaderboard in 2025 and into 2026 tells a very different story, one where a new generation of purpose-built computer use agents have blown past the big lab offerings by focusing on the actual problem instead of making the computer use feature a footnote in a general-purpose model announcement.
Why Coasty Exists (And Why the Score Gap Is the Whole Story)
I'm going to be straight with you. I work at Coasty. But I'm also someone who tested every major computer use agent on the market before joining, and the reason I'm here is the benchmark number. Coasty sits at 82% on OSWorld. Not 82% on some internal test we designed to flatter ourselves. 82% on the independent, academic, gold-standard benchmark that every serious researcher uses to compare computer use agents. Anthropic is at 61%. The gap is 21 percentage points. In a category where every percentage point represents real tasks either completed or failed, 21 points is an enormous chasm. What makes Coasty different isn't just the score. It's the architecture. Coasty controls real desktops, real browsers, and real terminals. It's not making API calls and pretending that's the same as computer use. It actually sees the screen, moves the mouse, types the keystrokes, and handles the unexpected the way a competent human would. The desktop app works on your machine. The cloud VMs let you spin up parallel agents. The agent swarms let you run dozens of tasks simultaneously instead of waiting in the queue that Claude Pro users have been screaming about in Reddit megathreads for months. There's a free tier. You can bring your own API key. There's no reason to keep paying for a computer use agent that fails 4 out of 10 times when the best one is available right now.
Anthropic deserves credit for making computer use AI a real category. That's genuinely important. But the companies that shipped first rarely end up as the companies that ship best, and in 2025 the computer use agent race has a clear leader that isn't Anthropic or OpenAI. The benchmark doesn't lie. 82% versus 61% is not a close call. If you're still copying and pasting data manually, your company is burning $28,500 per employee per year on work that should have been automated yesterday. If you tried Anthropic's computer use and found it flaky, that's not a problem with the concept. That's a problem with the specific tool. The concept works. The right tool works. Stop letting the loudest marketing budget pick your automation stack. Look at the scores, try the product, and make a decision based on what actually completes the task. Start at coasty.ai. The free tier is right there.