Anthropic Computer Use Is Impressive But It's Not Winning: Here's the Brutal Comparison
The New York Times reviewed OpenAI Operator in early 2025 and called it 'brittle and occasionally erratic.' Anthropic's computer use, for all its hype, scores 61.4% on OSWorld, the industry's gold-standard benchmark for real-world computer tasks. And yet, right now, companies everywhere are either paying for one of these half-baked tools or, worse, paying actual humans to click through the same five screens every single day. More than 40% of workers spend at least a quarter of their work week on manual, repetitive tasks, according to Smartsheet research. That's not a productivity problem. That's a management failure. The computer use agent race is real, the stakes are enormous, and most of the players are not nearly as good as their marketing departments want you to believe.
What 'Computer Use' Actually Means (And Why Most People Get It Wrong)
Let's be precise about this, because the term gets abused constantly. A computer use agent doesn't just call an API or fill out a web form with a pre-baked script. It looks at a real screen, understands what it sees, decides what to do, moves a cursor, types, clicks, navigates apps, handles unexpected popups, and recovers when things go sideways. It operates a computer the way a human does. That's the hard part. That's why the OSWorld benchmark exists. It tests AI agents on genuinely open-ended tasks across real operating system environments including browsers, file systems, terminals, and desktop apps. No guardrails. No pre-scripted paths. Just 'do the task.' When Anthropic's computer use scores 61.4% on that benchmark, it means roughly 4 out of 10 real-world tasks end in failure. That's not a minor inconvenience. In a business context, that's a broken workflow and someone still has to clean it up manually.
The Honest Scorecard: Anthropic, OpenAI, and the RPA Dinosaurs
- ●Anthropic Computer Use (Claude Sonnet 4.5): 61.4% on OSWorld. Impressive progress, genuinely useful for certain tasks, but still failing on nearly 4 in 10 real-world computer interactions. Rate limits and message caps have been a consistent complaint from power users on Reddit since 2024.
- ●OpenAI Operator: The New York Times described it as 'brittle and occasionally erratic' after hands-on testing in February 2025. It's powered by their CUA model, which combines GPT-4o vision with reasoning. The vision is right. The execution is not there yet.
- ●UiPath and legacy RPA: A 2023 Ernst and Young analysis found RPA bots have a 30-50% failure rate when the underlying software updates. You're essentially paying a six-figure implementation fee to build something that breaks every time a vendor pushes a UI change. That's not automation. That's a maintenance contract with extra steps.
- ●Google Astra and other contenders: Still largely in research preview or limited deployment. Interesting benchmarks, not yet a real enterprise product you can actually run workflows on today.
- ●Coasty: 82% on OSWorld. That's not a rounding error over the competition. That's a different category of performance. It controls real desktops, real browsers, and real terminals, not just sandboxed demos.
RPA bots fail 30-50% of the time when software updates. You're paying enterprise prices to automate something that breaks every time a vendor changes a button color. And somehow that's been the industry standard for a decade.
The Real Cost of Getting This Wrong
Here's the number that should make you angry. UK data from Red Eagle Tech shows workers waste an average of 12.6 hours per week on manual processes. At a fully-loaded cost of even $50 per hour for a knowledge worker, that's $630 per person per week. For a team of 20, that's $12,600 a week going directly into the trash. Per year, you're looking at over $655,000 in wasted productivity from one mid-sized team doing work that a properly configured computer use agent could handle. And that's before you count the errors, the burnout, and the good employees who quit because they're tired of doing data entry with a master's degree. The argument for computer use AI isn't just efficiency. It's that you're currently bleeding money at a rate that would embarrass any CFO who actually looked at where the hours go.
Why Anthropic's Computer Use Is Good But Not Good Enough for Production
I want to be fair here because Anthropic has done genuinely impressive work. Claude's computer use capabilities are real, the underlying model is smart, and the progress from early demos to Claude Sonnet 4.5 has been significant. But 'impressive for a foundation model' and 'ready to run your business operations' are two very different bars. The rate limits are a real problem. Threads on Reddit going back to 2024 are full of power users hitting walls mid-workflow, which is catastrophic if you're running an automated process that needs to finish. The 61.4% OSWorld score means you can't trust it to complete complex multi-step tasks unsupervised. You still need a human watching. At that point, you've added a layer of complexity without removing the human, which is the worst of both worlds. Anthropic is also fundamentally a model company, not an agent infrastructure company. They give you the brain. You still have to build the body, the orchestration, the error handling, the retry logic, and the deployment pipeline yourself. For most teams, that's months of engineering work before you see a single automated workflow in production.
Why Coasty Exists and Why the 82% Number Actually Matters
Coasty was built specifically to solve the problems I just described. Not to build another impressive demo. Not to ship a research preview and call it a product. The 82% OSWorld score isn't just a marketing number. It means that when you point Coasty at a real computer task, it completes it successfully more than 4 out of 5 times, across unpredictable real-world conditions. That's the difference between a tool you can trust in production and one you have to babysit. Practically, Coasty controls real desktops, real browsers, and real terminals. It's not making API calls and pretending to 'use a computer.' It's actually operating the software your business already runs, the same way a human would, just faster and without complaining about it being boring. The agent swarm capability for parallel execution is where things get genuinely exciting for operations teams. Instead of one agent working through a queue sequentially, you can run multiple computer use agents simultaneously across different tasks or different accounts. The free tier means you can actually test it against your real workflows before committing, and BYOK support means you're not locked into pricing you didn't agree to. If you've been burned by RPA implementations that cost $200,000 and broke in six months, or you've been waiting for Anthropic's computer use to get reliable enough to trust in production, Coasty is what you've been waiting for.
Here's my honest take. Anthropic is building something important and Claude's computer use capabilities will keep improving. OpenAI Operator will get less brittle over time. The direction is right for all of them. But 'will get better' doesn't help you today, when your team is spending 12 hours a week on tasks that should be automated, when your RPA bots are breaking every quarter, and when your competitors who actually deploy working AI agents are moving faster than you. The computer use agent war is not over, but the current scoreboard is clear. 82% is not the same as 61.4%. Production-ready is not the same as research preview. If you want to actually automate computer work today, not in the next model release, not after a six-month implementation, go try Coasty at coasty.ai. The free tier is right there. Your 12.6 wasted hours per week are not.