OpenAI Operator Review 2026: A 38% Score Is Not a Computer Use Agent, It's a Demo
OpenAI launched Operator in January 2025 with the kind of hype usually reserved for moon landings. Tech Twitter lost its mind. VCs wrote breathless threads. Sam Altman smiled. And the actual benchmark score? 38.1% on OSWorld. That's not a computer use agent. That's a coin flip with extra steps. Eighteen months later, with the product now folded into ChatGPT as 'ChatGPT agent,' the fundamental problem hasn't changed: OpenAI built a product that fails on roughly 62% of real-world computer tasks, charged $200 a month for access to it, and somehow got credit for inventing the category. I've been watching this space closely, and I think it's time someone said the quiet part loud.
What 38.1% Actually Means in the Real World
OSWorld is the gold standard benchmark for computer use AI agents. It tests 369 real desktop tasks: file management, web browsing, multi-app workflows, spreadsheet work, the stuff your actual employees do every single day. OpenAI's Computer-Using Agent (CUA) scored 38.1% when it launched. OpenAI's own press release called this 'state-of-the-art.' And technically, at that moment in January 2025, it was. But here's the thing nobody put in the headline: state-of-the-art at 38% means the best tool available still fails on nearly two out of three tasks. You wouldn't hire a contractor who told you upfront they'd botch 62% of your jobs. You wouldn't fly an airline with a 38% on-time rate. But somehow we're supposed to celebrate this as a breakthrough in AI computer use. The benchmark isn't cruel or unfair. It's tasks like 'open this spreadsheet and format the columns' or 'find this file and attach it to an email.' Normal work. Work that a human intern could do on day one. And the best-funded AI company on the planet is getting a D-minus on it.
The $200/Month Problem Nobody Wants to Talk About
- ●OpenAI Operator launched exclusively for ChatGPT Pro subscribers at $200 per month, making it one of the most expensive consumer AI subscriptions ever sold
- ●Pro users get 400 ChatGPT agent messages per month. Business and Enterprise users get just 40. Forty. For a tool meant to automate your work.
- ●Manual data entry alone costs U.S. companies $28,500 per employee per year according to a 2025 Parseur report. A tool that fails 62% of the time doesn't fix that.
- ●The average knowledge worker wastes 8.2 hours per week finding, recreating, and duplicating information. A computer use agent that can't reliably complete tasks adds frustration, not relief.
- ●UK workers waste 15 hours per week on repetitive admin tasks per Ricoh Europe research. If your AI agent fails on 62% of those tasks, you've just created a more expensive version of the problem.
- ●OpenAI quietly rebranded Operator as 'ChatGPT agent' in July 2025, folding it into the main product. Rebranding something doesn't fix its benchmark score.
Manual data entry costs U.S. companies $28,500 per employee per year. A computer use agent that fails on 62% of real tasks doesn't solve that problem. It just gives you something expensive to blame.
Claude Tried Too. The Results Were Better, But Still Not Good Enough.
To be fair to OpenAI, Anthropic's original Computer Use feature launched around the same time and scored even worse. Claude Sonnet 4.5 eventually climbed to 61.4% on OSWorld by September 2025, which is genuinely better. Anthropic has been more aggressive about iterating on computer use specifically, and it shows. But 61.4% is still a failing grade if you're trying to run a business on it. You can't automate your accounts payable process with a tool that succeeds 61% of the time. One wrong number in the wrong cell and you have an accounting problem, not a productivity win. The honest truth about both OpenAI and Anthropic's computer use offerings in 2026 is that they were built by companies whose primary product is a chatbot. Computer use is a feature to them. A checkbox. A press release. It's not their obsession. And you can see that in the scores.
The Real Benchmark Gap Is Embarrassing for Everyone Except One Player
Here's the number that should end the debate. Coasty scores 82% on OSWorld. That's not a small improvement over OpenAI's 38.1%. That's more than double. That's the difference between a tool you can actually deploy in production and a tool you demo for investors. Claude at 61.4% is a meaningful step forward, and I respect Anthropic for pushing hard on computer use. But 82% is a different category of capability. At 82%, a computer use agent is handling four out of five real desktop tasks correctly. That's the point where you can actually start replacing manual workflows, not just automating the easy stuff and leaving your team to clean up the rest. The gap exists because Coasty is built from the ground up as a computer use agent, not bolted onto a chatbot as an afterthought. It controls real desktops, real browsers, and real terminals. Not just browser tabs. Not just API calls dressed up to look like automation. Actual computer use, the kind that works across the full stack of what your employees actually do.
Why Coasty Is the Answer OpenAI Promised But Couldn't Deliver
I'm not going to pretend I don't have a preference here. I do. When you look at what a computer use agent actually needs to do to be useful, the requirements are pretty clear. It needs to handle complex multi-step workflows across different apps. It needs to work on real desktops, not just web browsers. It needs to be reliable enough that you can run it without babysitting every task. And it needs to be economically sensible. Coasty hits all of those. The 82% OSWorld score is the headline, but the architecture matters too. Desktop app support, cloud VMs, and agent swarms for parallel execution mean you're not limited to one task at a time on one machine. That's actual scale. Compare that to OpenAI's 40-message monthly limit for Business users. Forty messages. If you're trying to automate any meaningful volume of work, that limit is a joke. Coasty has a free tier so you can actually test it before committing, and BYOK support if you want to bring your own API keys. The pricing model respects you as a user. The benchmark score respects you as a professional. That combination is rarer than it should be in this space.
Here's my honest take after watching this space for over a year. OpenAI Operator was a genuinely important product announcement. It proved that computer use AI agents were real, that the category mattered, and that the biggest labs were taking it seriously. That's worth acknowledging. But in 2026, a 38.1% benchmark score is not a product you build workflows around. It's a proof of concept that got a product launch. The companies still waiting for OpenAI to 'fix' Operator, or still running manual processes because 'the AI isn't ready yet,' are leaving real money on the table. The AI is ready. Just not the one you've been reading about in TechCrunch. If you're serious about computer use automation, go look at what 82% actually feels like in practice. Start at coasty.ai. There's a free tier. Run it on something real. Then come back and tell me you're still impressed by 38%.