Research

The OSWorld Benchmark Results Are In, and Most Computer Use Agents Should Be Embarrassed

Name: Coasty AI Employee
Brand: Coasty
Price: 19 USD
Availability: InStock
Rating: 4.8 (1250 reviews)

Lisa Chen|April 4, 2026|7 min

End

The human baseline on OSWorld, the gold-standard benchmark for real-world computer use tasks, is 72.36%. That's the number every AI lab is supposed to beat before they start writing press releases about autonomous agents changing the future of work. OpenAI launched their Computer-Using Agent in January 2025 with enormous fanfare. It scored 38.1%. That's not a typo. Half of human performance. Anthropic then spent most of 2025 iterating on their computer use product, and by September they hit 61.4% with Claude Sonnet 4.5 and called it, quote, 'a significant leap forward.' It is progress. It is also still below what a human does on the same tasks. We're out here reading press releases about the future of AI automation from companies whose products can't yet match a junior employee clicking through a spreadsheet. Let's talk about what the benchmark actually shows, who's winning, and why most of the industry narrative around computer use agents is quietly embarrassing.

What OSWorld Actually Tests (It's Not Easy, But It's Not Rocket Science Either)

OSWorld was published at NeurIPS 2024 by researchers at the University of Hong Kong and built around 369 real computer tasks. Not toy problems. Not 'what's the capital of France.' We're talking file management across operating systems, multi-app workflows, web browsing with actual state changes, OS-level I/O operations, and tasks that require an agent to look at a real desktop, understand what it's seeing, and take action. It controls a real mouse and keyboard. It reads real screens. The benchmark is deliberately designed to be the thing AI labs kept promising their agents could do. So when a company says 'our agent can automate your computer tasks,' OSWorld is the receipts check. And for most of 2024 and 2025, those receipts were rough. Early agents were scoring in the single digits. The first serious models cracked 20%. The human baseline, set by real people doing the same tasks, landed at 72.36%. That's the bar. That's what you need to clear before you can honestly say your computer use agent is ready for real work.

The Scoreboard Nobody Is Talking About Loudly Enough

●OpenAI's Computer-Using Agent (CUA) launched January 2025: 38.1% on OSWorld. That's the product they shipped to paying users.
●Early Anthropic Computer Use with Claude 3.5 Sonnet: approximately 22%. They shipped this to developers and called it a 'beta.'
●Claude Sonnet 4.5, September 2025: 61.4%. Anthropic's press release used the phrase 'significant leap forward.' It's still 11 points below a human.
●Human baseline on OSWorld: 72.36%. This is the actual bar. Most shipped AI products were not clearing it for most of 2025.
●Coasty on OSWorld: 82%. That's not just above the human baseline. That's 10 points above it, and the highest verified score on the benchmark.
●Gartner predicted in June 2025 that over 40% of agentic AI projects would be canceled by end of 2027. With scores like these, that's not a surprise.

Anthropic shipped a computer use product scoring 22% and called it a beta. OpenAI shipped one at 38.1% and called it the future of work. The human baseline is 72.36%. Coasty sits at 82%. These are not close numbers.

Why Scoring Below Human Baseline Is a Bigger Problem Than Anyone Admits

Here's the thing about deploying a computer use agent that scores below the human baseline. You're not automating work. You're creating a new category of work, which is cleaning up after the agent. If your AI completes 38 out of 100 tasks correctly, you still need a human watching every single output. You've added a layer of complexity without removing the human. That's not automation. That's a very expensive assistant with a 62% error rate. The Gartner stat about 40% of agentic AI projects being canceled by 2027 makes complete sense in this context. Companies tried these tools, did the math on supervision overhead, and pulled the plug. And honestly, that's the rational response when the tools are this unreliable. The benchmark scores aren't just academic vanity metrics. They're a direct proxy for whether you can actually trust the agent to run unsupervised. Below 72.36%, the answer is basically no. At 82%, the math starts flipping. You're above human accuracy, which means you can genuinely remove the human from the loop on a real category of tasks and expect better outcomes. That's the actual threshold that matters, and almost nobody in the industry was honest about where that line was.

The Hype Cycle Did Real Damage Here

Let's be direct about what happened. AI labs launched computer use products in 2024 and early 2025 that were genuinely not ready for production use. They knew the benchmark scores. They shipped anyway, because the competitive pressure was enormous and the narrative was more valuable than the product. Developers integrated these tools into real workflows. Companies built internal automations on top of them. Then the error rates showed up in production and the projects stalled. This is why Gartner's 40% cancellation prediction exists. It's not that AI agents are fundamentally flawed. It's that the industry shipped half-baked computer use agents, called them production-ready, and burned enterprise trust in the category. The benchmark was always there. OSWorld was public. The scores were public. The human baseline was public. Anyone doing serious due diligence before deploying a computer use agent in 2025 could have seen that most of the marketed products were operating at roughly half of human capability. The question is whether the companies selling these tools were being straight with their customers about that. Most were not.

Why Coasty Exists and Why 82% Changes the Conversation

I'm obviously not a neutral party here. But let me make the case with numbers instead of marketing copy. Coasty built a computer use agent that scores 82% on OSWorld. That's the highest verified score on the benchmark. It's 10 points above the human baseline, which means on the 369 real-world desktop tasks that OSWorld tests, Coasty's agent is more accurate than a human doing the same work. The architecture matters here. Coasty controls real desktops, real browsers, and real terminals. It's not making API calls and pretending that's computer use. It sees what's on screen, it moves a real cursor, it types in real fields. The desktop app runs locally, there are cloud VMs for scale, and agent swarms let you run tasks in parallel so you're not waiting on sequential execution. That last part is actually the business case made concrete. A human can do one task at a time. A swarm of computer use agents running in parallel can do dozens simultaneously, all at above-human accuracy. There's a free tier if you want to see what 82% on OSWorld actually looks like in practice, and BYOK support if you want to bring your own model keys. The benchmark score is the starting point. The architecture is why it scales.

OSWorld is the most honest thing in the AI agent industry right now. It doesn't care about your press release. It doesn't care about your funding round or your keynote demo. It gives you 369 real tasks and tells you what percentage your agent actually completes. For most of 2024 and 2025, the industry's answer was somewhere between embarrassing and mediocre. OpenAI at 38.1%. Anthropic at 22%, then 61.4% after a year of work. Both still below the human baseline when they were actively marketing these products to enterprises. The score that matters is 82%. That's Coasty. That's above human. That's the number where you can actually remove humans from repetitive computer tasks and trust the output. If you're evaluating computer use agents in 2026 and you're not asking every vendor for their OSWorld score, you're making the same mistake a lot of companies made in 2025. Ask for the number. If they don't have one, or if it's below 72.36%, you already know what you're getting. If you want to see what above-human computer use actually looks like, start at coasty.ai. The benchmark is public. The score is real. The gap is not close.