Industry

Your AI Agent Just Crashed Mid-Task and Has No Idea What to Do Next. This Is a $644 Billion Problem.

James Liu||7 min
Alt+Tab

A company handed an AI agent a multi-step workflow, walked away, and came back to a $47,000 mistake. The agent didn't crash loudly. It didn't ask for help. It just kept going, confidently, incorrectly, in a loop it couldn't recognize as a loop. This isn't a fringe horror story. It's the defining failure mode of the entire agentic AI era, and almost nobody building these tools is taking it seriously enough. MIT just reported that 95% of enterprise AI pilots are delivering zero measurable return. Analysts are calling 2025's enterprise AI spend a $644 billion act of economic vandalism. You want to know what's sitting at the center of that wreckage? It's not bad prompts. It's not hallucinations. It's agents that can't handle things going wrong.

The $47,000 Loop Nobody Talks About

The Tech Startups breakdown of that $47,000 AI agent failure is worth reading slowly. A multi-agent system was given a complex task. Somewhere in the middle, something broke. A dependency failed. An API returned something unexpected. A UI element moved. And instead of stopping, flagging the issue, and waiting for a human, the agent kept executing. It compounded the error across every downstream step. By the time anyone noticed, the damage was done and it was irreversible. This is the dirty secret of multi-agent systems right now. Researchers studying LLM-based agents have formally categorized what happens when things go sideways: agents get stuck in loops, repeating the same failed command over and over. They disobey task constraints without realizing it. They lose track of their original goal when an unexpected state interrupts their plan. And in computer use scenarios, where the agent is actually clicking through real interfaces, submitting real forms, and touching real data, these failures aren't academic. They're catastrophic.

What 'Stuck in a Loop' Actually Looks Like in Computer Use

  • An agent tries to click a button that's hidden behind a cookie consent banner. It clicks the same coordinates 40 times before timing out, having accomplished nothing and logged zero useful error state.
  • A computer use agent hits a two-factor authentication prompt mid-workflow. It has no recovery path, so it either freezes or starts the entire task over from scratch, triggering duplicate submissions.
  • OpenAI's Operator was caught during testing taking screenshots of screens instead of reading them, causing cascading OCR errors that corrupted every subsequent action in the chain.
  • Anthropic published a full postmortem in September 2025 detailing three separate bugs that intermittently degraded Claude's responses. Their own computer use tooling (computer_20250124) was hitting integration failures on Bedrock as recently as that same month.
  • Claude Sonnet 4.5 scores 61.4% on OSWorld. That means nearly 4 in 10 real-world computer tasks end in failure. What happens to those failures? Most agents just stop and report nothing useful.
  • Researchers at EMNLP 2025 specifically flagged 'stuck in loop' as a distinct failure category where the model repeatedly executes the same command without detecting it's not working. No self-correction. No escalation. Just repetition.

95% of enterprise AI pilots delivered zero measurable ROI in 2025. MIT called it. Forbes confirmed it. And the agents that failed weren't failing because the tasks were too hard. They were failing because nobody built them to handle being wrong.

The Industry Is Shipping Recovery as an Afterthought

Here's what makes this so frustrating. Error handling in software engineering is not a new concept. We've had try-catch blocks since the 1960s. Retry logic, circuit breakers, dead letter queues, graceful degradation, these are solved problems in traditional systems. But when companies started wrapping LLMs around computer use tasks and calling them agents, they somehow forgot everything they knew. The result is a generation of AI tools that are brilliant at the happy path and completely helpless the moment reality doesn't cooperate. ChatGPT Agent launched in July 2025 to some fanfare, and a detailed real-world test of four tasks found it was 'a big improvement but still not very useful.' That's the headline after months of iteration. The Washington Post called Operator 'not ready for the real world' back in February. By July it had been folded into ChatGPT agent and the core reliability problems were still generating the same complaints. UiPath and the legacy RPA crowd have a different version of the same disease: their bots break the moment a UI changes by a single pixel, and then they sit silently broken until a human notices the output queue has gone dry. At least those failures are usually contained. Computer-using AI agents can do a lot more damage before anyone realizes something went wrong.

What Good Error Recovery Actually Requires

Good error handling in a computer use agent isn't just catching exceptions. It's a whole architecture. First, the agent needs state awareness. It has to know what it successfully completed before the failure, not just what it was trying to do. Without that, every recovery attempt either starts over from scratch or picks up in an undefined state, both of which are usually worse than just stopping. Second, it needs failure classification. A timeout is different from an authentication wall, which is different from a missing UI element, which is different from a data validation error. Each one has a different recovery strategy. An agent that treats them all the same is going to make the wrong call most of the time. Third, it needs escalation thresholds. Some failures should trigger a retry. Some should trigger a different approach. Some should immediately stop and ask a human. The worst agents have no escalation logic at all. They're either fully autonomous with no off-ramp, or they give up on the first hiccup. Neither is acceptable for anything that touches real business data. Fourth, and this is the one almost nobody is building yet, it needs honest failure reporting. Not just 'task failed.' What failed, at what step, in what state, with what context. That information is what lets a human or a supervisor agent actually fix the problem instead of just re-running the same broken workflow.

Why Coasty Was Built Around This Problem

I'm going to be straight with you. The reason I think Coasty is the right answer here isn't just the benchmark number, though 82% on OSWorld is real and nobody else is close. Claude Sonnet 4.5 is at 61.4%. Most others aren't even publishing scores because the numbers aren't flattering. The reason is that Coasty was built to control real desktops, real browsers, and real terminals, and that means it has to deal with real failure modes constantly. Cookie banners, unexpected modals, slow-loading pages, auth prompts, UI changes, all of it. The architecture reflects that. When Coasty hits something unexpected, it doesn't spiral. It classifies the failure, attempts the appropriate recovery, and escalates with full context when it can't resolve the issue on its own. The agent swarm capability for parallel execution also matters here because parallel tasks need independent error containment. One agent crashing can't be allowed to corrupt the state of five others running alongside it. That's a design constraint that forces you to get error handling right from the start, not bolt it on later. You can start on the free tier and see how it handles the messy real-world stuff that makes other computer use agents fall apart. That's the test that actually matters.

The $644 billion enterprise AI failure story is going to get written one of two ways. Either companies figure out that reliability and error recovery are the actual product, not the demo, and they start demanding better from their tools. Or they keep shipping agents that work great in controlled environments, fail silently in production, and give every skeptic more ammunition to say AI isn't ready. I know which story I think is more likely if things don't change. The agents that survive the next two years won't be the ones with the flashiest demos. They'll be the ones that know what to do when something goes wrong, because something always goes wrong. If you're evaluating computer use agents right now, stop asking 'what can it do when everything works?' Start asking 'what does it do when it breaks?' The answer will tell you everything. Check out coasty.ai and ask that question. The answer should impress you.

Want to see this in action?

View Case Studies
Try Coasty Free