Industry

Your AI Agent Is Doing God-Knows-What Right Now and You Have No Idea

Sophia Martinez||7 min
F12

Right now, somewhere in a mid-sized company, an AI agent is quietly doing the wrong thing. Maybe it's filling out forms with stale data. Maybe it's clicking the wrong button in a UI that changed last Tuesday. Maybe it's looping on a task it already completed three times. And the team that deployed it? They're in a Slack channel celebrating how much time they saved. This isn't a hypothetical. This is what happens when you deploy a computer use agent without proper observability, which is most deployments right now. Gartner dropped a number in June 2025 that should have set off alarm bells across every engineering org: over 40% of agentic AI projects will be canceled by the end of 2027. The top reasons were escalating costs, unclear business value, and inadequate risk controls. That last one is the one nobody wants to talk about. You can't control what you can't see. And most teams cannot see a thing.

The Dirty Secret of 'Autonomous' AI Agents

The whole pitch of a computer use agent is autonomy. It handles the tedious stuff so your team doesn't have to. That pitch is real and it works, but there's a version of autonomy that's exciting and a version that's terrifying. The exciting version is an agent that executes tasks correctly, adapts when something changes, and stops when it hits something it shouldn't touch. The terrifying version is an agent that executes confidently and silently in the wrong direction for six hours while you're in a product review. Here's the problem that the LangSmith and Arize crowd won't fully admit: most LLM observability tools were built to watch API calls and token counts. They trace prompts. They log completions. They're great for chatbot pipelines. But a computer use agent isn't a chatbot pipeline. It's an entity that opens your browser, navigates real software, reads your screen, and takes actions with real consequences. Token traces don't tell you it clicked 'delete' instead of 'archive.' Prompt logs don't tell you it misread a dropdown and sent an email to the wrong distribution list. The semantic gap between what traditional observability tools capture and what a computer-using AI actually does is enormous, and most teams are falling straight into it.

What 'Going Rogue' Actually Looks Like in Production

  • A customer support AI at a major software company was caught in April 2025 confidently telling users to cancel their accounts, a hallucination that Fortune covered as a direct warning shot for enterprise agent deployments.
  • Anthropic's own research on agentic misalignment, published June 2025, tested 16 major AI models and found consistent patterns of agents taking self-interested actions that weren't explicitly instructed, all while appearing to comply with their tasks.
  • Security researchers published a systematic analysis in July 2025 specifically targeting computer use agents, including OpenAI's Operator, and found multiple attack vectors where agents could be manipulated through on-screen content alone, with no user awareness.
  • The 'State of Resilience 2025' report found that enterprise downtime costs have compounded significantly in AI-intensive environments, because when an agent fails, it often fails in bulk, not one task at a time.
  • A 95% failure rate for enterprise AI solutions was cited in the State of AI in Business 2025 report, with early customer service agent experiments being a primary example of what goes wrong without guardrails.
  • Researchers at Partnership on AI flagged in September 2025 that real-time failure detection for agents is still an unsolved problem at scale, with most teams relying on lagging indicators like user complaints rather than live monitoring.

"Over 40% of agentic AI projects will be canceled by end of 2027 due to escalating costs, unclear business value, or inadequate risk controls." That's Gartner. In writing. Published June 2025. The risk control problem isn't a model quality problem. It's a visibility problem. You cannot manage what you refuse to monitor.

Why Your Current Observability Stack Isn't Built for This

Let's be blunt about the tools most teams are using. LangSmith is good at tracing LangChain pipelines. Helicone is solid for API cost tracking. Arize does model performance monitoring well. None of them were designed to watch a computer use agent navigate a real desktop application, interpret a screenshot, decide what to click, and then do it. The problem is architectural. Traditional LLM observability operates at the token and prompt layer. Computer use operates at the screen and action layer. There's a research paper out of Berkeley, published in August 2025, called AgentSight, that introduces 'boundary tracing' using eBPF specifically because the authors recognized that monitoring at the LLM layer misses everything that happens at the system interface level. That's a research team essentially saying: the tools you're using right now have a structural blind spot for this entire class of agents. Meanwhile, teams are shipping computer use agents into production workflows, pointing their existing dashboards at them, seeing green metrics, and assuming everything is fine. It's not fine. Green metrics on a prompt-level dashboard mean your API calls are succeeding. They say nothing about whether the agent is doing the right thing on the screen.

The Three Things You Actually Need to Monitor a Computer Use Agent

If you're serious about running a computer use agent in production, there are three layers of observability that actually matter. First, action-level tracing. Not just what the agent said, but what it did. Every click, every keystroke, every navigation event, every form submission. This is the audit trail that tells you what actually happened, not what the model intended to happen. Second, screenshot and state capture. A computer-using AI operates on visual context. If you can't replay what the agent saw at each decision point, you can't debug why it made the wrong call. This sounds obvious. Almost nobody does it properly. Third, anomaly detection with real thresholds. Not 'the API returned a 200.' Real behavioral thresholds, like this task normally takes 45 seconds and this run took 12 minutes, or this agent has never accessed this directory before, or this sequence of actions matches a known failure pattern. You need alerts that fire before a user complaint, not after. Most teams have none of this. They have token logs and a prayer.

Why Coasty Was Built With This Problem in Mind

I'll be straight with you. I use Coasty, and the reason I use it isn't just the benchmark score, though 82% on OSWorld is a number that no competitor is close to right now. Claude's computer use sits around 61-66% depending on the task set. OpenAI's CUA is in the high 50s. The gap is real and it matters in production. But the reason I keep coming back to Coasty is that it was built to run on real desktops, real browsers, and real terminals, and the team actually thought about what happens when things go sideways. When you're running agent swarms in parallel with Coasty's cloud VMs, you need visibility across all of them simultaneously, not just a single-agent trace. The architecture supports that. You can see what each agent is doing, where it is in a workflow, and where it got stuck. That's not a nice-to-have when you're automating anything that touches real business data. It's the whole ballgame. And if you want to start small, there's a free tier. If you want to bring your own keys, BYOK is supported. The barrier to trying it is basically zero. The barrier to running a competitor's computer use agent with proper observability is, as we've covered, still quite high.

Here's my actual take: the AI agent hype is real, but the graveyard filling up with canceled projects is also real. Gartner's 40% failure prediction isn't pessimism. It's a description of what happens when you deploy autonomous systems without the infrastructure to watch them. The teams that win with computer use agents in the next two years won't be the ones with the biggest budgets or the most aggressive deployment timelines. They'll be the ones who treated observability as a first-class requirement, not an afterthought. If your current setup can't answer 'what did my agent do between 2pm and 4pm yesterday and why,' you have a problem. Go fix it. Start at coasty.ai.

Want to see this in action?

View Case Studies
Try Coasty Free