Industry

Your AI Agent Is Doing God-Knows-What Right Now and You Have Zero Visibility

Emily Watson||7 min
+N

Somewhere in your stack right now, an AI agent is making decisions you never approved, clicking things you can't see, and spending money you didn't budget for. And you won't find out until something breaks. That's not a hypothetical. That's the default state of AI agent deployment in 2025. Companies are sprinting to ship computer use agents into production, and almost none of them have figured out how to actually watch what those agents do. The result is a slow-motion disaster that the industry is only just starting to admit out loud.

The Numbers Are Genuinely Alarming

Let's start with the stat that should make every engineering lead sweat: 96% of organizations deploying generative AI report costs higher than expected. That's from DataRobot's enterprise scaling research, and it tracks with what practitioners are saying in the field. A single unmonitored agent running a bad loop overnight can consume thousands of dollars in tokens before anyone notices. Multiply that by the 10, 50, or 100 agents your company is quietly spinning up across teams, and you have a budget catastrophe that no one put in the roadmap. The Coralogix team called it 'The AI Monitoring Crisis No One's Talking About' back in mid-2025. They were right, and the industry still hasn't caught up. We're in a moment where the tooling for deploying AI agents is miles ahead of the tooling for understanding what those agents are actually doing. That gap is where money disappears and trust in AI gets torched.

Computer Use Agents Make This So Much Harder

Monitoring a chatbot is annoying but manageable. You log inputs and outputs, track latency, watch for hallucinations. Fine. Now try monitoring a computer use agent. It's not making API calls you can intercept cleanly. It's clicking buttons, filling forms, navigating browser tabs, executing terminal commands, and taking actions in desktop software. The surface area for 'what just happened' is enormous. When a computer-using AI agent goes sideways, it doesn't just return a bad JSON response. It might submit a form with wrong data, delete a file, send an email to the wrong person, or get stuck in a loop clicking the same button 400 times. By the time you notice, the damage is done. Palo Alto Networks put it well: when agentic systems go rogue, it's rarely one dramatic failure. It's a slow buildup of quiet missteps. That's the thing about computer use agents specifically. The failure modes are physical. They touch real systems, real data, real workflows. A black box that controls a desktop is a fundamentally different risk than a black box that writes a summary.

What 'Observability' Actually Means for Agents (And What People Get Wrong)

  • Logging is not observability. Knowing that an agent ran a task is useless if you can't see the sequence of decisions it made to complete it.
  • Latency alerts are reactive by design. By the time your GPU utilization alarm fires, the agent has already done whatever it was going to do.
  • Token cost tracking needs to be real-time and per-agent, not aggregated at the end of the month when your cloud bill lands.
  • Multi-agent systems compound the problem. A single failure in one agent cascades through entire workflows, and without trace-level visibility you can't find the root cause.
  • Screen-level replay for computer use agents is the standard that barely anyone has actually built. If you can't replay what the agent saw and clicked, step by step, you're guessing.
  • Evaluation in production is different from evaluation in testing. Agents behave differently on live data, live systems, and live timing constraints.

'In a single night, one unmonitored agent can consume thousands of dollars in tokens.' That's not a theoretical edge case. That's the production wall that's quietly killing enterprise AI adoption right now.

The Industry's Dirty Secret: Most Teams Are Flying Completely Blind

The Stack Overflow 2025 developer survey found that the most used tools for AI agent observability are just... staples of the DevOps world. Datadog. Splunk. New Relic. Tools that were built to monitor microservices and infrastructure, now being duct-taped onto autonomous agents that control real desktops. That's like using a speedometer to diagnose why your car is steering itself into a ditch. New Relic and Splunk are scrambling to add agentic AI monitoring features in early 2026, which tells you everything. The tooling is arriving after the deployment wave, not before it. Meanwhile, teams building on OpenAI's Operator or Anthropic's Computer Use are dealing with agents that one reviewer bluntly described as 'unfinished, unsuccessful, and unsafe' in production. Anthropic's Computer Use shipped 12 months before OpenAI's Operator, and Operator still doesn't work reliably. Both are still in research preview status. Neither has solved the observability problem. You're deploying experimental technology with no rearview mirror.

The Governance Vacuum Is Getting Dangerous

Here's the part that keeps security people up at night. AI agents can be spun up by end-users or individual business lines as easily as creating a spreadsheet. No IT review. No security sign-off. No monitoring infrastructure. A LinkedIn piece from early 2026 put it plainly: with autonomous agents, governance processes are often just missing. The METR research team raised the rogue replication threat model in late 2024, pointing out that companies might deploy millions of AI agents without supervising them. We're not at millions yet, but we're at 'enough to cause serious damage.' A DNS failure, an API outage, or a misconfigured permission can break an agent workflow in ways that silently corrupt data for hours before anyone notices. Without observability into what the agent was doing at the moment of failure, you can't fix it, you can't audit it, and you definitely can't explain it to a regulator. The governance and observability problem are the same problem. You can't govern what you can't see.

Why Coasty Was Built With Observability as a First-Class Feature

I'm going to be straight with you. Most computer use agents were built to do the task. Coasty was built to do the task and let you see every step of how it did it. That's not a small distinction. Coasty sits at 82% on OSWorld, the most rigorous real-world benchmark for computer use agents. Nobody else is close. But raw benchmark performance means nothing if you can't trust what the agent is doing in your actual environment. Coasty controls real desktops, real browsers, and real terminals, not sanitized API sandboxes. And it's built so you can actually watch that happen. Agent swarms for parallel execution mean you can scale without losing visibility across the fleet. The desktop app and cloud VM options mean you choose where the agent runs and you maintain oversight of that environment. BYOK support means your data, your keys, your control. The free tier means you can actually test this before you commit. The reason I'd recommend Coasty over patching Datadog onto an Anthropic Computer Use integration isn't just the benchmark score. It's that the architecture was designed for production trust, not just demo performance. When your computer use agent is filling out forms, navigating enterprise software, or running terminal commands at scale, you need a tool that was built to be watched. Coasty is.

Here's my honest take: the AI agent observability crisis is real, it's accelerating, and most companies are going to learn about it the expensive way. An overnight token burn. A corrupted dataset. An agent that submitted something it shouldn't have, and nobody can reconstruct what happened because nobody was watching. You don't have to be that company. The rule is simple. If you can't see what your computer use agent is doing in real time, you don't actually control it. You've just automated chaos and hoped for the best. Stop deploying agents you can't observe. Start with a computer use agent that was built for production trust from day one. Go try Coasty at coasty.ai. The free tier is there. The 82% OSWorld score is there. The visibility is there. The only thing missing is your excuse not to.

Want to see this in action?

View Case Studies
Try Coasty Free