Guide

Your QA Team Is Burning $2.4 Trillion a Year. An AI Computer Use Agent Fixes That.

Rachel Kim||8 min
Ctrl+H

U.S. companies lost $2.41 trillion to poor software quality in 2022. Trillion. With a T. That number comes from CISQ, the Consortium for Information and Software Quality, and it hasn't gotten better since. Meanwhile, your QA team is manually clicking through the same login flow, the same checkout funnel, the same form validation it clicked through last sprint. And the sprint before that. And the one before that. This is not a testing problem. It's a decision problem. The technology to fix this exists right now, and most engineering teams are still choosing not to use it. That's what this post is about.

The Manual QA Tax Is Bleeding You Out

Let's talk about what manual QA actually costs in real terms. A mid-level QA engineer in the U.S. runs you $80,000 to $120,000 a year in salary alone, before benefits, management overhead, and the time they spend writing test documentation that nobody reads. Cortex's 2024 State of Developer Productivity report found that 58% of teams lose more than 5 hours per developer per week to unproductive work. QA is drowning in that bucket. Think about what a QA engineer actually does on a Tuesday. They open the staging environment. They log in. They click through a user flow they've clicked through 200 times. They fill out a form. They check that a button works. They write a ticket. They do it again. This is a human being with a computer science degree doing work that a script could do, and that a computer use agent can do better than a script because it doesn't need brittle selectors, hardcoded XPaths, or a Selenium expert to maintain it. The worst part isn't the cost. It's the coverage gap. Humans get tired. They skip edge cases. They assume things work because they worked last week. Bugs slip through, get found in production, and suddenly you're the team that shipped a broken checkout on Black Friday. Ask Knight Capital how their $440 million trading loss in 45 minutes felt. That one came from untested legacy code. Untested. Legacy. Code.

Why Traditional Test Automation Isn't the Answer Either

  • Selenium and Playwright scripts break every time a developer renames a CSS class or restructures the DOM, and someone has to fix them, which means your 'automated' tests need a human babysitter
  • Script-based automation covers happy paths. Real bugs live in the weird, unexpected flows that nobody thought to script for, like what happens when a user pastes emoji into a phone number field
  • The average team spends 30-40% of QA time just maintaining existing test scripts, not writing new ones, not finding new bugs
  • UiPath and legacy RPA tools were built for structured, predictable workflows. Modern web apps change constantly. RPA breaks constantly. The maintenance cost often exceeds the value
  • Zero-day UI changes, A/B tests, and feature flags mean your test suite is out of date before the sprint review ends
  • Flaky tests are a morale killer. When developers see 30% of CI runs fail randomly, they start ignoring red builds entirely, which defeats the entire purpose

"$2.41 trillion. That's what poor software quality cost U.S. businesses in a single year. Your regression suite running on a Tuesday afternoon isn't going to save you. A computer use agent running 24/7 in parallel just might."

What a Real Computer Use Agent Actually Does in QA

Here's where it gets interesting. A proper AI computer use agent doesn't read your DOM. It doesn't need selectors. It sees the screen the same way a human tester does, visually, and it acts on what it sees. Click that button. Fill in that field. Notice that the error message is wrong. This is a fundamentally different approach from Selenium or even most 'AI testing' tools that are really just LLM wrappers around old automation frameworks. A genuine computer use agent can run exploratory testing, not just scripted regression. It can be told 'go test the onboarding flow and find anything that looks broken' and actually do it, adapting in real time when the UI changes. It can run overnight, across dozens of parallel sessions, covering more ground in 8 hours than a human QA team covers in a week. The comparison to OpenAI's Operator or Anthropic's Computer Use is worth making here, because people assume those are the benchmark. They're not. A reviewer testing ChatGPT's agent in July 2025 called it 'too slow, expensive, and error-prone' for real tasks. Anthropic's Computer Use has been in research preview so long it's become a running joke in the QA community on Reddit. These tools are impressive demos. They're not production-grade testing infrastructure.

How to Actually Automate QA With a Computer Use Agent: A Real Workflow

Stop thinking about this like you're setting up Selenium. The workflow is different and honestly much simpler. First, identify your highest-value, highest-repetition test cases. Regression suites are the obvious starting point. Login flows, checkout flows, form submissions, permission checks, anything you run every sprint without fail. These are the first things you hand off. Second, write your test instructions in plain language. A good computer use agent understands 'go to the signup page, create a new account with this email, verify the confirmation screen appears, then try to create a duplicate account and confirm the error message is correct.' You don't need to write code. You describe what a human tester would do. Third, run in parallel. This is where the economics flip entirely. One agent session costs a fraction of one engineer-hour. Running 20 parallel sessions overnight costs less than an hour of manual QA time and covers your entire regression suite before standup. Fourth, use agent swarms for release testing. Before a major release, spin up parallel agents hitting different parts of the product simultaneously. You'll catch interaction bugs and race conditions that sequential testing misses entirely. Fifth, keep humans on exploratory and judgment-heavy testing. A computer use agent is not a replacement for a senior QA engineer who understands the product deeply. It's a replacement for the 60% of their week that's mechanical repetition.

Why Coasty Is the Obvious Choice for This

I've tried a lot of these tools. The honest answer is that most of them are impressive in a controlled demo and frustrating in production. Coasty is different, and I can back that up with a number: 82% on OSWorld, the standard benchmark for AI computer use tasks. Claude Sonnet 4.5 scores 61.4% on the same benchmark. OpenAI's agents don't publish comparable numbers because the numbers aren't comparable. Coasty controls real desktops, real browsers, and real terminals. Not API calls pretending to be computer use. Not a browser extension with guardrails. Actual computer use the way a human tester would do it, which means it works on any app, any UI, any stack. For QA specifically, the agent swarm capability matters a lot. You can run parallel test sessions against your staging environment, get results fast, and not wait for a sequential test run to crawl through 300 test cases one at a time. The desktop app means your tests can interact with Electron apps, native desktop software, anything that lives outside the browser. And there's a free tier, so you can stop theorizing about whether this works and just go find out. BYOK is supported if you want to bring your own API keys and keep costs down at scale. The teams I've seen get the most out of Coasty are the ones who started small, automated their most painful regression suite first, saw the results, and then expanded from there. It's not a big-bang migration. It's a Tuesday afternoon project that turns into your entire QA strategy by Q2.

Here's my actual opinion: if you're still running manual regression testing in 2025 and you haven't at least tried a computer use agent, you're making a choice to waste money and ship more bugs. That's it. That's the whole argument. The tools exist. The benchmarks are public. The economics are not even close. Your QA engineers are talented people who should be doing the hard, judgment-intensive work that actually requires a human brain, not clicking through a login flow for the 300th time. Give them that. Automate the rest. Start with Coasty at coasty.ai. There's a free tier. Run your most painful regression suite tonight. Come back and tell me I was wrong.

Want to see this in action?

View Case Studies
Try Coasty Free