Engineering

Screenshot to Action: A Deep Dive into the /v1/predict Endpoint

Rachel Kim||6 min
Ctrl+H

Traditional UI automation relies on brittle selectors and static APIs. You cannot see the screen. You cannot handle dynamic layouts or unexpected states. The /v1/predict endpoint lets you send a screenshot and an instruction to a vision model. It returns an action list that your code executes with PyAutoGUI. This creates a computer use agent that watches the screen and acts like a human. You get robust desktop, browser, and terminal automation without hard‑coded selectors.

How it works

The /v1/predict endpoint takes a base64 screenshot, an instruction, and the CUA version. It returns an actions array and a status. Loop: capture screen, call /v1/predict, execute actions, repeat until status is done. The endpoint costs $0.05 per call. No tokens are billed separately. The model sees the screenshot and chooses a click, key press, text, or scroll action based on your instruction. You control the loop, state transitions, and error handling.

bash
curl -X POST https://coasty.ai/v1/predict \
  -H "X-API-Key: $COASTY_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "cua_version": "v4",
    "instruction": "Click the subscribe button in the top right corner.",
    "screenshot": "$(base64 -w 0 screenshot.png)"
  }'

Request and response fields

  • POST https://coasty.ai/v1/predict with headers X-API-Key and Content-Type: application/json.
  • Request body has cua_version (string), instruction (string), screenshot (base64 string).
  • Response body has actions (array of objects), status (string).
  • Status values include done, error, or other intermediate states.
  • Each action object contains at least type and coordinates (x, y) or text or key.
  • Billed $0.05 per POST /v1/predict call.

Loop: capture screen, POST /v1/predict, execute actions, repeat until status is done.

Where this beats brittle automation

API‑only tools fail when elements change IDs, classes, or animations delay them. With a computer use agent, the model sees the current screen state. It can handle dynamic layouts, overlapping windows, and user‑driven events. You do not need to maintain a selector registry. You give a natural language instruction, the model chooses the right action, and you execute it. This is ideal for web scraping, testing, data entry, and onboarding flows that require interacting with rich UIs.

Start building agents that see and click. Use /v1/predict with a real screenshot and an instruction. Create robust desktop and browser automation. Get your API key at https://coasty.ai/developers.

Want to see this in action?

View Case Studies
Try Coasty Free