Tutorial

Screenshot to Action: Deep Dive Into the /v1/predict Endpoint

Michael Rodriguez||7 min
+B

Most automation relies on brittle selectors or fixed APIs. When you need an agent that truly sees the screen and acts like a human, the /v1/predict endpoint is your core primitive. It takes a base64 screenshot and an instruction and returns a list of mouse/keyboard actions, billed at $0.05 per call.

How it works

Send a POST to https://coasty.ai/v1/predict with a base64-encoded screenshot, a text instruction, and a CUA version. The endpoint returns a JSON payload with a list of actions (click, type, scroll, etc.), a status (pending, done), and a session_id if you want stateful trajectory memory. Loop: capture screen → predict → execute actions until status is done. This is the foundation of any computer use agent.

bash
curl -X POST https://coasty.ai/v1/predict \
  -H "X-API-Key: $COASTY_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "image": "iVBORw0KGgoAAAANSUhEUgAAAAEAAAABCAYAAAAfFcSJAAAADUlEQVR42mP8/5+hHgAHggJ/PchI7wAAAABJRU5ErkJggg==",
    "instruction": "Click the blue button labeled Submit",
    "cua_version": "v3"
  }'

Request fields

  • image (string, base64) , the screenshot to analyze.
  • instruction (string) , natural language describing the action.
  • cua_version (string) , model version like "v3" or "v4".
  • Optional: session_id , continue a stateful trajectory from /v1/sessions.

Response fields

  • actions (array) , list of mouse/keyboard steps: click, type, scroll, etc.
  • status (string) , "pending" while computing, "done" when complete.
  • session_id (string) , ID for stateful trajectory memory if you use session-based flows.

Billed $0.05 per call. Loop until status is done to complete a task.

Where this beats brittle automation

Standard automation tools break when UI changes or when elements lack stable IDs. The computer use API understands the visual context, so it can click a button labeled with changing text or interact with dynamic dashboards. By using the /v1/predict endpoint, you build an agent that truly sees and acts like a human, not just a script that follows brittle selectors.

Start building your computer use agent with the /v1/predict endpoint. If you want stateful trajectory memory, create a session first with POST /v1/sessions. Get your API key at https://coasty.ai/developers and start turning screenshots into actions.

Want to see this in action?

View Case Studies
Try Coasty Free