Tutorial

Screenshot to Action: A Deep Dive Into the /v1/predict Endpoint

Alex Thompson||7 min
+D

Most automation tools rely on brittle selectors, XPath, or APIs that do not exist. You cannot script against a dynamic web UI or an application that does not expose endpoints. The /v1/predict endpoint gives you a computer use API that reads the screen and produces real actions such as mouse clicks, keyboard presses, and scrolling. You send a base64 screenshot and an instruction. The model outputs actions and a status. You loop capture, predict, act until status is done. This is how you build a computer use agent that works on any desktop, browser, or terminal.

How /v1/predict works

The endpoint is POST https://coasty.ai/v1/predict. It requires three fields. The screenshot must be a base64-encoded image. The instruction is a text prompt for the model. The cua_version identifies the computer use agent version. The request returns actions and a status. The actions array contains objects with fields name, x, y, params, and type. The status field is either done, pending, or error. You call predict in a loop: take a screenshot, post it with the current instruction, get actions, execute them, then wait for the next screen, building a trajectory over time. This is the core of a computer use agent.

bash
#!/bin/bash

COASTY_API_KEY="${COASTY_API_KEY}"
ENDPOINT="https://coasty.ai/v1/predict"

# Example: capture a screenshot, encode it, and send to the endpoint
# Assumes you have a script that outputs a base64 string
SCREENSHOT_BASE64=$(base64 -i screenshot.png)

INSTRUCTION="Click the login button and type your password."
CUA_VERSION="v3"

RESPONSE=$(curl -s -X POST "$ENDPOINT" \
  -H "X-API-Key: $COASTY_API_KEY" \
  -H "Content-Type: application/json" \
  -d "{
    \"screenshot\": \"$SCREENSHOT_BASE64\",
    \"instruction\": \"$INSTRUCTION\",
    \"cua_version\": \"$CUA_VERSION\"
  }")

echo "$RESPONSE"

Request and response fields

  • screenshot: base64-encoded image data representing the current screen.
  • instruction: short natural language prompt telling the model what to do.
  • cua_version: version string such as v3. The model behavior depends on this.
  • actions: array of action objects with fields name, x, y, params, type.
  • status: done, pending, or error. You stop looping when status is done.

Loop capture, predict, act until status is done.

Where this beats brittle automation

You cannot rely on stable element selectors when the UI changes, when pages load dynamically, or when elements are rendered by JavaScript. A computer use agent that reads the screen can adapt to layout shifts, new classes, or missing IDs. It can click buttons based on visual similarity and text. It can scroll, drag, and type in text fields that are not accessible via APIs. This makes the /v1/predict endpoint ideal for automating web browsers, desktop apps, or terminal sessions where no formal API exists. You get a computer use agent that behaves like a human user, not a library of brittle selectors.

Use the /v1/predict endpoint to build a screenshot-to-action pipeline. Capture the screen, send it to the model, execute the returned actions, and repeat until the task is done. This is the foundation of a computer use agent that works on any desktop environment. Get your API key from https://coasty.ai/developers and start building your own computer use agent today.

Want to see this in action?

View Case Studies
Try Coasty Free