Screenshot to Action: The /v1/predict Endpoint Deep Dive
Most automation tools rely on hardcoded selectors, CSS IDs, or brittle API mocks. They break as soon as a UI changes. The Coasty computer use API flips that model: you send a raw screenshot and a natural language instruction, and the model returns concrete actions like click at x,y or type text. The /v1/predict endpoint is the core of that loop. It costs $0.05 per request and returns actions plus a status. In this guide you will see the exact request shape, the response fields, and a working Python example that reads your API key from COASTY_API_KEY and streams actions until the task is done.
How /v1/predict works
The endpoint follows a capture-predict-act loop. You POST a base64 screenshot, an instruction, and the CUA version you want to use. The model analyzes the visual state and returns a list of actions plus a status. You then capture the next frame, POST again, and repeat until the status field is "done". This pattern lets you build a fully visual agent that reads the screen just like a human.
import base64
import os
import requests
import time
API_KEY = os.getenv("COASTY_API_KEY")
BASE = "https://coasty.ai/v1"
def encode_image(path: str) -> str:
with open(path, "rb") as f:
return base64.b64encode(f.read()).decode("utf-8")
def predict_action(
screenshot_b64: str,
instruction: str,
cua_version: str = "v3"
) -> dict:
resp = requests.post(
f"{BASE}/predict",
headers={
"X-API-Key": API_KEY,
"Content-Type": "application/json"
},
json={
"screenshot": screenshot_b64,
"instruction": instruction,
"cua_version": cua_version
}
)
resp.raise_for_status()
return resp.json()
def run_task(
screenshot_path: str,
instruction: str,
cua_version: str = "v3"
):
screenshot_b64 = encode_image(screenshot_path)
while True:
data = predict_action(screenshot_b64, instruction, cua_version)
actions = data.get("actions", [])
status = data.get("status")
print("Actions:", actions)
if status == "done":
break
# Capture the next frame here
time.sleep(0.5)
if __name__ == "__main__":
run_task("screen.png", "click the submit button")Fields you need to know
- ●screenshot: base64-encoded image. The model sees this exactly as a human would.
- ●instruction: natural language task for the agent. Examples include 'click the download button' or 'fill the email field'.
- ●cua_version: "v3" for standard computer use, "v4" for autonomous mode with a pass/fail verifier (when used as part of a task run, not directly via /v1/predict).
- ●actions: a list of prediction results for the current step. The format depends on the CUA version, but each entry typically includes an action type and coordinates.
- ●status: one of "queued", "running", "awaiting_human", "done", "failed", or "cancelled". When status is "done" you have completed the step and should capture the next frame.
The /v1/predict endpoint costs $0.05 per call and returns actions + status. Loop capture, predict, act until status is "done".
Where this beats brittle automation
Controllers that rely on selectors break when a class name changes, a layout shifts, or a page uses dynamic IDs. A computer use agent reads the actual pixels. It understands visual layout, ambiguous regions, and context. You can ask the agent to "click the login button in the top right" and it will locate that area regardless of how the DOM is structured. This makes your automation resilient to UI churn, responsive redesigns, and even slight differences between browsers. You still need to handle state changes, but you get a much higher success rate without maintaining a master list of selectors.
The /v1/predict endpoint is the foundation for vision-based agents that see and act like humans. Start by capturing a screenshot, sending it with an instruction, and building a loop that keeps running until status is "done". From there you can integrate task runs, workflows, and cloud machines for full end-to-end automation. Ready to build a computer use agent? Get your API key at https://coasty.ai/developers.