Screenshot to Action: A Deep Dive Into the /v1/predict Endpoint
Most automation stacks rely on brittle selectors, hardcoded IDs, and text matching. You spend hours maintaining selectors when a UI change happens. The /v1/predict endpoint flips that model. You send a base64 screenshot and an instruction, the model sees the screen and returns x/y coordinates for clicks or text to type. You loop capture, predict, act until the status is done. Pricing is $0.05 per predict call. This gives you a computer use API that drives real desktops, browsers, and terminals like a human.
How /v1/predict works
The endpoint follows a capture-predict-act loop. First, you capture a base64 screenshot of the current view. You then POST to /v1/predict with three fields. The screenshot as a base64 string, your natural language instruction, and cua_version. The model returns an actions array with coordinate pairs and a status field. If the status is not done, you capture again and predict. If the status is done, the task is complete. You can then call /v1/sessions/{id}/predict for stateful trajectory memory, or /v1/ground to map element descriptions to x,y coordinates.
#!/bin/bash
# Capture a screenshot (Linux/macOS example)
SCREENSHOT=$(base64 -i /tmp/current_screen.png)
# Replace this with your actual key
export COASTY_API_KEY="your_key_here"
# /v1/predict endpoint: $0.05 per call
curl -s -X POST https://coasty.ai/v1/predict \
-H "X-API-Key: $COASTY_API_KEY" \
-H "Content-Type: application/json" \
-d '{
"screenshot": "'$SCREENSHOT'",
"instruction": "Click the button labeled Submit",
"cua_version": "v3"
}' | jq .Key fields and pricing
- ●screenshot: base64 encoded image of the current view
- ●instruction: natural language task description
- ●cua_version: "v3" for the standard model, "v4" for autonomous mode with pass/fail verifier
- ●actions: JSON array with coordinate pairs (x,y) and text to type
- ●status: "done" means the task is complete, otherwise loop predict again
- ●POST /v1/predict costs $0.05 per call
- ●POST /v1/sessions/{id}/predict costs $0.04 per call for stateful trajectory memory
- ●POST /v1/ground costs $0.03 per call to map element descriptions to x,y coordinates
Remember: capture, POST /v1/predict, act on actions, and loop until status is done.
Where /v1/predict beats brittle automation
With selectors you must maintain a separate mapping of element IDs, classes, and text for every app. A single UI change breaks your scripts. The computer use API sees the screen, understands context, and returns the correct coordinates based on what the model actually sees. This works for dynamic layouts, SPA navigation, and localized text. You can also chain multiple predict calls to perform multi-step tasks without rewriting selectors each time. This makes your automation resilient to UI changes and faster to build.
Now you understand the screenshot-to-action pipeline. Build a computer use agent that clicks, types, and navigates just like a human. Get your API key at https://coasty.ai/developers and start integrating /v1/predict into your automation stack.