Ground UI Elements to Coordinates with /v1/ground
Automating clicks, inputs, and scrolling is easiest when you know the exact x,y coordinates of the element you need to act on. The /v1/ground endpoint takes a base64 screenshot and a description of the UI element and returns precise x,y coordinates. This grounding step pairs with a computer use agent to drive clicks and actions on real desktops, browsers, and terminals without brittle selectors.
How /v1/ground works
The endpoint expects a base64-encoded screenshot and a text description of the target element. It returns a JSON body with a top-level coordinates object containing x and y integers. The request costs $0.03 per call. You can run grounding in conjunction with other computer use operations such as /v1/sessions or /v1/predict to build a robust automation flow.
curl -X POST https://coasty.ai/v1/ground \
-H "Content-Type: application/json" \
-H "X-API-Key: $COASTY_API_KEY" \
-d '{
"screenshot": "<base64-encoded-screenshot>",
"description": "the green submit button near the address bar"
}'
# Example Python snippet
import base64
import os
import requests
url = "https://coasty.ai/v1/ground"
key = os.getenv("COASTY_API_KEY")
with open("screenshot.png", "rb") as f:
img_bytes = f.read()
screenshot_b64 = base64.b64encode(img_bytes).decode("utf-8")
resp = requests.post(
url,
json={
"screenshot": screenshot_b64,
"description": "the green submit button near the address bar"
},
headers={"X-API-Key": key}
)
resp.raise_for_status()
print(resp.json())Grounding vs brittle selectors
- ●Grounding uses visual understanding instead of DOM selectors that break with layout changes or frameworks.
- ●The computer use agent can see the same screenshot and click the same visual element after coordinate grounding.
- ●You avoid writing selectors for every element and maintain robust automation across different browsers and OS versions.
- ●Combine grounding with /v1/sessions to let the agent drive clicks, inputs, and scrolling based on real screen state.
Use /v1/ground to convert any visible UI element into precise coordinates, then let the computer use agent act on those coordinates.
Practical patterns
- ●Ground the submit button once, then use the coordinates in subsequent actions or in a workflow step.
- ●You can ground multiple elements in parallel or sequentially to map out a UI before the agent starts interacting.
- ●Grounding works with both desktop and browser environments because the agent sees the entire screen.
Start grounding UI elements to coordinates with the /v1/ground endpoint. Pair it with a computer use agent to automate clicks, inputs, and scrolling on real desktops and browsers. Get your API key at https://coasty.ai/developers.