Tutorial

Ground UI Elements to Coordinates with /v1/ground

Alex Thompson||4 min
Tab

Automating clicks, inputs, and scrolling is easiest when you know the exact x,y coordinates of the element you need to act on. The /v1/ground endpoint takes a base64 screenshot and a description of the UI element and returns precise x,y coordinates. This grounding step pairs with a computer use agent to drive clicks and actions on real desktops, browsers, and terminals without brittle selectors.

How /v1/ground works

The endpoint expects a base64-encoded screenshot and a text description of the target element. It returns a JSON body with a top-level coordinates object containing x and y integers. The request costs $0.03 per call. You can run grounding in conjunction with other computer use operations such as /v1/sessions or /v1/predict to build a robust automation flow.

bash
curl -X POST https://coasty.ai/v1/ground \
  -H "Content-Type: application/json" \
  -H "X-API-Key: $COASTY_API_KEY" \
  -d '{
    "screenshot": "<base64-encoded-screenshot>",
    "description": "the green submit button near the address bar"
  }'

# Example Python snippet
import base64
import os
import requests

url = "https://coasty.ai/v1/ground"
key = os.getenv("COASTY_API_KEY")
with open("screenshot.png", "rb") as f:
    img_bytes = f.read()
screenshot_b64 = base64.b64encode(img_bytes).decode("utf-8")

resp = requests.post(
    url,
    json={
        "screenshot": screenshot_b64,
        "description": "the green submit button near the address bar"
    },
    headers={"X-API-Key": key}
)
resp.raise_for_status()
print(resp.json())

Grounding vs brittle selectors

  • Grounding uses visual understanding instead of DOM selectors that break with layout changes or frameworks.
  • The computer use agent can see the same screenshot and click the same visual element after coordinate grounding.
  • You avoid writing selectors for every element and maintain robust automation across different browsers and OS versions.
  • Combine grounding with /v1/sessions to let the agent drive clicks, inputs, and scrolling based on real screen state.

Use /v1/ground to convert any visible UI element into precise coordinates, then let the computer use agent act on those coordinates.

Practical patterns

  • Ground the submit button once, then use the coordinates in subsequent actions or in a workflow step.
  • You can ground multiple elements in parallel or sequentially to map out a UI before the agent starts interacting.
  • Grounding works with both desktop and browser environments because the agent sees the entire screen.

Start grounding UI elements to coordinates with the /v1/ground endpoint. Pair it with a computer use agent to automate clicks, inputs, and scrolling on real desktops and browsers. Get your API key at https://coasty.ai/developers.

Want to see this in action?

View Case Studies
Try Coasty Free