Tutorial

Ground UI Elements to Coordinates with the Coasty Computer Use API

Marcus Sterling||6 min
End

Most UI automation tools rely on brittle selectors like IDs, XPath, or CSS classes. When the layout changes, those selectors break. The ground endpoint solves this by letting you describe a UI element in plain language and receiving back its exact screen coordinates. You get a stable, human-like way to click buttons and fill inputs without maintaining selector maps.

How /v1/ground works

The /v1/ground endpoint takes a base64 screenshot and a natural language description of the element you want to target. The server analyzes the screen image, identifies the matching visual region, and returns the top-left x and y coordinates. This is useful when you need to compose actions with other tools like pyautogui or when you want to ground a computer use agent's vision system. The endpoint costs $0.03 per call.

bash
curl -X POST https://coasty.ai/v1/ground \
  -H "X-API-Key: $COASTY_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "screenshot": "$(base64 -i screenshot.png)",
    "description": "the blue submit button with text create account"
  }'
python
import base64
import os
import requests

API_KEY = os.getenv("COASTY_API_KEY")
url = "https://coasty.ai/v1/ground"

with open("screenshot.png", "rb") as f:
    screenshot_b64 = base64.b64encode(f.read()).decode()

payload = {
    "screenshot": screenshot_b64,
    "description": "the blue submit button with text create account"
}

resp = requests.post(url, headers={"X-API-Key": API_KEY}, json=payload)
resp.raise_for_status()
result = resp.json()
print(result)  # example: {"x": 120, "y": 450, "confidence": 0.92}

Key fields and parameters

  • screenshot: base64-encoded PNG, JPG, or similar image.
  • description: natural language describing the target element.
  • Response always includes x and y coordinates in pixels.
  • Confidence score indicates how strongly the description matches the visual region.
  • No authentication headers other than X-API-Key or Authorization: Bearer.

Grounding an element description to coordinates makes your computer use agent robust against layout changes.

Where this beats brittle automation

Traditional automation tools need you to maintain a list of selectors, XPath, or CSS class names. When a designer changes the order of columns or rebrands buttons with new classes, your scripts break. Grounding lets you describe what you want to click in plain language. The API finds the element in the current visual context and returns coordinates. This approach works whether the UI is built with React, Vue, or a plain HTML page. It also pairs naturally with the /v1/predict endpoint, which can reason about the screen state and compose ground-based actions for a fully autonomous computer use agent.

Use /v1/ground to bridge vision and action. Describe UI elements, get coordinates, and build stable computer use agents. Get a free API key at https://coasty.ai/developers and start grounding today.

Want to see this in action?

View Case Studies
Try Coasty Free