Tutorial

Ground UI Elements to Coordinates with the /v1/ground Endpoint

Sophia Martinez||6 min
+W

Browsers and apps change. CSS selectors break. XPath selectors break. When you rely on a specific class name or ID, a UI update can stop a computer use agent in its tracks. The /v1/ground endpoint solves this by mapping a natural language description to absolute x,y coordinates on a screen. You describe what you see, the API returns the click location. No selectors, no fragile patterns, just a reliable click from a screenshot.

How /v1/ground works

You send a screenshot and a text description to /v1/ground. The server analyzes the image, matches the description, and returns the top-left coordinate of the element. This endpoint is billed at $0.03 per request.

bash
curl https://coasty.ai/v1/ground \  \
  -H 'X-API-Key: $COASTY_API_KEY' \  \
  -F '[email protected]' \  \
  -F 'description=click the big blue submit button'

Request and response

The request includes a multipart form with screenshot (base64 binary) and description (string). The response is JSON with an x and y field in pixels. Example { "x": 150, "y": 320 }. Use these coordinates with pyautogui click(x, y) or Click(x, y) in your automation code.

Grounding turns a screenshot and a description into clickable coordinates. Billed $0.03 per call.

Why this beats brittle automation

Modern UIs use shadow DOM, React portals, dynamic IDs, and hidden containers. Selectors often depend on these unstable parts. Computer use agents that ground to coordinates work on what the user sees, not on implementation details. If the UI changes, the description stays the same. The API finds the new location. This makes your agents more maintainable and resilient to refactor-heavy codebases.

Integrate with a computer use agent

A typical loop combines prediction and grounding. Capture a screenshot, call /v1/predict (or /v1/sessions/{id}/predict) to get an action. If the action is a click on a UI element, call /v1/ground with the description from the original instruction. Pass the returned (x, y) to the click. This keeps your agent aligned with visual state without maintaining a selector database.

Grounding gives you a bridge from natural language to precise clicks. Build robust computer use agents that adapt to UI changes. Get a key at https://coasty.ai/developers and start grounding.

Want to see this in action?

View Case Studies
Try Coasty Free