Ground UI Elements to Coordinates with /v1/ground
Selecting web elements by class or ID breaks when a UI changes. You need a computer use agent that sees the screen and understands what it finds. The Coasty /v1/ground endpoint turns a natural language description of an element into precise x,y coordinates. This lets your agent click and type on real UI elements without brittle selectors.
How it works
The /v1/ground endpoint takes a base64 screenshot and a human-readable description of an element. It returns the element's bounding box as x, y, width, and height coordinates. This is a paid operation. The cost is $0.03 per ground request. You can run /v1/ground after capturing a screenshot with your vision endpoint. The endpoint uses the server's computer vision model to locate the element on the screen.
#!/bin/bash
# Set your API key from the environment
export COASTY_API_KEY=$(cat ~/.coasty_key)
# Base64-encoded screenshot (replace with your screenshot)
SCREENSHOT="$(base64 -i screenshot.png | tr -d '\n')"
# Description of the element to ground
ELEMENT_DESC="The submit button in the center of the screen"
curl -X POST https://coasty.ai/v1/ground \
-H "X-API-Key: $COASTY_API_KEY" \
-H "Content-Type: application/json" \
-d '{
"screenshot": "'"$SCREENSHOT"'",
"element_description": "'"$ELEMENT_DESC"'"
}'Ground request fields
- ●screenshot (string): base64-encoded PNG image of the screen
- ●element_description (string): human-readable description of the element
- ●Returns: x, y, width, height coordinates of the bounding box
Ground response fields
- ●x (number): left edge of the element
- ●y (number): top edge of the element
- ●width (number): width of the element
- ●height (number): height of the element
/v1/ground costs $0.03 per request and returns x, y, width, height coordinates for visual element selection.
Where this beats brittle automation
API-only tools rely on selectors like class names, IDs, or XPath. A single UI change breaks your automation. A computer use agent that sees the screen adapts instantly. Grounding lets the agent describe what it needs in plain English and resolve coordinates on demand. This works across browsers, desktop apps, and any UI that renders to a screen. You do not need to maintain selector maps or handle dynamic classes.
Use /v1/ground to turn natural language descriptions into precise UI locations. Pair it with /v1/predict for full computer use flows. Get your API key at https://coasty.ai/developers and start building UI automation that sees and acts like a human.