Research

OSWorld Style Synthetic Benchmarks for Computer Use Agents

Priya Patel||7 min
Ctrl+P

Computer use agents that can navigate desktops and browsers still struggle to generalize. Real-world interactions are messy. They include rare events, inconsistent UI states, and constantly changing tools. This makes it hard to build reliable benchmarks or train models that actually work in production. The solution is not to abandon real data but to complement it with synthetic data that captures realistic interaction patterns at scale.

What OSWorld shows about real benchmarks

OSWorld is a popular benchmark for computer use AI. It tests agents on a set of real desktop environments with a defined task list. Agents must complete tasks like editing a document, updating system settings, or browsing a webpage. The score reflects how often the agent reaches the goal without human help. OSWorld revealed that even top models still fail 20, 40% of tasks because they cannot handle unexpected UI states or rare workflows. The problem is not just accuracy. It is that real scenarios are too varied for any single dataset to cover. This limits how well benchmarks can predict performance in the wild.

Why synthetic data solves coverage gaps

Synthetic data lets you explore scenarios that are hard or expensive to record. You can generate thousands of task trajectories that combine different tools, workflows, and error conditions. For example, you can simulate a sequence where a user first opens a spreadsheet, imports data, formats cells, and then exports the result. You can also create edge cases like a missing file, a network timeout, or an unexpected dialog. Synthetic data generation tools can create these sequences by running computer use agents on controlled environments. They record mouse clicks, keyboard inputs, screen states, and natural language instructions. The result is a dataset that covers rare but realistic paths. This improves both training and evaluation.

Concrete benefits with real numbers

Teams that added synthetic trajectories to their training data report measurable gains. One study showed that augmenting a computer use model with 10,000 synthetic trajectories lifted task success from 58% to 71%. Another team reduced the number of real human evaluations needed by 60% after validating a synthetic benchmark on a smaller set of tasks. Synthetic data also lowers the cost of debugging. When a model fails, you can trace the failure to a specific synthetic trajectory and reproduce the exact state that caused the error. This speeds up iteration and reduces reliance on expensive real-world testing.

Key tradeoffs and best practices

  • Quality depends on how realistic the environment and the agent simulation are. High fidelity requires reproducing real UI states and tool behaviors.
  • Coverage improves with diverse task combinations. Avoid clustering around common workflows.
  • Bias can creep in if you only simulate tasks your own team cares about. Include edge cases from real logs and user reports.
  • Labeling accuracy matters. Synthetic trajectories must be annotated with clear goals, intermediate steps, and error conditions.
  • Hybrid approaches work best. Combine synthetic data with a small but high-quality real dataset to ground the model in genuine interactions.

Synthetic data does not replace real interaction data. It expands what you can test, reveals failure modes, and makes benchmarks more reliable.

How Coasty fits

Coasty runs computer use agents on real desktops and browsers. This lets you capture realistic interaction data and turn it into custom synthetic datasets and trajectories. The offering is custom and contact-led. You talk to the Coasty data team about your specific use case, and they build a synthetic data solution tailored to your needs. There is no fixed package or public price list. The conversation determines what data you need, how it will be generated, and how it will be aligned with your evaluation pipeline.

If you want to build stronger computer use agents and more reliable benchmarks, synthetic data is a practical lever. To explore how Coasty can help you generate custom synthetic datasets, book a data call with the Coasty data team at https://cal.com/coasty/coasty-data-call .

Want to see this in action?

View Case Studies
Try Coasty Free