Research

OSWorld Style Synthetic Benchmarks for Computer Use Agents

Daniel Kim||6 min
Ctrl+P

OSWorld proved AI agents can navigate real desktops and browsers, completing tasks like ordering groceries or booking flights. But running agents on live systems is slow, noisy, and sometimes risky. Real datasets are expensive to collect and hard to share. Synthetic benchmarks offer a faster, safer alternative that mimics real interaction patterns.

OSWorld style benchmarks rely on real logs

OSWorld builds benchmarks by running computer use agents on live desktops and browsers, then recording their inputs and outcomes. The win rate of a model on OSWorld is a direct measure of how well it can follow instructions and handle UI controls. This approach is credible because the data comes from actual usage. But it has real limits. You need hardware, time, and careful permission handling to gather enough sessions. Reproducibility suffers when agents hit blockers or when real websites change layout. Plus, you cannot easily expose raw sessions or guarantee privacy compliance without scrubbing.

Synthetic data captures interaction patterns, not screenshots

The key insight is that many agents learn from sequences of actions, not from raw images. A synthetic dataset can encode the same click paths, form fills, and navigation choices observed in real logs, without storing screen pixels or cookies. Coasty’s computer use agents run on real desktops and browsers, recording interaction trajectories. Those trajectories are then transformed into synthetic datasets: sequences of clicks, keystrokes, and state transitions that mimic genuine user behavior. Studies show synthetic trajectories can match real log accuracy within a few percentage points for common tasks like form filling or file management. The cost per task drops dramatically because you do not need continuous live sessions. You can also inject edge cases and rare errors that rarely appear in real logs.

Why synthetic benchmarks matter for evaluation

Synthetic benchmarks let you run thousands of agent evaluations overnight on a single GPU. You can test dozens of models against the same set of tasks, isolate failure modes, and iterate on prompts or tools without waiting for real-world runs. For safety-critical domains like banking or healthcare, synthetic tasks can simulate workflows that are too risky or regulated to capture in live logs. You can also design tasks that deliberately cover gaps in your real dataset, such as complex multi-step interactions or error recovery. The result is a more robust evaluation surface that scales with your experimentation needs.

Synthetic trajectories from real interaction patterns can match real log performance for common tasks while dramatically lowering cost and increasing safety.

How Coasty fits into this workflow

Coasty runs computer use agents on real desktops and browsers, capturing realistic interaction data and producing synthetic datasets and trajectories. The service is custom and contact-led: you talk to the Coasty data team about your specific tasks, constraints, and evaluation targets. They can generate synthetic benchmarks tailored to your models, tools, and safety requirements. There is no fixed product or public pricing, each engagement is scoped to your use case, whether you need a single task set or a large-scale benchmarking infrastructure.

If you want OSWorld-style benchmarks for your computer use agents but need them faster, cheaper, and safer, Coasty can help. Book a data call with the Coasty data team at https://cal.com/coasty/coasty-data-call to discuss your use case and explore what custom synthetic data can do for you.

Want to see this in action?

View Case Studies
Try Coasty Free