Three Properties of the Current Pipeline
Suppose you want to evaluate a computer-use agent on an HR platform. You need a seed state (right employees, right permissions, right leave balances), a task prompt, a golden run executed by hand, and a rubric. That’s one environment. Now do it for an ERP system. Now do it for a project management tool. Now do it for every edge case you care about.
The trouble is:
Environments don’t compose. The seed state for a leave approval shares almost nothing with an inventory reconciliation. Each encodes domain-specific semantics that resist abstraction. You cannot write a function that generalizes across them.
Coverage is sparse. You test what you think of. The long tail, unusual state configurations, rare feature interactions, goes unexamined. This is exactly where capable agents fail.
Benchmarks saturate. Benchmark grows linearly, agent capability grows exponentially. You stop measuring progress and start measuring benchmark-fit.
N engineers produce ~kN environments. You can optimize k. You cannot escape the linearity.
The Asymmetry
FDM-1 demonstrates that the agent side of computer action scales beautifully. Train on more video, get better agents. The dataset is internet-scale. The architecture handles hours of context. Scaling laws apply cleanly.
The environment side has no such property. Every eval task is a bespoke artifact. Hand-authored by someone who understands both the application’s semantics and the evaluation’s intent. There is no dataset of environments to scale on. There is no architecture that compresses the problem. There is just labor.
This is the asymmetry. And it means the binding constraint on computer-use AI is shifting from “can the agent act?” to “can we generate enough worlds to evaluate and train it in?”
An Environment Model
Here is the idea. Same intuition FDM-1 applied to agents, applied to environments.
A model that takes as input the surface of a web application, DOM structure, API topology, state schema, and jointly generates complete evaluation worlds. Not task prompts in isolation. Not templates. Coherent units consisting of:
Seed states. Database configurations that are semantically coherent, not random. The kind of state that creates interesting decision surfaces for an agent.
Task specs. Instructions where difficulty comes from requiring reasoning across multiple views, not from length.
Adversarial inputs. Files with merged cells, mixed date formats, hidden sheets, generated from understanding which malformations are meaningful for the task, not from a library.
Rubrics. Machine-evaluable success conditions generated by reasoning backward from the task and the app’s state model.
The key property is jointness. A template system can produce a thousand variations of “change the status of ticket X to Y.” What it cannot do is generate a seed state where that change has non-obvious downstream consequences, an adversarial file that triggers exactly the wrong behavior, and a rubric that checks the cascade. This compositional generation under semantic constraints is what makes it a learned model rather than a programmatic system.
What Changes
If this works, three things happen.
The linearity breaks. Marginal cost of a new environment drops from hours of engineering to seconds of compute. The constraint moves from human throughput to GPU throughput, which is the thing that scales exponentially.
Coverage explodes. A generative model trained on the full surface of an application can explore the long tail that hand-authored benchmarks miss.
Evaluation becomes adaptive. Instead of a fixed benchmark that agents saturate, you have a generator that produces harder environments in response to improving capabilities. The benchmark is no longer a set. It’s a distribution that shifts.
The Hard Parts
This is cleaner on paper than in practice.
Semantic depth. The model needs to understand apps well enough to generate coherent seed states. Beyond DOM structure, entity relationships, business logic, permission models. Whether this can be learned from observation or requires access to source code is open.
Rubric precision. Generating tasks is creative, many valid options. Generating rubrics is specification work, much narrower correctness band. Errors in rubrics corrupt the entire evaluation signal. This requires a kind of precision generative models don’t typically have.
Bootstrapping. You need high-quality environments to train the generator, but producing those is the bottleneck you’re trying to break. Probably requires a curriculum, small hand-authored seed set, train, filter with human review, iterate.
FDM-1 showed that computer action scales when you find the right data and the right encoder. The environment side is waiting for the same treatment. I think whoever builds this closes the loop on scalable computer-use AI.