Your Workday Is Monetizable RL Data

Software 1.0 easily automates what you can specify. Software 2.0 easily automates what you can verify. -- @karpathy 

For the past decade, large language models have been trained on all of the data on the internet. This has led to a wide number of impressive use cases: summarization, writing, semantic search, and most recently – the work domain that LLMs currently excel at most – coding agents.

But until last year, LLMs could generate text but not reliably take actions. Their learning process combined pretraining on static data with fine-tuning from human feedback, producing strong reasoning but inconsistent execution in real environments.

That all changed in 2025. The newest generation of models reached a capability threshold where agentic behavior, including the ability to operate software, manipulate data, and complete multi-step goals, became viable. So what was the step change unlock in post-training that would put these models into enterprise-ready production? Reinforcement Learning (RL).

That’s because as models began to act, the problem shifted from intelligence to reliability. Making agents reliable now depends on experience, not additional text. And reliability comes from either human feedback (where experts demonstrate how to perform tasks) or verified rewards (where models learn by doing within environments that deliver structured, programmatic feedback).

But in both cases, the main idea is this: Important knowledge about how to do tasks rests in the minds of the expert humans that do them, not in public internet data. Thus, we need to build realistic software environments or sandboxes that can simulate human actions in order to train the models on how to effectively operate a specific domain task.

Take software engineering as an example. Claude Code became a great coding agent in part because Anthropic acquired the legacy code bases of hundreds of failed tech startups on which to train their models. But crucially, models don’t learn directly from the code itself; they learn by interacting with the environment it defines. Because code can be executed and tested, it provides clear, programmatic reward signals (e.g., whether it compiles or passes tests), making it especially well-suited for reinforcement learning compared to more ambiguous domains. The brilliance of this is that a functional codebase is a machine-readable document that inherently reflects accurate and effective human software engineering.

But software engineering is unique. Rarely in the workplace is the output product of work so structured and so binary in terms of whether it’s right or wrong. Coding is an unusually easy field for an LLM to train on. In many other fields, by contrast, an expert human in the loop is needed to train the LLM on what’s right and what’s wrong in their data set on a particular domain. And these verified datasets and sandboxes are extremely valuable.

Frontier labs have been investing heavily in RL. Hundreds of millions of dollars are being allocated to RL-based post-training, a category that barely existed a year ago. This could mark the beginning of the experiential era of AI training, where models improve through concrete interaction rather than observation. If the goal is enterprise implementation across every major domain in our economy, it’s not a stretch to see why the labs have seemingly unlimited budget for high-quality RL to bridge the gap between general and domain-specific expertise. Thus, many startups have popped up to service a market with seemingly unlimited demand.

AI data itself has become one of the fastest-growing markets in history. Frontier labs and hyperscalers are expected to spend roughly $185 billion on AI capex this year, with tens of billions allocated to data. The general rule of thumb is this: for every $1 a lab spends on compute, 10 cents will be spent on data.

Within the broader AI data market, the RL post-training category has begun to dominate frontier research budgets. Industry estimates suggest that hundreds of millions of dollars are already being spent each year, with several labs signaling budgets exceeding one billion dollars to advance agentic models and the environments they depend on. It seems possible there will be dozens of decacorn RL startups founded to service the labs’ insatiable appetite.

Why don’t the labs just bring RL capabilities in-house? We asked this question to an Anthropic researcher and got the following response:

Has NVIDIA ever been tempted to replace TSMC with its own factory? Probably not. I think at some point, that’s what all these guys [the startups we’ve backed] are building: A factory that can churn out tasks and environments [for the foundation models] with incredible fidelity and specificity. And that will be an enduring company and value.

Throughout the second half of 2025, and continuing into 2026, Village Global has made investments in domain-specific RL companies, each with unique training methods. 

We believe the impact will not stop at the research frontier. As agentic models move into enterprise software, automating finance, analytics, customer support, operations, and even robotics, the same principles apply. Indeed, everything one does in a workday could be monetizable RL for an LLM in the near future.

---

Ben Casnocha is the cofounder and fGeneral Partner at of Village Global. Max Kilberg is an Investment Partner.

If you're building an RL post-training company, Village Global invests in startups at the earliest stages.

Are you an amazing entrepreneur working on a big idea?