The Lab as One Policy: Training an All-Agent Research Organization — Part I
In March, Karpathy released autoresearch: a single agent that edits a training script, trains for a fixed five minutes, keeps the change if one metric improves and reverts it otherwise, and loops unattended through roughly a hundred experiments a night.[1] Pointed at nanochat (a codebase he had already spent a long time hand-tuning), it found a missing scalar multiplier in a QK-Norm that he hadn't caught. No organization. No learning. One agent, one metric, a greedy loop, and a human writing the research directions into a markdown file.
The most interesting thing about the system we are building at MV37 isn't that it adds specialist agents, or that it trains the researcher instead of fixing it. It's that autoresearch already works without either, and that fact is the bar every layer we add has to clear.
This post lays out the training pipeline for an organization of LLM agents that collaborate toward a scientific goal — invent a post-training algorithm, find a treatment direction in biology — and, unlike autoresearch, gets better at the act of research itself. We also discuss where the plan might break. The through-line is a problem I've written about in a narrower setting: dense feedback only helps if you know when to trust it.[2] A research organization is the hardest place I know of to manufacture that trust.
Autoresearch is not a strawman. It is the strongest baseline we have, and the one we are most likely to lose to.
I'm writing this before I have results, on purpose. What follows is the plan we're building and testing at MV37: the bets it rests on, not a report of what worked. This is Part I; I'll keep publishing what actually happens as the experiment runs, including the layers that fail to beat autoresearch. Treat everything below as a hypothesis I'm on the hook to test, not a finding.
Why automate the lab
We work on this because the binding constraint on AI progress is moving from compute and data to research labor: the number of good experiments a competent researcher can design, run, and reason about in a day. Autoresearch made the point almost by accident: most of a researcher's day is spent waiting on the GPU, not thinking.[1] If that's the bottleneck, the thing to build is a research org that compounds: one that doesn't just run more experiments but gets better at running them.
A fixed agent in a loop captures only half of that. It removes the waiting (a hundred experiments run while you sleep), but it is no better at the ten-thousandth experiment than at the first, because it re-derives everything each run from the same frozen weights and keeps nothing but the diffs to one codebase. The half we are actually after is the other one: a system whose skill at research accumulates, where solving one problem makes the next easier. Throughput is linear; learning compounds.
This is also the natural next object for where post-training is already heading (verifiable rewards, agentic tool use, and the specialist-then-consolidate pipelines I've written about), so training a research organization is an extension of that trajectory rather than a departure from it.[2] It is the highest-variance bet I know of in the space, and worth making for one reason: of everything you can automate in AI research, the researcher is the only component whose returns compound. We would rather build the thing that learns to do research than hand-tune one more pipeline that doesn't.
1. The baseline that sets the bar
It is worth stating plainly what autoresearch is, because the cleanest way to understand our proposal is as autoresearch with four switches flipped. Autoresearch is the degenerate case of the system we want: one agent rather than an org, a single scalar metric rather than a vector, a greedy keep-or-revert outer loop rather than a learned update, a fixed off-the-shelf agent whose weights never change, and a human supplying the curriculum in program.md. Everything we are designing is one of those switches turned on.
Drawing them as accumulating layers makes the cost of each obvious. V1 replaces the single metric with a constrained objective. V2 replaces the lone agent with an orchestrated organization of specialists. V3 replaces the greedy loop with a learned policy and the human curriculum with an automatic one. Only V3 ever updates the researcher's weights; only V3 can, in principle, compound.
The framing imposes a discipline that runs through the rest of this post: each layer must beat the simpler one on a held-out measure, or it does not ship. A learned orchestrator that cannot beat a greedy fixed agent is several orders of magnitude of compute spent rediscovering autoresearch. Treating V0 as the control rather than the foil is the single most useful decision in the whole plan.
Where we expect to beat it, and where we won't
Autoresearch is greedy hill-climbing with a fixed agent and a single metric, and each of those three properties is a ceiling. The places we expect to win are exactly the places those ceilings bind.
The edge that matters most is transfer. Autoresearch starts every run from the same frozen weights and keeps nothing but the diffs to one codebase; it has no memory across problems, so it is no better at the thousandth problem than at the first. A trained policy accumulates research skill: solving one problem shifts the prior for the next. Across a portfolio, a learning curve bends upward where a memoryless one stays flat. This is the asymptotic case, the only one that can justify the cost, and what the figure below is really about.
Greedy search stalls in local optima. Keep-or-revert only ever accepts an improvement, so it halts wherever no single five-minute edit helps. Autoresearch found a missing scalar because that fix was one local step away; it will not find a change that needs three coordinated steps each of which looks neutral or worse alone. A learned policy with a value function and a proposer can take the locally-worse step and pursue the multi-step plan whose payoff only appears at the end.
Breadth is structural, not incidental. Autoresearch refines an already-good single-file codebase on one GPU, and it is superb at it. It is not built for problems whose state spans data, training, evaluation, and analysis across sub-disciplines, nor for inventing an algorithm rather than tuning one. The organization exists for problems that do not fit one agent's context and whose moves are not single code edits. A single metric, similarly, cannot say “good and cheap and reproducible at once”; where those genuinely trade off, the constrained reward (§3) finds solutions the scalar would either miss or hack. And autoresearch explores only inside the directions a human writes into program.md; the proposer can propose directions the human didn't, which is where genuinely new results, if they come, will come from.
Every one of these is a hypothesis, and it's on us to prove it. The place we expect to win is not autoresearch's home turf (local refinement of a scorable single-GPU loop, where it will keep winning per-dollar and per-night), but the regime it structurally cannot enter: portfolios of long-horizon, multi-objective, cross-domain problems where skill should carry from one to the next. If our curve never crosses its flat line there, the discipline from above settles it, and the layer does not ship.
2. The system, formally
The clean object is a hierarchical decision process: a Dec-POMDP with a manager.[3] There are domain experts π₁…πₙ, each acting on a shared project state s that holds everything produced so far: current code, datasets, training results, literature notes, open hypotheses. Above them sits an orchestrator π₀ whose action is a task sequence: who does what next. The environment is real: code execution, GPUs, literature access. A constrained reward is computed on the artifact the org eventually produces. The "policy" of the organization is the joint object Π = (π₀, π₁, …, πₙ).
This matters because "turn five policies into one" is three different operations, and only one of them is the architecture. Weight-space merging (soups, TIES, task arithmetic) blends fine-tuned models into a single network, and even the interference-aware variants blur the specialization you spent compute building.[4] Behavioral distillation trains one student to imitate all the teachers and folds the routing in implicitly. Hierarchical abstraction physically combines nothing: it keeps the experts separate and simply calls (π₀,…,πₙ) the policy. We take the third. Distillation re-enters later, as the consolidation step for serving, the role I've argued it is generally settling into across post-training.[2]
π₀. Experts are trained earlier on dense, local rewards (teal), so the global signal never has to do their credit assignment.We do not merge the experts. We make the orchestrator the only thing that has to learn the organization.
3. The training pipeline
The pipeline is staged, and the ordering is the whole point: the same "narrow specialists first, consolidation last" shape that the best recent post-training pipelines have converged on.[2]
- Stage 0: Specialize. Each expert is RL'd on its own domain tasks against a local, verifiable reward: data-quality and schema checks for the data engineer, throughput and convergence for the trainer, held-out solve rate for the researcher. These run in parallel and cheaply, with group-relative scoring so no critic is needed.[5] A generalist trained against one noisy global reward is the hard case; a deliberately narrow specialist is the tractable one.
- Stage 1: Orchestrate. Freeze the experts; train
π₀against the global constrained reward. Its action space is small (assignments and sequencing), so a sparse signal travels far. This is where credit assignment would otherwise destroy us, and the deliberate dodge is to keep most learning in Stage 0's local rewards and reserve the global signal for a low-dimensional policy. - Efficiency scaffold (cross-cutting). A proxy ladder of fixed-budget surrogate episodes (autoresearch's five-minute trick, generalized), so most updates run in minutes, validated by tracking the rank correlation between proxy and full-scale outcomes. A learned value / process-reward model that approximates the expensive benchmark cheaply, supplies dense intermediate signal, and prunes doomed rollouts before they finish. Off-policy replay with hindsight relabeling, so a failed attempt that nonetheless produced a clean dataset is mined as a positive for that sub-goal.[6]
- Stage 2: Curriculum. A proposer generates tasks at the learnability frontier (roughly the band where the org succeeds about half the time and the gradient is largest) instead of sampling uniformly.[6] This is also where genuine open-endedness, and the only real self-play in the design, enters: a proposer searching for hard novel sub-problems is the asymmetric opponent the system otherwise lacks.
- Stage 3: Consolidate. Distill the trained, orchestrated org into a single fast policy for serving via on-policy distillation. Distillation is the last layer, not the engine: it compresses a multi-policy system into one deployable policy after the learning is done.
The reward is a constrained objective, not a weighted sum
The thing you most want to avoid is a single scalar formed by weighting eight metrics. Performance and cost are different types of quantity, and a weighted sum invites the policy to buy unlimited quality with unlimited compute, or to hit a beautiful cost number with a useless result, depending on which margin is softest. The structure that actually expresses "we care about all of this" is a small number of objectives you push (for an RL algorithm, realistically one: does it beat the baseline on held-out tasks) and a larger set of hard constraints the episode must satisfy or fail: compute budget, wall-clock, numerical stability across seeds, reproducibility, no eval overfitting. Most of what you care about turns out to be a constraint, not an objective.
Anything you leave unmeasured will be driven to its worst value. The suite is not a scoreboard; it is the specification of acceptable behavior.
4. The trust-signal problem
This is where the research-org setting turns adversarial, and where this plan connects to a question I left open in earlier writing. In the cleanest worked example I've analyzed (DeepSeek's Thinking with Visual Primitives), verifiable rewards came almost for free, because vision is spatial: grounding every reasoning step in image coordinates keeps the trajectory anchored to a space where the result is cheap to check against the image itself, so it cannot drift out of the region where the verifier operates.[7] I argued there that the deep move was not "use bounding boxes" but design the output representation so that local correctness is externally checkable, and I named open-ended research trajectories as exactly the domain where that move is hardest and the "know when to trust dense feedback" critique bites deepest. We are now standing in that domain on purpose.
A research org has two candidate trust signals, and both are flawed. The learned value model is dense and cheap but drifts on long research trajectories: the same on-policy-distillation drift problem, returning in a setting with no bounded canvas to stop it. The external outcome verifier (held-out task suites, reproduction across seeds, compute receipts, ablation checks) is trustworthy but sparse and late, available only once an artifact exists.
The manufactured anchors we can build for research (held-out suites, seed reproduction, ablation and cost ledgers) are the analog of TwVP's coordinates, but lower density. They check that a result is real. They cannot check that it is novel. And novelty is the whole point: you can only reward what you can specify, and discovery is by construction unspecified. The benchmark that measures everything you care about today is the one that stops measuring what you'll care about tomorrow. The partial defenses are canary holdouts the org never trains on, rotating verifiers so it cannot overfit a fixed surface, and periodic human audit of the specification itself to catch exploitation of the unmeasured axes.
We can verify that the org's output is correct and cheap. We cannot verify that it is a discovery. That gap is the ceiling.
The nearest existing blueprint for closing that gap is the rubric machinery behind benchmarks like OpenAI's LifeSciBench,[9] which grades open-ended scientific answers by breaking each into a few dozen atomic, independently-checkable criteria, scoring partial credit against them, gating on a hard pass threshold, and licensing the whole rubric through a cohort of experts who validate criteria they didn't write. Every one of those moves transfers: atomic decomposition is the dense process reward §3 leans on; partial credit behind a hard gate is its constrained objective; an independent validation cohort is what a rotating verifier looks like in practice. What does not transfer is the part that defines the gap: each criterion is fixed in advance from a verifiable answer or prior consensus, so the rubric can confirm a result is real and correct and credit nothing the authors didn't already anticipate. That is the wall novelty runs into, and the verifier this plan is ultimately reaching for is the one that gets past it.
5. Pitfalls
Really, it's a stack of bets, any of which can fail by itself. The ones that worry me most, roughly in order:
- Sample economics dominate everything. Each episode is an entire research attempt (hours of wall-clock and real GPU dollars) while gradient RL wants thousands of them. We may be forced into population-based or evolutionary search over a handful of expensive evaluations, with gradient RL surviving only on the cheap local rewards of Stage 0. The cost structure picks the optimizer; the optimizer does not pick the cost structure.
- Credit assignment may not factor. A scalar end-of-episode reward over a long multi-agent trajectory is brutal to attribute. Counterfactual baselines and value decomposition exist, but they cost.[8] Our bet is to sidestep most of it with local expert rewards plus an orchestrator-only global signal, and if that bet is wrong, the global signal is simply too weak to learn from.
- Proxy decoherence. The proxy ladder only works while small-scale surrogates preserve the ranking of full-scale outcomes. A proxy that was faithful last week can quietly start lying. Continuous rank-correlation monitoring catches it; nothing fixes it if no cheap proxy tracks the real metric, and then the whole sample-efficiency story collapses.
- Reward hacking. A fixed suite gets gamed: the org overfits the measured surface and sacrifices everything unmeasured. The growing gap between train-suite score and canary-holdout score is the alarm, provided we are disciplined enough to keep the canaries secret and watch them.
- Distillation drift, again. Consolidating a long-horizon research org via on-policy distillation reintroduces exactly the drift I wrote about: as student rollouts lengthen, their prefixes leave the teacher's reliable region and dense supervision becomes cheap, low-variance, and wrong.[7] Short proxy episodes hide this; real research horizons surface it.
- Orchestrator collapse. The manager can learn to route everything to one expert, or to a degenerate plan that scores well on the proxy and nowhere else. Entropy and diversity terms plus held-out task rotation are partial guards, not guarantees.
- The expensive-rediscovery failure. The one that subsumes the rest: spend two or three orders of magnitude more compute than autoresearch and produce a learned policy that still cannot beat a greedy fixed agent. The only protection is the discipline from §1: gate every layer on beating the simpler baseline on a held-out measure before it ships.
Where this leaves us
The bet is narrower than "more agents discover more." On a scorable loop, coordination is a tax and a single agent wins; the multi-agent structure only earns its keep where a problem genuinely spans expertise one context cannot hold. The real bet is that a researcher that improves itself crosses a line a fixed one structurally cannot — because autoresearch can neither learn nor see past a single metric — and that the line is reachable before the sample economics make it academic. That second clause is genuinely open, and I would rather state it than dress it up.
Underneath all of it is the trust signal. The organization that can verify its own progress is the one that can train; the one that cannot is just an expensive way to overfit a benchmark. The verifier this experiment needs is the ordinary kind — held-out tasks, reproduction, constraint gates — and we can build it, so whether a learning lab beats autoresearch is a question we can ask and answer now. What we cannot yet build is a verifier that certifies a discovery: an anchor as load-bearing as TwVP's coordinates, in a domain that hands you none for free. That is the deeper problem, and it sets the ceiling on how far any of this can climb, but it is what the experiment climbs toward rather than a wall in front of it.
Autoresearch proved a fixed agent in a greedy loop can already find what a careful human missed. The open question is whether a lab that learns can find what a careful human couldn't — and whether we can ever know that it did.
References
- A. Karpathy, autoresearch, GitHub, March 2026. Single-agent loop over single-GPU nanochat training: edit, train for a fixed 5-minute budget, score on one metric, keep or revert, repeat unattended (~100 experiments/night).
- V. Tan, Why Post-Training Is Moving Toward On-Policy Distillation, May 2026. The "sequence-level RL on specialists, distillation as consolidation" template.
- F. Oliehoek & C. Amato, A Concise Introduction to Decentralized POMDPs, Springer, 2016. The Dec-POMDP formalism used here for the org-as-one-policy framing.
- Wortsman et al., Model Soups, 2022; Yadav et al., TIES-Merging, 2023; Ilharco et al., Task Arithmetic, 2023. Weight-space merging — the operation we deliberately avoid for the experts.
- GRPO — group-relative policy optimization (DeepSeek), used for critic-free specialist RL in Stage 0.
- Andrychowicz et al., Hindsight Experience Replay, 2017, for relabeling failed trajectories; Jiang et al., Prioritized Level Replay, 2021, for regret-based (positive value-loss) level selection. The "succeeds about half the time" learnability framing in §3 follows the learnability-sampling / ZPD line, not PLR's regret signal.
- DeepSeek, Thinking with Visual Primitives, April 2026 (analyzed in V. Tan, A Well-Worked Example of OPD, and Its Limits). Coordinate grounding as a manufactured verifiable trust signal; the on-policy-distillation drift problem on long horizons.
- Foerster et al., Counterfactual Multi-Agent Policy Gradients (COMA), 2018; Rashid et al., QMIX, 2018. Multi-agent credit assignment.
- A. Liu, A. Ho, et al. (OpenAI & Tacit Labs), LifeSciBench: Evaluating Language Models on Realistic, Expert-Level Tasks in the Life Sciences, 2026. Open-ended scientific tasks graded against expert-written rubrics: atomic per-criterion scoring, a partial-credit score plus a hard pass threshold, and rubrics validated by a cohort disjoint from the authors.