Why Post-Training Is Moving Toward On-Policy Distillation
Recently, I’ve had a strong impression from reading papers: a visible strand of post-training research is moving from PPO/GRPO-style reinforcement learning toward On-Policy Distillation (OPD).
This is not based on one or two isolated papers. Since the beginning of the year, there has been a cluster of OPD-related work: OPSD, SCOPE, Rethinking OPD, SDPO, and other variants. The common thread is clear: researchers are trying to replace sparse, high-variance sequence-level feedback with denser supervision on student-generated rollouts. OPSD, for example, frames on-policy self-distillation as dense token-level supervision on student rollouts, while SCOPE explicitly describes OPD as a way to address sparse outcome rewards and hard token-level credit assignment in on-policy RL.[1]
Industry signals point in a similar direction, though the evidence should be stated carefully. Qwen3 reports that distillation from stronger models can achieve better performance and much higher training efficiency than its full multi-stage training pipeline. DeepSeek-V4’s public model card describes a post-training pipeline where domain-specific experts are trained with SFT/GRPO and then consolidated through on-policy distillation.[2]
I think this shift is worth paying attention to because it is not merely a change in algorithmic taste. It reflects a deeper structural pressure in the post-training regime.
The simple version is:
As reasoning trajectories get longer, sequence-level reinforcement learning becomes increasingly difficult to scale.
From sparse rewards to dense supervision
PPO and GRPO operate primarily in a trajectory-level regime. The model samples a rollout, receives a reward signal, and updates based on that sampled behavior. This works reasonably well when rollouts are short and the reward is informative. But as rollouts grow from hundreds of tokens to thousands or tens of thousands, the learning problem changes qualitatively.
The reward becomes delayed. Credit assignment becomes harder. The variance of the learning signal becomes a central bottleneck.
GRPO improves on PPO by simplifying the training stack and using group-relative advantages instead of a learned value model. This can make reasoning RL more stable and efficient. But it still largely lives in the same regime: sequence-level optimization over sampled rollouts.
OPD changes the granularity of feedback. Instead of only asking, “Was this whole trajectory good?”, OPD asks a teacher model to provide dense token-level supervision on trajectories generated by the student itself. The student is still sampling from its own policy, so the training distribution is closer to inference-time behavior than in standard off-policy distillation. But the feedback signal is much denser than sparse outcome-level RL.
This is the core appeal of OPD:
It combines on-policy sampling with lower-variance local supervision.
In that sense, the evolution from PPO → GRPO → OPD can be understood as a progression in feedback granularity:
- PPO gives sparse sequence-level reward.
- GRPO improves relative credit assignment, but remains sequence-level RL.
- OPD turns the student’s own rollouts into supervised learning targets by querying a teacher along the way.
This looks especially natural in the agent era. For long-horizon reasoning, what matters is not only whether the final answer is correct, but whether each intermediate decision keeps the trajectory inside the region where success remains possible. Dense token-level supervision can shape behavior much earlier than final-outcome RL.
Where is OPD’s ceiling?
But this also exposes the central unresolved problem.
Where is OPD’s ceiling?
The weakness of OPD is teacher reliability on student-generated prefixes. As the student’s trajectory becomes longer, its prefix may drift into states that are unusual, low-quality, or outside the teacher’s most reliable region. The teacher is then asked to provide token-level guidance conditioned on a prefix it may not itself have produced.
This creates a fundamental tension:
- Token-level supervision is low-variance, stable, and efficient. But it can become biased if the teacher’s local guidance is unreliable on long student-generated prefixes.
- Sequence-level RL is closer to the objective we ultimately care about. But it is much noisier, especially as horizon length grows.
Recent OPD papers can be read as different attempts to navigate this tension. SCOPE routes rollouts by signal quality and down-weights unreliable teacher guidance. Rethinking OPD argues that OPD succeeds only under specific conditions, including compatibility between student and teacher reasoning patterns and the teacher offering genuinely new capabilities. The paper also notes that OPD’s apparent “free lunch” of dense token-level reward comes with a cost, especially for long-horizon distillation.[3]
This is why OPD feels like both a major practical direction and an incomplete theoretical story. It is practical because it directly addresses the scaling problem of long-horizon RL: sparse rewards become harder to use as trajectories lengthen. But it is incomplete because the thing that makes OPD efficient — dense token-level teacher supervision — is also the thing that can become unreliable as student-generated trajectories drift away from the teacher’s natural distribution.
A transitional regime, not an endpoint
My current view is that OPD is not the endpoint of post-training. It is a transitional regime.
The field is moving away from pure sparse trajectory-level RL because long-horizon agent training makes it too expensive and too noisy. But pure token-level distillation is also unlikely to be enough, because local teacher supervision does not fully capture global sequence-level objectives.
The next step may be methods that combine the two more explicitly:
- dense process supervision where the teacher is reliable,
- sequence-level correction where local supervision is insufficient,
- and uncertainty-aware routing between different training signals.
In other words, the important question may not be:
RL or distillation?
It may be:
At each point in a long trajectory, what is the most reliable training signal available?
That framing makes OPD less like a replacement for RL and more like a new layer in the post-training stack: a way to densify supervision while preserving on-policy adaptation.
The trend is clear: post-training is being pushed from sparse outcome feedback toward denser, more local feedback. But the open problem is equally clear:
Dense feedback only helps if we know when to trust it.
References
- OPSD / SCOPE on dense token-level supervision and sparse-reward credit assignment in on-policy RL. arxiv.org/abs/2601.18734
- Qwen3 and DeepSeek-V4 post-training reports on distillation as a consolidation step after SFT/GRPO experts. arxiv.org/abs/2604.13016
- SCOPE: Signal-Calibrated On-Policy Distillation Enhancement with Dual-Path Adaptive Weighting. arxiv.org/abs/2604.10688