DeepSeek’s Thinking with Visual Primitives: A Well-Worked Example of OPD, and Its Limits
Last week DeepSeek released Thinking with Visual Primitives (TwVP), a multimodal reasoning model whose chain of thought is forced to emit bounding boxes and point coordinates rather than free-form language. The post-training pipeline ends in on-policy distillation from two specialist teachers.[1]
The most interesting thing about TwVP isn’t what it does on visual reasoning benchmarks. It’s that the paper solves on-policy distillation’s central failure mode without touching the algorithm — by changing the output representation instead.
OPD has a known failure mode: as student rollouts lengthen, prefixes drift outside the teacher’s reliable region, and dense token-level supervision becomes quietly biased: cheap, low-variance, and wrong. TwVP forces every reasoning step to emit image coordinates of the form <ref>label</ref><box>x1,y1,x2,y2</box>. The trajectory literally cannot leave the bounded canvas. The drift problem doesn’t get solved; it gets engineered out of existence.
The deeper move generalizes well beyond vision: design the output representation so local correctness is externally checkable. TwVP works because image coordinates sit inside the model’s output and are cheaply verifiable against detection annotations. That gives the system a trust signal independent of the teacher, which is what OPD has needed all along.
It is also the cleanest worked example I’ve seen of the layered post-training stack I sketched in an earlier post: sequence-level RL where local supervision is unreliable, on-policy distillation as the final consolidation step, externally verifiable rewards as the trust signal on top. This post walks through how each piece is handled in the actual paper, what the pattern says about where post-training is heading, and the limitations of this approach.
1. Getting OPD’s upside without its downside
Grounding bounds drift
TwVP attacks the drift problem at the representation level rather than the algorithmic level. Every reasoning step is forced to emit tokens of the form <ref>label</ref><box>x1,y1,x2,y2</box> that pin the trajectory to physical image coordinates. The student cannot drift into the open-ended linguistic state space where teacher guidance becomes unreliable; at every step, its output lives in the same grounded representational space as the teacher’s.
Drift is bounded by construction, not by routing.
Specialists satisfy compatibility
Rethinking OPD argued that OPD only succeeds when student and teacher share reasoning patterns.[2] The natural failure mode is a generalist teacher providing token-level supervision on a student that reasons in a slightly different style.
TwVP’s response is to train two narrow specialists before doing any OPD: FTwG (focused on grounding, i.e. bounding boxes) and FTwP (focused on pointing, i.e. coordinates). Each specialist’s reliable region is wider within its sub-task than a generalist’s would be. The OPD step then runs inside a regime where the teacher distribution is well-defined for the kinds of trajectories the student is producing.
The compatibility condition is satisfied not by a clever student, but by a deliberately narrow teacher.
OPD as consolidation, not engine
TwVP’s pipeline runs sequence-level objectives first, dense token-level supervision last:
- Stage 1–2. SFT cold start on a synthetic corpus of bounding-box-annotated reasoning.
- Stage 3. GRPO on each specialist, with three reward heads (format, quality, accuracy). Sequence-level RL.
- Stage 4. Unified RFT: another round of sequence-level RL, on mixed rollouts from both specialists, to merge them into a single policy.
- Stage 5. On-policy distillation, described in the paper as “bridging the gap between the unified model and the specialists.”
OPD is the last layer, not the engine. Sequence-level objectives are baked in before the dense token-level supervision step. This is the layered stack I sketched, with OPD acting as the consolidation step.
Knowing when to trust dense feedback
The most open-ended claim in the previous post was that dense feedback only helps if we know when to trust it. SCOPE’s answer was uncertainty-aware routing of teacher KL.[3] TwVP’s answer is different and arguably stronger: do not trust the teacher’s local distribution at all, verify against ground truth.
The reward on TwVP’s maze navigation task decomposes into five components: causal exploration progress, exploration completeness, wall-penetration penalty, path validity, and final answer correctness. On counting, the reward uses a smooth exponential decay rather than a binary indicator.[1] These are externally verifiable signals computed against ground-truth geometry, not divergence from a teacher’s per-token distribution.
Because the trajectory is pinned to image coordinates, its reward can be computed against ground truth (the count, the final answer, the path through the maze) rather than against the teacher’s distribution. That gives the system a trust signal that does not depend on the teacher’s reliability on long student-generated prefixes. The teacher contributes shape; the verifiable reward contributes truth.
To compress the mapping: in the previous post, I argued that OPD is a transitional regime that needs (a) better-defined reliable regions, (b) sequence-level correction underneath, and (c) external trust signals on top.
TwVP does (a) by grounding trajectories in image coordinates and using narrow specialists, (b) by running GRPO and Unified RFT before the OPD stage, and (c) by replacing teacher-KL-as-truth with decomposed, geometrically verifiable rewards.
This is not coincidence. It is what the OPD ceiling looks like when you can engineer the output representation to do the work the algorithm cannot.
2. What it means for the field
A few patterns this work confirms.
OPD-as-consolidation is becoming the recognizable template. Qwen3 reported that distillation can outperform their full multi-stage pipeline. DeepSeek-V4 used OPD to merge experts after SFT/GRPO.[4] TwVP does the same: specialists trained with RL, then unified via OPD. The shape is consistent: dense token-level supervision is not the training method, it is the final layer that compresses a multi-policy system into a single policy. The default stack in domains with verifiable rewards is becoming: sequence-level RL on specialists, on-policy distillation as consolidation.
The “specialist → unified” template generalizes beyond vision. The reason TwVP needs specialists is that grounding and pointing are different sub-tasks with different reward shapes. The same is plausibly true of code agents (refactor vs. test vs. debug), of tool use (search vs. compute vs. retrieve), of research workflows (locate evidence vs. synthesize). Training narrow experts and consolidating with OPD is more tractable than RL on a generalist with a single noisy reward.
Coordinate grounding is one instance of a broader strategy: domain-engineered trust signals. TwVP works because image coordinates are simultaneously (i) part of the model’s output and (ii) cheaply verifiable against external annotations. The deeper move is not “use bounding boxes”: it is design the output representation so that local correctness is externally checkable. This reframes a chunk of agent RL research: instead of training better critics or denser rewards, design the action space so that local truth signals come for free.
The interesting open question shifts from “how do we make OPD work?” to “how do we manufacture verifiable trust signals in domains without natural anchors?”
The general lesson is that OPD’s ceiling is lifted not by better algorithms but by better representations.
3. Limitations
TwVP’s setup is unusually friendly to OPD because visual coordinates give cheap external verification. The “dense feedback only helps if we know when to trust it” critique bites hardest in domains where there is no ground-truth anchor to check against: long-horizon code agents, open-ended research trajectories, multi-turn tool use.
DeepSeek picked a domain where the trust signal is essentially free, and built the pipeline around it.
A few specific limits worth naming:
- The headline numbers are concentrated where coordinates are load-bearing. The 17-point gap over GPT-5.4 on maze and path-tracing is exactly where coordinate grounding is most useful and where conventional OPD’s drift problem is most easily sidestepped. On tasks closer to general VQA, the gap narrows considerably.
- Specialist training is expensive. The narrow-teacher fix to the compatibility condition requires training multiple specialist models before the final OPD step. Total compute is closer to a full RL pipeline than to a distillation step; that is the cost of avoiding the drift problem, not an escape from it.
- The grounding constraint is a tax, not a feature, in some domains. Forcing every step to emit a coordinate works for vision because vision is fundamentally spatial. It is much less obvious that abstract reasoning (math, philosophy, code architecture) has an analogous low-dimensional verifiable space. Manufactured anchors (line numbers, AST node IDs, retrieval source IDs) may help, but they do not carry the same density of external truth that pixels do.
- The original OPD critique still holds in its original setting. Rethinking OPD warned that dense token-level reward comes with a cost especially for long-horizon distillation. TwVP avoids this by running OPD over short, image-grounded trajectories. Stretch the horizon — an agent that produces hundreds of grounded steps over an hour of video — and the drift problem returns. We have not seen that stress-tested.
TwVP is a thoughtfully engineered instance of OPD-as-consolidation, but the conditions that make it work — cheap external verification, naturally short trajectories, well-defined sub-tasks — are not present in most domains where post-training is actually a bottleneck.
It is a successful demonstration, not a generalizable answer.
The open question I left in the earlier post still stands: dense feedback only helps if we know when to trust it. TwVP’s contribution is one way to know: anchor the output in something the world can verify. The question for everyone else is how to manufacture that anchor when the world doesn’t give you one for free.
References
- DeepSeek, Thinking with Visual Primitives, April 2026. Withdrawn by DeepSeek shortly after release; the link points to an archived snapshot.
- Rethinking On-Policy Distillation, 2026. On the compatibility condition between student and teacher reasoning, and the hidden costs of dense token-level supervision over long horizons.
- SCOPE: Signal-Calibrated On-Policy Distillation Enhancement with Dual-Path Adaptive Weighting, 2026. Uncertainty-aware routing of teacher KL.
- Qwen3 and DeepSeek-V4 post-training reports on distillation as a consolidation step after SFT/GRPO experts.
- Ventali Tan, Why Post-Training Is Moving Toward On-Policy Distillation, May 2026.