Training AI models forces an expensive choice. You can use reinforcement learning (RL), where models learn from their own attempts but get sparse feedback—like playing chess with only a win/loss record. Or you can use supervised fine-tuning (SFT), where models study expert examples but never learn to recover from their own mistakes. Both work. Neither is efficient.
Thinking Machines Lab says they’ve found a way around this. Their latest research introduces on-policy distillation, a hybrid method that matches RL’s results with roughly 10% of the compute. In their benchmark, a math reasoning model hit 70% accuracy on AIME’24 using 1,800 GPU hours instead of 17,920.
Key Points:
- Dense feedback, relevant training: On-policy distillation combines RL’s self-generated learning with SFT’s per-token guidance—each word graded, not just the final outcome.
- Massive efficiency gains: The method reached RL-level performance with 9–30× better compute efficiency across reasoning and assistant-training tasks.
- Continual learning fix: It helps models absorb new knowledge without forgetting how to follow instructions—a long-standing pain point in fine-tuning.
Here’s the core idea: the student model generates its own responses, while a larger teacher model scores each token, showing exactly where it went wrong. Think chess.com’s move analysis—you play your own game, but the engine flags every “blunder” and “brilliant.” It’s a middle ground between watching grandmaster games (SFT) and being told you lost (RL).
There’s one practical constraint: you need a capable teacher. Thinking Machines used Qwen3-32B to train smaller Qwen3-8B students. That’s fine when open-weight models suffice, but it’s unclear how the method scales when teacher models are proprietary or too expensive to query. And the student still needs some domain foundation—you can’t teach calculus to a model that’s never seen math.
The continual-learning angle is the real enterprise hook. Fine-tuning a model on internal data often erases its instruction-following behavior. Thinking Machines showed that on-policy distillation can restore those capabilities afterward, effectively preserving chat quality while adding domain expertise. For companies training private AI assistants, that’s gold.
Early adopters at Stanford, Princeton, and Berkeley are already running experiments in Tinker: theorem proving with 20% of the data, chemistry reasoning jumps from 15% to 50%. The question now is whether these results hold beyond reasoning—into domains like software engineering, biology, and law—and whether smaller labs can adopt it without the Tinker stack.
If it scales, on-policy distillation could change the economics of model training. It’s not just cheaper—it’s a more human feedback loop: learn by doing, graded by example.