Technical details of DeepSeek
- The development of SFT (Supervised Fine-Tuning)
(1) SFT may no longer be necessary at the inference level
- The biggest innovation of DeepSeek is not its open source concept or low cost, but that it no longer requires SFT at the inference layer .
- Does this mean that a new method or architecture makes data usage more efficient and speeds up model iteration ?
- SFT remains necessary for certain tasks (e.g., data generation and alignment) , but it is no longer the only way to improve inference performance.
(2) DeepSeek-R1 continues to use SFT in certain steps
- R1 does not completely dispense with SFT, but uses SFT for distillation in the third step before it is tuned with RLHF (Reinforcement Learning from Human Feedback).
- R1 is still based on a model trained by SFT, but the data used was generated by a model trained by RLHF.
- Distilling well-curated data through SFT can still yield significant performance gains , but complex RL methods may not be required.
(3) GRPO mechanism (Reinforcement Learning with Verifiability)
- Crucially, the base model is intelligent enough – a prompt is generated up to 16 times to find a high-quality answer .
- Particularly suitable for mathematics and programming as these are easily verifiable but theoretically transferable to other areas.
- This process shows that the resulting RL model represents an emergent computational procedure .
(4) The emergence of CoT (Chain-of-Thought)
- R1-Zero showed CoT emergence without SFT , suggesting that CoT may be a natural property of LLMs.
- An infinitely long CoT could give LLMs a kind of Turing machine capability , but essentially it is just an optimized search strategy.
- Between R1-Preview and R1, the context window was probably enlarged , possibly by a Long2Short CoT optimization.
- The crucial role of data annotation
- DeepSeek attaches great importance to data annotation , even the founder Liang Wenfeng is involved in it.
- Data quality is more important than algorithms , similar to Tesla’s strategy for autonomous driving.
- Scale.AI continues to have market opportunities, particularly in mathematics and programming that require expert annotation.
- Multimodal data currently show no significant effects due to high training costs, but may offer opportunities in the future.
- The Advantages and Disadvantages of Distillation
(1) Short-term benefits of distillation
- Distillation allows small models to learn from large models and can provide significant performance improvements.
- In the short term, distillation remains an important method for improving the performance of small models , especially for startups.
- DeepSeek has developed several smaller model versions that can run on mobile devices , which, if successful, could significantly increase the usability of AI applications.
(2) Long-term problems of distillation
- Reduced model variety, which can lower the upper performance limit.
- Some RL hacks cause the model to initially generate useless ideas before suddenly giving the right answer , possibly because it “memorized” many questions during pretraining instead of truly understanding them.
- Without its own data pipeline, reliance on distillation could lead to long-term limitations.
(3) Possible future improvements
- Future models could use reinforcement learning with verifiable rewards (RLVR) to ensure they truly understand rather than just memorizing answers.
- OpenAI doesn’t rely on distillation – if you want to outperform OpenAI, you probably shouldn’t.
- R1-Zero might be the right approach by training from scratch rather than relying on existing O1 data.
- Future LLMs must learn to make “leaps” in answers to maximize their performance within fixed context lengths.
- Process Reward: “The upper limit of process monitoring is the human, the upper limit of result monitoring is the model”
(1) Process Reward has potential problems
- Process Reward is not necessarily useless , but it can easily lead to reward hacking – the model learns nothing useful but can achieve high rewards.
- Example mathematics: A model generates 1000 solutions, but none is correct. It cannot learn anything with RLVR. However, a weak process reward could help to move in the right direction.
- How useful Process Reward is depends on the complexity of the task and the reliability of the assessment.
(2) Challenges of process evaluation (Process Reward Model, PRM)
- If the PRM evaluation contains a systematic bias, it is easily exploited (reward hacking).
- Process monitoring is theoretically possible, but there is currently no robust method to ensure that it is not tampered with.
- Outcome-based supervision is done by matching with extracted responses , but there is no mature method for models to evaluate themselves without hacking.
- Process evaluation is technically feasible because it can be systematically enumerated , but it has been little researched so far – potentially a promising approach.
(3) The upper limit of process vs. result monitoring
- The upper limit of process monitoring is humans – people cannot imagine many solutions.
- The upper limit of outcome monitoring is the model itself , as it can find new, unforeseen solutions.
(4) Comparison with AlphaZero: Why it works
- AlphaZero is effective because chess and Go games have a clear win/loss rating and the reward can be calculated by win rate.
- LLMs do not have this clear signal – they generate an infinite number of answers without knowing whether they provide a solution.
- Similar to genetic algorithms, the model could arrive at better answers through many iterations, but there is a risk of reward hacking.
(5) Process and Rule Validation in Math & Coding
- The advantage of mathematics and programming is that they have testable rules , which is why many RL approaches start here.
- If the rules are not clearly defined, the model will try to “hack” them by generating formally correct but substantively incorrect answers.
- A robust evaluation method is crucial for the quality of reinforcement learning.
