A closed discussion about DeepSeek (Part 2)

Technical details of DeepSeek

  1. The development of SFT (Supervised Fine-Tuning)

(1) SFT may no longer be necessary at the inference level

  • The biggest innovation of DeepSeek is not its open source concept or low cost, but that it no longer requires SFT at the inference layer .
  • Does this mean that a new method or architecture makes data usage more efficient and speeds up model iteration ?
  • SFT remains necessary for certain tasks (e.g., data generation and alignment) , but it is no longer the only way to improve inference performance.

(2) DeepSeek-R1 continues to use SFT in certain steps

  • R1 does not completely dispense with SFT, but uses SFT for distillation in the third step before it is tuned with RLHF (Reinforcement Learning from Human Feedback).
  • R1 is still based on a model trained by SFT, but the data used was generated by a model trained by RLHF.
  • Distilling well-curated data through SFT can still yield significant performance gains , but complex RL methods may not be required.

(3) GRPO mechanism (Reinforcement Learning with Verifiability)

  • Crucially, the base model is intelligent enough – a prompt is generated up to 16 times to find a high-quality answer .
  • Particularly suitable for mathematics and programming as these are easily verifiable but theoretically transferable to other areas.
  • This process shows that the resulting RL model represents an emergent computational procedure .

(4) The emergence of CoT (Chain-of-Thought)

  • R1-Zero showed CoT emergence without SFT , suggesting that CoT may be a natural property of LLMs.
  • An infinitely long CoT could give LLMs a kind of Turing machine capability , but essentially it is just an optimized search strategy.
  • Between R1-Preview and R1, the context window was probably enlarged , possibly by a Long2Short CoT optimization.
  1. The crucial role of data annotation
  • DeepSeek attaches great importance to data annotation , even the founder Liang Wenfeng is involved in it.
  • Data quality is more important than algorithms , similar to Tesla’s strategy for autonomous driving.
  • Scale.AI continues to have market opportunities, particularly in mathematics and programming that require expert annotation.
  • Multimodal data currently show no significant effects due to high training costs, but may offer opportunities in the future.
  1. The Advantages and Disadvantages of Distillation

(1) Short-term benefits of distillation

  • Distillation allows small models to learn from large models and can provide significant performance improvements.
  • In the short term, distillation remains an important method for improving the performance of small models , especially for startups.
  • DeepSeek has developed several smaller model versions that can run on mobile devices , which, if successful, could significantly increase the usability of AI applications.

(2) Long-term problems of distillation

  • Reduced model variety, which can lower the upper performance limit.
  • Some RL hacks cause the model to initially generate useless ideas before suddenly giving the right answer , possibly because it “memorized” many questions during pretraining instead of truly understanding them.
  • Without its own data pipeline, reliance on distillation could lead to long-term limitations.

(3) Possible future improvements

  • Future models could use reinforcement learning with verifiable rewards (RLVR) to ensure they truly understand rather than just memorizing answers.
  • OpenAI doesn’t rely on distillation – if you want to outperform OpenAI, you probably shouldn’t.
  • R1-Zero might be the right approach by training from scratch rather than relying on existing O1 data.
  • Future LLMs must learn to make “leaps” in answers to maximize their performance within fixed context lengths.
  1. Process Reward: “The upper limit of process monitoring is the human, the upper limit of result monitoring is the model”

(1) Process Reward has potential problems

  • Process Reward is not necessarily useless , but it can easily lead to reward hacking – the model learns nothing useful but can achieve high rewards.
  • Example mathematics: A model generates 1000 solutions, but none is correct. It cannot learn anything with RLVR. However, a weak process reward could help to move in the right direction.
  • How useful Process Reward is depends on the complexity of the task and the reliability of the assessment.

(2) Challenges of process evaluation (Process Reward Model, PRM)

  • If the PRM evaluation contains a systematic bias, it is easily exploited (reward hacking).
  • Process monitoring is theoretically possible, but there is currently no robust method to ensure that it is not tampered with.
  • Outcome-based supervision is done by matching with extracted responses , but there is no mature method for models to evaluate themselves without hacking.
  • Process evaluation is technically feasible because it can be systematically enumerated , but it has been little researched so far – potentially a promising approach.

(3) The upper limit of process vs. result monitoring

  • The upper limit of process monitoring is humans – people cannot imagine many solutions.
  • The upper limit of outcome monitoring is the model itself , as it can find new, unforeseen solutions.

(4) Comparison with AlphaZero: Why it works

  • AlphaZero is effective because chess and Go games have a clear win/loss rating and the reward can be calculated by win rate.
  • LLMs do not have this clear signal – they generate an infinite number of answers without knowing whether they provide a solution.
  • Similar to genetic algorithms, the model could arrive at better answers through many iterations, but there is a risk of reward hacking.

(5) Process and Rule Validation in Math & Coding

  • The advantage of mathematics and programming is that they have testable rules , which is why many RL approaches start here.
  • If the rules are not clearly defined, the model will try to “hack” them by generating formally correct but substantively incorrect answers.
  • A robust evaluation method is crucial for the quality of reinforcement learning.
PHP Code Snippets Powered By : XYZScripts.com