A closed discussion about DeepSeek (Part 2)

Technical details of DeepSeek

(1) SFT may no longer be necessary at the inference level

The biggest innovation of DeepSeek is not its open source concept or low cost, but that it no longer requires SFT at the inference layer .
Does this mean that a new method or architecture makes data usage more efficient and speeds up model iteration ?
SFT remains necessary for certain tasks (e.g., data generation and alignment) , but it is no longer the only way to improve inference performance.

(2) DeepSeek-R1 continues to use SFT in certain steps

R1 does not completely dispense with SFT, but uses SFT for distillation in the third step before it is tuned with RLHF (Reinforcement Learning from Human Feedback).
R1 is still based on a model trained by SFT, but the data used was generated by a model trained by RLHF.
Distilling well-curated data through SFT can still yield significant performance gains , but complex RL methods may not be required.

(3) GRPO mechanism (Reinforcement Learning with Verifiability)

Crucially, the base model is intelligent enough – a prompt is generated up to 16 times to find a high-quality answer .
Particularly suitable for mathematics and programming as these are easily verifiable but theoretically transferable to other areas.
This process shows that the resulting RL model represents an emergent computational procedure .

(4) The emergence of CoT (Chain-of-Thought)

R1-Zero showed CoT emergence without SFT , suggesting that CoT may be a natural property of LLMs.
An infinitely long CoT could give LLMs a kind of Turing machine capability , but essentially it is just an optimized search strategy.
Between R1-Preview and R1, the context window was probably enlarged , possibly by a Long2Short CoT optimization.

DeepSeek attaches great importance to data annotation , even the founder Liang Wenfeng is involved in it.
Data quality is more important than algorithms , similar to Tesla’s strategy for autonomous driving.
Scale.AI continues to have market opportunities, particularly in mathematics and programming that require expert annotation.
Multimodal data currently show no significant effects due to high training costs, but may offer opportunities in the future.

(1) Short-term benefits of distillation

Distillation allows small models to learn from large models and can provide significant performance improvements.
In the short term, distillation remains an important method for improving the performance of small models , especially for startups.
DeepSeek has developed several smaller model versions that can run on mobile devices , which, if successful, could significantly increase the usability of AI applications.

(2) Long-term problems of distillation

Reduced model variety, which can lower the upper performance limit.
Some RL hacks cause the model to initially generate useless ideas before suddenly giving the right answer , possibly because it “memorized” many questions during pretraining instead of truly understanding them.
Without its own data pipeline, reliance on distillation could lead to long-term limitations.

(3) Possible future improvements

Future models could use reinforcement learning with verifiable rewards (RLVR) to ensure they truly understand rather than just memorizing answers.
OpenAI doesn’t rely on distillation – if you want to outperform OpenAI, you probably shouldn’t.
R1-Zero might be the right approach by training from scratch rather than relying on existing O1 data.
Future LLMs must learn to make “leaps” in answers to maximize their performance within fixed context lengths.

Process Reward: “The upper limit of process monitoring is the human, the upper limit of result monitoring is the model”

(1) Process Reward has potential problems

Process Reward is not necessarily useless , but it can easily lead to reward hacking – the model learns nothing useful but can achieve high rewards.
Example mathematics: A model generates 1000 solutions, but none is correct. It cannot learn anything with RLVR. However, a weak process reward could help to move in the right direction.
How useful Process Reward is depends on the complexity of the task and the reliability of the assessment.

(2) Challenges of process evaluation (Process Reward Model, PRM)

If the PRM evaluation contains a systematic bias, it is easily exploited (reward hacking).
Process monitoring is theoretically possible, but there is currently no robust method to ensure that it is not tampered with.
Outcome-based supervision is done by matching with extracted responses , but there is no mature method for models to evaluate themselves without hacking.
Process evaluation is technically feasible because it can be systematically enumerated , but it has been little researched so far – potentially a promising approach.

(3) The upper limit of process vs. result monitoring

The upper limit of process monitoring is humans – people cannot imagine many solutions.
The upper limit of outcome monitoring is the model itself , as it can find new, unforeseen solutions.

(4) Comparison with AlphaZero: Why it works

AlphaZero is effective because chess and Go games have a clear win/loss rating and the reward can be calculated by win rate.
LLMs do not have this clear signal – they generate an infinite number of answers without knowing whether they provide a solution.
Similar to genetic algorithms, the model could arrive at better answers through many iterations, but there is a risk of reward hacking.

(5) Process and Rule Validation in Math & Coding

The advantage of mathematics and programming is that they have testable rules , which is why many RL approaches start here.
If the rules are not clearly defined, the model will try to “hack” them by generating formally correct but substantively incorrect answers.
A robust evaluation method is crucial for the quality of reinforcement learning.