Fine-tuning Guide | 微调指南¶
The Analogy: Education vs. Cheat Sheet - Pre-training: Primary School. Learning grammar, logic, and general world knowledge. - Fine-tuning: Medical School. Learning specific jargon, format, and style. (Changing the brain). - RAG: A Textbook. Looking up facts during work. (Using a tool). 类比:教育 vs 小抄 - 预训练:小学。学习语法、逻辑和通用世界知识。 - 微调:医学院。学习特定术语、格式和风格。(改变大脑)。 - RAG:教科书。工作中查阅事实。(使用工具)。
1. The Decision Tree: Do I need Fine-tuning? | 决策树:我需要微调吗?¶
graph TD
Start[I have a problem] --> Q1{Does the model lack KNOWLEDGE?}
Q1 -- Yes --> Q2{Is the knowledge changing often?}
Q2 -- Yes --> RAG[Use RAG]
Q2 -- No --> Q3{Is the dataset huge?}
Q3 -- Yes --> FT[Fine-Tuning (Knowledge Injection)]
Q3 -- No --> RAG
Q1 -- No --> Q4{Does the model lack STYLE/FORMAT?}
Q4 -- Yes --> Q5{Can Prompt Engineering fix it?}
Q5 -- Yes --> Prompt[Use Few-Shot Prompting]
Q5 -- No --> FT2[Fine-Tuning (Style Injection)]
Verdict: 90% of use cases need RAG, not Fine-tuning. 结论:90% 的场景需要 RAG,而不是微调。
2. PEFT: Fine-tuning for Everyone | PEFT:人人可用的微调¶
In the past, fine-tuning a 7B model required 100GB+ VRAM. Now, thanks to PEFT (Parameter-Efficient Fine-Tuning), you can do it on a gaming PC. 过去,微调 7B 模型需要 100GB+ 显存。现在,感谢 PEFT,你可以在游戏本上完成。
2.1 LoRA (Low-Rank Adaptation)¶
The "Post-it Note" Method | “便利贴”法
Instead of rewriting the whole textbook (Weights), we stick small post-it notes (Adapter Layers) on the pages. 我们不重写整本教科书(权重),而是在页面上贴小便利贴(适配器层)。 - Original Weights: Frozen (Don't touch). - LoRA Weights: Trainable (Tiny, < 1% of total size).
2.2 QLoRA (Quantized LoRA)¶
The "Compressed Post-it Note" Method | “压缩便利贴”法
Compress the textbook to 4-bit (Quantization) AND use LoRA. 将教科书压缩到 4-bit(量化),同时使用 LoRA。 - Result: Fine-tune a 70B model on a single 48GB GPU.
3. RLHF: How ChatGPT was made | RLHF:ChatGPT 是如何炼成的¶
Reinforcement Learning from Human Feedback 基于人类反馈的强化学习
- SFT (Supervised Fine-Tuning): Teach the model to answer questions. (The "Intern" stage).
教模型回答问题。(“实习生”阶段)。 - Reward Modeling: Teach a judge model to rate answers. (The "Critic" stage).
教裁判模型给答案打分。(“评论家”阶段)。 - PPO (Optimization): The model generates answers, the judge rates them, and the model updates itself to get higher scores. (The "Training" stage).
模型生成答案,裁判打分,模型自我更新以获得更高分。(“训练”阶段)。
4. Tools & Frameworks | 工具与框架¶
- Hugging Face PEFT: The standard library for LoRA/Prefix Tuning.
- Unsloth: Optimized library for faster LoRA fine-tuning (2x-5x faster).
- Axolotl: All-in-one configuration-based fine-tuning tool.