跳转至

AI Learning Roadmap | AI学习路线图

Safety and Alignment | 安全与对齐

wendellhua/AI-Learning-Roadmap

AI Safety & Alignment | AI 安全与对齐¶

As AI models become more powerful, ensuring they are safe, helpful, and honest is critical. This field is known as AI Alignment.

随着 AI 模型变得越来越强大，确保它们安全、有益且诚实至关重要。这个领域被称为 AI 对齐。

1. The HHH Principle | HHH 原则¶

Anthropic defines alignment using three pillars: Anthropic 用三大支柱定义对齐：

Helpful (有益): The AI should attempt to perform the task as best as it can.
AI 应尽可能好地完成任务。
Honest (诚实): The AI should give accurate information and express uncertainty when it doesn't know.
AI 应提供准确信息，并在不知道时表达不确定性（避免幻觉）。
Harmless (无害): The AI should not be offensive, discriminatory, or assist in illegal acts.
AI 不应具有冒犯性、歧视性或协助非法行为。

2. Common Risks | 常见风险¶

2.1 Hallucination (幻觉)¶

Definition: The model confidently generates false information.
模型自信地生成错误信息。
Mitigation: RAG (Grounding), Chain of Thought (Reasoning), RLHF.
缓解：RAG（接地）、思维链（推理）、RLHF。

2.2 Bias & Fairness (偏见与公平)¶

Definition: Models reflecting historical biases in training data (e.g., gender roles).
模型反映训练数据中的历史偏见（如性别角色）。
Mitigation: Diverse datasets, Constitutional AI.
缓解：多样化数据集、宪法 AI。

2.3 Data Privacy (数据隐私)¶

Risk: Models memorizing and leaking PII (Personally Identifiable Information).
模型记忆并泄露个人身份信息 (PII)。
Mitigation: Data sanitization, Differential Privacy.
缓解：数据清洗、差分隐私。

3. Alignment Techniques | 对齐技术¶

RLHF (Reinforcement Learning from Human Feedback): The industry standard. Humans rank outputs to guide the model.
行业标准。人类对输出排序以指导模型。
Constitutional AI (CAI): AI trains AI. The model critiques its own outputs based on a set of principles (a "Constitution").
AI 训练 AI。模型根据一套原则（“宪法”）批评自己的输出。
Red Teaming: Hiring experts to attack the model to find vulnerabilities before release.
聘请专家攻击模型，在发布前发现漏洞。