Local Model Evaluation Scheme | 本地模型测评方案¶
How do you know which model is better for your specific hardware and use case? This document provides a standardized scheme for local model evaluation.
如何知道哪个模型更适合您的硬件和应用场景?本文档提供了一套标准化的本地模型测评方案。
1. Evaluation Dimensions | 测评维度¶
| Dimension (维度) | Core Metrics (核心指标) | Recommended Tools/Datasets (推荐工具/数据集) | Local Tips (本地操作要点) |
|---|---|---|---|
| Base Capability 基础能力 |
Language: Perplexity (PPL), BLEU. Knowledge: MMLU (General), C-Eval (Chinese). Logic: GSM8K (Math), ARC-AGI. Coding: HumanEval. |
MMLU, C-Eval, GSM8K, HumanEval | Use standardized datasets and scripts to calculate accuracy/match rate. 使用标准化数据集,通过脚本批量测试。 |
| Performance 性能效率 |
Throughput: Tokens/s. Latency: TTFT (Time to First Token). Stability: Context length support. |
Custom Scripts, vLLM, Ollama | Fix input/output length, run multiple times, take average. 固定输入输出长度,多次测试取平均值。 |
| Resource 资源消耗 |
VRAM: Peak usage. RAM: System memory peak. Energy: Watts/token. |
nvidia-smi, htop, DCGM |
Monitor hardware during stress tests. 在压力测试下记录硬件监控数据。 |
2. How to Execute Local Evaluation | 如何执行本地测评¶
Step 1: Prepare Environment (准备环境)¶
- Models: Ensure models are at the same quantization level (e.g., all 4-bit) for fairness.
确保对比的模型采用相同的量化等级(如均采用 4-bit),保证公平性。 - Hardware: Isolate the test machine. Close background apps.
确保硬件环境唯一,避免其他进程干扰。 - Tools: Install
transformers,vLLM, orOpenCompass(a powerful evaluation framework).
安装必要的库,OpenCompass 是一个强大的自动化评测框架。
Step 2: Automate Testing (实施自动化测评)¶
- Base Test: Write scripts to load MMLU/C-Eval datasets, query the model, and compare answers.
编写脚本批量运行基础能力测试。 - Performance Test: Use a fixed prompt (e.g., "Write a 300-word essay on AI") and record Tokens/s.
使用固定提示词重复测试,记录 Tokens/s 和延迟。 - Resource Monitor: Run
nvidia-smi --query-gpu=memory.used --format=csv -l 1in background to log VRAM.
在后台运行监控命令记录显存占用峰值。
Step 3: Analyze & Optimize (分析结果与优化)¶
- Visualize: Create charts comparing models across dimensions.
将结果汇总成表格或可视化图表。 - Context Matters: For logic tasks, prioritize GSM8K scores over speed. For chat bots, prioritize TTFT.
重点关注模型在不同场景下的性能差异。逻辑任务看重准确率,聊天机器人看重首字延迟。
3. Important Notes | 重要提醒¶
- No "Perfect" Model: Define your goal (Coding vs. Creative Writing) first.
没有“全能”的模型。测评前想清楚主要应用场景。 - Avoid Overfitting: Ensure test data wasn't in the training set (Data Contamination).
警惕过拟合,确保测评数据集与训练数据无重叠。 - Human Eval: For subjective quality, human review is still necessary.
对于内容质量等主观指标,必须结合人工评测。