语言模型的发展历史
🏗 Early foundation models (2010年代后期)
- 2018:ELMo(基于 LSTM 预训练 + 微调)[Peters+ 2018]
- 2018:BERT(基于 Transformer 预训练 + 微调)[Devlin+ 2018]
- 2019:Google T5(统一为 text-to-text)[Raffel+ 2019]
🚀 Scaling & closed (2020年代初期)
- 2019:OpenAI GPT-2 (1.5B),流畅生成文本、出现 zero-shot [Radford+ 2019]
- 2020:Scaling laws 提出,预测大模型表现 [Kaplan+ 2020]
- 2020:OpenAI GPT-3 (175B),in-context learning [Brown+ 2020]
- 2022:Google PaLM (540B),大规模但 undertrained [Chowdhery+ 2022]
- 2022:DeepMind Chinchilla (70B),计算最优 scaling [Hoffmann+ 2022]
🌍 Open models (2020年代中期)
- 2020/2021:EleutherAI,The Pile 数据集 + GPT-J [Gao+ 2020][Wang+ 2021]
- 2022:Meta OPT (175B),GPT-3 复现 [Zhang+ 2022]
- 2022:Hugging Face/BigScience BLOOM,关注数据来源 [Workshop+ 2022]
- 2023:Meta LLaMA 系列 [Touvron+ 2023]
- 2024:Alibaba Qwen 系列 [Qwen+ 2024]
- 2024:DeepSeek 系列 [DeepSeek-AI+ 2024]
- 2024:AI2 OLMo 2 [Groeneveld+ 2024][OLMo+ 2024]
🔓 Levels of openness
- 2023:封闭模型,如 OpenAI GPT-4o [OpenAI+ 2023]
- 2024:开放权重模型,如 DeepSeek [DeepSeek-AI+ 2024]
- 2024:开源模型,如 OLMo(权重+数据开放)[Groeneveld+ 2024]
🌌 Today’s frontier models (2025)
- 2025:OpenAI o3 → https://openai.com/index/openai-o3-mini/
- 2025:Anthropic Claude Sonnet 3.7 → https://www.anthropic.com/news/claude-3-7-sonnet
- 2025:xAI Grok 3 → https://x.ai/news/grok-3
- 2025:Google Gemini 2.5 → https://blog.google/technology/google-deepmind/gemini-model-thinking-updates-march-2025/
- 2025:Meta LLaMA 3.3 → https://ai.meta.com/blog/meta-llama-3/
- 2025:DeepSeek r1 → [DeepSeek-AI+ 2025]
- 2025:Alibaba Qwen 2.5 Max → https://qwenlm.github.io/blog/qwen2.5-max/
- 2025:Tencent Hunyuan-T1 → https://tencent.github.io/llm.hunyuan.T1/README_EN.html
效率组件
✅ 基础 (Basics)
- 分词 (Tokenization)
- 架构 (Architecture)
- 损失函数 (Loss function)
- 优化器 (Optimizer)
- 学习率 (Learning rate)
✅ 系统 (Systems)
- 内核 (Kernels)
- 并行化 (Parallelism)
- 量化 (Quantization)
- 激活检查点 (Activation checkpointing)
- CPU 卸载 (CPU offloading)
- 推理 (Inference)
✅ 缩放规律 (Scaling laws)
- 缩放序列 (Scaling sequence)
- 模型复杂度 (Model complexity)
- 损失度量 (Loss metric)
- 参数化形式 (Parametric form)
✅ 数据 (Data)
- 评估 (Evaluation)
- 筛选 (Curation)
- 转换 (Transformation)
- 过滤 (Filtering)
- 去重 (Deduplication)
- 混合 (Mixing)
✅ 对齐 (Alignment)
- 有监督微调 (Supervised fine-tuning)
- 强化学习 (Reinforcement learning)
- 偏好数据 (Preference data)
- 合成数据 (Synthetic data)
- 验证器 (Verifiers)
Tokenization(分词)
Byte-Pair Encoding(BPE)分词器 [Sennrich 等, 2015]
👉 它的核心思想就是:不断找出出现频率最高的字符对,把它们合并成一个新“词”,反复迭代,直到达到设定的词表大小。BPE 已经成为现在大部分主流大模型(比如 GPT 系列)的标配分词方案。
当然,也有一些不走分词器路线的探索:
比如 [Xue 等, 2021][Yu 等, 2023][Pagnoni 等, 2024][Deiseroth 等, 2024] 提到的 tokenizer-free 方法,直接基于字节(bytes)做处理。
这些方法很有潜力,省去了复杂的分词步骤,但目前还没能像 BPE 一样被大规模用于最前沿的大模型。
Architecture(结构)
Variants(变体):
Activation functions: ReLU, SwiGLU[Shazeer 2020]
Positional encodings: sinusoidal, RoPE[Su+ 2021]
Normalization: LayerNorm, RMSNorm[Ba+ 2016][Zhang+ 2019]
Placement of normalization: pre-norm versus post-norm[Xiong+ 2020]
MLP: dense, mixture of experts[Shazeer+ 2017]
Attention: full, sliding window, linear[Jiang+ 2023][Katharopoulos+ 2020]
Lower-dimensional attention: group-query attention (GQA), multi-head latent attention (MLA)[Ainslie+ 2023][DeepSeek-AI+ 2024]
State-space models: Hyena[Poli+ 2023]
Training(训练)
Optimizer (e.g., AdamW, Muon, SOAP)
Learning rate schedule (e.g., cosine, WSD)
Batch size (e…g, critical batch size)
Regularization (e.g., dropout, weight decay)
Hyperparameters (number of heads, hidden dimension): grid search