什么是 Logits?——全面解析大模型输出的关键
在深度学习中,logits 是指在模型的最后一层(通常是全连接层)的原始输出值,尚未经过归一化处理。Logits 是一个非常重要的概念,因为它是模型从中预测最终结果的基础,决定了模型对不同类别的“信心”或“倾向性”。
在本篇博客中,我们将深入探讨:
- Logits 的定义与作用
- Logits 与 Softmax 的关系
- 实际案例:从 Logits 到概率分布
- Logits 的使用场景和注意事项
- 总结与洞见
1. Logits 的定义与作用
什么是 Logits?
Logits 是深度学习模型预测过程中最后一层输出的原始值。它们通常是一个未归一化的实数向量,每个值对应一个类别。Logits 的取值范围可以是正数、负数,甚至非常大或非常小的值。
- 形状:假设模型有 ( C C C ) 个类别,输入一个样本,Logits 的形状为 ( [ C ] [C] [C] )。
- 特性:
- Logits 不具有概率含义。
- 可能包含正数或负数,甚至可能超出人类直观理解的范围(如 -1000 或 1000)。
作用
Logits 是模型预测的中间结果。它们本身并非最终的预测值,而需要通过激活函数(如 Softmax)进行归一化,转化为概率分布。
模型的优化目标(如交叉熵损失)直接基于 Logits 或其归一化结果进行计算。
2. Logits 与 Softmax 的关系
Softmax 函数
Softmax 是一个将 Logits 转化为概率分布的激活函数,定义如下:
P ( y i ) = exp ( z i ) ∑ j = 1 C exp ( z j ) P(y_i) = \frac{\exp(z_i)}{\sum_{j=1}^C \exp(z_j)} P(yi)=∑j=1Cexp(zj)exp(zi)
其中:
- ( z i z_i zi ):Logits 中第 ( i i i ) 个值。
- ( C C C ):类别数。
- ( P ( y i ) P(y_i) P(yi) ):归一化后的概率值,表示模型对第 ( i i i ) 个类别的置信度,满足:
∑ i = 1 C P ( y i ) = 1 \sum_{i=1}^C P(y_i) = 1 i=1∑CP(yi)=1
Logits 与概率的区别
Logits(未归一化):
- 取值范围:任意实数。
- 无概率意义。
- 仅表示模型对某类别的“倾向性”。
Softmax 输出(归一化后):
- 取值范围:[0, 1]。
- 满足概率分布的性质,所有值之和为 1。
- 每个值表示模型预测该类别的概率。
3. 实际案例:从 Logits 到概率分布
假设有一个文本分类任务,模型需要将输入的句子分类为三类:
- 类别 A:新闻
- 类别 B:娱乐
- 类别 C:科技
输入:句子 "The new smartphone has amazing features."
模型的 Logits 输出:
logits = [ 2.0 , 1.0 , 0.1 ] \text{logits} = [2.0, 1.0, 0.1] logits=[2.0,1.0,0.1]
步骤 1:计算 Softmax
我们将 Logits 通过 Softmax 转化为概率分布:
P ( A ) = exp ( 2.0 ) exp ( 2.0 ) + exp ( 1.0 ) + exp ( 0.1 ) P(A) = \frac{\exp(2.0)}{\exp(2.0) + \exp(1.0) + \exp(0.1)} P(A)=exp(2.0)+exp(1.0)+exp(0.1)exp(2.0)
P ( B ) = exp ( 1.0 ) exp ( 2.0 ) + exp ( 1.0 ) + exp ( 0.1 ) P(B) = \frac{\exp(1.0)}{\exp(2.0) + \exp(1.0) + \exp(0.1)} P(B)=exp(2.0)+exp(1.0)+exp(0.1)exp(1.0)
P ( C ) = exp ( 0.1 ) exp ( 2.0 ) + exp ( 1.0 ) + exp ( 0.1 ) P(C) = \frac{\exp(0.1)}{\exp(2.0) + \exp(1.0) + \exp(0.1)} P(C)=exp(2.0)+exp(1.0)+exp(0.1)exp(0.1)
计算每项的指数值:
exp ( 2.0 ) ≈ 7.39 , exp ( 1.0 ) ≈ 2.72 , exp ( 0.1 ) ≈ 1.11 \exp(2.0) \approx 7.39, \quad \exp(1.0) \approx 2.72, \quad \exp(0.1) \approx 1.11 exp(2.0)≈7.39,exp(1.0)≈2.72,exp(0.1)≈1.11
归一化:
P ( A ) = 7.39 7.39 + 2.72 + 1.11 ≈ 0.66 P(A) = \frac{7.39}{7.39 + 2.72 + 1.11} \approx 0.66 P(A)=7.39+2.72+1.117.39≈0.66
P ( B ) = 2.72 7.39 + 2.72 + 1.11 ≈ 0.24 P(B) = \frac{2.72}{7.39 + 2.72 + 1.11} \approx 0.24 P(B)=7.39+2.72+1.112.72≈0.24
P ( C ) = 1.11 7.39 + 2.72 + 1.11 ≈ 0.10 P(C) = \frac{1.11}{7.39 + 2.72 + 1.11} \approx 0.10 P(C)=7.39+2.72+1.111.11≈0.10
最终概率分布:
P = [ 0.66 , 0.24 , 0.10 ] P = [0.66, 0.24, 0.10] P=[0.66,0.24,0.10]
解释
- 类别 A(新闻)的概率最高,模型认为输入句子最有可能属于新闻类别。
- 类别 B 和 C 的概率较低,模型对这些类别的置信度较弱。
4. Logits 的使用场景和注意事项
(1) 使用场景
- 分类任务:Logits 是交叉熵损失的输入,计算预测类别的损失。
- 推理阶段:可以直接使用 Logits 的最大值对应的类别(无需计算 Softmax),因为 Softmax 不会改变 Logits 的排序。
(2) 注意事项
数值稳定性:
Logits 值过大或过小可能导致溢出。- 解决方法:在 Softmax 计算时,减去 Logits 的最大值:
P ( y i ) = exp ( z i − max ( z ) ) ∑ j = 1 C exp ( z j − max ( z ) ) P(y_i) = \frac{\exp(z_i - \max(z))}{\sum_{j=1}^C \exp(z_j - \max(z))} P(yi)=∑j=1Cexp(zj−max(z))exp(zi−max(z))
这不会改变最终的概率分布结果,但能避免溢出。
- 解决方法:在 Softmax 计算时,减去 Logits 的最大值:
对比损失(Contrastive Loss):
在一些高级任务中,Logits 被用来计算对比学习的损失函数,直接比较不同样本的相似性。解释能力:
虽然 Logits 本身不是概率,但可以看作是模型的“原始信号”,值越高表示模型对该类别的“倾向性”越强。
5. 总结与洞见
Logits 是模型输出的核心中间结果:
它们直接反映了模型对各个类别的偏好,但未归一化为概率。Softmax 将 Logits 转化为概率分布:
提供清晰的概率解释,便于模型评估和决策。数值稳定性和高效计算是关键:
通过合理的数值操作(如减去最大值),可以确保计算稳定性。实际应用中的选择:
在推理阶段,可以直接使用 Logits 最大值对应的类别进行预测,避免额外的 Softmax 开销。
理解 Logits 和它们与 Softmax 的关系,不仅能帮助我们更好地优化模型,还能让我们在实际应用中更高效地设计系统。
What Are Logits? A Comprehensive Guide
In deep learning, logits refer to the raw output values produced by the final layer of a model before applying any normalization, such as Softmax. Understanding logits is crucial as they serve as the foundation for transforming model outputs into probabilities and making predictions.
This blog covers:
- Definition and Purpose of Logits
- Relationship Between Logits and Softmax
- Practical Example: From Logits to Probability Distributions
- Applications and Considerations of Logits
- Key Takeaways and Insights
1. Definition and Purpose of Logits
What Are Logits?
Logits are the raw scores output by the model’s final layer (often a dense layer). They represent the unnormalized confidence of the model for each class. Logits are not probabilities—they can be positive, negative, or extremely large/small numbers.
- Shape: If the model predicts ( C C C ) classes for a single input, the logits are a vector of size ( [ C ] [C] [C] ).
- Properties:
- They do not satisfy the constraints of probabilities (e.g., summing to 1 or being between 0 and 1).
- They are a direct representation of the model’s tendency or “preference” for each class.
Why Are Logits Important?
Logits are an intermediate result in a model’s prediction pipeline. They are transformed into probabilities through activation functions like Softmax, which are then used for decision-making or loss computation.
2. Relationship Between Logits and Softmax
The Softmax Function
Softmax is a mathematical function that transforms logits into probabilities. It is defined as:
P ( y i ) = exp ( z i ) ∑ j = 1 C exp ( z j ) P(y_i) = \frac{\exp(z_i)}{\sum_{j=1}^C \exp(z_j)} P(yi)=∑j=1Cexp(zj)exp(zi)
Where:
- ( z i z_i zi ): Logit for class ( i i i ).
- ( C C C ): Total number of classes.
- ( P ( y i ) P(y_i) P(yi) ): Normalized probability for class ( i i i ), satisfying:
∑ i = 1 C P ( y i ) = 1 \sum_{i=1}^C P(y_i) = 1 i=1∑CP(yi)=1
Difference Between Logits and Probabilities
Logits (Unnormalized Scores):
- Range: Any real number (( − ∞ -\infty −∞) to ( + ∞ +\infty +∞)).
- No probabilistic interpretation.
Softmax Output (Probabilities):
- Range: ( [ 0 , 1 ] [0, 1] [0,1]).
- Represents the model’s confidence in each class, summing to 1.
3. Practical Example: From Logits to Probabilities
Let’s take a text classification task where the model predicts the category of a sentence. Suppose the task has three categories:
- Class A: News
- Class B: Entertainment
- Class C: Technology
Input: The sentence “The new smartphone has amazing features.”
Model’s Logits Output:
logits = [ 2.0 , 1.0 , 0.1 ] \text{logits} = [2.0, 1.0, 0.1] logits=[2.0,1.0,0.1]
Step 1: Apply Softmax
To convert logits to probabilities, apply the Softmax function:
P ( A ) = exp ( 2.0 ) exp ( 2.0 ) + exp ( 1.0 ) + exp ( 0.1 ) P(A) = \frac{\exp(2.0)}{\exp(2.0) + \exp(1.0) + \exp(0.1)} P(A)=exp(2.0)+exp(1.0)+exp(0.1)exp(2.0)
P ( B ) = exp ( 1.0 ) exp ( 2.0 ) + exp ( 1.0 ) + exp ( 0.1 ) P(B) = \frac{\exp(1.0)}{\exp(2.0) + \exp(1.0) + \exp(0.1)} P(B)=exp(2.0)+exp(1.0)+exp(0.1)exp(1.0)
P ( C ) = exp ( 0.1 ) exp ( 2.0 ) + exp ( 1.0 ) + exp ( 0.1 ) P(C) = \frac{\exp(0.1)}{\exp(2.0) + \exp(1.0) + \exp(0.1)} P(C)=exp(2.0)+exp(1.0)+exp(0.1)exp(0.1)
Step 2: Compute the Exponentials
exp ( 2.0 ) ≈ 7.39 , exp ( 1.0 ) ≈ 2.72 , exp ( 0.1 ) ≈ 1.11 \exp(2.0) \approx 7.39, \quad \exp(1.0) \approx 2.72, \quad \exp(0.1) \approx 1.11 exp(2.0)≈7.39,exp(1.0)≈2.72,exp(0.1)≈1.11
Step 3: Normalize
The total sum of exponentials is:
sum = 7.39 + 2.72 + 1.11 ≈ 11.22 \text{sum} = 7.39 + 2.72 + 1.11 \approx 11.22 sum=7.39+2.72+1.11≈11.22
The probabilities for each class are:
P ( A ) = 7.39 11.22 ≈ 0.66 P(A) = \frac{7.39}{11.22} \approx 0.66 P(A)=11.227.39≈0.66
P ( B ) = 2.72 11.22 ≈ 0.24 P(B) = \frac{2.72}{11.22} \approx 0.24 P(B)=11.222.72≈0.24
P ( C ) = 1.11 11.22 ≈ 0.10 P(C) = \frac{1.11}{11.22} \approx 0.10 P(C)=11.221.11≈0.10
Interpretation
- The model assigns the highest probability to Class A (News), indicating the sentence is most likely about news.
- Class B (Entertainment) and Class C (Technology) have lower probabilities, reflecting weaker confidence in those predictions.
4. Applications and Considerations of Logits
Applications
Classification Tasks:
Logits are used as input to loss functions like cross-entropy, which compares the logits (or their normalized probabilities) with ground truth labels.Inference:
During inference, instead of computing probabilities, we can directly use the index of the largest logit value for the predicted class:
Predicted Class = arg max ( logits ) \text{Predicted Class} = \arg\max(\text{logits}) Predicted Class=argmax(logits)
This avoids unnecessary computation and yields the same result as Softmax.
Considerations
Numerical Stability:
Large or small logits can cause numerical overflow or underflow during Softmax computation. To mitigate this, subtract the maximum logit from all logits before applying Softmax:
P ( y i ) = exp ( z i − max ( z ) ) ∑ j = 1 C exp ( z j − max ( z ) ) P(y_i) = \frac{\exp(z_i - \max(z))}{\sum_{j=1}^C \exp(z_j - \max(z))} P(yi)=∑j=1Cexp(zj−max(z))exp(zi−max(z))
This adjustment ensures stable calculations without affecting the final probabilities.Gradient Behavior:
The magnitude of logits affects gradients during backpropagation, influencing model training dynamics. Proper initialization and regularization can help manage this.Interpretability:
Logits are not human-readable probabilities but provide insight into how confident the model is about different classes before normalization.
5. Key Takeaways and Insights
Logits Are the Raw Model Outputs:
They represent unnormalized scores indicating the model’s inclination toward different classes.Softmax Converts Logits to Probabilities:
This transformation is essential for interpreting model predictions and training with probability-based loss functions.Numerical Stability Is Critical:
Subtracting the maximum logit during Softmax computation avoids overflow and ensures robust results.Efficiency in Inference:
For classification tasks, the maximum logit directly gives the predicted class, eliminating the need for Softmax in inference pipelines.
By understanding logits and their transformation into probabilities, you gain deeper insights into the inner workings of deep learning models and how they make predictions. With practical examples and careful considerations, logits can be harnessed effectively for various machine learning tasks.
后记
2024年12月13日21点02分于上海,在GPT4o大模型辅助下完成。