机器学习——逻辑回归-EW帮帮网

一、简介

逻辑回归虽然名称中包含"回归"二字，但实际上是一种广泛应用于二分类问题的经典机器学习算法。这种命名源于其使用逻辑函数（也称为Sigmoid函数）来建模二元分类问题，本质上是对线性回归的扩展。

逻辑回归的核心思想是将线性回归的输出通过Sigmoid函数映射到0-1之间的概率值。具体来说：

首先计算特征的线性组合：z = w0 + w1x1 + ... + wnxn
然后通过Sigmoid函数转换：σ(z) = 1/(1+e^-z)
最终输出代表样本属于正类的概率

在二分类问题中，模型会输出一个0到1之间的概率值，表示样本属于正例（positive class）的可能性。我们需要设定一个分类阈值（decision threshold）来将概率值转换为最终的类别预测。

最常用的默认阈值是0.5，其决策规则如下：

当预测概率 ≥ 0.5 时，判定为正例（positive）
当预测概率 < 0.5 时，判定为反例（negative）

例如：

如果一个样本的预测概率为0.7，则判定为正例
如果预测概率为0.3，则判定为反例

在实际应用中，阈值的选择可以根据具体需求进行调整：

医疗诊断中可能将阈值设为0.3以提高敏感度（减少假阴性）
垃圾邮件过滤可能将阈值设为0.8以提高精确度（减少假阳性）

调整阈值会影响模型的性能指标：

提高阈值会增加精确度，但降低召回率
降低阈值会增加召回率，但降低精确度

评估不同阈值下模型表现的常用方法是绘制ROC曲线或精确率-召回率曲线。

在实际应用中，逻辑回归具有以下典型特点：

计算效率高，训练速度快
输出具有概率解释性
可以添加L1/L2正则化防止过拟合
通过特征工程可以处理非线性决策边界

典型的应用场景包括：

金融风控：预测贷款违约概率
医疗诊断：判断疾病发生可能性
市场营销：预测客户购买倾向
垃圾邮件过滤：识别垃圾邮件

值得注意的是，虽然标准的逻辑回归是二分类算法，但通过以下方式可以扩展到多分类问题：

One-vs-Rest（OvR）策略
Multinomial逻辑回归（Softmax回归）
One-vs-One（OvO）策略

逻辑回归因其简单、高效和可解释性强的特点，在实践中仍然是很多分类任务的首选算法，特别是在需要概率输出或对模型解释性要求较高的场景中。

二、逻辑回归的使用

一、逻辑回归函数分类

class sklearn.linear_model.LogisticRegression( penalty='l2',

dual=False,

tol=0.0001,

C=1.0,

fit_intercept=True,

intercept_scaling=1,

class_weight=None,

andom_state=None,

solver='lbfgs',

max_iter=100,

multi_class='auto',

verbose=0,

warm_start=False,

n_jobs=None,

l1_ratio=None )

二、核心参数详解

1. 正则化参数

参数	类型	说明
`penalty`	str	正则化类型：'l1', 'l2', 'elasticnet' 或 'none'
`C`	float	正则化强度，必须为正数，越小表示正则化越强
`l1_ratio`	float	ElasticNet混合比例(0-1)，仅当`penalty='elasticnet'`时使用

2. 算法控制参数

参数	类型	说明
`solver`	str	优化算法：'newton-cg', 'lbfgs', 'liblinear', 'sag', 'saga'
`max_iter`	int	最大迭代次数
`tol`	float	停止标准的容忍度
`dual`	bool	对偶或原始形式，仅适用于liblinear和l2惩罚

3. 多分类参数

参数	类型	说明
`multi_class`	str	'auto', 'ovr'(一对多), 'multinomial'(多项式)
`n_jobs`	int	用于计算的CPU核心数，仅适用于多分类问题

4. 其他参数

参数	类型	说明
`class_weight`	dict/str	类别权重：None, 'balanced' 或字典形式
`random_state`	int	随机种子，用于solver='sag', 'saga'或'liblinear'
`warm_start`	bool	是否重用前一次调用的解决方案

三、属性说明

训练后模型将获得以下属性：

属性	说明
`classes_`	分类器已知的类别标签
`coef_`	特征系数(权重)
`intercept_`	决策函数中的截距(偏置)
`n_iter_`	实际迭代次数

四、主要方法

1. 训练模型

fit(X, y, sample_weight=None)

X：训练数据，形状(n_samples, n_features)
y：目标值，形状(n_samples,)
sample_weight：样本权重，形状(n_samples,)

2. 预测

predict(X)

返回预测类别

3. 概率预测

predict_proba(X)

返回每个类别的概率估计，形状(n_samples, n_classes)

4. 决策函数

decision_function(X)

返回样本到决策超平面的距离

5. 评分

score(X, y, sample_weight=None)

返回预测准确率

五、solver选择指南

solver	支持的惩罚	多分类	大数据集	备注
'liblinear'	L1, L2	仅OvR	不适合	小数据集首选
'newton-cg'	L2	支持	不适合	需要计算Hessian矩阵
'lbfgs'	L2	支持	适合	默认选择
'sag'	L2	支持	适合	随机平均梯度下降
'saga'	L1, L2, Elasticnet	支持	适合	sag的扩展，支持所有惩罚

import numpy as np
from sklearn.linear_model import LogisticRegression

data = np.loadtxt('datingTestSet2.txt')

# data_1 = data[data[:,-1]==1]
# data_2 = data[data[:,-1]==2]
# data_3 = data[data[:,-1]==3]

X = data[:,:-1]
Y = data[:,-1]

from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, Y, test_size=0.3, random_state=100)#随机拆分

lr = LogisticRegression(C=0.01)#训练
lr.fit(X_train, y_train)

y_pred = lr.predict(X_test)
score = lr.score(X_test, y_test)
print(score)

三、示例

import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
'''绘制混淆矩阵'''
def polt_matrix(x,y):#x:真实标签，y:测试标签
    from sklearn.metrics import confusion_matrix
    import matplotlib.pyplot as plt

    cm = confusion_matrix(x,y)
    plt.matshow(cm,cmap=plt.cm.Blues)
    plt.colorbar()
    for i in range(len(cm)):
        for j in range(len(cm)):
            plt.annotate(cm[i][j], xy=(j,i), horizontalalignment='center', verticalalignment='center')
            plt.ylabel('True label')
            plt.xlabel('Predicted label')
    return plt
'''画图'''
def plot_z(data):
    import matplotlib.pyplot as plt
    from pylab import mpl

    mpl.rcParams['font.sans-serif'] = ['Microsoft YaHei']
    mpl.rcParams['axes.unicode_minus'] = False
    lables_count = pd.value_counts(data['Class'])
    print(lables_count)
    # plt.title = '正负例样本数'
    # plt.xlabel = '类别'
    # plt.ylabel = '频数'
    ax=lables_count.plot(kind='bar')
    ax.set_title('正负例样本数', fontsize=14)
    ax.set_xlabel('类别', fontsize=12)
    ax.set_ylabel('频数', fontsize=12)
    plt.show()
'''数据预处理'''
data = pd.read_csv('creditcard.csv')
'''
z标准化
'''
from sklearn.preprocessing import StandardScaler

scaler = StandardScaler()
data['Amount'] = scaler.fit_transform(data[['Amount']])#二维数据，z标准化
data = data.drop('Time', axis=1)

'''切分数据集'''
X_train = data.drop('Class', axis=1)
Y_train = data.iloc[:,-1]


# '''随机切分，训练模型'''
# x_train, x_test, y_train, y_test = \
#     train_test_split(X_train, Y_train, test_size=0.3, random_state=100)

# '''下采样，解决样本不均衡'''
# x_train['Class'] = y_train
# data = x_train
#
# positive_data = data[data['Class'] == 0]
# negative_data = data[data['Class'] == 1]
# #随机
# positive_data = positive_data.sample(len(negative_data))
#
# data_c =pd.concat([positive_data, negative_data])
#
# x_train =data_c.drop('Class', axis=1)
# y_train =data_c['Class']
# plot_z(data_c)
'''过采样，解决样本不均衡'''
x_train, x_test, y_train, y_test = \
    train_test_split(X_train, Y_train, test_size=0.2, random_state=0)
#smote算法
from  imblearn.over_sampling import SMOTE

seed = SMOTE(random_state=0)

x_train, y_train = seed.fit_resample(x_train, y_train)
# x_train['Class'] = y_train
# data_c = x_train
# plot_z(data_c)

# x_train,x_test,y_train,y_test= \
#  train_test_split(x_train, y_train, test_size=0.2, random_state=0)




'''惩罚因子，交叉验证'''
#
from sklearn.model_selection import cross_val_score

scores = []
c_param_range  = [0.01,0.1,1,10,100]#遍历c_param参数列表
for c_param in c_param_range:
    lr = LogisticRegression(C=c_param,penalty='l2',solver='lbfgs',max_iter=1000)
    score = cross_val_score(lr, x_train, y_train, cv=8,scoring='recall')
    scores_m = sum(score)/len(score)
    scores.append(scores_m)
    print(scores_m)

best_param = c_param_range[np.argmax(scores)]
print('最佳惩罚因子为：',best_param)

lr = LogisticRegression(C=best_param,penalty='l2',max_iter=1000)
lr.fit(x_train, y_train)



#混淆矩阵
y_spred = lr.predict(x_train)
y_pred = lr.predict(x_test)
result = lr.score(x_test, y_test)
polt_matrix(y_train, y_spred).show()
polt_matrix(y_test, y_pred).show()

print(result)
from sklearn import metrics
print('自测：',metrics.classification_report(y_train, y_spred))
print('测试：',metrics.classification_report(y_test, y_pred))#获得测试集测试报告

'''调整Sigmoid函数阈值，以结果为导向'''
thresholds = [0.1,0.2,0.3,0.4,0.5,0.6,0.7,0.8,0.9]
recalls = []
for t in thresholds:
    y_spred = lr.predict_proba(x_test)
    y_spred = pd.DataFrame(y_spred).drop([0], axis=1)
    y_spred[y_spred[1]>t] = 1 #大于阈值的为1
    y_spred[y_spred[1]<=t] = 0 #小于阈值为0
    recall = metrics.recall_score(y_test, y_spred)#单独获取召回率
    recalls.append(recall)
    print(f'阈值：{t}，召回率：{recall}')
best_thresholds = thresholds[np.argmax(recalls)]
print(f'最佳阈值：{best_thresholds},召回率：{recalls[np.argmax(recalls)]}')

机器学习——逻辑回归