作者简介
杜嘉宝,男,西安工程大学电子信息学院,2024级研究生
研究方向:变压器故障预警与检测
电子邮件:djb857497378@gmail.com
王子谦,男,西安工程大学电子信息学院,2024级研究生,张宏伟人工智能课题组
研究方向:机器视觉与人工智能
电子邮件:1523018430@qq.com
在这篇文章中,我将分享如何使用支持向量机(SVM)算法对心脏病数据进行分类。整个流程包括数据加载、预处理、SMOTE过采样、PCA降维、超参数调优、灰狼优化算法的使用等。通过这篇文章,希望你能够了解如何通过集成不同技术实现更好的分类效果。
1. 安装包的准备
首先,你需要安装必要的Python库。以下是一些主要的库:
pip install numpy pandas scikit-learn imbalanced-learn matplotlib
这些库提供了SVM、SMOTE过采样、PCA降维以及其他常用的数据处理工具。
2. 数据集介绍
本次我们使用的是UCI心脏病数据集,包含多种与心脏病相关的特征,如年龄、性别、血压、胆固醇水平等。数据集中的目标变量target有五个不同的类别,表示不同程度的心脏病。
通过加载数据,我们将其分为特征集(X)和目标值(y),并进行清洗(将?替换为0)。
url = "https://archive.ics.uci.edu/ml/machine-learning-databases/heart-disease/processed.cleveland.data"
columns = ["age", "sex", "cp", "trestbps", "chol", "fbs", "restecg", "thalach", "exang", "oldpeak", "slope", "ca", "thal", "target"]
features, target = load_data(url, columns)
3. SVM算法介绍
支持向量机(SVM)是一种强大的分类算法,尤其适合处理高维数据。它的目标是寻找一个超平面,将不同类别的样本分开,并且最大化类别之间的间隔。在这篇文章中,我们使用SVC(支持向量分类)来实现心脏病数据的分类。
from sklearn.svm import SVC
svm_clf = SVC(kernel='rbf', C=1.0, gamma=0.1)
svm_clf.fit(X_train, y_train)
4. SMOTE算法介绍
在心脏病数据集中,类别不平衡问题较为严重。为了解决这个问题,我们使用了SMOTE(Synthetic Minority Over-sampling Technique)算法,通过生成合成样本来平衡各个类别。
from imblearn.over_sampling import SMOTE
X_resampled, y_resampled = smote(X_train, y_train, sampling_strategy={2: 55, 3: 55, 4: 55})
5. GridSearch算法
为了寻找SVM模型的最佳超参数,我们使用了GridSearchCV进行超参数搜索。通过在不同的C和gamma值之间进行网格搜索,找出最优的组合。
from sklearn.model_selection import GridSearchCV
gridsearch(X_train, y_train, X_test, y_test)
6. PCA算法介绍
主成分分析(PCA)是降维的常用方法,可以在保留数据大部分信息的情况下,减少数据的维度。在本次任务中,我们使用PCA将数据维度降到95%的信息量。
from sklearn.decomposition import PCA
X_train = pca.fit_transform(X_train)
X_test = pca.transform(X_test)
7. 灰狼优化算法介绍
灰狼优化算法(Grey Wolf Optimizer, GWO)是一种模拟灰狼捕猎行为的优化算法。在这里,我们使用灰狼优化算法来寻找SVM模型的最佳超参数C和gamma。
def grey_wolf_optimizer(...):
# 代码实现见下文
return alpha_pos, alpha_score
灰狼优化算法通过模拟灰狼群体的领导行为,帮助我们在搜索空间中找到最优解。
8. 实现流程
8.1 数据加载与预处理
首先,加载数据集,并进行数据清洗。然后,使用SMOTE算法处理类别不平衡问题。
(X_train, X_test, y_train, y_test) = train_test_split(features, target, test_size=0.3, random_state=1, stratify=target)
X_train, y_train = smote(X_train, y_train, sampling_strategy={2: 38, 3: 38, 4: 38})
8.2 数据标准化与PCA降维
使用StandardScaler对数据进行标准化处理,并使用PCA降维。
X_train, X_test = scaler(X_train, X_test)
pca = PCA(n_components=0.95)
X_train = pca.fit_transform(X_train)
X_test = pca.transform(X_test)
8.3 模型训练与优化
使用SVM进行训练,并通过灰狼优化算法调整超参数。
print(grey_wolf_optimizer(0, 100, 10000, 100, 2, X_train, y_train, X_test, y_test))
8.4 可视化
使用Matplotlib进行数据可视化,展示PCA降维后的数据分布。
fig = plt.figure(figsize=(8, 6))
ax = fig.add_subplot(111, projection='3d')
ax.scatter(X_train[:, 0], X_train[:, 1], X_train[:, 2], c=y_train, cmap='viridis', alpha=0.7)
9. 源代码
Utils.py
import numpy as np
from sklearn.svm import SVC
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import GridSearchCV
from sklearn.decomposition import PCA
def scaler(x_train, x_test):
standard_transform = StandardScaler()
return standard_transform.fit_transform(x_train), standard_transform.fit_transform(x_test)
def gridsearch(x_train, y_train, x_test, y_test):
C = np.linspace(0,100,100)
gamma = np.linspace(0,100,100)
param_grid = {'C' : C,
'gamma':gamma,
'kernel':['rbf','poly']
}
svm_clf = SVC()
grid_search = GridSearchCV(
estimator=svm_clf,
param_grid=param_grid,
n_jobs=-1,
cv=5,
scoring='accuracy'
)
grid_search.fit(x_train,y_train)
print('网格中最优参数:',grid_search.best_params_)
print('测试集的准确率:',grid_search.score(x_test,y_test))
def apply_pca(data, n_components):
# 标准化数据,使其均值为0,方差为1
pca_scaler = StandardScaler()
scaled_data = pca_scaler.fit_transform(data)
# 进行 PCA 降维
pca = PCA(n_components=n_components)
reduced_data = pca.fit_transform(scaled_data)
# 获取解释方差比例
explained_variance = pca.explained_variance_ratio_
return reduced_data, explained_variance
def grey_wolf_optimizer(lb, ub, n_wolves, max_iter, dim, x_train, y_train, x_test, y_test):
# 定义目标函数
def objective_function(C, gamma):
clf = SVC(kernel='rbf', C=C, gamma=gamma, random_state=1)
clf.fit(x_train, y_train)
return 1 - clf.score(x_test, y_test)
# 初始化狼群
wolves = np.random.uniform(lb, ub, (n_wolves, dim))
# 初始化 alpha、beta、delta 位置及适应度值
alpha_pos = np.zeros(dim)
alpha_score = float('inf')
beta_pos = np.zeros(dim)
beta_score = float('inf')
delta_pos = np.zeros(dim)
delta_score = float('inf')
# 迭代优化
for t in range(max_iter):
# 计算当前狼群的适应度
for i in range(n_wolves):
wolves[i, :] = np.clip(wolves[i, :], lb, ub) # 约束搜索范围
C = max(float(wolves[i, 0]), 1e-3)
gamma = max(float(wolves[i, 1]),1e-3)
fitness = objective_function(C, gamma)
# 更新 alpha、beta、delta
if fitness < alpha_score:
delta_score, delta_pos = beta_score, beta_pos.copy()
beta_score, beta_pos = alpha_score, alpha_pos.copy()
alpha_score, alpha_pos = fitness, wolves[i, :].copy()
elif fitness < beta_score:
delta_score, delta_pos = beta_score, beta_pos.copy()
beta_score, beta_pos = fitness, wolves[i, :].copy()
elif fitness < delta_score:
delta_score, delta_pos = fitness, wolves[i, :].copy()
# 计算系数 a
a = 2 - t * (2 / max_iter)
# 更新狼群位置
for i in range(n_wolves):
r1, r2 = np.random.rand(dim), np.random.rand(dim)
A1 = 2 * a * r1 - a
C1 = 2 * r2
D_alpha = abs(C1 * alpha_pos - wolves[i, :])
X1 = alpha_pos - A1 * D_alpha
r1, r2 = np.random.rand(dim), np.random.rand(dim)
A2 = 2 * a * r1 - a
C2 = 2 * r2
D_beta = abs(C2 * beta_pos - wolves[i, :])
X2 = beta_pos - A2 * D_beta
r1, r2 = np.random.rand(dim), np.random.rand(dim)
A3 = 2 * a * r1 - a
C3 = 2 * r2
D_delta = abs(C3 * delta_pos - wolves[i, :])
X3 = delta_pos - A3 * D_delta
# 计算新位置
wolves[i, :] = (X1 + X2 + X3) / 3
print(f"Iteration {t+1}: Best C={alpha_pos[0]}, Best gamma={alpha_pos[1]}, Best fitness={1-alpha_score}")
return alpha_pos, alpha_score
load_data.py
import pandas as pd
import numpy as np
from imblearn.over_sampling import SMOTE
def load_data(url, columns):
# 读取数据
df = pd.read_csv(url, names=columns)
df_cleaned = df.replace('?', 0)
X = df_cleaned.iloc[:, :-1] # 特征
y = df_cleaned.iloc[:, -1] # 目标值
return X, y
def smote(x, y, sampling_strategy, random_state=1, k_neighbors=1):
smote = SMOTE(random_state=random_state, sampling_strategy=sampling_strategy, k_neighbors=k_neighbors)
x_resampled, y_resampled = smote.fit_resample(x, y)
return x_resampled, y_resampled
if __name__ == '__main__':
url = "https://archive.ics.uci.edu/ml/machine-learning-databases/heart-disease/processed.cleveland.data"
columns = ["age", "sex", "cp", "trestbps", "chol", "fbs", "restecg", "thalach",
"exang", "oldpeak", "slope", "ca", "thal", "target"]
features, target = load_data(url, columns)
labels, count = np.unique(target, return_counts=True)
print('labels', labels, ' ', 'count:', count)
sampling_strategy = {2: 55, 3: 55, 4: 55}
smote = SMOTE(random_state=42, sampling_strategy=sampling_strategy, k_neighbors=1)
features_resampled, target_resampled = smote.fit_resample(features, target)
labels, count = np.unique(target_resampled, return_counts=True)
print('labels', labels, ' ', 'count:', count)
train.py
from sklearn.model_selection import train_test_split
from load_data import load_data, smote
from utils import scaler, gridsearch,grey_wolf_optimizer
import numpy as np
import matplotlib.pyplot as plt
from sklearn.decomposition import PCA
url = "https://archive.ics.uci.edu/ml/machine-learning-databases/heart-disease/processed.cleveland.data"
columns = ["age", "sex", "cp", "trestbps", "chol", "fbs", "restecg", "thalach", "exang", "oldpeak", "slope", "ca", "thal", "target"]
features, target = load_data(url, columns)
(X_train,
X_test,
y_train,
y_test) = train_test_split(features, target,test_size=0.3,random_state=1,stratify=target)
labels, count = np.unique(y_train, return_counts=True)
print('labels', labels, ' ', 'count:', count)
sampling_strategy = {2: 38, 3: 38, 4: 38}
X_train, y_train = smote(X_train, y_train,sampling_strategy)
labels, count = np.unique(y_train, return_counts=True)
print('labels', labels, ' ', 'count:', count)
X_train, X_test = scaler(X_train,X_test)
pca = PCA(n_components=0.95)
X_train = pca.fit_transform(X_train)
X_test = pca.transform(X_test)
fig = plt.figure(figsize=(8, 6))
ax = fig.add_subplot(111, projection='3d')
ax.scatter(X_train[:, 0], X_train[:, 1], X_train[:, 2], c=y_train, cmap='viridis', alpha=0.7)
ax.set_xlabel("Principal Component 1")
ax.set_ylabel("Principal Component 2")
ax.set_zlabel("Principal Component 3")
ax.set_title("PCA 3D Scatter Plot")
plt.show()
print(grey_wolf_optimizer(0,100,10000,100,2,X_train,y_train,X_test,y_test))