使用SVM对心脏数据是否患病进行分类预测

发布于:2025-04-15 ⋅ 阅读:(30) ⋅ 点赞:(0)

作者简介

杜嘉宝,男,西安工程大学电子信息学院,2024级研究生
研究方向:变压器故障预警与检测
电子邮件:djb857497378@gmail.com
王子谦,男,西安工程大学电子信息学院,2024级研究生,张宏伟人工智能课题组
研究方向:机器视觉与人工智能
电子邮件:1523018430@qq.com

在这篇文章中,我将分享如何使用支持向量机(SVM)算法对心脏病数据进行分类。整个流程包括数据加载、预处理、SMOTE过采样、PCA降维、超参数调优、灰狼优化算法的使用等。通过这篇文章,希望你能够了解如何通过集成不同技术实现更好的分类效果。

1. 安装包的准备

首先,你需要安装必要的Python库。以下是一些主要的库:
pip install numpy pandas scikit-learn imbalanced-learn matplotlib
这些库提供了SVM、SMOTE过采样、PCA降维以及其他常用的数据处理工具。

2. 数据集介绍

本次我们使用的是UCI心脏病数据集,包含多种与心脏病相关的特征,如年龄、性别、血压、胆固醇水平等。数据集中的目标变量target有五个不同的类别,表示不同程度的心脏病。
通过加载数据,我们将其分为特征集(X)和目标值(y),并进行清洗(将?替换为0)。

url = "https://archive.ics.uci.edu/ml/machine-learning-databases/heart-disease/processed.cleveland.data"
columns = ["age", "sex", "cp", "trestbps", "chol", "fbs", "restecg", "thalach", "exang", "oldpeak", "slope", "ca", "thal", "target"]
features, target = load_data(url, columns)

3. SVM算法介绍

支持向量机(SVM)是一种强大的分类算法,尤其适合处理高维数据。它的目标是寻找一个超平面,将不同类别的样本分开,并且最大化类别之间的间隔。在这篇文章中,我们使用SVC(支持向量分类)来实现心脏病数据的分类。

from sklearn.svm import SVC
svm_clf = SVC(kernel='rbf', C=1.0, gamma=0.1)
svm_clf.fit(X_train, y_train)

4. SMOTE算法介绍

在心脏病数据集中,类别不平衡问题较为严重。为了解决这个问题,我们使用了SMOTE(Synthetic Minority Over-sampling Technique)算法,通过生成合成样本来平衡各个类别。

from imblearn.over_sampling import SMOTE
X_resampled, y_resampled = smote(X_train, y_train, sampling_strategy={2: 55, 3: 55, 4: 55})

5. GridSearch算法

为了寻找SVM模型的最佳超参数,我们使用了GridSearchCV进行超参数搜索。通过在不同的C和gamma值之间进行网格搜索,找出最优的组合。

from sklearn.model_selection import GridSearchCV
gridsearch(X_train, y_train, X_test, y_test)

6. PCA算法介绍

主成分分析(PCA)是降维的常用方法,可以在保留数据大部分信息的情况下,减少数据的维度。在本次任务中,我们使用PCA将数据维度降到95%的信息量。

from sklearn.decomposition import PCA
X_train = pca.fit_transform(X_train)
X_test = pca.transform(X_test)

7. 灰狼优化算法介绍

灰狼优化算法(Grey Wolf Optimizer, GWO)是一种模拟灰狼捕猎行为的优化算法。在这里,我们使用灰狼优化算法来寻找SVM模型的最佳超参数C和gamma。

def grey_wolf_optimizer(...):
    # 代码实现见下文
    return alpha_pos, alpha_score

灰狼优化算法通过模拟灰狼群体的领导行为,帮助我们在搜索空间中找到最优解。

8. 实现流程

8.1 数据加载与预处理

首先,加载数据集,并进行数据清洗。然后,使用SMOTE算法处理类别不平衡问题。

(X_train, X_test, y_train, y_test) = train_test_split(features, target, test_size=0.3, random_state=1, stratify=target)
X_train, y_train = smote(X_train, y_train, sampling_strategy={2: 38, 3: 38, 4: 38})

8.2 数据标准化与PCA降维

使用StandardScaler对数据进行标准化处理,并使用PCA降维。

X_train, X_test = scaler(X_train, X_test)
pca = PCA(n_components=0.95)
X_train = pca.fit_transform(X_train)
X_test = pca.transform(X_test)

8.3 模型训练与优化

使用SVM进行训练,并通过灰狼优化算法调整超参数。

print(grey_wolf_optimizer(0, 100, 10000, 100, 2, X_train, y_train, X_test, y_test))

8.4 可视化

使用Matplotlib进行数据可视化,展示PCA降维后的数据分布。

fig = plt.figure(figsize=(8, 6))
ax = fig.add_subplot(111, projection='3d')
ax.scatter(X_train[:, 0], X_train[:, 1], X_train[:, 2], c=y_train, cmap='viridis', alpha=0.7)

9. 源代码

Utils.py
import numpy as np
from sklearn.svm import SVC
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import GridSearchCV
from sklearn.decomposition import PCA


def scaler(x_train, x_test):
    standard_transform = StandardScaler()
    return standard_transform.fit_transform(x_train), standard_transform.fit_transform(x_test)

def gridsearch(x_train, y_train, x_test, y_test):
    C = np.linspace(0,100,100)
    gamma = np.linspace(0,100,100)
    param_grid = {'C' : C,
                  'gamma':gamma,
                  'kernel':['rbf','poly']
                  }
    svm_clf = SVC()
    grid_search = GridSearchCV(
        estimator=svm_clf,
        param_grid=param_grid,
        n_jobs=-1,
        cv=5,
        scoring='accuracy'
    )
    grid_search.fit(x_train,y_train)
    print('网格中最优参数:',grid_search.best_params_)
    print('测试集的准确率:',grid_search.score(x_test,y_test))


def apply_pca(data, n_components):

    # 标准化数据,使其均值为0,方差为1
    pca_scaler = StandardScaler()
    scaled_data = pca_scaler.fit_transform(data)

    # 进行 PCA 降维
    pca = PCA(n_components=n_components)
    reduced_data = pca.fit_transform(scaled_data)

    # 获取解释方差比例
    explained_variance = pca.explained_variance_ratio_

    return reduced_data, explained_variance


def grey_wolf_optimizer(lb, ub, n_wolves, max_iter, dim, x_train, y_train, x_test, y_test):
    # 定义目标函数
    def objective_function(C, gamma):
        clf = SVC(kernel='rbf', C=C, gamma=gamma, random_state=1)
        clf.fit(x_train, y_train)
        return 1 - clf.score(x_test, y_test)

    # 初始化狼群
    wolves = np.random.uniform(lb, ub, (n_wolves, dim))

    # 初始化 alpha、beta、delta 位置及适应度值
    alpha_pos = np.zeros(dim)
    alpha_score = float('inf')
    beta_pos = np.zeros(dim)
    beta_score = float('inf')
    delta_pos = np.zeros(dim)
    delta_score = float('inf')

    # 迭代优化
    for t in range(max_iter):
        # 计算当前狼群的适应度
        for i in range(n_wolves):
            wolves[i, :] = np.clip(wolves[i, :], lb, ub)  # 约束搜索范围
            C = max(float(wolves[i, 0]), 1e-3)
            gamma = max(float(wolves[i, 1]),1e-3)
            fitness = objective_function(C, gamma)

            # 更新 alpha、beta、delta
            if fitness < alpha_score:
                delta_score, delta_pos = beta_score, beta_pos.copy()
                beta_score, beta_pos = alpha_score, alpha_pos.copy()
                alpha_score, alpha_pos = fitness, wolves[i, :].copy()
            elif fitness < beta_score:
                delta_score, delta_pos = beta_score, beta_pos.copy()
                beta_score, beta_pos = fitness, wolves[i, :].copy()
            elif fitness < delta_score:
                delta_score, delta_pos = fitness, wolves[i, :].copy()

        # 计算系数 a
        a = 2 - t * (2 / max_iter)

        # 更新狼群位置
        for i in range(n_wolves):
            r1, r2 = np.random.rand(dim), np.random.rand(dim)
            A1 = 2 * a * r1 - a
            C1 = 2 * r2
            D_alpha = abs(C1 * alpha_pos - wolves[i, :])
            X1 = alpha_pos - A1 * D_alpha

            r1, r2 = np.random.rand(dim), np.random.rand(dim)
            A2 = 2 * a * r1 - a
            C2 = 2 * r2
            D_beta = abs(C2 * beta_pos - wolves[i, :])
            X2 = beta_pos - A2 * D_beta

            r1, r2 = np.random.rand(dim), np.random.rand(dim)
            A3 = 2 * a * r1 - a
            C3 = 2 * r2
            D_delta = abs(C3 * delta_pos - wolves[i, :])
            X3 = delta_pos - A3 * D_delta

            # 计算新位置
            wolves[i, :] = (X1 + X2 + X3) / 3

        print(f"Iteration {t+1}: Best C={alpha_pos[0]}, Best gamma={alpha_pos[1]}, Best fitness={1-alpha_score}")

    return alpha_pos, alpha_score
load_data.py

import pandas as pd
import numpy as  np
from imblearn.over_sampling import SMOTE

def load_data(url, columns):
    # 读取数据
    df = pd.read_csv(url, names=columns)
    df_cleaned = df.replace('?', 0)
    X = df_cleaned.iloc[:, :-1]  # 特征
    y = df_cleaned.iloc[:, -1]   # 目标值
    return X, y

def smote(x, y, sampling_strategy, random_state=1, k_neighbors=1):
    smote = SMOTE(random_state=random_state, sampling_strategy=sampling_strategy, k_neighbors=k_neighbors)
    x_resampled, y_resampled = smote.fit_resample(x, y)
    return x_resampled, y_resampled


if __name__ == '__main__':
    url = "https://archive.ics.uci.edu/ml/machine-learning-databases/heart-disease/processed.cleveland.data"
    columns = ["age", "sex", "cp", "trestbps", "chol", "fbs", "restecg", "thalach",
               "exang", "oldpeak", "slope", "ca", "thal", "target"]
    features, target = load_data(url, columns)
    labels, count = np.unique(target, return_counts=True)
    print('labels', labels, '  ', 'count:', count)
    sampling_strategy = {2: 55, 3: 55, 4: 55}
    smote = SMOTE(random_state=42, sampling_strategy=sampling_strategy, k_neighbors=1)
    features_resampled, target_resampled = smote.fit_resample(features, target)
    labels, count = np.unique(target_resampled, return_counts=True)
print('labels', labels, '  ', 'count:', count)


train.py
from sklearn.model_selection import train_test_split
from load_data import load_data, smote
from utils import scaler, gridsearch,grey_wolf_optimizer
import numpy as np
import matplotlib.pyplot as plt
from sklearn.decomposition import PCA

url = "https://archive.ics.uci.edu/ml/machine-learning-databases/heart-disease/processed.cleveland.data"
columns = ["age", "sex", "cp", "trestbps", "chol", "fbs", "restecg", "thalach", "exang", "oldpeak", "slope", "ca", "thal", "target"]
features, target = load_data(url, columns)
(X_train,
 X_test,
 y_train,
 y_test) = train_test_split(features, target,test_size=0.3,random_state=1,stratify=target)
labels, count = np.unique(y_train, return_counts=True)
print('labels', labels, '  ', 'count:', count)
sampling_strategy = {2: 38, 3: 38, 4: 38}
X_train, y_train = smote(X_train, y_train,sampling_strategy)
labels, count = np.unique(y_train, return_counts=True)
print('labels', labels, '  ', 'count:', count)
X_train, X_test = scaler(X_train,X_test)
pca = PCA(n_components=0.95)
X_train = pca.fit_transform(X_train)
X_test = pca.transform(X_test)
fig = plt.figure(figsize=(8, 6))
ax = fig.add_subplot(111, projection='3d')
ax.scatter(X_train[:, 0], X_train[:, 1], X_train[:, 2], c=y_train, cmap='viridis', alpha=0.7)

ax.set_xlabel("Principal Component 1")
ax.set_ylabel("Principal Component 2")
ax.set_zlabel("Principal Component 3")
ax.set_title("PCA 3D Scatter Plot")
plt.show()
print(grey_wolf_optimizer(0,100,10000,100,2,X_train,y_train,X_test,y_test))