【数据挖掘实战】房价预测-EW帮帮网

本次对kaggle中的入门级数据集，房价回归数据集进行数据挖掘，预测房屋价格。

本人主页：机器学习司猫白

机器学习专栏：机器学习实战

PyTorch入门专栏：PyTorch入门

深度学习实战：深度学习

ok，话不多说，我们进入正题吧

概述

本次竞赛有 79 个解释变量（几乎）描述了爱荷华州艾姆斯住宅的各个方面，需要预测每套住宅的最终价格。

数据集描述

本次数据集已经上传，大家可以自行下载尝试

文件说明

train.csv - 训练集

test.csv - 测试集

data_description.txt - 每列的完整描述，最初由 Dean De Cock 准备，但经过轻微编辑以匹配此处使用的列名称

Sample_submission.csv - 根据销售年份和月份、地块面积和卧室数量的线性回归提交的基准

建模思路

本次预测是预测房屋价格，很明显是一个回归预测。这里考虑使用线性回归和树模型的回归进行尝试并优化其中参数，选择最佳的一个模型进行预测，输出每个房屋的预测价格。

Python源码

一，打开数据文件，查看数据的基本情况。

import numpy as np 
import pandas as pd 

train_data = pd.read_csv('train.csv')
test_data = pd.read_csv('test.csv')
train_data.info()

输出：

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1460 entries, 0 to 1459
Data columns (total 81 columns):
 #   Column         Non-Null Count  Dtype  
---  ------         --------------  -----  
 0   Id             1460 non-null   int64  
 1   MSSubClass     1460 non-null   int64  
 2   MSZoning       1460 non-null   object 
 3   LotFrontage    1201 non-null   float64
 4   LotArea        1460 non-null   int64  
 5   Street         1460 non-null   object 
 6   Alley          91 non-null     object 
 7   LotShape       1460 non-null   object 
 8   LandContour    1460 non-null   object 
 9   Utilities      1460 non-null   object 
 10  LotConfig      1460 non-null   object 
 11  LandSlope      1460 non-null   object 
 12  Neighborhood   1460 non-null   object 
 13  Condition1     1460 non-null   object 
 14  Condition2     1460 non-null   object 
 15  BldgType       1460 non-null   object 
 16  HouseStyle     1460 non-null   object 
 17  OverallQual    1460 non-null   int64  
 18  OverallCond    1460 non-null   int64  
 19  YearBuilt      1460 non-null   int64  
 20  YearRemodAdd   1460 non-null   int64  
 21  RoofStyle      1460 non-null   object 
 22  RoofMatl       1460 non-null   object 
 23  Exterior1st    1460 non-null   object 
 24  Exterior2nd    1460 non-null   object 
 25  MasVnrType     588 non-null    object 
 26  MasVnrArea     1452 non-null   float64
 27  ExterQual      1460 non-null   object 
 28  ExterCond      1460 non-null   object 
 29  Foundation     1460 non-null   object 
 30  BsmtQual       1423 non-null   object 
 31  BsmtCond       1423 non-null   object 
 32  BsmtExposure   1422 non-null   object 
 33  BsmtFinType1   1423 non-null   object 
 34  BsmtFinSF1     1460 non-null   int64  
 35  BsmtFinType2   1422 non-null   object 
 36  BsmtFinSF2     1460 non-null   int64  
 37  BsmtUnfSF      1460 non-null   int64  
 38  TotalBsmtSF    1460 non-null   int64  
 39  Heating        1460 non-null   object 
 40  HeatingQC      1460 non-null   object 
 41  CentralAir     1460 non-null   object 
 42  Electrical     1459 non-null   object 
 43  1stFlrSF       1460 non-null   int64  
 44  2ndFlrSF       1460 non-null   int64  
 45  LowQualFinSF   1460 non-null   int64  
 46  GrLivArea      1460 non-null   int64  
 47  BsmtFullBath   1460 non-null   int64  
 48  BsmtHalfBath   1460 non-null   int64  
 49  FullBath       1460 non-null   int64  
 50  HalfBath       1460 non-null   int64  
 51  BedroomAbvGr   1460 non-null   int64  
 52  KitchenAbvGr   1460 non-null   int64  
 53  KitchenQual    1460 non-null   object 
 54  TotRmsAbvGrd   1460 non-null   int64  
 55  Functional     1460 non-null   object 
 56  Fireplaces     1460 non-null   int64  
 57  FireplaceQu    770 non-null    object 
 58  GarageType     1379 non-null   object 
 59  GarageYrBlt    1379 non-null   float64
 60  GarageFinish   1379 non-null   object 
 61  GarageCars     1460 non-null   int64  
 62  GarageArea     1460 non-null   int64  
 63  GarageQual     1379 non-null   object 
 64  GarageCond     1379 non-null   object 
 65  PavedDrive     1460 non-null   object 
 66  WoodDeckSF     1460 non-null   int64  
 67  OpenPorchSF    1460 non-null   int64  
 68  EnclosedPorch  1460 non-null   int64  
 69  3SsnPorch      1460 non-null   int64  
 70  ScreenPorch    1460 non-null   int64  
 71  PoolArea       1460 non-null   int64  
 72  PoolQC         7 non-null      object 
 73  Fence          281 non-null    object 
 74  MiscFeature    54 non-null     object 
 75  MiscVal        1460 non-null   int64  
 76  MoSold         1460 non-null   int64  
 77  YrSold         1460 non-null   int64  
 78  SaleType       1460 non-null   object 
 79  SaleCondition  1460 non-null   object 
 80  SalePrice      1460 non-null   int64  
dtypes: float64(3), int64(35), object(43)
memory usage: 924.0+ KB

根据输出结果，我们可以看到数据集中存在缺失值。缺失值如果不处理，会影响后续建模过程，甚至可能导致模型报错。这里有一个具体的情况需要说明：假设缺失值出现在object类型的特征中，通常情况下，我们会使用独热编码（One-Hot Encoding）将分类数据转化为数值。如果我们直接对包含缺失值的列进行独热编码，可能会生成一列专门表示缺失值（通常是NaN的列）。这样会导致训练数据和后续用于预测的实际数据维度不一致，进而无法使用模型进行预测。

此外，一些模型对NaN值非常敏感，因为NaN表示缺失数据，而不是数值类型。如果模型在训练时遇到NaN值，很多模型会因此报错，因为它们无法处理非数值的输入数据。因此，在建模前，我们需要先处理缺失值，确保数据的一致性和模型能够正确训练。常见的处理方法包括填充缺失值（如使用均值、中位数或众数填充）或者删除包含缺失值的行或列。

数据维度一致性：训练数据和预测数据的特征维度必须完全一致，否则模型无法正确应用于新数据。

二，数据处理和特征工程

# 计算每个特征的缺失值比例
missing_values = train_data.isnull().sum()  # 计算每一列的缺失值数量
total_values = train_data.shape[0]  # 获取总行数

# 计算每一列缺失值的比例
missing_percentage = (missing_values / total_values) * 100

# 显示缺失值比例超过50%的特征
high_missing_features = missing_percentage[missing_percentage > 50]

# 输出缺失值比例超过50%的特征
high_missing_features

输出：

Alley          93.767123
MasVnrType     59.726027
PoolQC         99.520548
Fence          80.753425
MiscFeature    96.301370
dtype: float64

这里计算了缺失值的比例。

train_data2 = train_data.drop(['MiscFeature', 'Fence', 'PoolQC',  'MasVnrType', 'Alley','Id'], axis=1)
test_data2 = test_data.drop(['MiscFeature', 'Fence', 'PoolQC', 'MasVnrType', 'Alley','Id'], axis=1)
id = test_data['Id']
train_data2.shape, test_data2.shape

删除缺失值过多的列，剩下的列采用填充的方法进行处理。

# 处理测试集中的缺失值
for column in test_data2.columns:
   if test_data2[column].dtype == 'object':
       # 对象类型，使用训练集的众数填充
       test_data2[column].fillna(train_data2[column].mode()[0], inplace=True)
   else:
       # 数值类型，使用训练集的中位数填充
       test_data2[column].fillna(train_data2[column].median(), inplace=True)

# 处理训练集中的缺失值
for column in train_data2.columns:
   if train_data2[column].dtype == 'object':
       # 对象类型，使用训练集的众数填充
       train_data2[column].fillna(train_data2[column].mode()[0], inplace=True)
   else:
       # 数值类型，使用训练集的中位数填充
       train_data2[column].fillna(train_data2[column].median(), inplace=True)


# 查看处理后的训练集和测试集
print(train_data2.shape)
print(test_data2.shape)

输出：

(1460, 75)
(1459, 74)

缺失值处理完毕，接下来就可以划分目标变量和特征。

train_data3=train_data2.drop(['SalePrice'], axis=1)

label=train_data2['SalePrice']
train_data3.shape

输出：
(1460, 74)

这里可以看到，特征较多，考虑尝试使用相关性去除一部分。

import seaborn as sns
import matplotlib.pyplot as plt

# 选择所有数值类型的列
numerical_data = train_data3.select_dtypes(include=['number'])

# 计算相关性矩阵
correlation_matrix = numerical_data.corr()

# 设置绘图的尺寸
plt.figure(figsize=(15, 8))

# 使用seaborn绘制热图
sns.heatmap(correlation_matrix, annot=True, cmap='coolwarm', fmt='.1f', linewidths=0.5)

# 设置标题
plt.title('Correlation Heatmap of Numerical Features')

# 显示热图
plt.show()

有点看不太清，那就直接使用阈值，去除相关性大于0.8的列。

# 设置相关性阈值
threshold = 0.8

# 找到相关性大于阈值的列对
to_drop = set()  # 用于存储要删除的列
for i in range(len(correlation_matrix.columns)):
    for j in range(i):
        if abs(correlation_matrix.iloc[i, j]) > threshold:
            colname = correlation_matrix.columns[i]
            # 仅当当前列未被标记删除时才进行删除操作
            if colname not in to_drop:
                to_drop.add(correlation_matrix.columns[j])

list(to_drop)

输出：

['GarageCars', 'GrLivArea', 'TotalBsmtSF']

# 删除相关性较强的列
train_data4 = train_data3.drop(columns=to_drop)
test_data4 = test_data2.drop(columns=to_drop)

print(train_data4.shape)
print(test_data4.shape)

from sklearn.preprocessing import LabelEncoder

# 创建每个类别特征进行编码
for column in train_data4.select_dtypes(include=['object']).columns:
   # 合并训练集和测试集的类别，以创建一个包含所有可能类别的编码器
   all_categories = pd.concat([train_data4[column], test_data4[column]]).unique()
   encoder = LabelEncoder()
   encoder.fit(all_categories)
   
   # 使用编码器对训练集和测试集进行编码
   train_data4[column] = encoder.transform(train_data4[column])
   test_data4[column] = encoder.transform(test_data4[column])

# 查看处理后的训练集和测试集
print(train_data4.shape)
print(test_data4.shape)

这里对object类型的列进行编码，使其变为数值，至于为什么使用标签编码，后续我会出一个有关特征编码的文章，这里不多进行赘述。

三，模型训练与评估

先考虑使用线性回归中的岭回归，来看看效果。

import numpy as np
from sklearn.model_selection import GridSearchCV
from sklearn.linear_model import Ridge
from sklearn.model_selection import train_test_split


X = train_data4  # 特征数据
y = label  # 目标变量

# 划分数据集为训练集和验证集
X_train, X_val, y_train, y_val = train_test_split(X, y, test_size=0.2, random_state=42)

# 定义岭回归模型
ridge_model = Ridge()

# 设置待调优的超参数范围，这里我们主要调节 alpha（正则化参数）
param_grid = {'alpha': np.logspace(-6, 6, 13)}  # alpha 的范围通常是从 1e-6 到 1e6

# 使用交叉验证来选择最佳的 alpha 参数
grid_search = GridSearchCV(ridge_model, param_grid, cv=5, scoring='neg_mean_squared_error')  # 5折交叉验证，使用负均方误差作为评分标准

# 拟合模型
grid_search.fit(X_train, y_train)

# 输出最佳参数
print("Best alpha parameter:", grid_search.best_params_)

# 获取最佳模型
best_ridge_model = grid_search.best_estimator_

# 使用最佳模型在验证集上评估
score = best_ridge_model.score(X_val, y_val)
print("Model R^2 score on validation set:", score)

# 输出交叉验证的结果
print("Best cross-validation score:", grid_search.best_score_)

Best alpha parameter: {'alpha': 100.0}
Model R^2 score on validation set: 0.8496053872702527
Best cross-validation score: -1348455440.2012005

再使用lightgbm，树模型来看看效果。

import optuna
import lightgbm as lgb
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error, r2_score

X = train_data4  # 特征数据
y = label  # 目标变量

# 划分训练集和验证集
X_train, X_valid, y_train, y_valid = train_test_split(X, y, test_size=0.2, random_state=42)

def objective(trial):
    # 使用Optuna选择超参数
    params = {
        'objective': 'regression',  # 回归任务
        'boosting_type': 'gbdt',  # 梯度提升决策树
        'num_leaves': trial.suggest_int('num_leaves', 20, 100),  # 树的最大叶子数
        'learning_rate': trial.suggest_float('learning_rate', 1e-5, 1e-1, log=True),  # 学习率，使用对数均匀分布
        'n_estimators': trial.suggest_int('n_estimators', 50, 500),  # 树的数量
        'max_depth': trial.suggest_int('max_depth', 3, 15),  # 树的最大深度
        'subsample': trial.suggest_float('subsample', 0.5, 1.0),  # 数据采样率
        'colsample_bytree': trial.suggest_float('colsample_bytree', 0.5, 1.0),  # 特征采样率
        'min_child_samples': trial.suggest_int('min_child_samples', 5, 100),  # 每个叶子的最小样本数
        'reg_alpha': trial.suggest_float('reg_alpha', 1e-5, 1.0, log=True),  # L1 正则化
        'reg_lambda': trial.suggest_float('reg_lambda', 1e-5, 1.0, log=True)  # L2 正则化
    }

    # 创建LightGBM模型
    model = lgb.LGBMRegressor(**params, verbose=-1)
    
    # 训练模型
    model.fit(X_train, y_train)
    
    # 进行预测
    y_pred = model.predict(X_valid)
    
    # 计算RMSE（均方根误差）
    rmse = np.sqrt(mean_squared_error(y_valid, y_pred))
    
    return rmse  # Optuna将根据最小化RMSE来寻找最佳超参数

# 创建Optuna的Study对象
study = optuna.create_study(direction='minimize')  # 最小化RMSE

# 开始超参数优化
study.optimize(objective, n_trials=50)  # 尝试100次

# 输出最佳超参数和对应的RMSE值
print(f"Best trial: {study.best_trial.params}")
print(f"Best RMSE: {study.best_value}")

# 使用最佳超参数训练最终模型
best_params = study.best_trial.params
final_model = lgb.LGBMRegressor(**best_params, verbose=-1)

# 训练最终模型时
final_model.fit(X_train, y_train)

# 在验证集上进行预测并计算RMSE和R2
y_pred_final = final_model.predict(X_valid)
final_rmse = np.sqrt(mean_squared_error(y_valid, y_pred_final))
final_r2 = r2_score(y_valid, y_pred_final)

print(f"Final RMSE on validation set: {final_rmse}")
print(f"Final R2 on validation set: {final_r2}")

Best trial: {'num_leaves': 97, 'learning_rate': 0.013163137448188754, 'n_estimators': 372, 'max_depth': 11, 'subsample': 0.8474988867349187, 'colsample_bytree': 0.7064845955811748, 'min_child_samples': 5, 'reg_alpha': 0.0011685340064003379, 'reg_lambda': 0.041584313394230084}
Best RMSE: 26248.97344413891
Final RMSE on validation set: 26248.97344413891
Final R2 on validation set: 0.910172189779164

根据输出结果，初步发现lightgbm模型效果会更好。这里解释以下回归模型的评估，比如这里的RMSE，虽然说这个指标是越小越好，小到多少是好，大到多少是不好，这里要讲的是RMSE更像是一个相对指标，比如第一次运行RMSE为1000，第二次运行RMSE是998，那么第二次运行的就是更优的，并没有一个绝对的数值来评判，而是相对的比较。

1. 这里使用 Optuna 对 LightGBM 回归模型的超参数进行优化，目的是找到能够最小化 RMSE 的最佳参数组合。

2. 优化的超参数包括树的深度、叶子数、学习率等。

3. 最终训练并评估了一个基于最佳超参数的回归模型，并计算了其在验证集上的 RMSE 和 R²。

由于数据量较小，很容易过拟合，因此加入了l1和l2正则化，并进行超参数优化，可以看到训练集RMSE和测试集RMSE非常接近，说明并没有过度拟合。

四，使用真实的数据运行模型，预测房屋的价格

y_pred_test = final_model.predict(test_data4)
# 将预测结果转换为 DataFrame
y_pred_df = pd.DataFrame({
   'Id': test_data['Id'],
   'SalePrice': y_pred_test
})

# 保存预测结果到 CSV 文件
y_pred_df.to_csv('predictions.csv', index=False)
y_pred_df

Id SalePrice

0 1461 128989.106316

1 1462 155402.491796

2 1463 173423.163568

3 1464 184025.799434

4 1465 200870.139148

... ... ...

1454 2915 84714.331635

1455 2916 89781.868635

1456 2917 171236.073006

1457 2918 121141.145259

1458 2919 220957.998442

1459 rows × 2 columns

Id	SalePrice
0	1461	128989.106316
1	1462	155402.491796
2	1463	173423.163568
3	1464	184025.799434
4	1465	200870.139148
...	...	...
1454	2915	84714.331635
1455	2916	89781.868635
1456	2917	171236.073006
1457	2918	121141.145259
1458	2919	220957.998442

这样模型的预测结果就保存为了csv文件。

五，展示特征重要性

import matplotlib.pyplot as plt
# 绘制特征重要性图
lgb.plot_importance(final_model, importance_type='split', max_num_features=10, figsize=(10, 6))
plt.title('Feature Importance (Split)')
plt.show()

根据特征重要性图可以发现，影响房屋价格的最大因素是1stFlrSF，也就是房租第一层的面积。

我的博客即将同步至腾讯云开发者社区，邀请大家一同入驻：https://cloud.tencent.com/developer/support-plan?invite_code=5mtxxtr44v7

【数据挖掘实战】房价预测

概述

数据集描述

文件说明

建模思路

Python源码

一，打开数据文件，查看数据的基本情况。

二，数据处理和特征工程

三，模型训练与评估

四，使用真实的数据运行模型，预测房屋的价格

五，展示特征重要性

网站公告

今日签到

热门文章

最新发布

【数据挖掘实战】 房价预测

概述

数据集描述

文件说明

建模思路

Python源码

一，打开数据文件，查看数据的基本情况。

二，数据处理和特征工程

三，模型训练与评估

四，使用真实的数据运行模型，预测房屋的价格

五，展示特征重要性

网站公告

今日签到

热门文章

最新发布

【数据挖掘实战】房价预测