EDA and Predictive Analysis of Hotel Booking Demand Datasets 2.0

数据背景:该数据集包含酒店预订相关信息，数据信息范围包括：酒店类型、订单是否取消、预订时间、入住时长、入住人数、用户国籍等。

数据分析目的:1、分析用户的特征分布；2、分析酒店业务经营情况；3、预测酒店订单是否会被取消，找出重要影响因素

数据来源链接： https://www.kaggle.com/jessemostipak/hotel-booking-demand

以下通过Python对酒店预订数据进行探索性数据分析（Exploratory Data Analysis）和预测分析（Predictive Analysis）：

一、数据准备

import os
import zipfile
import pandas as pd
import numpy as np
import datetime as dt
import matplotlib.pyplot as plt
import seaborn as sns 
import warnings
# 忽略警告
warnings.filterwarnings('ignore')

# 正常显示中文和负号
plt.rcParams['font.sans-serif'] = ['SimHei']
plt.rcParams['axes.unicode_minus'] = False
# 可视化显示在页面 
%matplotlib inline
# 设定绘图风格
plt.style.use('ggplot')

# 声明变量
dataset_path = './'    # 数据集路径
zip_filename = 'archive_4.zip'     # zip文件名
zip_filepath = os.path.join(dataset_path, zip_filename)    # zip文件路径

# 解压数据集
with zipfile.ZipFile(zip_filepath) as zf:
    dataset_filename = zf.namelist()[0]      # 数据集文件名（在zip中）
    dataset_filepath = os.path.join(dataset_path, dataset_filename)  # 数据集文件路径
    print ("解压zip...",)
    zf.extractall(path = dataset_path)
    print ("完成。")

解压zip...
完成。

# 导入数据集
df_data = pd.read_csv(dataset_filepath)

# 查看加载的数据基本信息
print ('数据集基本信息：')
df_data.info()

数据集基本信息：
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 119390 entries, 0 to 119389
Data columns (total 32 columns):
 #   Column                          Non-Null Count   Dtype  
---  ------                          --------------   -----  
 0   hotel                           119390 non-null  object 
 1   is_canceled                     119390 non-null  int64  
 2   lead_time                       119390 non-null  int64  
 3   arrival_date_year               119390 non-null  int64  
 4   arrival_date_month              119390 non-null  object 
 5   arrival_date_week_number        119390 non-null  int64  
 6   arrival_date_day_of_month       119390 non-null  int64  
 7   stays_in_weekend_nights         119390 non-null  int64  
 8   stays_in_week_nights            119390 non-null  int64  
 9   adults                          119390 non-null  int64  
 10  children                        119386 non-null  float64
 11  babies                          119390 non-null  int64  
 12  meal                            119390 non-null  object 
 13  country                         118902 non-null  object 
 14  market_segment                  119390 non-null  object 
 15  distribution_channel            119390 non-null  object 
 16  is_repeated_guest               119390 non-null  int64  
 17  previous_cancellations          119390 non-null  int64  
 18  previous_bookings_not_canceled  119390 non-null  int64  
 19  reserved_room_type              119390 non-null  object 
 20  assigned_room_type              119390 non-null  object 
 21  booking_changes                 119390 non-null  int64  
 22  deposit_type                    119390 non-null  object 
 23  agent                           103050 non-null  float64
 24  company                         6797 non-null    float64
 25  days_in_waiting_list            119390 non-null  int64  
 26  customer_type                   119390 non-null  object 
 27  adr                             119390 non-null  float64
 28  required_car_parking_spaces     119390 non-null  int64  
 29  total_of_special_requests       119390 non-null  int64  
 30  reservation_status              119390 non-null  object 
 31  reservation_status_date         119390 non-null  object 
dtypes: float64(4), int64(16), object(12)
memory usage: 29.1+ MB

总共有32个列标签，119390条数据记录，存在缺失数据。

二、数据清洗

# 逐条预览每个字段的数据计数，例：
col_data=df_data['meal'].value_counts()
plt.barh(y=col_data.index,width=col_data.values,height=0.5)

<BarContainer object of 5 artists>

在这里插入图片描述

分析字段数据的含义并对32个列标签初步分类，拟定数据清洗方法，如下表：
在这里插入图片描述

（一）缺失数据处理

#复制源数据
df_copy=df_data.copy()

# 查看有缺失数据的字段缺失个数占比
print ('缺失个数占比：')
df_copy.isnull().sum()[df_copy.isnull().sum()!=0]/df_copy.shape[0]

缺失个数占比：
children    0.000034
country     0.004087
agent       0.136862
company     0.943069
dtype: float64

# 'children'列缺失值较少，直接删除缺失的行
df_copy.dropna(subset=['children'],inplace=True)

# 'country'列缺失值较少，直接删除缺失的行
df_copy.dropna(subset=['country'],inplace=True)

#‘agent'列缺失值占比近14%，使用0替换列中的缺失值
df_copy['agent'].fillna(value=0, inplace = True)

# 'company'列缺失值过多，占比约94%，直接删除列
df_copy.drop(['company'],axis =1,inplace=True)

# 查验是否还有缺失数据
print ('含缺失数据的列的个数：')
df_copy.isnull().sum()[df_copy.isnull().sum()!=0].count()

含缺失数据的列的个数：
0

（二）删除重复记录

数据集无主键，暂不处理重复记录

（三）异常值处理

# 剔除入住晚数为0的记录
df_copy.drop(df_copy[df_copy['stays_in_weekend_nights']+df_copy['stays_in_week_nights']==0].index,inplace=True)

# 剔除入住人数为0的记录
df_copy.drop(df_copy[df_copy['adults']+df_copy['children']+df_copy['babies']==0].index,inplace=True)

# 将'children'字段数据类型修改为整型
df_copy.children = df_copy.children.astype(int)

# 将'meal'字段中的Undefined 修改为 SC 
df_copy.meal.replace("Undefined", "SC", inplace=True)

# 将'agent'字段数据类型修改为字符串型
df_copy.agent = df_copy.agent.astype(int)
df_copy['agent'] = df_copy['agent'].apply(str)

#查看'adr'字段数据分布
plt.boxplot(df_copy['adr'])

{'whiskers': [<matplotlib.lines.Line2D at 0x2359613ed60>,
  <matplotlib.lines.Line2D at 0x23596154070>],
 'caps': [<matplotlib.lines.Line2D at 0x23596154370>,
  <matplotlib.lines.Line2D at 0x23596154580>],
 'boxes': [<matplotlib.lines.Line2D at 0x2359613ea90>],
 'medians': [<matplotlib.lines.Line2D at 0x23596154850>],
 'fliers': [<matplotlib.lines.Line2D at 0x23596154b20>],
 'means': []}

在这里插入图片描述

#删除'adr'字段离群点
df_copy.drop(df_copy[df_copy['adr']>5000].index,inplace=True)

# 重置索引
df_copy.reset_index(drop=True,inplace=True)
# 查验加载的数据基本信息
print ('数据集基本信息：')
df_copy.info()

数据集基本信息：
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 118087 entries, 0 to 118086
Data columns (total 31 columns):
 #   Column                          Non-Null Count   Dtype  
---  ------                          --------------   -----  
 0   hotel                           118087 non-null  object 
 1   is_canceled                     118087 non-null  int64  
 2   lead_time                       118087 non-null  int64  
 3   arrival_date_year               118087 non-null  int64  
 4   arrival_date_month              118087 non-null  object 
 5   arrival_date_week_number        118087 non-null  int64  
 6   arrival_date_day_of_month       118087 non-null  int64  
 7   stays_in_weekend_nights         118087 non-null  int64  
 8   stays_in_week_nights            118087 non-null  int64  
 9   adults                          118087 non-null  int64  
 10  children                        118087 non-null  int32  
 11  babies                          118087 non-null  int64  
 12  meal                            118087 non-null  object 
 13  country                         118087 non-null  object 
 14  market_segment                  118087 non-null  object 
 15  distribution_channel            118087 non-null  object 
 16  is_repeated_guest               118087 non-null  int64  
 17  previous_cancellations          118087 non-null  int64  
 18  previous_bookings_not_canceled  118087 non-null  int64  
 19  reserved_room_type              118087 non-null  object 
 20  assigned_room_type              118087 non-null  object 
 21  booking_changes                 118087 non-null  int64  
 22  deposit_type                    118087 non-null  object 
 23  agent                           118087 non-null  object 
 24  days_in_waiting_list            118087 non-null  int64  
 25  customer_type                   118087 non-null  object 
 26  adr                             118087 non-null  float64
 27  required_car_parking_spaces     118087 non-null  int64  
 28  total_of_special_requests       118087 non-null  int64  
 29  reservation_status              118087 non-null  object 
 30  reservation_status_date         118087 non-null  object 
dtypes: float64(1), int32(1), int64(16), object(13)
memory usage: 27.5+ MB

三、数据分析及可视化

（一）用户的特征分布

从用户属性、用户行为分析用户的特征分布，判断其对取消订单的相关性：

1、用户的地域分布

#引入世界地图信息表
import geopandas
world = geopandas.read_file(geopandas.datasets.get_path('naturalearth_lowres'))

#按'country'字段分组对酒店的订单量计数
df_country=df_copy['country'].value_counts().reset_index().rename (columns= {'index': 'code','country':'num_order'})
#将酒店订单量数据并入世界地图信息表
df_country = world.merge(df_country, how="left", left_on=['iso_a3'], right_on=['code'])
df_country.head()

	pop_est	continent	name	iso_a3	gdp_md_est	geometry	code	num_order
0	889953.0	Oceania	Fiji	FJI	5496	MULTIPOLYGON (((180.00000 -16.06713, 180.00000...	FJI	1.0
1	58005463.0	Africa	Tanzania	TZA	63177	POLYGON ((33.90371 -0.95000, 34.07262 -1.05982...	TZA	4.0
2	603253.0	Africa	W. Sahara	ESH	907	POLYGON ((-8.66559 27.65643, -8.66512 27.58948...	NaN	NaN
3	37589262.0	North America	Canada	CAN	1736425	MULTIPOLYGON (((-122.84000 49.00000, -122.9742...	NaN	NaN
4	328239523.0	North America	United States of America	USA	21433226	MULTIPOLYGON (((-122.84000 49.00000, -120.0000...	USA	2089.0

#概览来自不同地区客户的酒店订单量的计数分布
plt.boxplot(df_copy['country'].value_counts())

{'whiskers': [<matplotlib.lines.Line2D at 0x235966a2a90>,
  <matplotlib.lines.Line2D at 0x235966a2d90>],
 'caps': [<matplotlib.lines.Line2D at 0x235966af0a0>,
  <matplotlib.lines.Line2D at 0x235966af370>],
 'boxes': [<matplotlib.lines.Line2D at 0x235966a27c0>],
 'medians': [<matplotlib.lines.Line2D at 0x235966af640>],
 'fliers': [<matplotlib.lines.Line2D at 0x235966af910>],
 'means': []}

在这里插入图片描述

根据箱线图可知，酒店客户主要集中在部分地区，可视化过程中应重点突出该部分地区。

#可视化
fig, ax = plt.subplots(1, figsize=(25,10))
df_country.plot(column='num_order', cmap='ocean_r', linewidth=0.8, ax=ax, edgecolors='0.8',legend=True,\
                missing_kwds={'color': 'lightgrey'})
ax.set_title('Order Quantity of Customers from different Countries ', fontdict={'fontsize':30})
ax.set_axis_off()

在这里插入图片描述

根据地区分布图可知，酒店客户主要来自西欧国家。其中，来自葡萄牙的客户订单量最多，占总订单量近一半，其次订单主要集中来自英国、法国、西班牙、德国。

#筛选订单量前五的地区数据
df_copy_top5 = df_copy.loc[df_copy['country'].isin(['PRT','GBR','FRA','ESP','DEU'])]

# 订单量前五地区
grouped_df_copy_top5  = df_copy_top5.pivot_table(values='hotel',index='country',columns='is_canceled',aggfunc='count')
# 可视化
grouped_df_copy_top5.plot(kind='bar',title='订单量前五地区')

<AxesSubplot:title={'center':'订单量前五地区'}, xlabel='country'>

在这里插入图片描述

根据上图发现，葡萄牙用户订单取消的概率远高于其他地区用户。

2、新老用户分布

# 新老客户的订单量
grouped_df_guest = df_copy.pivot_table(values='hotel',index='is_repeated_guest',columns='is_canceled',aggfunc='count')
# 可视化
grouped_df_guest.plot(kind='bar',title='新老客户订单量')

<AxesSubplot:title={'center':'新老客户订单量'}, xlabel='is_repeated_guest'>

在这里插入图片描述

根据新老客户分布可知，酒店订单主要来自新客户，而老客户取消订单的概率明显低于新客户。

3、客户类型分布

# 各客户类型的订单量
grouped_df_ct = df_copy.pivot_table(values='hotel',index='customer_type',columns='is_canceled',aggfunc='count')
# 可视化
grouped_df_ct.plot(kind='bar',title='客户类型订单量')

<AxesSubplot:title={'center':'客户类型订单量'}, xlabel='customer_type'>

在这里插入图片描述

根据客户类型分布可知，酒店订单主要来自过往旅客，高于其他客户。此外，跟团旅客取消订单可能性很小。

4、客户历史取消率

# 复制df_copy表
df_copy_pb = df_copy.copy()
#计算历史订单取消率，创建新列'canceled_rate'
df_copy_pb['canceled_rate'] = (df_copy_pb['previous_cancellations']/(df_copy_pb['previous_bookings_not_canceled']\
                                                                     +df_copy_pb['previous_cancellations'])).replace(np.nan,0)

# 历史订单取消率分布
grouped_df_copy_pb = df_copy_pb.pivot_table(values='is_canceled',index='canceled_rate', aggfunc='mean')
# 可视化
sns.jointplot(x=grouped_df_copy_pb.index, y=grouped_df_copy_pb.is_canceled, kind='reg')

<seaborn.axisgrid.JointGrid at 0x235bd31e7c0>

在这里插入图片描述

根据上图可知，用户d 历史订单取消率与是否取消酒店订单有一定的相关性。

5、订单预订时长和确定时长分布

# 预订时长对应的订单取消率
grouped_df_lt = df_copy.pivot_table(values='is_canceled',index='lead_time',aggfunc='mean')
# 可视化
fig, ax = plt.subplots(1, figsize=(18,6))
grouped_df_lt.plot(kind='bar', ax=ax, title='预订时长对应的订单取消率')

<AxesSubplot:title={'center':'预订时长对应的订单取消率'}, xlabel='lead_time'>

在这里插入图片描述

由上图可知，预订时长越长，酒店订单被取消的概率显着增长。

# 确认时长对应的订单取消率
grouped_df_diwl = df_copy.pivot_table(values='is_canceled',index='days_in_waiting_list',aggfunc='mean')
# 可视化
fig, ax = plt.subplots(1, figsize=(18,6))
grouped_df_diwl.plot(kind='bar', ax=ax, title='确认时长对应的订单取消率')

<AxesSubplot:title={'center':'确认时长对应的订单取消率'}, xlabel='days_in_waiting_list'>

在这里插入图片描述

根据上图，未发现确认时长与酒店订单被取消的相关关系。

6、客户入住时间分布

# 查看每年月份是否完整
df_copy.drop_duplicates(['arrival_date_year','arrival_date_month'])[['arrival_date_year','arrival_date_month']]

	arrival_date_year	arrival_date_month
0	2015	July
825	2015	August
1454	2015	September
2054	2015	October
2789	2015	November
3212	2015	December
3669	2016	January
3794	2016	February
4204	2016	March
4730	2016	April
5446	2016	May
6194	2016	June
6683	2016	July
7187	2016	August
7795	2016	September
8397	2016	October
9091	2016	November
9366	2016	December
9686	2017	January
9889	2017	February
10271	2017	March
10662	2017	April
11193	2017	May
11736	2017	June
12364	2017	July
13020	2017	August

由上表可知，数据集包含的入住时间范围为2015年下半年、2016年完整年及2017年1-8月。

# 复制df_copy表
df_copy_t = df_copy.copy()
# 对入住月份进行映射，创建入住季度字段'arrival_date_season'
df_copy_t['arrival_date_season'] = df_copy_t['arrival_date_month'].map({'January':'冬季','February':'春季','March':'春季',\
                                                                            'April':'春季','May':'夏季','June':'夏季','July':'夏季',\
                                                                            'August':'秋季','September':'秋季','October':'秋季',\
                                                                            'November':'冬季','December':'冬季'})

# 按季度自定义排序
from pandas.api.types import CategoricalDtype
cat_size_order = CategoricalDtype(['春季','夏季','秋季','冬季'], ordered=True)
df_copy_t['arrival_date_season'] = df_copy_t['arrival_date_season'].astype(cat_size_order)

# 选取2016年数据，排除其他因素干扰
df_copy_t_2016 = df_copy_t[df_copy_t['arrival_date_year']==2016]

# 各季度的客户订单量
grouped_df_copy_t_2016  = df_copy_t_2016.pivot_table(values='hotel',index='arrival_date_season',columns='is_canceled',aggfunc='count')
# 可视化
grouped_df_copy_t_2016.plot(kind='bar',title='酒店各季度的客户订单量')

<AxesSubplot:title={'center':'酒店各季度的客户订单量'}, xlabel='arrival_date_season'>

在这里插入图片描述

从时间分布可知，客户订单量随着季节的变化比较明显。其中，秋季为入住高峰，对应取消的订单较多，冬季为入住低峰，对应取消的订单也较少，入住季节与订单取消率未发现明显关联。

7、客户入住晚数分布

# 复制df_copy表
df_copy_n = df_copy.copy()
#创建入住晚数字段'stays_in_nights'
df_copy_n['stays_in_nights'] = df_copy_n['stays_in_weekend_nights'] + df_copy_n['stays_in_week_nights']

# 各入住晚数对应订单量
grouped_df_copy_n  = df_copy_n.pivot_table(values='hotel',index='stays_in_nights',columns='is_canceled',aggfunc='count')
# 可视化
fig, ax = plt.subplots(1, figsize=(18,6))
grouped_df_copy_n.plot(kind='bar',ax=ax, title='各入住晚数对应订单量')

<AxesSubplot:title={'center':'各入住晚数对应订单量'}, xlabel='stays_in_nights'>

在这里插入图片描述

根据入住晚数分布图，酒店订单对应的入住晚数主要集中在1-3晚，其中入住1晚的订单被取消的概率明显较小。入住7晚和14晚的订单量有异常增长，需根据酒店信息分层分析。

# 选取未被取消订单的数据
df_copy_n_0 = df_copy_n[df_copy_n['is_canceled']==0]
# 各入住晚数对应入住订单量
grouped_df_copy_n_0  = df_copy_n_0.pivot_table(values='is_canceled',index='stays_in_nights',columns='hotel',aggfunc='count')
# 可视化
fig, ax = plt.subplots(1, figsize=(18,6))
grouped_df_copy_n_0.plot(kind='bar',ax=ax, title='各入住晚数对应入住订单量')

<AxesSubplot:title={'center':'各入住晚数对应入住订单量'}, xlabel='stays_in_nights'>

在这里插入图片描述

从上图可知，入住7晚和14晚的异常订单量主要发生在度假酒店，可能与度假酒店优惠活动有关，需补充酒店信息相关数据进一步分析。

8、成人、孩子、婴儿入住人数分布

# 成人入住人数对应的订单取消率
grouped_df_a = df_copy.pivot_table(values='is_canceled',index='adults',aggfunc='mean')
# 可视化
grouped_df_a.plot(kind='bar',title='成人入住人数对应的订单取消率')

<AxesSubplot:title={'center':'成人入住人数对应的订单取消率'}, xlabel='adults'>

在这里插入图片描述

由上图可知，成人入住人数在5及5以上的订单被取消概率极高。

# 孩子入住人数对应订单量
grouped_df_copy_c  = df_copy.pivot_table(values='is_canceled',index='children',aggfunc='mean')
# 可视化
grouped_df_copy_c.plot(kind='bar',title='孩子入住人数对应订单量')

<AxesSubplot:title={'center':'孩子入住人数对应订单量'}, xlabel='children'>

在这里插入图片描述

由上图可知，孩子入住人数为10人的订单被取消概率极高。

# 婴儿入住人数对应订单量
grouped_df_copy_b  = df_copy.pivot_table(values='is_canceled', index='babies', aggfunc='mean')
# 可视化
grouped_df_copy_b.plot(kind='bar',title='婴儿入住人数对应订单量')

<AxesSubplot:title={'center':'婴儿入住人数对应订单量'}, xlabel='babies'>

在这里插入图片描述

由上图可知，无婴儿入住的订单被取消的概率偏高。

9、订餐

# 各订餐类型的订单量
grouped_df_meal = df_copy.pivot_table(values='hotel',index='meal',columns='is_canceled',aggfunc='count')
# 可视化
grouped_df_meal.plot(kind='bar',title='订餐类型订单量')

<AxesSubplot:title={'center':'订餐类型订单量'}, xlabel='meal'>

在这里插入图片描述

由上图可知，订餐类型中早中晚餐全包(FB)的酒店订单被取消的可能性更大。

10、市场细分及分销渠道分布

# 市场细分的订单量
grouped_df_ms = df_copy.pivot_table(values='hotel',index='market_segment',columns='is_canceled',aggfunc='count')
# 可视化
grouped_df_ms.plot(kind='bar',title='市场细分的订单量')

<AxesSubplot:title={'center':'市场细分的订单量'}, xlabel='market_segment'>

在这里插入图片描述

根据上图可以看出，订单量按照市场细分后，其被取消的概率差异较大。

# 分销渠道的订单量
grouped_df_dc = df_copy.pivot_table(values='hotel',index='distribution_channel',columns='is_canceled',aggfunc='count')
# 可视化
grouped_df_dc.plot(kind='bar',title='分销渠道的订单量')

<AxesSubplot:title={'center':'分销渠道的订单量'}, xlabel='distribution_channel'>

在这里插入图片描述

根据上图可以看出，订单量按照分销渠道细分后，'TA/TO’渠道被取消的概率较大。

11、房型分布

# 预订房型的订单取消率
grouped_df_rrt = df_copy.pivot_table(values='is_canceled',index='reserved_room_type',aggfunc='mean')
# 可视化
grouped_df_rrt.plot(kind='bar',title='预订房型的订单取消率')

<AxesSubplot:title={'center':'预订房型的订单取消率'}, xlabel='reserved_room_type'>

在这里插入图片描述

由上图可知，不同预订房型对应的订单取消率差异较小。

# 复制df_copy表
df_copy_rt = df_copy.copy()
#创建字段'room_type_agreed',判断预订房型与酒店分配房型是否一致
df_copy_rt['room_type_agreed'] = np.where(df_copy_rt['reserved_room_type']==df_copy_rt['assigned_room_type'],1,0)

# 房型是否一致的订单量
grouped_df_rt = df_copy_rt.pivot_table(values='hotel',index='room_type_agreed',columns='is_canceled',aggfunc='count')
# 可视化
grouped_df_rt.plot(kind='bar',title='房型是否一致的订单量')

<AxesSubplot:title={'center':'房型是否一致的订单量'}, xlabel='room_type_agreed'>

在这里插入图片描述

根据上图，房型不一致的客户订单取消的概率较小，可能因为大部分房型不一致的情况属于房型升级，从而降低了退单率，也可能因为酒店非常受欢迎导致供不应求，从而客户退订的概率减小，需引入相关数据进一步分析。

12、订单变更分布

# 订单变更次数对应的订单取消率
grouped_df_bc = df_copy.pivot_table(values='is_canceled',index='booking_changes',aggfunc='mean')
# 可视化
grouped_df_bc.plot(kind='bar',title='订单变更次数对应的订单取消率')

<AxesSubplot:title={'center':'订单变更次数对应的订单取消率'}, xlabel='booking_changes'>

在这里插入图片描述

从上图发现，订单未做过变更或者变更次数达14、16次的，订单取消率较高。

13、押金类型分布

# 押金类型对应的订单量
grouped_df_dt = df_copy.pivot_table(values='hotel',index='deposit_type',columns='is_canceled',aggfunc='count')
# 可视化
grouped_df_dt.plot(kind='bar',title='押金类型对应的订单量')

<AxesSubplot:title={'center':'押金类型对应的订单量'}, xlabel='deposit_type'>

在这里插入图片描述

根据上图，用户订单主要为无押金订单，不可退押金订单基本被取消，可退押金订单较少，需进一步查看取消率。

# 押金类型对应的订单量取消率
grouped_df_dtp = df_copy.pivot_table(values='is_canceled',index='deposit_type',aggfunc='mean')
# 可视化
grouped_df_dtp.plot(kind='bar',title='押金类型对应的订单量取消率')

<AxesSubplot:title={'center':'押金类型对应的订单量取消率'}, xlabel='deposit_type'>

在这里插入图片描述

可退押金订单与无押金订单的取消率差距较小，不可退押金订单取消率接近于1，可根据用户属性和订单特征进一步分析退订原因。

14、代理商分布

# 订单量前10的代理商id
df_copy['agent'].value_counts().head(10).index

Index(['9', '0', '240', '1', '14', '7', '6', '250', '241', '28'], dtype='object')

df_copy_a_top10 = df_copy.loc[df_copy['agent'].isin(['9', '0', '240', '1', '14', '7', '6', '250', '241', '28'])]
# 代理商对应的订单量
grouped_df_a_top10 = df_copy_a_top10.pivot_table(values='hotel',index='agent',columns='is_canceled',aggfunc='count')
# 可视化
grouped_df_a_top10.plot(kind='bar',title='代理商对应的订单量')

<AxesSubplot:title={'center':'代理商对应的订单量'}, xlabel='agent'>

在这里插入图片描述

根据上图，代理商9的订单占比最高，其次为无代理商订单。各代理商对应酒店订单的取消率差异较大，其中代理商1的取消率明显偏高，需根据退订客户的用户画像进一步分析。

15、需求数量分布

# 车位数量需求对应的订单量
grouped_df_rcps = df_copy.pivot_table(values='hotel',index='required_car_parking_spaces',columns='is_canceled',aggfunc='count')
# 可视化
grouped_df_rcps.plot(kind='bar',title='车位数量需求对应的订单量')

<AxesSubplot:title={'center':'车位数量需求对应的订单量'}, xlabel='required_car_parking_spaces'>

在这里插入图片描述

由上图可知，大部分订单没有车位需求，有1个车位需求的订单基本都未被取消。

# 特殊需求数量对应的订单量
grouped_df_tosr = df_copy.pivot_table(values='hotel',index='total_of_special_requests',columns='is_canceled',aggfunc='count')
# 可视化
grouped_df_tosr.plot(kind='bar',title='特殊需求数量对应的订单量')

<AxesSubplot:title={'center':'特殊需求数量对应的订单量'}, xlabel='total_of_special_requests'>

在这里插入图片描述

根据上图可知，有特殊需求的客户订单被取消的概率较小。

16、预约状态分布

# 预约状态对应的订单量
grouped_df_rs = df_copy.pivot_table(values='hotel',index='reservation_status',columns='is_canceled',aggfunc='count')
# 可视化
grouped_df_rs.plot(kind='bar',title='预约状态对应的订单量')

<AxesSubplot:title={'center':'预约状态对应的订单量'}, xlabel='reservation_status'>

在这里插入图片描述

由上图可知，被取消的订单中大部分为用户直接取消，小部分用户未入住但告知酒店原因。

（二）酒店经营情况分析

# 复制df_copy表
df_copy_tri = df_copy.copy()
# 创建'total_rental_income'列
df_copy_tri['total_rental_income']=(df_copy_tri['stays_in_weekend_nights']+df_copy_tri['stays_in_week_nights'])*df_copy_tri['adr']

# 筛除已取消预订的数据
df_copy_tri_0 = df_copy_tri[df_copy_tri['is_canceled']==0]

1、按年分析

# 酒店各年营业总额
df_copy_tri_0.pivot_table(values='total_rental_income',index='arrival_date_year',columns='hotel',aggfunc='sum')\
.plot(kind='bar',title='酒店各年营业收入总额')

<AxesSubplot:title={'center':'酒店各年营业收入总额'}, xlabel='arrival_date_year'>

在这里插入图片描述

由图可知,2015年下半年城市酒店营业收入低于度假酒店，2016年完整年及2017年1-8月城市酒店营业收入高于度假酒店；

2、按月分析

# 创建入住月份数值列'arrival_date_month_code'
df_copy_tri_0['arrival_date_month_code'] = df_copy_tri_0['arrival_date_month']\
.map({'January':1,'February':2,'March':3,'April':4,'May':5,'June':6,\
      'July':7,'August':8,'September':9,'October':10,'November':11,'December':12
      })
# 选取有完整月份的2016年数据
df_copy_tri_2016= df_copy_tri_0[df_copy_tri_0['arrival_date_year']==2016]

# 酒店2016年各月营业收入总额
df_copy_tri_2016.pivot_table(values='total_rental_income',index='arrival_date_month_code',columns='hotel',aggfunc='sum')\
.plot(kind='bar',title='酒店2016年各月营业收入总额')

<AxesSubplot:title={'center':'酒店2016年各月营业收入总额'}, xlabel='arrival_date_month_code'>

在这里插入图片描述

由图可知：
1、度假酒店仅7月和8月营业收入高于城市酒店，其他月份均低于城市酒店；
2、8月为酒店营业收入高峰时间，其中度假酒店每月营业收入分布峰度高于城市酒店。

# 酒店2016年各月平均adr（日均房价）
df_copy_tri_2016.pivot_table(values='adr',index='arrival_date_month_code',columns='hotel',aggfunc='mean').plot(title='酒店2016年各月平均adr')

<AxesSubplot:title={'center':'酒店2016年各月平均adr'}, xlabel='arrival_date_month_code'>

在这里插入图片描述

由图可知，度假酒店相比城市酒店各月的房价波动较大，其中度假酒店8月的日均房价最高，因此其8月的营业收入总额达到峰值。

四、预测酒店订单是否会被取消，找出重要影响因素

from sklearn import preprocessing
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import OneHotEncoder
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.ensemble import AdaBoostClassifier
from sklearn.metrics import confusion_matrix
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import GridSearchCV

# 查看酒店预订的情况
is_canceled = df_copy['is_canceled'].value_counts().reset_index()
plt.figure(figsize = (6,6))
plt.title('酒店预订情况\n (预订：0， 取消：1)')
sns.set_color_codes("pastel")
sns.barplot(x = 'index', y='is_canceled', data=is_canceled)

<AxesSubplot:title={'center':'酒店预订情况\n (预订：0， 取消：1)'}, xlabel='index', ylabel='is_canceled'>

在这里插入图片描述

# 复制df_copy表
df_copy2 = df_copy.copy()

# 根据上文数据分析，删除与分类无关、暴露取消或入住后信息的特征列
df_copy2.drop(['arrival_date_year','arrival_date_month', 'arrival_date_week_number', 'arrival_date_day_of_month',\
               'days_in_waiting_list','adr','reservation_status', 'reservation_status_date'] ,axis =1,inplace=True)

# 根据上文数据分析，对部分数值型特征列特殊处理
#计算历史订单取消率，创建新列'canceled_rate'
df_copy2['canceled_rate'] = (df_copy2['previous_cancellations']/(df_copy2['previous_bookings_not_canceled']\
                             +df_copy2['previous_cancellations'])).replace(np.nan, 0)
df_copy2.drop(['previous_cancellations','previous_bookings_not_canceled'] ,axis =1,inplace=True)

# 选出数值型列特征选择
num_features = ['lead_time', 'stays_in_weekend_nights', 'stays_in_week_nights',\
                'adults', 'children', 'babies', 'is_repeated_guest', 'canceled_rate', 'booking_changes',\
                'required_car_parking_spaces','total_of_special_requests']
# 数值型列特征列数据标准化
X_num = StandardScaler().fit_transform(df_copy2[num_features])

# 根据上文数据分析，对部分字符型特征列特殊处理
#创建字段'room_type_agreed',判断预订房型与酒店分配房型是否一致
df_copy2['room_type_agreed'] = np.where(df_copy2['reserved_room_type']==df_copy2['assigned_room_type'],1,0)
df_copy2.drop(['assigned_room_type','reserved_room_type'] ,axis =1,inplace=True)

# 选出字符型特征列，对其独热编码
cat_features = ['hotel', 'meal', 'country', 'market_segment', 'distribution_channel', \
                'room_type_agreed', 'deposit_type', 'agent', 'customer_type']
X_cat = OneHotEncoder(handle_unknown='ignore').fit_transform(df_copy2[cat_features]).toarray()

# 创建数据集
X = np.concatenate((X_num, X_cat),axis=1)
y = df_copy2['is_canceled'].values

# 30%作为测试集，其余作为训练集
train_x, test_x, train_y, test_y = train_test_split(X, y, test_size=0.30, stratify = y, random_state = 2)

# 构造分类器
classifiers = [
    LogisticRegression(),
    DecisionTreeClassifier(criterion='gini'),
    RandomForestClassifier(random_state = 2, criterion = 'gini'),
    KNeighborsClassifier(metric = 'minkowski'),
    AdaBoostClassifier()
]

# 分类器名称
classifier_names = [
            'LogisticRegression',
            'DecisionTreeClassifier',
            'RandomForestClassifier',
            'KNeighborsClassifier',
            'AdaBoostClassifier'
]

# 显示模型评估结果
def show_metrics():
    tp = cm[1,1]
    fn = cm[1,0]
    fp = cm[0,1]
    tn = cm[0,0]
    print('精确率: {:.3f}'.format(tp/(tp+fp)))
    print('召回率: {:.3f}'.format(tp/(tp+fn)))
    print('F1值: {:.3f}'.format(2*(((tp/(tp+fp))*(tp/(tp+fn)))/((tp/(tp+fp))+(tp/(tp+fn))))))

# 使用分类器对酒店预订进行预测
for model, model_name in zip(classifiers, classifier_names):
    clf = model
    clf.fit(train_x, train_y)
    predict_y = clf.predict(test_x)
    # 计算混淆矩阵
    cm = confusion_matrix(test_y, predict_y)
    # 显示模型评估分数
    print(model_name+":")
    show_metrics()

LogisticRegression:
精确率: 0.817
召回率: 0.717
F1值: 0.764
DecisionTreeClassifier:
精确率: 0.805
召回率: 0.806
F1值: 0.806
RandomForestClassifier:
精确率: 0.853
召回率: 0.808
F1值: 0.830
KNeighborsClassifier:
精确率: 0.802
召回率: 0.791
F1值: 0.796
AdaBoostClassifier:
精确率: 0.822
召回率: 0.719
F1值: 0.767

从对以上5个分类器的评估分数可知,随机森林分类器的预测效果综合性最强，精确率达0.853，召回率达0.808，均高于其他分类器，能更精准地预测和覆盖到将被取消的酒店订单。

#拟对随机森林分类器参数进行调优，绘制n_estimators的学习曲线，确定大致范围，间隔为10
rfc_s = []
for i in range(0,101,10):
    rfc = RandomForestClassifier(random_state = 2, criterion = 'gini', n_estimators=i)
    cvs = cross_val_score(rfc, train_x, train_y, cv=3).mean()
    rfc_s.append(cvs) 
plt.plot(range(0,101,10),rfc_s,label='随机森林')
plt.legend()

<matplotlib.legend.Legend at 0x236031d85e0>

在这里插入图片描述

# 根据上图缩小参数范围
parameters = {'n_estimators':[40,45,50,55,60]}
# 使用GridSearchCV进行参数调优
clf = GridSearchCV(estimator=RandomForestClassifier(random_state = 2, criterion = 'gini'), param_grid=parameters, scoring='f1')
# 对数据集进行分类
clf.fit(train_x, train_y)
print("最优分数： %.3lf" %clf.best_score_)
print("最优参数：", clf.best_params_)

最优分数： 0.825
最优参数： {'n_estimators': 55}

#建立n_estimators为55的随机森林
rf = RandomForestClassifier(random_state = 2, criterion = 'gini', n_estimators=55)
rf.fit(train_x, train_y)
# 特征重要性
importances = rf.feature_importances_
# 标签名称
feat_labels = pd.concat([df_copy2[num_features], df_copy2['room_type_agreed'], df_copy2_dum], axis=1).columns
# 下标排序
indices = np.argsort(importances)[::-1] 
# 找出重要性排前10的特征（-号代表左对齐、后补空白，*号代表对齐宽度由输入时确定）
for f in range(10):   
    print("%2d) %-*s %f" % (f + 1, 30, feat_labels[indices[f]], importances[indices[f]]))

 1) lead_time                      0.219403
 2) country_PRI                    0.067354
 3) total_of_special_requests      0.065522
 4) deposit_type_Non Refund        0.065501
 5) deposit_type_Refundable        0.060866
 6) stays_in_week_nights           0.059261
 7) stays_in_weekend_nights        0.034379
 8) canceled_rate                  0.029086
 9) booking_changes                0.023928
10) required_car_parking_spaces    0.023598

以上10个特征对酒店订单是否会被取消有较重要的影响，应重点关注，包括：预订时长、用户是否为葡萄牙国籍、是否有特殊需求数量、订单是否可退押金、入住晚数、用户历史订单取消率、订单变更次数、车位数量需求等。

Kaggle项目：酒店预订需求数据的探索与预测

EDA and Predictive Analysis of Hotel Booking Demand Datasets 2.0

一、数据准备

二、数据清洗

（一）缺失数据处理

（二）删除重复记录

（三）异常值处理

三、数据分析及可视化

（一）用户的特征分布

1、用户的地域分布

2、新老用户分布

3、客户类型分布

4、客户历史取消率

5、订单预订时长和确定时长分布

6、客户入住时间分布

7、客户入住晚数分布

8、成人、孩子、婴儿入住人数分布

9、订餐

10、市场细分及分销渠道分布

11、房型分布

12、订单变更分布

13、押金类型分布

14、代理商分布

15、需求数量分布

16、预约状态分布

（二）酒店经营情况分析

1、按年分析

2、按月分析

四、预测酒店订单是否会被取消，找出重要影响因素

网站公告

今日签到

热门文章

最新发布