数据分析处理库-Pandas基础-下篇

发布于:2022-12-02 ⋅ 阅读:(2687) ⋅ 点赞:(2)

在这里插入图片描述

数据分析处理库-Pandas基础-下篇

1 SAC*Pandas 分组模式及其对象

  • SAC 指的是分组操作中的split-apply-combine过程,其中
    • split 基于某一些规则,将数据拆成若干组
    • apply 对每一组独立地使用函数
    • combine 指将每一组的结果组合成某一类数据结构

1.1groupby对象

  • 学生分组
import pandas as pd
df = pd.read_csv('./data/learn_pandas.csv')
df.head(4)
School Grade Name Gender Height Weight Transfer Test_Number Test_Date Time_Record
0 Shanghai Jiao Tong University Freshman Gaopeng Yang Female 158.9 46.0 N 1 2019/10/5 0:04:34
1 Peking University Freshman Changqiang You Male 166.5 70.0 N 1 2019/9/4 0:04:20
2 Shanghai Jiao Tong University Senior Mei Sun Male 188.9 89.0 N 2 2019/9/12 0:05:22
3 Fudan University Sophomore Xiaojuan Sun Female NaN 41.0 N 2 2020/1/3 0:04:08
df.groupby(['School','Gender'])['Height'].mean()
School                         Gender
Fudan University               Female    158.776923
                               Male      174.212500
Peking University              Female    158.666667
                               Male      172.030000
Shanghai Jiao Tong University  Female    159.122500
                               Male      176.760000
Tsinghua University            Female    159.753333
                               Male      171.638889
Name: Height, dtype: float64
df.groupby(['School','Grade','Gender'])['Gender'].count()
#根据学校和年纪及性别,统计男女人数
School                         Grade      Gender
Fudan University               Freshman   Female     8
                                          Male       1
                               Junior     Female     9
                                          Male       3
                               Senior     Female     9
                                          Male       2
                               Sophomore  Female     4
                                          Male       4
Peking University              Freshman   Female     7
                                          Male       6
                               Junior     Female     6
                                          Male       2
                               Senior     Female     5
                                          Male       3
                               Sophomore  Female     4
                                          Male       1
Shanghai Jiao Tong University  Freshman   Female     8
                                          Male       5
                               Junior     Female    15
                                          Male       2
                               Senior     Female    14
                                          Male       8
                               Sophomore  Female     4
                                          Male       1
Tsinghua University            Freshman   Female    17
                               Junior     Female    13
                                          Male       9
                               Senior     Female    10
                                          Male       4
                               Sophomore  Female     8
                                          Male       8
Name: Gender, dtype: int64
#根据学生的身高是否超过总体的均值来分组,最后计算体重的平均值
condition = df['Height'] > df["Height"].mean()
#身高超过均值True
#身高没有超过均值False
df.groupby(condition)['Weight'].mean()
Height
False    47.672414
True     66.684932
Name: Weight, dtype: float64
df .groupby([ 'School','Gender' ]).agg(
    {"Height":[
        ('height_max',"max"),
        ('height_min','min'),
        ("height_mean","mean")
    ],
"Weight" :[
        ("weight_ptp",lambda x:x.max()-x.min()),
        ("weight_median" , "median")
    ]
    }
)
Height Weight
height_max height_min height_mean weight_ptp weight_median
School Gender
Fudan University Female 170.2 147.3 158.776923 29.0 47.5
Male 177.3 167.6 174.212500 19.0 73.5
Peking University Female 170.0 147.8 158.666667 22.0 46.5
Male 185.3 162.4 172.030000 29.0 73.5
Shanghai Jiao Tong University Female 167.7 145.4 159.122500 23.0 49.0
Male 188.9 166.0 176.760000 27.0 75.0
Tsinghua University Female 168.9 150.5 159.753333 21.0 49.0
Male 193.9 155.7 171.638889 28.0 71.0
  • 北京不同aqiInfo下,北京的白天最高温度,均值,白天最低温迪,夜晚最高,夜晚最低,均值,以及aqiInfo的数量,aqi的,最大值,最小值,均值,空气质量指数大于50的天数有多少
import pandas as pd
import numpy as np
data = pd.read_csv('./data/beijing_tianqi_2018.csv')
data.head(4)
ymd bWendu yWendu tianqi fengxiang fengli aqi aqiInfo aqiLevel
0 2018-01-01 3℃ -6℃ 晴~多云 东北风 1-2级 59 2
1 2018-01-02 2℃ -5℃ 阴~多云 东北风 1-2级 49 1
2 2018-01-03 2℃ -5℃ 多云 北风 1-2级 28 1
3 2018-01-04 0℃ -8℃ 东北风 1-2级 28 1
data['bWendu'] = data['bWendu'].str.replace("℃",'').astype(np.int0)
data['yWendu'] = data['yWendu'].str.replace("℃",'').astype(np.int0)
df = data.groupby('aqiInfo').agg(
    {
        "bWendu":[
            ("bwendu_max","max"),
            ('bwendu_min','min'),
            ('bwendu_mean',"mean")
        ],
        "yWendu":[
            ("ywendu_max","max"),
            ('ywendu_min','min'),
            ('ywendu_mean',"mean")
        ],
        'aqiInfo':[
            ("aqi_info_num",'count')
        ],
        'aqi':[
            ('aqi_min','min'),
            ('aqi_max','max'),
            ('aqi_mean','mean'),
            ('aqi_50',lambda x: sum(x>50))
        ]
    }
)
  • set_index(‘label’,inplace=True) # 设置索引
  • reset_index(inplace=True) # 取消索引
df.reset_index(inplace=True)
df
aqiInfo bWendu yWendu aqiInfo aqi
bwendu_max bwendu_min bwendu_mean ywendu_max ywendu_min ywendu_mean aqi_info_num aqi_min aqi_max aqi_mean aqi_50
0 严重污染 25 25 25.000000 9 9 9.000000 1 387 387 387.000000 1
1 中度污染 36 6 20.238095 25 -5 8.047619 21 151 198 175.523810 21
2 35 -5 16.191304 26 -12 6.478261 115 21 50 37.426087 0
3 38 -2 20.233333 27 -10 10.100000 150 51 100 72.766667 150
4 轻度污染 37 -1 19.318182 27 -9 8.303030 66 101 149 117.772727 66
5 重度污染 27 7 15.916667 11 0 5.416667 12 206 293 244.333333 12

2 数据合并相关

2.1 concat函数

  • 不加参数直接合并,默认是按照轴0进行合并,不会忽略索引,缺失值用NaN填充
  • ignore_index= True, 忽略索引
  • axis=0 ,指定合并的方向
  • join=‘outer’ # 取并集,如果设置为"inner"则取交集
  • keys() # 连接的时候,加上一个层次的Key,用于判定数据来源于那个表
df1 = pd.DataFrame({'A': ['A0', 'A1', 'A2', 'A3'],
'B': ['B0', 'B1', 'B2', 'B3'],
'C': ['C0', 'C1', 'C2', 'C3'],
'D': ['D0', 'D1', 'D2', 'D3'],
'E': ['E0', 'E1', 'E2', 'E3']})
df2 = pd.DataFrame({'A': ['A4', 'A5', 'A6', 'A7'],
'B': ['B4', 'B5', 'B6', 'B7'],
'C': ['C4', 'C5', 'C6', 'C7'],
'D': ['D4', 'D5', 'D6', 'D7'],
'F': ['F4', 'F5', 'F6', 'F7']})
pd.concat([df1,df2],ignore_index=True)
A B C D E F
0 A0 B0 C0 D0 E0 NaN
1 A1 B1 C1 D1 E1 NaN
2 A2 B2 C2 D2 E2 NaN
3 A3 B3 C3 D3 E3 NaN
4 A4 B4 C4 D4 NaN F4
5 A5 B5 C5 D5 NaN F5
6 A6 B6 C6 D6 NaN F6
7 A7 B7 C7 D7 NaN F7
pd.concat([df1,df2],axis=1)
A B C D E A B C D F
0 A0 B0 C0 D0 E0 A4 B4 C4 D4 F4
1 A1 B1 C1 D1 E1 A5 B5 C5 D5 F5
2 A2 B2 C2 D2 E2 A6 B6 C6 D6 F6
3 A3 B3 C3 D3 E3 A7 B7 C7 D7 F7
pd.concat([df1,df2],join="inner")
A B C D
0 A0 B0 C0 D0
1 A1 B1 C1 D1
2 A2 B2 C2 D2
3 A3 B3 C3 D3
0 A4 B4 C4 D4
1 A5 B5 C5 D5
2 A6 B6 C6 D6
3 A7 B7 C7 D7
pd.concat([df1,df2],keys=['df1','df2'])
A B C D E F
df1 0 A0 B0 C0 D0 E0 NaN
1 A1 B1 C1 D1 E1 NaN
2 A2 B2 C2 D2 E2 NaN
3 A3 B3 C3 D3 E3 NaN
df2 0 A4 B4 C4 D4 NaN F4
1 A5 B5 C5 D5 NaN F5
2 A6 B6 C6 D6 NaN F6
3 A7 B7 C7 D7 NaN F7
  • json.loads() # 可以将内存当中的数据转化为字典格式
import json
# 将dt['data']['areaTree']里面所有国家的数据存为一张表excel格式,
#  excel文件名称为all_world.xlsx
with open('./data/list-total.json','r',encoding='utf-8') as f: # 先读取本地json数据
    dt = json.loads(f.read())   #使用json.loads方法,将数据做序列化处理
head_info = pd.DataFrame(dt['data']['areaTree'][2]['children']).loc[:,[
                    'name', 'id', 'lastUpdateTime']]  # 筛选出字段中直接可用的字段
today = pd.DataFrame([i['today'] for i in dt['data']['areaTree'][2]['children']]) 
# 使用pd.DataFrame方法,将列表数据转为dataframe数据
total = pd.DataFrame([i['total'] for i in dt['data']['areaTree'][2]['children']])
today.columns = [f"today_{i}" for i in today.columns] # 修改数据列名/字段名
total.columns = [f"total_{i}" for i in total.columns]
all_info = pd.concat([head_info,today,total],axis=1) # 合并数据
all_info.to_excel("./data/all_provience.xlsx",index=None) 
# 数据存储,index=None,表示不要数据索引
dt.keys()
dict_keys(['reqId', 'code', 'msg', 'data', 'timestamp'])

2.2连接合并函数merge()

2.2.1 合并应用

data1 = pd.DataFrame({'key': ['K0', 'K1', 'K2', 'K3'],
'A': ['A0', 'A1', 'A2', 'A3'],
'B': ['B0', 'B1', 'B2', 'B3']})
data2 = pd.DataFrame({'key': ['K0', 'K1', 'K2', 'K3'],
'C': ['C0', 'C1', 'C2', 'C3'],
'D': ['D0', 'D1', 'D2', 'D3']})
pd.merge(data1,data2)
key A B C D
0 K0 A0 B0 C0 D0
1 K1 A1 B1 C1 D1
2 K2 A2 B2 C2 D2
3 K3 A3 B3 C3 D3
  • 如果两个表,有相同的字段/列,拼接的时候即使不指定以那个字段作为主键函数,也会用默认信息相同的列,作为主键进行拼接
pd.merge(data1,data2,on='key')
key A B C D
0 K0 A0 B0 C0 D0
1 K1 A1 B1 C1 D1
2 K2 A2 B2 C2 D2
3 K3 A3 B3 C3 D3
  • 如果两个表中,存在多个列信息相同,如何拼接呢?
    • 指定主键为员工姓名,工号,只要这俩对的上,信息就会进行保留,注意这里用的连接内连接
    • 如果有多个字段相同,指定那个,则按那个进行连接,如果不指定,则用所有相同的字段进行连接
data1 = pd.DataFrame([[1, '赵一', 30000, 6000, 2500, 125, 21375, False],
[2, '钱明', 20000, 4000, 2500, 125, 13375, False],
[3, '周元', 50000, 10000, 2500, 125, 37375, True],
[4, '李雷', 40000, 8000, 2500, 125, 29375, False],
[5, '王七', 60000, 12000, 2500, 125, 45375, True]],
                     columns=['工号', '员工姓名','销售业绩', '提成收入', '基本工斐', '社保金额', '创造收益',
'是否达标'])
data2 = pd.DataFrame([[4, '李雷', '脑洞有限公司', '女', '离职', '中级'],
[8, '王七', '脑洞有限公司', '男', '在职', '高级'],
[6, '赵三', '脑洞有限公司', '男', '离职', '高级'],
[7, '赵六', '脑洞有限公司', '女', '在职', '中级']],
columns=['工号', '员工姓名', '公司名称', '性别', '员工状态', '员工等级'])
pd.merge(data1,data2,on=['员工姓名'])
工号_x 员工姓名 销售业绩 提成收入 基本工斐 社保金额 创造收益 是否达标 工号_y 公司名称 性别 员工状态 员工等级
0 4 李雷 40000 8000 2500 125 29375 False 4 脑洞有限公司 离职 中级
1 5 王七 60000 12000 2500 125 45375 True 8 脑洞有限公司 在职 高级
data1 = pd.DataFrame([[1, '赵一', 30000, 6000, 2500, 125, 21375, False],
[2, '钱明', 20000, 4000, 2500, 125, 13375, False],
[3, '周元', 50000, 10000, 2500, 125, 37375, True],
[4, '李雷', 40000, 8000, 2500, 125, 29375, False],
[5, '王七', 60000, 12000, 2500, 125, 45375, True]],
columns = ['工号', '员工姓名', '销售业绩', '提成收入', '基本工斐', '社保金额', '创造收益',
'是否达标'])
data3 = pd.DataFrame([[1, '赵一', '脑洞有限公司', '女', '离职', '初级'],
[2, '钱明', '脑洞有限公司', '男', '离职', '初级'],
[3, '周元', '脑洞有限公司', '男', '在职', '中级'],
[4, '李雷', '脑洞有限公司', '女', '离职', '中级'],
[5, '王七', '脑洞有限公司', '男', '在职', '高级'],
[6, '赵三', '脑洞有限公司', '男', '离职', '高级'],
[7, '赵六', '脑洞有限公司', '女', '在职', '中级']],
columns = ['工号', '姓名', '公司名称', '性别', '状态', '员工等级'])
data1.set_index('工号',inplace=True)
data3.set_index('工号',inplace=True)
pd.merge(data1,data3,left_index=True,right_index=True)
# 使用索引做为拼接条件
员工姓名 销售业绩 提成收入 基本工斐 社保金额 创造收益 是否达标 姓名 公司名称 性别 状态 员工等级
工号
1 赵一 30000 6000 2500 125 21375 False 赵一 脑洞有限公司 离职 初级
2 钱明 20000 4000 2500 125 13375 False 钱明 脑洞有限公司 离职 初级
3 周元 50000 10000 2500 125 37375 True 周元 脑洞有限公司 在职 中级
4 李雷 40000 8000 2500 125 29375 False 李雷 脑洞有限公司 离职 中级
5 王七 60000 12000 2500 125 45375 True 王七 脑洞有限公司 在职 高级

2.2.2 参数:

  • left,right,inner,outer,on
data1 = pd.DataFrame({'key': ['K0', 'K1', 'K2', 'K3'],
'A': ['A0', 'A1', 'A2', 'A3'],
'B': ['B0', 'B1', 'B2', 'B3']})
data2 = pd.DataFrame({'key': ['K0s', 'K1s', 'K2s', 'K3s'],
'C': ['C0', 'C1', 'C2', 'C3'],
'D': ['D0', 'D1', 'D2', 'D3']})
pd.merge(data1,data2,how='left',on='key')
key A B C D
0 K0 A0 B0 NaN NaN
1 K1 A1 B1 NaN NaN
2 K2 A2 B2 NaN NaN
3 K3 A3 B3 NaN NaN
pd.merge(data1,data2,how='right',on='key')
key A B C D
0 K0s NaN NaN C0 D0
1 K1s NaN NaN C1 D1
2 K2s NaN NaN C2 D2
3 K3s NaN NaN C3 D3
pd.merge(data1,data2,how='outer',on='key')
key A B C D
0 K0 A0 B0 NaN NaN
1 K1 A1 B1 NaN NaN
2 K2 A2 B2 NaN NaN
3 K3 A3 B3 NaN NaN
4 K0s NaN NaN C0 D0
5 K1s NaN NaN C1 D1
6 K2s NaN NaN C2 D2
7 K3s NaN NaN C3 D3
data1 = pd.DataFrame({'key1': ['K0', 'K1', 'K2', 'K3'],
'A': ['A0', 'A1', 'A2', 'A3'],
'B': ['B0', 'B1', 'B2', 'B3']})
data2 = pd.DataFrame({'key2': ['K0', 'K1', 'K2', 'K3'],
'C': ['C0', 'C1', 'C2', 'C3'],
'D': ['D0', 'D1', 'D2', 'D3']})
  • left_on: 左边的某个字段作为键,
  • right_on: 以右边的某个字段作为键
pd.merge(data1,data2,how="outer",left_on='key1',right_on='key2')
key1 A B key2 C D
0 K0 A0 B0 K0 C0 D0
1 K1 A1 B1 K1 C1 D1
2 K2 A2 B2 K2 C2 D2
3 K3 A3 B3 K3 C3 D3

3 数据转换相关函数

  • map: 只作用于series,实现对每个键—>值的映射
  • apply: 用于Series实现每个值的处理,用于DataFrame对某个轴上面的某个Series进行处理
  • applymap: 只作用于DataFrame,用于处理DataFrame中的每一个元素
data = pd.DataFrame([['2019-10-03', 'BIDU', 104.32, 102.35, 104.73, 101.15, 2.24, 0.02],
['2019-10-02', 'iq', 102.62, 100.85, 103.24, 99.5, 2.69, 0.01],
['2019-10-01', 'Jd', 102.0, 102.8, 103.26, 101.0, 1.78, -0.01],
['2019-10-03', 'BABA', 169.48, 166.65, 170.18, 165.0, 10.39, 0.02],
['2019-10-02', 'baidu', 165.77, 162.82, 166.88, 161.9, 11.6, 0.0],
['2019-10-01', 'BABA', 165.15, 168.01, 168.23, 163.64, 14.19,
-0.01]], columns = ['日期', '公司', '收盘', '开盘', '高', '低', '交易量', '涨跌幅'])
data
日期 公司 收盘 开盘 交易量 涨跌幅
0 2019-10-03 BIDU 104.32 102.35 104.73 101.15 2.24 0.02
1 2019-10-02 iq 102.62 100.85 103.24 99.50 2.69 0.01
2 2019-10-01 Jd 102.00 102.80 103.26 101.00 1.78 -0.01
3 2019-10-03 BABA 169.48 166.65 170.18 165.00 10.39 0.02
4 2019-10-02 baidu 165.77 162.82 166.88 161.90 11.60 0.00
5 2019-10-01 BABA 165.15 168.01 168.23 163.64 14.19 -0.01
en_name = data['公司'].unique().tolist() # 去重
cn_name = ["百度","爱奇艺","京东","阿里巴巴","百度"]
name_en_cn = dict(zip(en_name,cn_name))
name_en_cn
{'BIDU': '百度', 'iq': '爱奇艺', 'Jd': '京东', 'BABA': '阿里巴巴', 'baidu': '百度'}
  • map: 只作用于series,实现对每个键—>值的映射
data["公司中文名字"] = data['公司'].map(name_en_cn)
 data["公司中文名字2"] = data['公司'].map(lambda x:name_en_cn[x])
  • apply: 用于Series实现每个值的处理,用于DataFrame对某个轴上面的某个Series进行处理
data['公司'].apply(lambda x:x.lower())
0     bidu
1       iq
2       jd
3     baba
4    baidu
5     baba
Name: 公司, dtype: object
 data["公司名字2"] = data.apply(lambda x:x['公司'].lower(),axis=1)
data
日期 公司 收盘 开盘 交易量 涨跌幅 公司中文名字 公司中文名字2 公司名字2
0 2019-10-03 BIDU 104.32 102.35 104.73 101.15 2.24 0.02 百度 百度 bidu
1 2019-10-02 iq 102.62 100.85 103.24 99.50 2.69 0.01 爱奇艺 爱奇艺 iq
2 2019-10-01 Jd 102.00 102.80 103.26 101.00 1.78 -0.01 京东 京东 jd
3 2019-10-03 BABA 169.48 166.65 170.18 165.00 10.39 0.02 阿里巴巴 阿里巴巴 baba
4 2019-10-02 baidu 165.77 162.82 166.88 161.90 11.60 0.00 百度 百度 baidu
5 2019-10-01 BABA 165.15 168.01 168.23 163.64 14.19 -0.01 阿里巴巴 阿里巴巴 baba
  • applymap: 只作用于DataFrame,用于处理DataFrame中的每一个元素
data.iloc[:,2:8] = data.iloc[:,2:8].applymap(lambda x: int(x))
data
日期 公司 收盘 开盘 交易量 涨跌幅 公司中文名字 公司中文名字2 公司名字2
0 2019-10-03 BIDU 104 102 104 101 2 0 百度 百度 bidu
1 2019-10-02 iq 102 100 103 99 2 0 爱奇艺 爱奇艺 iq
2 2019-10-01 Jd 102 102 103 101 1 0 京东 京东 jd
3 2019-10-03 BABA 169 166 170 165 10 0 阿里巴巴 阿里巴巴 baba
4 2019-10-02 baidu 165 162 166 161 11 0 百度 百度 baidu
5 2019-10-01 BABA 165 168 168 163 14 0 阿里巴巴 阿里巴巴 baba

4时序数据处理

ts = pd.Timestamp('2022-05-19 16:09:27')
ts
Timestamp('2022-05-19 16:09:27')
ts.month
5
ts.day
19
ts.year
2022
ts.hour
16
  • 生成时间
  • pd.date_range(start,end,freq,periods)
    • 开始时间
    • 结束时间
    • 时间间隔
    • 时间戳的个数
date = pd.Series(pd.date_range('2022-01-01','2022-05-31',freq='10D'))
date
0    2022-01-01
1    2022-01-11
2    2022-01-21
3    2022-01-31
4    2022-02-10
5    2022-02-20
6    2022-03-02
7    2022-03-12
8    2022-03-22
9    2022-04-01
10   2022-04-11
11   2022-04-21
12   2022-05-01
13   2022-05-11
14   2022-05-21
15   2022-05-31
dtype: datetime64[ns]
date.dt.day_of_week
# 周一是0,周二是1,周三是2,周四是3,周五是4,周六是5,周日是6
0     5
1     1
2     4
3     0
4     3
5     6
6     2
7     5
8     1
9     4
10    0
11    3
12    6
13    2
14    5
15    1
dtype: int64
date.dt.month_name() # 返回英文的月份
0      January
1      January
2      January
3      January
4     February
5     February
6        March
7        March
8        March
9        April
10       April
11       April
12         May
13         May
14         May
15         May
dtype: object
date.dt.day_name() # 星期名
0      Saturday
1       Tuesday
2        Friday
3        Monday
4      Thursday
5        Sunday
6     Wednesday
7      Saturday
8       Tuesday
9        Friday
10       Monday
11     Thursday
12       Sunday
13    Wednesday
14     Saturday
15      Tuesday
dtype: object
date.dt.is_month_end # 判断是否为某月的最后一天
date.dt.is_month_start # 判断是否为某月的第一天
0      True
1     False
2     False
3     False
4     False
5     False
6     False
7     False
8     False
9      True
10    False
11    False
12     True
13    False
14    False
15    False
dtype: bool
date.dt.is_quarter_start # 判断是否为某个季度的第一天
date.dt.is_quarter_end
0     False
1     False
2     False
3     False
4     False
5     False
6     False
7     False
8     False
9     False
10    False
11    False
12    False
13    False
14    False
15    False
dtype: bool
  • 日期为索引时的用法
data = pd.DataFrame(np.random.randn(365,7),index=pd.date_range('2022/01/01','2022/12/31'),
            columns=list('ABCDEFG'))
data.loc[data.index.day_of_week.isin([5,6]),:] # 取出周六周日的数据
A B C D E F G
2022-01-01 -0.315602 1.938327 -0.661239 1.836433 -0.428376 1.541268 -0.530534
2022-01-02 1.910313 -0.044312 2.415538 -1.043056 0.569339 1.426120 0.810997
2022-01-08 -0.969020 0.261871 0.654391 -0.001796 -0.271103 -0.460436 0.919883
2022-01-09 1.303100 1.159366 -0.801309 -1.771291 -0.115401 -1.290599 -2.397392
2022-01-15 1.084151 -0.366372 -1.588546 0.710772 1.384267 -0.291067 0.630542
... ... ... ... ... ... ... ...
2022-12-17 0.294332 2.134218 -0.302565 0.317527 -0.227907 -2.086622 -0.135243
2022-12-18 -0.574041 0.008905 0.038212 -0.575164 0.970056 -0.101120 -0.191207
2022-12-24 -0.112847 -0.100422 -0.480243 -1.527839 -0.975818 0.975825 0.175665
2022-12-25 -0.474319 -0.687114 0.037217 0.756393 1.766992 1.143376 1.619943
2022-12-31 0.627957 1.562743 -0.786430 1.117696 -0.076019 -0.781706 -0.997004

105 rows × 7 columns

# 取出每个月第一天和最后一天的数据
data.loc[data.index.is_month_start | data.index.is_month_end]
A B C D E F G
2022-01-01 -0.315602 1.938327 -0.661239 1.836433 -0.428376 1.541268 -0.530534
2022-01-31 0.695454 -0.248829 1.516964 1.981671 1.349618 -0.160981 -1.139252
2022-02-01 -0.749427 0.284487 1.145231 -0.754888 0.806720 0.383182 -2.055748
2022-02-28 -0.929055 0.114310 -1.687400 -1.163224 -0.464947 -1.831223 1.089308
2022-03-01 0.804782 1.210089 0.330364 -0.647111 2.124095 -0.380592 -0.228030
2022-03-31 -1.370201 -0.019276 0.542051 -0.646306 0.645481 0.966601 -1.141636
2022-04-01 0.184640 -0.356681 -1.116461 1.436733 0.397696 0.212470 0.793653
2022-04-30 -1.104026 0.063872 -1.646085 -0.154575 0.570746 0.375219 -0.772805
2022-05-01 0.660145 -0.623160 -0.349977 0.409316 -0.559836 0.981211 -0.294337
2022-05-31 0.077000 -0.346333 0.633896 0.237591 -0.931699 0.846571 -1.423198
2022-06-01 0.032000 0.902177 0.511199 -0.802463 -1.028341 0.471876 -0.425566
2022-06-30 -0.660275 -0.316015 -0.213440 2.043720 -1.248286 -0.680060 -0.624570
2022-07-01 -0.937354 -0.400323 -0.535804 -0.482045 0.295207 -0.975954 0.725207
2022-07-31 -0.404180 0.599710 1.702778 0.670591 -0.948164 0.590056 0.301291
2022-08-01 -2.485035 -0.213797 -0.291625 -1.843586 -2.144625 -0.710480 -0.118547
2022-08-31 -1.378986 2.086181 -1.198380 -0.083847 -0.333783 -0.042462 1.855522
2022-09-01 -0.905029 2.848176 -0.906683 1.595003 -0.151549 1.087064 -0.880773
2022-09-30 0.781285 0.369520 0.870433 1.296064 1.977797 -0.148640 0.700021
2022-10-01 0.225028 1.566604 0.492757 -1.303721 -0.977534 -0.095851 2.878760
2022-10-31 0.157326 0.145141 0.104380 -0.762992 0.012724 2.832172 1.667798
2022-11-01 0.534880 -1.129539 -0.125695 -0.209681 0.764291 -0.323865 -0.460686
2022-11-30 0.870528 1.473008 1.022682 0.249725 -1.217788 1.245133 -0.029543
2022-12-01 0.561672 -0.046809 -1.072582 -0.384236 -0.255351 1.754940 -0.596977
2022-12-31 0.627957 1.562743 -0.786430 1.117696 -0.076019 -0.781706 -0.997004
dt = pd.read_excel('./data/all_provience.xlsx')
  • pd.to_datetime()
    将普通时间转为datetime64ns格式
dt['lastUpdateTime'] = pd.to_datetime(dt['lastUpdateTime']) 
# 将普通的日期转为datetime64ns这种格式很重要
dt.set_index('lastUpdateTime',inplace=True)
dt.loc['2022-05-19 09:00':'2022-05-19 10:00':]
C:\Users\buzhi\AppData\Local\Temp\ipykernel_60496\3227629596.py:1: FutureWarning: Value based partial slicing on non-monotonic DatetimeIndexes with non-existing keys is deprecated and will raise a KeyError in a future Version.
  dt.loc['2022-05-19 09:00':'2022-05-19 10:00':]
name id today_confirm today_suspect today_heal today_dead today_severe today_storeConfirm total_confirm total_suspect total_heal total_dead total_severe total_input total_newConfirm total_newDead total_newHeal
lastUpdateTime
2022-05-19 09:33:57 台湾 700000 85082 NaN 0 41 NaN 85041 981141 0 13742 1224 0 0 NaN NaN NaN
2022-05-19 09:33:58 香港 810000 70 NaN 110 0 NaN -40 331840 0 61516 9361 0 0 NaN NaN NaN
2022-05-19 09:33:57 河南 410000 11 NaN 16 0 NaN -5 3163 0 2962 22 0 0 NaN NaN NaN
2022-05-19 09:04:57 辽宁 210000 0 NaN 1 0 NaN -1 1670 0 1657 2 0 0 NaN NaN NaN
2022-05-19 09:04:57 广西 450000 0 NaN 2 0 NaN -2 1611 0 1598 2 0 0 NaN NaN NaN
2022-05-19 09:04:56 安徽 340000 0 NaN 1 0 NaN -1 1065 0 1059 6 0 0 NaN NaN NaN
2022-05-19 09:33:58 重庆 500000 0 NaN 2 0 NaN -2 708 0 695 6 0 0 NaN NaN NaN
dt.reset_index(inplace=True)
dt.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 34 entries, 0 to 33
Data columns (total 18 columns):
 #   Column              Non-Null Count  Dtype         
---  ------              --------------  -----         
 0   lastUpdateTime      34 non-null     datetime64[ns]
 1   name                34 non-null     object        
 2   id                  34 non-null     int64         
 3   today_confirm       34 non-null     int64         
 4   today_suspect       0 non-null      float64       
 5   today_heal          34 non-null     int64         
 6   today_dead          34 non-null     int64         
 7   today_severe        0 non-null      float64       
 8   today_storeConfirm  34 non-null     int64         
 9   total_confirm       34 non-null     int64         
 10  total_suspect       34 non-null     int64         
 11  total_heal          34 non-null     int64         
 12  total_dead          34 non-null     int64         
 13  total_severe        34 non-null     int64         
 14  total_input         34 non-null     int64         
 15  total_newConfirm    1 non-null      float64       
 16  total_newDead       1 non-null      float64       
 17  total_newHeal       1 non-null      float64       
dtypes: datetime64[ns](1), float64(5), int64(11), object(1)
memory usage: 4.9+ KB
  • .apply(lambda x:x.strftime(“%Y-%m-%d %H:%M:%S”))
    可用将datetime64ns格式转为时间字符串
import time
# 将datetime64ns这种日期数据,转为普通的字符串数据
dt['lastUpdateTime'].apply(lambda x:x.strftime("%Y-%m-%d %H:%M:%S")).tolist()
['2022-05-19 09:33:57',
 '2022-05-19 09:33:58',
 '2022-05-19 00:00:27',
 '2022-05-19 10:45:03',
 '2022-05-19 14:13:16',
 '2022-05-19 13:08:11',
 '2022-05-19 00:00:26',
 '2022-05-19 09:33:57',
 '2022-05-19 13:08:11',
 '2022-05-19 12:23:08',
 '2022-05-19 08:32:55',
 '2022-05-19 08:32:56',
 '2022-05-19 08:32:56',
 '2022-05-19 00:04:26',
 '2022-05-19 11:58:09',
 '2022-05-19 08:32:56',
 '2022-05-19 00:00:26',
 '2022-05-19 00:39:33',
 '2022-05-19 00:00:27',
 '2022-05-19 09:04:57',
 '2022-05-19 09:04:57',
 '2022-05-19 11:57:07',
 '2022-05-19 11:58:09',
 '2022-05-19 09:04:56',
 '2022-05-19 00:00:26',
 '2022-05-19 09:33:58',
 '2022-05-19 00:00:26',
 '2022-05-19 12:23:08',
 '2022-05-19 00:00:26',
 '2022-05-19 00:00:27',
 '2022-05-19 11:58:09',
 '2022-05-19 00:00:26',
 '2022-05-19 00:00:26',
 '2022-05-19 00:00:26']

5 数据清洗

  • 重复数据处理

    • 记录重复: 一个或多个特征的某几条记录值完全相同
    • 特征重复: 存在一个或多个名称不同,但是数据完全相同
  • 缺失值处理

    • 删除法
    • 填充法
    • 插值法
  • 异常值处理

    • 异常值检测
    • 统计量分析检测
    • 箱线图检测
    • 3 σ \sigma σ原则
  • 数据标准化

    • 利差标准化
    • 标准差标准化

5.1 重复数据处理

5.1.1记录重复

import pandas as pd
data = pd.read_csv('./data/detail.csv',encoding='gbk')

drop_duplicates(keep=False)

  • subset: 要去重的列名
  • keep参数
    • first: 保留第一次出现的重复行,删除后面的重复行
    • last: 保留最后一次出现的重复行,删除前面的重复行
    • False: 删除所有的重复行
  • inplace:
data['dishes_name'].drop_duplicates(keep=False)
Series([], Name: dishes_name, dtype: object)
data.drop_duplicates(subset=['order_id','emp_id']).head()
detail_id order_id dishes_id logicprn_name parent_class_name dishes_name itemis_add counts amounts cost place_order_time discount_amt discount_reason kick_back add_inprice add_info bar_code picture_file emp_id
0 2956 417 610062 NaN NaN 蒜蓉生蚝 0 1 49 NaN 2016/8/111:05:00 NaN NaN NaN 0 NaN NaN caipu/104001.jpg 1442
5 1899 301 610019 NaN NaN 白斩鸡 0 1 88 NaN 2016/8/111:15:00 NaN NaN NaN 0 NaN NaN caipu/204002.jpg 1095
11 2916 413 609966 NaN NaN 芝士烩波士顿龙虾 0 1 175 NaN 2016/8/112:42:00 NaN NaN NaN 0 NaN NaN caipu/101001.jpg 1147
22 2938 415 609964 NaN NaN 避风塘炒蟹 0 1 48 NaN 2016/8/112:51:00 NaN NaN NaN 0 NaN NaN caipu/102004.jpg 1166
28 2643 392 609930 NaN NaN 豌豆薯仔猪骨汤 0 1 39 NaN 2016/8/112:58:00 NaN NaN NaN 0 NaN NaN caipu/203001.jpg 1094

5.1.2特征重复

  • 存在一个或多个名称不同,但是数据完全相同
  • 通过计算相似度系数进行判断
  • 通过矩阵图,热力图进行判断
data[['counts','amounts']].corr()
counts amounts
counts 1.000000 -0.159264
amounts -0.159264 1.000000
import numpy as np
a = np.arange(1,11)
b = a * 10
dt = pd.DataFrame({'A':a,'B':b})
dt[['A','B']].corr()
A B
A 1.0 1.0
B 1.0 1.0

5.2 缺失值的处理

  • 如何识别识别缺失值
    • isnull 判断是否为缺失值
    • notnull 非缺失值的识别
data = pd.read_csv('./data/learn_pandas.csv')
data.head()
School Grade Name Gender Height Weight Transfer Test_Number Test_Date Time_Record
0 Shanghai Jiao Tong University Freshman Gaopeng Yang Female 158.9 46.0 N 1 2019/10/5 0:04:34
1 Peking University Freshman Changqiang You Male 166.5 70.0 N 1 2019/9/4 0:04:20
2 Shanghai Jiao Tong University Senior Mei Sun Male 188.9 89.0 N 2 2019/9/12 0:05:22
3 Fudan University Sophomore Xiaojuan Sun Female NaN 41.0 N 2 2020/1/3 0:04:08
4 Fudan University Sophomore Gaojuan You Male 174.0 74.0 N 2 2019/11/6 0:05:22
  • isnull 判断是否为缺失值
data.isnull().sum() # 计算缺失值,占总体数据的比例
School          0
Grade           0
Name            0
Gender          0
Height         17
Weight         11
Transfer       12
Test_Number     0
Test_Date       0
Time_Record     0
dtype: int64
data.loc[:,"Height"].isna().sum()
17

网站公告

今日签到

点亮在社区的每一天
去签到