
数据分析处理库-Pandas基础-下篇
1 SAC*Pandas 分组模式及其对象
- SAC 指的是分组操作中的split-apply-combine过程,其中
- split 基于某一些规则,将数据拆成若干组
- apply 对每一组独立地使用函数
- combine 指将每一组的结果组合成某一类数据结构
1.1groupby对象
import pandas as pd
df = pd.read_csv('./data/learn_pandas.csv')
df.head(4)
|
School |
Grade |
Name |
Gender |
Height |
Weight |
Transfer |
Test_Number |
Test_Date |
Time_Record |
0 |
Shanghai Jiao Tong University |
Freshman |
Gaopeng Yang |
Female |
158.9 |
46.0 |
N |
1 |
2019/10/5 |
0:04:34 |
1 |
Peking University |
Freshman |
Changqiang You |
Male |
166.5 |
70.0 |
N |
1 |
2019/9/4 |
0:04:20 |
2 |
Shanghai Jiao Tong University |
Senior |
Mei Sun |
Male |
188.9 |
89.0 |
N |
2 |
2019/9/12 |
0:05:22 |
3 |
Fudan University |
Sophomore |
Xiaojuan Sun |
Female |
NaN |
41.0 |
N |
2 |
2020/1/3 |
0:04:08 |
df.groupby(['School','Gender'])['Height'].mean()
School Gender
Fudan University Female 158.776923
Male 174.212500
Peking University Female 158.666667
Male 172.030000
Shanghai Jiao Tong University Female 159.122500
Male 176.760000
Tsinghua University Female 159.753333
Male 171.638889
Name: Height, dtype: float64
df.groupby(['School','Grade','Gender'])['Gender'].count()
School Grade Gender
Fudan University Freshman Female 8
Male 1
Junior Female 9
Male 3
Senior Female 9
Male 2
Sophomore Female 4
Male 4
Peking University Freshman Female 7
Male 6
Junior Female 6
Male 2
Senior Female 5
Male 3
Sophomore Female 4
Male 1
Shanghai Jiao Tong University Freshman Female 8
Male 5
Junior Female 15
Male 2
Senior Female 14
Male 8
Sophomore Female 4
Male 1
Tsinghua University Freshman Female 17
Junior Female 13
Male 9
Senior Female 10
Male 4
Sophomore Female 8
Male 8
Name: Gender, dtype: int64
condition = df['Height'] > df["Height"].mean()
df.groupby(condition)['Weight'].mean()
Height
False 47.672414
True 66.684932
Name: Weight, dtype: float64
df .groupby([ 'School','Gender' ]).agg(
{"Height":[
('height_max',"max"),
('height_min','min'),
("height_mean","mean")
],
"Weight" :[
("weight_ptp",lambda x:x.max()-x.min()),
("weight_median" , "median")
]
}
)
|
|
Height |
Weight |
|
|
height_max |
height_min |
height_mean |
weight_ptp |
weight_median |
School |
Gender |
|
|
|
|
|
Fudan University |
Female |
170.2 |
147.3 |
158.776923 |
29.0 |
47.5 |
Male |
177.3 |
167.6 |
174.212500 |
19.0 |
73.5 |
Peking University |
Female |
170.0 |
147.8 |
158.666667 |
22.0 |
46.5 |
Male |
185.3 |
162.4 |
172.030000 |
29.0 |
73.5 |
Shanghai Jiao Tong University |
Female |
167.7 |
145.4 |
159.122500 |
23.0 |
49.0 |
Male |
188.9 |
166.0 |
176.760000 |
27.0 |
75.0 |
Tsinghua University |
Female |
168.9 |
150.5 |
159.753333 |
21.0 |
49.0 |
Male |
193.9 |
155.7 |
171.638889 |
28.0 |
71.0 |
- 北京不同aqiInfo下,北京的白天最高温度,均值,白天最低温迪,夜晚最高,夜晚最低,均值,以及aqiInfo的数量,aqi的,最大值,最小值,均值,空气质量指数大于50的天数有多少
import pandas as pd
import numpy as np
data = pd.read_csv('./data/beijing_tianqi_2018.csv')
data.head(4)
|
ymd |
bWendu |
yWendu |
tianqi |
fengxiang |
fengli |
aqi |
aqiInfo |
aqiLevel |
0 |
2018-01-01 |
3℃ |
-6℃ |
晴~多云 |
东北风 |
1-2级 |
59 |
良 |
2 |
1 |
2018-01-02 |
2℃ |
-5℃ |
阴~多云 |
东北风 |
1-2级 |
49 |
优 |
1 |
2 |
2018-01-03 |
2℃ |
-5℃ |
多云 |
北风 |
1-2级 |
28 |
优 |
1 |
3 |
2018-01-04 |
0℃ |
-8℃ |
阴 |
东北风 |
1-2级 |
28 |
优 |
1 |
data['bWendu'] = data['bWendu'].str.replace("℃",'').astype(np.int0)
data['yWendu'] = data['yWendu'].str.replace("℃",'').astype(np.int0)
df = data.groupby('aqiInfo').agg(
{
"bWendu":[
("bwendu_max","max"),
('bwendu_min','min'),
('bwendu_mean',"mean")
],
"yWendu":[
("ywendu_max","max"),
('ywendu_min','min'),
('ywendu_mean',"mean")
],
'aqiInfo':[
("aqi_info_num",'count')
],
'aqi':[
('aqi_min','min'),
('aqi_max','max'),
('aqi_mean','mean'),
('aqi_50',lambda x: sum(x>50))
]
}
)
- set_index(‘label’,inplace=True) # 设置索引
- reset_index(inplace=True) # 取消索引
df.reset_index(inplace=True)
df
|
aqiInfo |
bWendu |
yWendu |
aqiInfo |
aqi |
|
|
bwendu_max |
bwendu_min |
bwendu_mean |
ywendu_max |
ywendu_min |
ywendu_mean |
aqi_info_num |
aqi_min |
aqi_max |
aqi_mean |
aqi_50 |
0 |
严重污染 |
25 |
25 |
25.000000 |
9 |
9 |
9.000000 |
1 |
387 |
387 |
387.000000 |
1 |
1 |
中度污染 |
36 |
6 |
20.238095 |
25 |
-5 |
8.047619 |
21 |
151 |
198 |
175.523810 |
21 |
2 |
优 |
35 |
-5 |
16.191304 |
26 |
-12 |
6.478261 |
115 |
21 |
50 |
37.426087 |
0 |
3 |
良 |
38 |
-2 |
20.233333 |
27 |
-10 |
10.100000 |
150 |
51 |
100 |
72.766667 |
150 |
4 |
轻度污染 |
37 |
-1 |
19.318182 |
27 |
-9 |
8.303030 |
66 |
101 |
149 |
117.772727 |
66 |
5 |
重度污染 |
27 |
7 |
15.916667 |
11 |
0 |
5.416667 |
12 |
206 |
293 |
244.333333 |
12 |
2 数据合并相关
2.1 concat函数
- 不加参数直接合并,默认是按照轴0进行合并,不会忽略索引,缺失值用NaN填充
- ignore_index= True, 忽略索引
- axis=0 ,指定合并的方向
- join=‘outer’ # 取并集,如果设置为"inner"则取交集
- keys() # 连接的时候,加上一个层次的Key,用于判定数据来源于那个表
df1 = pd.DataFrame({'A': ['A0', 'A1', 'A2', 'A3'],
'B': ['B0', 'B1', 'B2', 'B3'],
'C': ['C0', 'C1', 'C2', 'C3'],
'D': ['D0', 'D1', 'D2', 'D3'],
'E': ['E0', 'E1', 'E2', 'E3']})
df2 = pd.DataFrame({'A': ['A4', 'A5', 'A6', 'A7'],
'B': ['B4', 'B5', 'B6', 'B7'],
'C': ['C4', 'C5', 'C6', 'C7'],
'D': ['D4', 'D5', 'D6', 'D7'],
'F': ['F4', 'F5', 'F6', 'F7']})
pd.concat([df1,df2],ignore_index=True)
|
A |
B |
C |
D |
E |
F |
0 |
A0 |
B0 |
C0 |
D0 |
E0 |
NaN |
1 |
A1 |
B1 |
C1 |
D1 |
E1 |
NaN |
2 |
A2 |
B2 |
C2 |
D2 |
E2 |
NaN |
3 |
A3 |
B3 |
C3 |
D3 |
E3 |
NaN |
4 |
A4 |
B4 |
C4 |
D4 |
NaN |
F4 |
5 |
A5 |
B5 |
C5 |
D5 |
NaN |
F5 |
6 |
A6 |
B6 |
C6 |
D6 |
NaN |
F6 |
7 |
A7 |
B7 |
C7 |
D7 |
NaN |
F7 |
pd.concat([df1,df2],axis=1)
|
A |
B |
C |
D |
E |
A |
B |
C |
D |
F |
0 |
A0 |
B0 |
C0 |
D0 |
E0 |
A4 |
B4 |
C4 |
D4 |
F4 |
1 |
A1 |
B1 |
C1 |
D1 |
E1 |
A5 |
B5 |
C5 |
D5 |
F5 |
2 |
A2 |
B2 |
C2 |
D2 |
E2 |
A6 |
B6 |
C6 |
D6 |
F6 |
3 |
A3 |
B3 |
C3 |
D3 |
E3 |
A7 |
B7 |
C7 |
D7 |
F7 |
pd.concat([df1,df2],join="inner")
|
A |
B |
C |
D |
0 |
A0 |
B0 |
C0 |
D0 |
1 |
A1 |
B1 |
C1 |
D1 |
2 |
A2 |
B2 |
C2 |
D2 |
3 |
A3 |
B3 |
C3 |
D3 |
0 |
A4 |
B4 |
C4 |
D4 |
1 |
A5 |
B5 |
C5 |
D5 |
2 |
A6 |
B6 |
C6 |
D6 |
3 |
A7 |
B7 |
C7 |
D7 |
pd.concat([df1,df2],keys=['df1','df2'])
|
|
A |
B |
C |
D |
E |
F |
df1 |
0 |
A0 |
B0 |
C0 |
D0 |
E0 |
NaN |
1 |
A1 |
B1 |
C1 |
D1 |
E1 |
NaN |
2 |
A2 |
B2 |
C2 |
D2 |
E2 |
NaN |
3 |
A3 |
B3 |
C3 |
D3 |
E3 |
NaN |
df2 |
0 |
A4 |
B4 |
C4 |
D4 |
NaN |
F4 |
1 |
A5 |
B5 |
C5 |
D5 |
NaN |
F5 |
2 |
A6 |
B6 |
C6 |
D6 |
NaN |
F6 |
3 |
A7 |
B7 |
C7 |
D7 |
NaN |
F7 |
- json.loads() # 可以将内存当中的数据转化为字典格式
import json
with open('./data/list-total.json','r',encoding='utf-8') as f:
dt = json.loads(f.read())
head_info = pd.DataFrame(dt['data']['areaTree'][2]['children']).loc[:,[
'name', 'id', 'lastUpdateTime']]
today = pd.DataFrame([i['today'] for i in dt['data']['areaTree'][2]['children']])
total = pd.DataFrame([i['total'] for i in dt['data']['areaTree'][2]['children']])
today.columns = [f"today_{i}" for i in today.columns]
total.columns = [f"total_{i}" for i in total.columns]
all_info = pd.concat([head_info,today,total],axis=1)
all_info.to_excel("./data/all_provience.xlsx",index=None)
dt.keys()
dict_keys(['reqId', 'code', 'msg', 'data', 'timestamp'])
2.2连接合并函数merge()
2.2.1 合并应用
data1 = pd.DataFrame({'key': ['K0', 'K1', 'K2', 'K3'],
'A': ['A0', 'A1', 'A2', 'A3'],
'B': ['B0', 'B1', 'B2', 'B3']})
data2 = pd.DataFrame({'key': ['K0', 'K1', 'K2', 'K3'],
'C': ['C0', 'C1', 'C2', 'C3'],
'D': ['D0', 'D1', 'D2', 'D3']})
pd.merge(data1,data2)
|
key |
A |
B |
C |
D |
0 |
K0 |
A0 |
B0 |
C0 |
D0 |
1 |
K1 |
A1 |
B1 |
C1 |
D1 |
2 |
K2 |
A2 |
B2 |
C2 |
D2 |
3 |
K3 |
A3 |
B3 |
C3 |
D3 |
- 如果两个表,有相同的字段/列,拼接的时候即使不指定以那个字段作为主键函数,也会用默认信息相同的列,作为主键进行拼接
pd.merge(data1,data2,on='key')
|
key |
A |
B |
C |
D |
0 |
K0 |
A0 |
B0 |
C0 |
D0 |
1 |
K1 |
A1 |
B1 |
C1 |
D1 |
2 |
K2 |
A2 |
B2 |
C2 |
D2 |
3 |
K3 |
A3 |
B3 |
C3 |
D3 |
- 如果两个表中,存在多个列信息相同,如何拼接呢?
- 指定主键为员工姓名,工号,只要这俩对的上,信息就会进行保留,注意这里用的连接内连接
- 如果有多个字段相同,指定那个,则按那个进行连接,如果不指定,则用所有相同的字段进行连接
data1 = pd.DataFrame([[1, '赵一', 30000, 6000, 2500, 125, 21375, False],
[2, '钱明', 20000, 4000, 2500, 125, 13375, False],
[3, '周元', 50000, 10000, 2500, 125, 37375, True],
[4, '李雷', 40000, 8000, 2500, 125, 29375, False],
[5, '王七', 60000, 12000, 2500, 125, 45375, True]],
columns=['工号', '员工姓名','销售业绩', '提成收入', '基本工斐', '社保金额', '创造收益',
'是否达标'])
data2 = pd.DataFrame([[4, '李雷', '脑洞有限公司', '女', '离职', '中级'],
[8, '王七', '脑洞有限公司', '男', '在职', '高级'],
[6, '赵三', '脑洞有限公司', '男', '离职', '高级'],
[7, '赵六', '脑洞有限公司', '女', '在职', '中级']],
columns=['工号', '员工姓名', '公司名称', '性别', '员工状态', '员工等级'])
pd.merge(data1,data2,on=['员工姓名'])
|
工号_x |
员工姓名 |
销售业绩 |
提成收入 |
基本工斐 |
社保金额 |
创造收益 |
是否达标 |
工号_y |
公司名称 |
性别 |
员工状态 |
员工等级 |
0 |
4 |
李雷 |
40000 |
8000 |
2500 |
125 |
29375 |
False |
4 |
脑洞有限公司 |
女 |
离职 |
中级 |
1 |
5 |
王七 |
60000 |
12000 |
2500 |
125 |
45375 |
True |
8 |
脑洞有限公司 |
男 |
在职 |
高级 |
data1 = pd.DataFrame([[1, '赵一', 30000, 6000, 2500, 125, 21375, False],
[2, '钱明', 20000, 4000, 2500, 125, 13375, False],
[3, '周元', 50000, 10000, 2500, 125, 37375, True],
[4, '李雷', 40000, 8000, 2500, 125, 29375, False],
[5, '王七', 60000, 12000, 2500, 125, 45375, True]],
columns = ['工号', '员工姓名', '销售业绩', '提成收入', '基本工斐', '社保金额', '创造收益',
'是否达标'])
data3 = pd.DataFrame([[1, '赵一', '脑洞有限公司', '女', '离职', '初级'],
[2, '钱明', '脑洞有限公司', '男', '离职', '初级'],
[3, '周元', '脑洞有限公司', '男', '在职', '中级'],
[4, '李雷', '脑洞有限公司', '女', '离职', '中级'],
[5, '王七', '脑洞有限公司', '男', '在职', '高级'],
[6, '赵三', '脑洞有限公司', '男', '离职', '高级'],
[7, '赵六', '脑洞有限公司', '女', '在职', '中级']],
columns = ['工号', '姓名', '公司名称', '性别', '状态', '员工等级'])
data1.set_index('工号',inplace=True)
data3.set_index('工号',inplace=True)
pd.merge(data1,data3,left_index=True,right_index=True)
|
员工姓名 |
销售业绩 |
提成收入 |
基本工斐 |
社保金额 |
创造收益 |
是否达标 |
姓名 |
公司名称 |
性别 |
状态 |
员工等级 |
工号 |
|
|
|
|
|
|
|
|
|
|
|
|
1 |
赵一 |
30000 |
6000 |
2500 |
125 |
21375 |
False |
赵一 |
脑洞有限公司 |
女 |
离职 |
初级 |
2 |
钱明 |
20000 |
4000 |
2500 |
125 |
13375 |
False |
钱明 |
脑洞有限公司 |
男 |
离职 |
初级 |
3 |
周元 |
50000 |
10000 |
2500 |
125 |
37375 |
True |
周元 |
脑洞有限公司 |
男 |
在职 |
中级 |
4 |
李雷 |
40000 |
8000 |
2500 |
125 |
29375 |
False |
李雷 |
脑洞有限公司 |
女 |
离职 |
中级 |
5 |
王七 |
60000 |
12000 |
2500 |
125 |
45375 |
True |
王七 |
脑洞有限公司 |
男 |
在职 |
高级 |
2.2.2 参数:
- left,right,inner,outer,on
data1 = pd.DataFrame({'key': ['K0', 'K1', 'K2', 'K3'],
'A': ['A0', 'A1', 'A2', 'A3'],
'B': ['B0', 'B1', 'B2', 'B3']})
data2 = pd.DataFrame({'key': ['K0s', 'K1s', 'K2s', 'K3s'],
'C': ['C0', 'C1', 'C2', 'C3'],
'D': ['D0', 'D1', 'D2', 'D3']})
pd.merge(data1,data2,how='left',on='key')
|
key |
A |
B |
C |
D |
0 |
K0 |
A0 |
B0 |
NaN |
NaN |
1 |
K1 |
A1 |
B1 |
NaN |
NaN |
2 |
K2 |
A2 |
B2 |
NaN |
NaN |
3 |
K3 |
A3 |
B3 |
NaN |
NaN |
pd.merge(data1,data2,how='right',on='key')
|
key |
A |
B |
C |
D |
0 |
K0s |
NaN |
NaN |
C0 |
D0 |
1 |
K1s |
NaN |
NaN |
C1 |
D1 |
2 |
K2s |
NaN |
NaN |
C2 |
D2 |
3 |
K3s |
NaN |
NaN |
C3 |
D3 |
pd.merge(data1,data2,how='outer',on='key')
|
key |
A |
B |
C |
D |
0 |
K0 |
A0 |
B0 |
NaN |
NaN |
1 |
K1 |
A1 |
B1 |
NaN |
NaN |
2 |
K2 |
A2 |
B2 |
NaN |
NaN |
3 |
K3 |
A3 |
B3 |
NaN |
NaN |
4 |
K0s |
NaN |
NaN |
C0 |
D0 |
5 |
K1s |
NaN |
NaN |
C1 |
D1 |
6 |
K2s |
NaN |
NaN |
C2 |
D2 |
7 |
K3s |
NaN |
NaN |
C3 |
D3 |
data1 = pd.DataFrame({'key1': ['K0', 'K1', 'K2', 'K3'],
'A': ['A0', 'A1', 'A2', 'A3'],
'B': ['B0', 'B1', 'B2', 'B3']})
data2 = pd.DataFrame({'key2': ['K0', 'K1', 'K2', 'K3'],
'C': ['C0', 'C1', 'C2', 'C3'],
'D': ['D0', 'D1', 'D2', 'D3']})
- left_on: 左边的某个字段作为键,
- right_on: 以右边的某个字段作为键
pd.merge(data1,data2,how="outer",left_on='key1',right_on='key2')
|
key1 |
A |
B |
key2 |
C |
D |
0 |
K0 |
A0 |
B0 |
K0 |
C0 |
D0 |
1 |
K1 |
A1 |
B1 |
K1 |
C1 |
D1 |
2 |
K2 |
A2 |
B2 |
K2 |
C2 |
D2 |
3 |
K3 |
A3 |
B3 |
K3 |
C3 |
D3 |
3 数据转换相关函数
- map: 只作用于series,实现对每个键—>值的映射
- apply: 用于Series实现每个值的处理,用于DataFrame对某个轴上面的某个Series进行处理
- applymap: 只作用于DataFrame,用于处理DataFrame中的每一个元素
data = pd.DataFrame([['2019-10-03', 'BIDU', 104.32, 102.35, 104.73, 101.15, 2.24, 0.02],
['2019-10-02', 'iq', 102.62, 100.85, 103.24, 99.5, 2.69, 0.01],
['2019-10-01', 'Jd', 102.0, 102.8, 103.26, 101.0, 1.78, -0.01],
['2019-10-03', 'BABA', 169.48, 166.65, 170.18, 165.0, 10.39, 0.02],
['2019-10-02', 'baidu', 165.77, 162.82, 166.88, 161.9, 11.6, 0.0],
['2019-10-01', 'BABA', 165.15, 168.01, 168.23, 163.64, 14.19,
-0.01]], columns = ['日期', '公司', '收盘', '开盘', '高', '低', '交易量', '涨跌幅'])
data
|
日期 |
公司 |
收盘 |
开盘 |
高 |
低 |
交易量 |
涨跌幅 |
0 |
2019-10-03 |
BIDU |
104.32 |
102.35 |
104.73 |
101.15 |
2.24 |
0.02 |
1 |
2019-10-02 |
iq |
102.62 |
100.85 |
103.24 |
99.50 |
2.69 |
0.01 |
2 |
2019-10-01 |
Jd |
102.00 |
102.80 |
103.26 |
101.00 |
1.78 |
-0.01 |
3 |
2019-10-03 |
BABA |
169.48 |
166.65 |
170.18 |
165.00 |
10.39 |
0.02 |
4 |
2019-10-02 |
baidu |
165.77 |
162.82 |
166.88 |
161.90 |
11.60 |
0.00 |
5 |
2019-10-01 |
BABA |
165.15 |
168.01 |
168.23 |
163.64 |
14.19 |
-0.01 |
en_name = data['公司'].unique().tolist()
cn_name = ["百度","爱奇艺","京东","阿里巴巴","百度"]
name_en_cn = dict(zip(en_name,cn_name))
name_en_cn
{'BIDU': '百度', 'iq': '爱奇艺', 'Jd': '京东', 'BABA': '阿里巴巴', 'baidu': '百度'}
- map: 只作用于series,实现对每个键—>值的映射
data["公司中文名字"] = data['公司'].map(name_en_cn)
data["公司中文名字2"] = data['公司'].map(lambda x:name_en_cn[x])
- apply: 用于Series实现每个值的处理,用于DataFrame对某个轴上面的某个Series进行处理
data['公司'].apply(lambda x:x.lower())
0 bidu
1 iq
2 jd
3 baba
4 baidu
5 baba
Name: 公司, dtype: object
data["公司名字2"] = data.apply(lambda x:x['公司'].lower(),axis=1)
data
|
日期 |
公司 |
收盘 |
开盘 |
高 |
低 |
交易量 |
涨跌幅 |
公司中文名字 |
公司中文名字2 |
公司名字2 |
0 |
2019-10-03 |
BIDU |
104.32 |
102.35 |
104.73 |
101.15 |
2.24 |
0.02 |
百度 |
百度 |
bidu |
1 |
2019-10-02 |
iq |
102.62 |
100.85 |
103.24 |
99.50 |
2.69 |
0.01 |
爱奇艺 |
爱奇艺 |
iq |
2 |
2019-10-01 |
Jd |
102.00 |
102.80 |
103.26 |
101.00 |
1.78 |
-0.01 |
京东 |
京东 |
jd |
3 |
2019-10-03 |
BABA |
169.48 |
166.65 |
170.18 |
165.00 |
10.39 |
0.02 |
阿里巴巴 |
阿里巴巴 |
baba |
4 |
2019-10-02 |
baidu |
165.77 |
162.82 |
166.88 |
161.90 |
11.60 |
0.00 |
百度 |
百度 |
baidu |
5 |
2019-10-01 |
BABA |
165.15 |
168.01 |
168.23 |
163.64 |
14.19 |
-0.01 |
阿里巴巴 |
阿里巴巴 |
baba |
- applymap: 只作用于DataFrame,用于处理DataFrame中的每一个元素
data.iloc[:,2:8] = data.iloc[:,2:8].applymap(lambda x: int(x))
data
|
日期 |
公司 |
收盘 |
开盘 |
高 |
低 |
交易量 |
涨跌幅 |
公司中文名字 |
公司中文名字2 |
公司名字2 |
0 |
2019-10-03 |
BIDU |
104 |
102 |
104 |
101 |
2 |
0 |
百度 |
百度 |
bidu |
1 |
2019-10-02 |
iq |
102 |
100 |
103 |
99 |
2 |
0 |
爱奇艺 |
爱奇艺 |
iq |
2 |
2019-10-01 |
Jd |
102 |
102 |
103 |
101 |
1 |
0 |
京东 |
京东 |
jd |
3 |
2019-10-03 |
BABA |
169 |
166 |
170 |
165 |
10 |
0 |
阿里巴巴 |
阿里巴巴 |
baba |
4 |
2019-10-02 |
baidu |
165 |
162 |
166 |
161 |
11 |
0 |
百度 |
百度 |
baidu |
5 |
2019-10-01 |
BABA |
165 |
168 |
168 |
163 |
14 |
0 |
阿里巴巴 |
阿里巴巴 |
baba |
4时序数据处理
ts = pd.Timestamp('2022-05-19 16:09:27')
ts
Timestamp('2022-05-19 16:09:27')
ts.month
5
ts.day
19
ts.year
2022
ts.hour
16
- 生成时间
- pd.date_range(start,end,freq,periods)
date = pd.Series(pd.date_range('2022-01-01','2022-05-31',freq='10D'))
date
0 2022-01-01
1 2022-01-11
2 2022-01-21
3 2022-01-31
4 2022-02-10
5 2022-02-20
6 2022-03-02
7 2022-03-12
8 2022-03-22
9 2022-04-01
10 2022-04-11
11 2022-04-21
12 2022-05-01
13 2022-05-11
14 2022-05-21
15 2022-05-31
dtype: datetime64[ns]
date.dt.day_of_week
0 5
1 1
2 4
3 0
4 3
5 6
6 2
7 5
8 1
9 4
10 0
11 3
12 6
13 2
14 5
15 1
dtype: int64
date.dt.month_name()
0 January
1 January
2 January
3 January
4 February
5 February
6 March
7 March
8 March
9 April
10 April
11 April
12 May
13 May
14 May
15 May
dtype: object
date.dt.day_name()
0 Saturday
1 Tuesday
2 Friday
3 Monday
4 Thursday
5 Sunday
6 Wednesday
7 Saturday
8 Tuesday
9 Friday
10 Monday
11 Thursday
12 Sunday
13 Wednesday
14 Saturday
15 Tuesday
dtype: object
date.dt.is_month_end
date.dt.is_month_start
0 True
1 False
2 False
3 False
4 False
5 False
6 False
7 False
8 False
9 True
10 False
11 False
12 True
13 False
14 False
15 False
dtype: bool
date.dt.is_quarter_start
date.dt.is_quarter_end
0 False
1 False
2 False
3 False
4 False
5 False
6 False
7 False
8 False
9 False
10 False
11 False
12 False
13 False
14 False
15 False
dtype: bool
data = pd.DataFrame(np.random.randn(365,7),index=pd.date_range('2022/01/01','2022/12/31'),
columns=list('ABCDEFG'))
data.loc[data.index.day_of_week.isin([5,6]),:]
|
A |
B |
C |
D |
E |
F |
G |
2022-01-01 |
-0.315602 |
1.938327 |
-0.661239 |
1.836433 |
-0.428376 |
1.541268 |
-0.530534 |
2022-01-02 |
1.910313 |
-0.044312 |
2.415538 |
-1.043056 |
0.569339 |
1.426120 |
0.810997 |
2022-01-08 |
-0.969020 |
0.261871 |
0.654391 |
-0.001796 |
-0.271103 |
-0.460436 |
0.919883 |
2022-01-09 |
1.303100 |
1.159366 |
-0.801309 |
-1.771291 |
-0.115401 |
-1.290599 |
-2.397392 |
2022-01-15 |
1.084151 |
-0.366372 |
-1.588546 |
0.710772 |
1.384267 |
-0.291067 |
0.630542 |
... |
... |
... |
... |
... |
... |
... |
... |
2022-12-17 |
0.294332 |
2.134218 |
-0.302565 |
0.317527 |
-0.227907 |
-2.086622 |
-0.135243 |
2022-12-18 |
-0.574041 |
0.008905 |
0.038212 |
-0.575164 |
0.970056 |
-0.101120 |
-0.191207 |
2022-12-24 |
-0.112847 |
-0.100422 |
-0.480243 |
-1.527839 |
-0.975818 |
0.975825 |
0.175665 |
2022-12-25 |
-0.474319 |
-0.687114 |
0.037217 |
0.756393 |
1.766992 |
1.143376 |
1.619943 |
2022-12-31 |
0.627957 |
1.562743 |
-0.786430 |
1.117696 |
-0.076019 |
-0.781706 |
-0.997004 |
105 rows × 7 columns
data.loc[data.index.is_month_start | data.index.is_month_end]
|
A |
B |
C |
D |
E |
F |
G |
2022-01-01 |
-0.315602 |
1.938327 |
-0.661239 |
1.836433 |
-0.428376 |
1.541268 |
-0.530534 |
2022-01-31 |
0.695454 |
-0.248829 |
1.516964 |
1.981671 |
1.349618 |
-0.160981 |
-1.139252 |
2022-02-01 |
-0.749427 |
0.284487 |
1.145231 |
-0.754888 |
0.806720 |
0.383182 |
-2.055748 |
2022-02-28 |
-0.929055 |
0.114310 |
-1.687400 |
-1.163224 |
-0.464947 |
-1.831223 |
1.089308 |
2022-03-01 |
0.804782 |
1.210089 |
0.330364 |
-0.647111 |
2.124095 |
-0.380592 |
-0.228030 |
2022-03-31 |
-1.370201 |
-0.019276 |
0.542051 |
-0.646306 |
0.645481 |
0.966601 |
-1.141636 |
2022-04-01 |
0.184640 |
-0.356681 |
-1.116461 |
1.436733 |
0.397696 |
0.212470 |
0.793653 |
2022-04-30 |
-1.104026 |
0.063872 |
-1.646085 |
-0.154575 |
0.570746 |
0.375219 |
-0.772805 |
2022-05-01 |
0.660145 |
-0.623160 |
-0.349977 |
0.409316 |
-0.559836 |
0.981211 |
-0.294337 |
2022-05-31 |
0.077000 |
-0.346333 |
0.633896 |
0.237591 |
-0.931699 |
0.846571 |
-1.423198 |
2022-06-01 |
0.032000 |
0.902177 |
0.511199 |
-0.802463 |
-1.028341 |
0.471876 |
-0.425566 |
2022-06-30 |
-0.660275 |
-0.316015 |
-0.213440 |
2.043720 |
-1.248286 |
-0.680060 |
-0.624570 |
2022-07-01 |
-0.937354 |
-0.400323 |
-0.535804 |
-0.482045 |
0.295207 |
-0.975954 |
0.725207 |
2022-07-31 |
-0.404180 |
0.599710 |
1.702778 |
0.670591 |
-0.948164 |
0.590056 |
0.301291 |
2022-08-01 |
-2.485035 |
-0.213797 |
-0.291625 |
-1.843586 |
-2.144625 |
-0.710480 |
-0.118547 |
2022-08-31 |
-1.378986 |
2.086181 |
-1.198380 |
-0.083847 |
-0.333783 |
-0.042462 |
1.855522 |
2022-09-01 |
-0.905029 |
2.848176 |
-0.906683 |
1.595003 |
-0.151549 |
1.087064 |
-0.880773 |
2022-09-30 |
0.781285 |
0.369520 |
0.870433 |
1.296064 |
1.977797 |
-0.148640 |
0.700021 |
2022-10-01 |
0.225028 |
1.566604 |
0.492757 |
-1.303721 |
-0.977534 |
-0.095851 |
2.878760 |
2022-10-31 |
0.157326 |
0.145141 |
0.104380 |
-0.762992 |
0.012724 |
2.832172 |
1.667798 |
2022-11-01 |
0.534880 |
-1.129539 |
-0.125695 |
-0.209681 |
0.764291 |
-0.323865 |
-0.460686 |
2022-11-30 |
0.870528 |
1.473008 |
1.022682 |
0.249725 |
-1.217788 |
1.245133 |
-0.029543 |
2022-12-01 |
0.561672 |
-0.046809 |
-1.072582 |
-0.384236 |
-0.255351 |
1.754940 |
-0.596977 |
2022-12-31 |
0.627957 |
1.562743 |
-0.786430 |
1.117696 |
-0.076019 |
-0.781706 |
-0.997004 |
dt = pd.read_excel('./data/all_provience.xlsx')
- pd.to_datetime()
将普通时间转为datetime64ns格式
dt['lastUpdateTime'] = pd.to_datetime(dt['lastUpdateTime'])
dt.set_index('lastUpdateTime',inplace=True)
dt.loc['2022-05-19 09:00':'2022-05-19 10:00':]
C:\Users\buzhi\AppData\Local\Temp\ipykernel_60496\3227629596.py:1: FutureWarning: Value based partial slicing on non-monotonic DatetimeIndexes with non-existing keys is deprecated and will raise a KeyError in a future Version.
dt.loc['2022-05-19 09:00':'2022-05-19 10:00':]
|
name |
id |
today_confirm |
today_suspect |
today_heal |
today_dead |
today_severe |
today_storeConfirm |
total_confirm |
total_suspect |
total_heal |
total_dead |
total_severe |
total_input |
total_newConfirm |
total_newDead |
total_newHeal |
lastUpdateTime |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
2022-05-19 09:33:57 |
台湾 |
700000 |
85082 |
NaN |
0 |
41 |
NaN |
85041 |
981141 |
0 |
13742 |
1224 |
0 |
0 |
NaN |
NaN |
NaN |
2022-05-19 09:33:58 |
香港 |
810000 |
70 |
NaN |
110 |
0 |
NaN |
-40 |
331840 |
0 |
61516 |
9361 |
0 |
0 |
NaN |
NaN |
NaN |
2022-05-19 09:33:57 |
河南 |
410000 |
11 |
NaN |
16 |
0 |
NaN |
-5 |
3163 |
0 |
2962 |
22 |
0 |
0 |
NaN |
NaN |
NaN |
2022-05-19 09:04:57 |
辽宁 |
210000 |
0 |
NaN |
1 |
0 |
NaN |
-1 |
1670 |
0 |
1657 |
2 |
0 |
0 |
NaN |
NaN |
NaN |
2022-05-19 09:04:57 |
广西 |
450000 |
0 |
NaN |
2 |
0 |
NaN |
-2 |
1611 |
0 |
1598 |
2 |
0 |
0 |
NaN |
NaN |
NaN |
2022-05-19 09:04:56 |
安徽 |
340000 |
0 |
NaN |
1 |
0 |
NaN |
-1 |
1065 |
0 |
1059 |
6 |
0 |
0 |
NaN |
NaN |
NaN |
2022-05-19 09:33:58 |
重庆 |
500000 |
0 |
NaN |
2 |
0 |
NaN |
-2 |
708 |
0 |
695 |
6 |
0 |
0 |
NaN |
NaN |
NaN |
dt.reset_index(inplace=True)
dt.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 34 entries, 0 to 33
Data columns (total 18 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 lastUpdateTime 34 non-null datetime64[ns]
1 name 34 non-null object
2 id 34 non-null int64
3 today_confirm 34 non-null int64
4 today_suspect 0 non-null float64
5 today_heal 34 non-null int64
6 today_dead 34 non-null int64
7 today_severe 0 non-null float64
8 today_storeConfirm 34 non-null int64
9 total_confirm 34 non-null int64
10 total_suspect 34 non-null int64
11 total_heal 34 non-null int64
12 total_dead 34 non-null int64
13 total_severe 34 non-null int64
14 total_input 34 non-null int64
15 total_newConfirm 1 non-null float64
16 total_newDead 1 non-null float64
17 total_newHeal 1 non-null float64
dtypes: datetime64[ns](1), float64(5), int64(11), object(1)
memory usage: 4.9+ KB
- .apply(lambda x:x.strftime(“%Y-%m-%d %H:%M:%S”))
可用将datetime64ns格式转为时间字符串
import time
dt['lastUpdateTime'].apply(lambda x:x.strftime("%Y-%m-%d %H:%M:%S")).tolist()
['2022-05-19 09:33:57',
'2022-05-19 09:33:58',
'2022-05-19 00:00:27',
'2022-05-19 10:45:03',
'2022-05-19 14:13:16',
'2022-05-19 13:08:11',
'2022-05-19 00:00:26',
'2022-05-19 09:33:57',
'2022-05-19 13:08:11',
'2022-05-19 12:23:08',
'2022-05-19 08:32:55',
'2022-05-19 08:32:56',
'2022-05-19 08:32:56',
'2022-05-19 00:04:26',
'2022-05-19 11:58:09',
'2022-05-19 08:32:56',
'2022-05-19 00:00:26',
'2022-05-19 00:39:33',
'2022-05-19 00:00:27',
'2022-05-19 09:04:57',
'2022-05-19 09:04:57',
'2022-05-19 11:57:07',
'2022-05-19 11:58:09',
'2022-05-19 09:04:56',
'2022-05-19 00:00:26',
'2022-05-19 09:33:58',
'2022-05-19 00:00:26',
'2022-05-19 12:23:08',
'2022-05-19 00:00:26',
'2022-05-19 00:00:27',
'2022-05-19 11:58:09',
'2022-05-19 00:00:26',
'2022-05-19 00:00:26',
'2022-05-19 00:00:26']
5 数据清洗
重复数据处理
- 记录重复: 一个或多个特征的某几条记录值完全相同
- 特征重复: 存在一个或多个名称不同,但是数据完全相同
缺失值处理
异常值处理
- 异常值检测
- 统计量分析检测
- 箱线图检测
- 3 σ \sigma σ原则
数据标准化
5.1 重复数据处理
5.1.1记录重复
import pandas as pd
data = pd.read_csv('./data/detail.csv',encoding='gbk')
drop_duplicates(keep=False)
- subset: 要去重的列名
- keep参数
- first: 保留第一次出现的重复行,删除后面的重复行
- last: 保留最后一次出现的重复行,删除前面的重复行
- False: 删除所有的重复行
- inplace:
data['dishes_name'].drop_duplicates(keep=False)
Series([], Name: dishes_name, dtype: object)
data.drop_duplicates(subset=['order_id','emp_id']).head()
|
detail_id |
order_id |
dishes_id |
logicprn_name |
parent_class_name |
dishes_name |
itemis_add |
counts |
amounts |
cost |
place_order_time |
discount_amt |
discount_reason |
kick_back |
add_inprice |
add_info |
bar_code |
picture_file |
emp_id |
0 |
2956 |
417 |
610062 |
NaN |
NaN |
蒜蓉生蚝 |
0 |
1 |
49 |
NaN |
2016/8/111:05:00 |
NaN |
NaN |
NaN |
0 |
NaN |
NaN |
caipu/104001.jpg |
1442 |
5 |
1899 |
301 |
610019 |
NaN |
NaN |
白斩鸡 |
0 |
1 |
88 |
NaN |
2016/8/111:15:00 |
NaN |
NaN |
NaN |
0 |
NaN |
NaN |
caipu/204002.jpg |
1095 |
11 |
2916 |
413 |
609966 |
NaN |
NaN |
芝士烩波士顿龙虾 |
0 |
1 |
175 |
NaN |
2016/8/112:42:00 |
NaN |
NaN |
NaN |
0 |
NaN |
NaN |
caipu/101001.jpg |
1147 |
22 |
2938 |
415 |
609964 |
NaN |
NaN |
避风塘炒蟹 |
0 |
1 |
48 |
NaN |
2016/8/112:51:00 |
NaN |
NaN |
NaN |
0 |
NaN |
NaN |
caipu/102004.jpg |
1166 |
28 |
2643 |
392 |
609930 |
NaN |
NaN |
豌豆薯仔猪骨汤 |
0 |
1 |
39 |
NaN |
2016/8/112:58:00 |
NaN |
NaN |
NaN |
0 |
NaN |
NaN |
caipu/203001.jpg |
1094 |
5.1.2特征重复
- 存在一个或多个名称不同,但是数据完全相同
- 通过计算相似度系数进行判断
- 通过矩阵图,热力图进行判断
data[['counts','amounts']].corr()
|
counts |
amounts |
counts |
1.000000 |
-0.159264 |
amounts |
-0.159264 |
1.000000 |
import numpy as np
a = np.arange(1,11)
b = a * 10
dt = pd.DataFrame({'A':a,'B':b})
dt[['A','B']].corr()
5.2 缺失值的处理
- 如何识别识别缺失值
- isnull 判断是否为缺失值
- notnull 非缺失值的识别
data = pd.read_csv('./data/learn_pandas.csv')
data.head()
|
School |
Grade |
Name |
Gender |
Height |
Weight |
Transfer |
Test_Number |
Test_Date |
Time_Record |
0 |
Shanghai Jiao Tong University |
Freshman |
Gaopeng Yang |
Female |
158.9 |
46.0 |
N |
1 |
2019/10/5 |
0:04:34 |
1 |
Peking University |
Freshman |
Changqiang You |
Male |
166.5 |
70.0 |
N |
1 |
2019/9/4 |
0:04:20 |
2 |
Shanghai Jiao Tong University |
Senior |
Mei Sun |
Male |
188.9 |
89.0 |
N |
2 |
2019/9/12 |
0:05:22 |
3 |
Fudan University |
Sophomore |
Xiaojuan Sun |
Female |
NaN |
41.0 |
N |
2 |
2020/1/3 |
0:04:08 |
4 |
Fudan University |
Sophomore |
Gaojuan You |
Male |
174.0 |
74.0 |
N |
2 |
2019/11/6 |
0:05:22 |
data.isnull().sum()
School 0
Grade 0
Name 0
Gender 0
Height 17
Weight 11
Transfer 12
Test_Number 0
Test_Date 0
Time_Record 0
dtype: int64
data.loc[:,"Height"].isna().sum()
17