目录
- 导入数据
- 1. 透视表
- 1. 1 pivot
- 1.2. pivot_table
- 1.3 crosstab(交叉表)
- 2.其他变形方法
- 3.哑变量与因子化
- 3.1. Dummy Variable(哑变量)
- 3.2. factorize方法
- 5.练习
导入数据
import numpy as npimport pandas as pddf = pd.read_csv('data/table.csv')df.head()
|
School |
Class |
ID |
Gender |
Address |
Height |
Weight |
Math |
Physics |
0 |
S_1 |
C_1 |
1101 |
M |
street_1 |
173 |
63 |
34.0 |
A+ |
1 |
S_1 |
C_1 |
1102 |
F |
street_2 |
192 |
73 |
32.5 |
B+ |
2 |
S_1 |
C_1 |
1103 |
M |
street_2 |
186 |
82 |
87.2 |
B+ |
3 |
S_1 |
C_1 |
1104 |
F |
street_2 |
167 |
81 |
80.4 |
B- |
4 |
S_1 |
C_1 |
1105 |
F |
street_4 |
159 |
64 |
84.8 |
B+ |
1. 透视表
1. 1 pivot
一般状态下,数据在DataFrame会以压缩(stacked)状态存放,例如上面的Gender,两个类别被叠在一列中,pivot函数可将某一列作为新的cols:
df.pivot(index='ID',columns='Gender',values='Height').head()
Gender |
F |
M |
ID |
|
|
1101 |
NaN |
173.0 |
1102 |
192.0 |
NaN |
1103 |
NaN |
186.0 |
1104 |
167.0 |
NaN |
1105 |
159.0 |
NaN |
然而pivot函数具有很强的局限性,除了功能上较少之外,还不允许values中出现重复的行列索引对(pair),例如下面的语句就会报错:
#df.pivot(index='School',columns='Gender',values='Height').head()
所以我们在这里使用pivot_table
1.2. pivot_table
首先,再现上面的操作:
pd.pivot_table(df,index='ID',columns='Gender',values='Height').head()
Gender |
F |
M |
ID |
|
|
1101 |
NaN |
173.0 |
1102 |
192.0 |
NaN |
1103 |
NaN |
186.0 |
1104 |
167.0 |
NaN |
1105 |
159.0 |
NaN |
由于功能相对更多,速度上是比不上原来的pivot函数的:
%timeit df.pivot(index='ID',columns='Gender',values='Height')%timeit pd.pivot_table(df,index='ID',columns='Gender',values='Height')
3.74 ms ± 240 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)16.9 ms ± 1.52 ms per loop (mean ± std. dev. of 7 runs, 100 loops each)
Pandas中提供了各种选项,下面介绍常用参数:
- aggfunc:对组内进行聚合统计,可传入各类函数,默认为’mean’
pd.pivot_table(df,index='School',columns='Gender',values='Height',aggfunc=['mean','sum']).head()
|
|
mean |
|
sum |
Gender |
F |
M |
F |
M |
School |
|
|
|
|
S_1 |
173.125000 |
178.714286 |
1385 |
1251 |
S_2 |
173.727273 |
172.000000 |
1911 |
1548 |
- margins:汇总边际状态
pd.pivot_table(df,index='School',columns='Gender',values='Height',aggfunc=['mean','sum'],margins=True).head()#margins_name可以设置名字,默认为'All'
|
|
|
mean |
|
|
sum |
Gender |
F |
M |
All |
F |
M |
All |
School |
|
|
|
|
|
|
S_1 |
173.125000 |
178.714286 |
175.733333 |
1385 |
1251 |
2636 |
S_2 |
173.727273 |
172.000000 |
172.950000 |
1911 |
1548 |
3459 |
All |
173.473684 |
174.937500 |
174.142857 |
3296 |
2799 |
6095 |
- 行、列、值都可以为多级
pd.pivot_table(df,index=['School','Class'], columns=['Gender','Address'], values=['Height','Weight'])
|
|
Height |
Height |
Height |
Height |
Height |
Height |
Height |
Height |
Height |
Height |
Height |
Height |
Weight |
Weight |
Weight |
Weight |
Weight |
Weight |
Weight |
Weight |
Weight |
Weight |
Weight |
Weight |
Gender |
|
F |
F |
F |
F |
F |
F |
M |
M |
M |
M |
M |
M |
F |
F |
F |
F |
F |
F |
M |
M |
M |
M |
M |
M |
Address |
|
street_1 |
street_2 |
street_4 |
street_5 |
street_6 |
street_7 |
street_1 |
street_2 |
street_4 |
street_5 |
street_6 |
street_7 |
street_1 |
street_2 |
street_4 |
street_5 |
street_6 |
street_7 |
street_1 |
street_2 |
street_4 |
street_5 |
street_6 |
street_7 |
School |
Class |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
S_1 |
C_1 |
|
179.5 |
159 |
|
|
|
173 |
186 |
|
|
|
|
|
77 |
64 |
|
|
|
63 |
82 |
|
|
|
|
S_1 |
C_2 |
|
|
176 |
162 |
167 |
|
|
|
|
188 |
160 |
|
|
|
94 |
63 |
63 |
|
|
|
|
68 |
53 |
|
S_1 |
C_3 |
175 |
|
|
187 |
|
|
|
195 |
161 |
|
|
188 |
57 |
|
|
69 |
|
|
|
70 |
68 |
|
|
82 |
S_2 |
C_1 |
|
|
|
159 |
161 |
|
|
|
163.5 |
|
|
174 |
|
|
|
97 |
61 |
|
|
|
71 |
|
|
84 |
S_2 |
C_2 |
|
|
|
|
|
188.5 |
175 |
|
155 |
193 |
|
|
|
|
|
|
|
76.5 |
74 |
|
91 |
100 |
|
|
S_2 |
C_3 |
|
|
157 |
|
164 |
190 |
|
|
187 |
171 |
|
|
|
|
78 |
|
81 |
99 |
|
|
73 |
88 |
|
|
S_2 |
C_4 |
|
176 |
|
|
175.5 |
|
|
|
|
|
|
166 |
|
73 |
|
|
57 |
|
|
|
|
|
|
82 |
1.3 crosstab(交叉表)
交叉表是一种特殊的透视表,典型的用途如分组统计,如现在想要统计关于街道和性别分组的频数:
pd.crosstab(index=df['Address'],columns=df['Gender'])
Gender |
F |
M |
Address |
|
|
street_1 |
1 |
2 |
street_2 |
4 |
2 |
street_4 |
3 |
5 |
street_5 |
3 |
3 |
street_6 |
5 |
1 |
street_7 |
3 |
3 |
交叉表的功能也很强大(但目前还不支持多级分组),一些重要参数:
- values和aggfunc:分组对某些数据进行聚合操作,这两个参数必须成对出现
pd.crosstab(index=df['Address'],columns=df['Gender'], values=np.random.randint(1,20,df.shape[0]),aggfunc='min')
#默认参数等于如下方法:#pd.crosstab(index=df['Address'],columns=df['Gender'],values=1,aggfunc='count')
Gender |
F |
M |
Address |
|
|
street_1 |
13 |
3 |
street_2 |
16 |
4 |
street_4 |
15 |
10 |
street_5 |
1 |
7 |
street_6 |
3 |
18 |
street_7 |
4 |
1 |
- 除了边际参数margins外,还引入了normalize参数,可选’all’,‘index’,'columns’参数值
pd.crosstab(index=df['Address'],columns=df['Gender'],normalize='all',margins=True)
Gender |
F |
M |
All |
Address |
|
|
|
street_1 |
0.028571 |
0.057143 |
0.085714 |
street_2 |
0.114286 |
0.057143 |
0.171429 |
street_4 |
0.085714 |
0.142857 |
0.228571 |
street_5 |
0.085714 |
0.085714 |
0.171429 |
street_6 |
0.142857 |
0.028571 |
0.171429 |
street_7 |
0.085714 |
0.085714 |
0.171429 |
All |
0.542857 |
0.457143 |
1.000000 |
2.其他变形方法
2.1. melt
melt函数可以认为是pivot函数的逆操作,将unstacked状态的数据,压缩成stacked,使“宽”的DataFrame变“窄”
df_m = df[['ID','Gender','Math']]df_m.head()
|
ID |
Gender |
Math |
0 |
1101 |
M |
34.0 |
1 |
1102 |
F |
32.5 |
2 |
1103 |
M |
87.2 |
3 |
1104 |
F |
80.4 |
4 |
1105 |
F |
84.8 |
df.pivot(index='ID',columns='Gender',values='Math').head()
Gender |
F |
M |
ID |
|
|
1101 |
NaN |
34.0 |
1102 |
32.5 |
NaN |
1103 |
NaN |
87.2 |
1104 |
80.4 |
NaN |
1105 |
84.8 |
NaN |
melt函数中的id_vars表示需要保留的列,value_vars表示需要stack的一组列
pivoted = df.pivot(index='ID',columns='Gender',values='Math')result = pivoted.reset_index().melt(id_vars=['ID'],value_vars=['F','M'],value_name='Math') .dropna().set_index('ID').sort_index()#检验是否与展开前的df相同,可以分别将这些链式方法的中间步骤展开,看看是什么结果result.equals(df_m.set_index('ID'))
True
2.2. 压缩与展开
- stack:这是最基础的变形函数,总共只有两个参数:level和dropna
df_s = pd.pivot_table(df,index=['Class','ID'],columns='Gender',values=['Height','Weight'])df_s.groupby('Class').head(2)
|
|
|
Height |
|
Weight |
Gender |
|
F |
M |
F |
M |
Class |
ID |
|
|
|
|
C_1 |
1101 |
NaN |
173.0 |
NaN |
63.0 |
|
1102 |
192.0 |
NaN |
73.0 |
NaN |
C_2 |
1201 |
NaN |
188.0 |
NaN |
68.0 |
|
1202 |
176.0 |
NaN |
94.0 |
NaN |
C_3 |
1301 |
NaN |
161.0 |
NaN |
68.0 |
|
1302 |
175.0 |
NaN |
57.0 |
NaN |
C_4 |
2401 |
192.0 |
NaN |
62.0 |
NaN |
|
2402 |
NaN |
166.0 |
NaN |
82.0 |
df_stacked = df_s.stack()df_stacked.groupby('Class').head(2)
|
|
|
Height |
Weight |
Class |
ID |
Gender |
|
|
C_1 |
1101 |
M |
173.0 |
63.0 |
|
1102 |
F |
192.0 |
73.0 |
C_2 |
1201 |
M |
188.0 |
68.0 |
|
1202 |
F |
176.0 |
94.0 |
C_3 |
1301 |
M |
161.0 |
68.0 |
|
1302 |
F |
175.0 |
57.0 |
C_4 |
2401 |
F |
192.0 |
62.0 |
|
2402 |
M |
166.0 |
82.0 |
stack函数可以看做将横向的索引放到纵向,因此功能类似与melt,参数level可指定变化的列索引是哪一层(或哪几层,需要列表)
df_stacked = df_s.stack(0)df_stacked.groupby('Class').head(2)
Gender |
|
F |
M |
Class |
ID |
|
|
C_1 |
1101 |
Height |
NaN |
|
|
Weight |
NaN |
C_2 |
1201 |
Height |
NaN |
|
|
Weight |
NaN |
C_3 |
1301 |
Height |
NaN |
|
|
Weight |
NaN |
C_4 |
2401 |
Height |
192.0 |
|
|
Weight |
62.0 |
- unstack:stack的逆函数,功能上类似于pivot_table
df_stacked.head()
|
|
Gender |
F |
M |
Class |
ID |
|
|
|
C_1 |
1101 |
Height |
NaN |
173.0 |
|
|
Weight |
NaN |
63.0 |
|
1102 |
Height |
192.0 |
NaN |
|
|
Weight |
73.0 |
NaN |
|
1103 |
Height |
NaN |
186.0 |
result = df_stacked.unstack().swaplevel(1,0,axis=1).sort_index(axis=1)result.equals(df_s)#同样在unstack中可以指定level参数
True
3.哑变量与因子化
3.1. Dummy Variable(哑变量)
这里主要介绍get_dummies函数,其功能主要是进行one-hot编码:
df_d = df[['Class','Gender','Weight']]df_d.head()
|
Class |
Gender |
Weight |
0 |
C_1 |
M |
63 |
1 |
C_1 |
F |
73 |
2 |
C_1 |
M |
82 |
3 |
C_1 |
F |
81 |
4 |
C_1 |
F |
64 |
现在将上面的表格前两列转化为哑变量,并加入第三列Weight数值
pd.get_dummies(df_d[['Class','Gender']]).join(df_d['Weight']).head()#可选prefix参数添加前缀,prefix_sep添加分隔符
|
Class_C_1 |
Class_C_2 |
Class_C_3 |
Class_C_4 |
Gender_F |
Gender_M |
Weight |
0 |
1 |
0 |
0 |
0 |
0 |
1 |
63 |
1 |
1 |
0 |
0 |
0 |
1 |
0 |
73 |
2 |
1 |
0 |
0 |
0 |
0 |
1 |
82 |
3 |
1 |
0 |
0 |
0 |
1 |
0 |
81 |
4 |
1 |
0 |
0 |
0 |
1 |
0 |
64 |
3.2. factorize方法
该方法主要用于自然数编码,并且缺失值会被记做-1,其中sort参数表示是否排序后赋值
codes, uniques = pd.factorize(['b', None, 'a', 'c', 'b'], sort=True)display(codes)display(uniques)
array([ 1, -1, 0, 2, 1], dtype=int32)array(['a', 'b', 'c'], dtype=object)
5.练习
5.1
- 导入数据
df = pd.read_csv('data/Drugs.csv',index_col=['State','COUNTY']).sort_index()df.head()
|
|
YYYY |
SubstanceName |
DrugReports |
State |
COUNTY |
|
|
|
KY |
ADAIR |
2010 |
Methadone |
1 |
|
ADAIR |
2010 |
Hydrocodone |
6 |
|
ADAIR |
2011 |
Oxycodone |
4 |
|
ADAIR |
2011 |
Buprenorphine |
3 |
|
ADAIR |
2011 |
Morphine |
2 |
result = pd.pivot_table(df,index=['State','COUNTY','SubstanceName'] ,columns='YYYY' ,values='DrugReports',fill_value='-').reset_index().rename_axis(columns={'YYYY':''})result.head()
|
State |
COUNTY |
SubstanceName |
2010 |
2011 |
2012 |
2013 |
2014 |
2015 |
2016 |
2017 |
0 |
KY |
ADAIR |
Buprenorphine |
- |
3 |
5 |
4 |
27 |
5 |
7 |
10 |
1 |
KY |
ADAIR |
Codeine |
- |
- |
1 |
- |
- |
- |
- |
1 |
2 |
KY |
ADAIR |
Fentanyl |
- |
- |
1 |
- |
- |
- |
- |
- |
3 |
KY |
ADAIR |
Heroin |
- |
- |
1 |
2 |
- |
1 |
- |
2 |
4 |
KY |
ADAIR |
Hydrocodone |
6 |
9 |
10 |
10 |
9 |
7 |
11 |
3 |
现在请将(a)中的结果恢复到原数据表,并通过equal函数检验初始表与新的结果是否一致(返回True)
result_melted = result.melt(id_vars=result.columns[:3],value_vars=result.columns[-8:] ,var_name='YYYY',value_name='DrugReports').query('DrugReports != "-"')result2 = result_melted.sort_values(by=['State','COUNTY','YYYY' ,'SubstanceName']).reset_index().drop(columns='index')#下面其实无关紧要,只是交换两个列再改一下类型(因为‘-’所以type变成object了)cols = list(result2.columns)a, b = cols.index('SubstanceName'), cols.index('YYYY')cols[b], cols[a] = cols[a], cols[b]result2 = result2[cols].astype({'DrugReports':'int','YYYY':'int'})result2.head()
|
State |
COUNTY |
YYYY |
SubstanceName |
DrugReports |
0 |
KY |
ADAIR |
2010 |
Hydrocodone |
6 |
1 |
KY |
ADAIR |
2010 |
Methadone |
1 |
2 |
KY |
ADAIR |
2011 |
Buprenorphine |
3 |
3 |
KY |
ADAIR |
2011 |
Hydrocodone |
9 |
4 |
KY |
ADAIR |
2011 |
Morphine |
2 |
df_tidy = df.reset_index().sort_values(by=result2.columns[:4].tolist()).reset_index().drop(columns='index')df_tidy.head()
|
State |
COUNTY |
YYYY |
SubstanceName |
DrugReports |
0 |
KY |
ADAIR |
2010 |
Hydrocodone |
6 |
1 |
KY |
ADAIR |
2010 |
Methadone |
1 |
2 |
KY |
ADAIR |
2011 |
Buprenorphine |
3 |
3 |
KY |
ADAIR |
2011 |
Hydrocodone |
9 |
4 |
KY |
ADAIR |
2011 |
Morphine |
2 |
df_tidy.equals(result2)
True
5.2
现有一份关于某地区地震情况的数据集,请解决如下问题
pd.read_csv('data/Earthquake.csv').head()
|
日期 |
时间 |
维度 |
经度 |
方向 |
距离 |
深度 |
烈度 |
0 |
2003.05.20 |
12:17:44 AM |
39.04 |
40.38 |
west |
0.1 |
10.0 |
0.0 |
1 |
2007.08.01 |
12:03:08 AM |
40.79 |
30.09 |
west |
0.1 |
5.2 |
4.0 |
2 |
1978.05.07 |
12:41:37 AM |
38.58 |
27.61 |
south_west |
0.1 |
0.0 |
0.0 |
3 |
1997.03.22 |
12:31:45 AM |
39.47 |
36.44 |
south_west |
0.1 |
10.0 |
0.0 |
4 |
2000.04.02 |
12:57:38 AM |
40.80 |
30.24 |
south_west |
0.1 |
7.0 |
0.0 |
- 将数据表转化成如下形态,将方向列展开,并将距离、深度和烈度三个属性压缩:
df = pd.read_csv('data/Earthquake.csv')df = df.sort_values(by=df.columns.tolist()[:3]).sort_index(axis=1).reset_index().drop(columns='index')df.head()
|
方向 |
日期 |
时间 |
深度 |
烈度 |
经度 |
维度 |
距离 |
0 |
south_east |
1912.08.09 |
12:29:00 AM |
16.0 |
6.7 |
27.2 |
40.6 |
4.3 |
1 |
south_west |
1912.08.10 |
12:23:00 AM |
15.0 |
6.0 |
27.1 |
40.6 |
2.0 |
2 |
south_west |
1912.08.10 |
12:30:00 AM |
15.0 |
5.2 |
27.1 |
40.6 |
2.0 |
3 |
south_east |
1912.08.11 |
12:19:04 AM |
30.0 |
4.9 |
27.2 |
40.6 |
4.3 |
4 |
south_west |
1912.08.11 |
12:20:00 AM |
15.0 |
4.5 |
27.1 |
40.6 |
2.0 |
result = pd.pivot_table(df,index=['日期','时间','维度','经度'] ,columns='方向' ,values=['烈度','深度','距离'],fill_value='-').stack(level=0).rename_axis(index={None:'地震参数'})result.head(6)
|
|
|
|
方向 |
east |
north |
north_east |
north_west |
south |
south_east |
south_west |
west |
日期 |
时间 |
维度 |
经度 |
地震参数 |
|
|
|
|
|
|
|
|
1912.08.09 |
12:29:00 AM |
40.6 |
27.2 |
深度 |
- |
- |
- |
- |
- |
16 |
- |
- |
|
|
|
|
烈度 |
- |
- |
- |
- |
- |
6.7 |
- |
- |
|
|
|
|
距离 |
- |
- |
- |
- |
- |
4.3 |
- |
- |
1912.08.10 |
12:23:00 AM |
40.6 |
27.1 |
深度 |
- |
- |
- |
- |
- |
- |
15 |
- |
|
|
|
|
烈度 |
- |
- |
- |
- |
- |
- |
6 |
- |
|
|
|
|
距离 |
- |
- |
- |
- |
- |
- |
2 |
- |
- 将(a)中的结果恢复到原数据表,并通过equal函数检验初始表与新的结果是否一致(返回True)
df_result = result.unstack().stack(0)[(~(result.unstack().stack(0)=='-')).any(1)].reset_index()df_result.columns.name=Nonedf_result = df_result.sort_index(axis=1).astype({'深度':'float64','烈度':'float64','距离':'float64'})df_result.head()
|
方向 |
日期 |
时间 |
深度 |
烈度 |
经度 |
维度 |
距离 |
0 |
south_east |
1912.08.09 |
12:29:00 AM |
16.0 |
6.7 |
27.2 |
40.6 |
4.3 |
1 |
south_west |
1912.08.10 |
12:23:00 AM |
15.0 |
6.0 |
27.1 |
40.6 |
2.0 |
2 |
south_west |
1912.08.10 |
12:30:00 AM |
15.0 |
5.2 |
27.1 |
40.6 |
2.0 |
3 |
south_east |
1912.08.11 |
12:19:04 AM |
30.0 |
4.9 |
27.2 |
40.6 |
4.3 |
4 |
south_west |
1912.08.11 |
12:20:00 AM |
15.0 |
4.5 |
27.1 |
40.6 |
2.0 |
df_result.astype({'深度':'float64','烈度':'float64','距离':'float64'},copy=False).dtypes
方向 object日期 object时间 object深度 float64烈度 float64经度 float64维度 float64距离 float64dtype: object
df.equals(df_result)
True