tidyr总结篇-续
欢迎来到医科研,这里是白介素2的读书笔记,跟我一起聊临床与科研的故事, 生物医学数据挖掘,R语言,TCGA、GEO数据挖掘。
tidyr总结篇
gather(data,key="“,value=”") ## key是变量,value是值 gather的意义是重新塑造数据的变量,原有数据的变量并不是真正的变量,这时候变量不是变量,变量还是变量。
举例说明: 神奇的gather
参数1:data 参数2:key变量名,参数3:value变量名 参数4:gather的变量指定 其中-表示除外某向量,全部gather
Sys.setlocale('LC_ALL','C')
## [1] "C"
library(tidyverse)
## Registered S3 methods overwritten by 'ggplot2':
## method from
## [.quosures rlang
## c.quosures rlang
## print.quosures rlang
## Registered S3 method overwritten by 'rvest':
## method from
## read_xml.response xml2
## -- Attaching packages -------------------------------------------- tidyverse 1.2.1 --
## <U+221A> ggplot2 3.1.0 <U+221A> purrr 0.3.0
## <U+221A> tibble 2.0.1 <U+221A> dplyr 0.8.0.1
## <U+221A> tidyr 0.8.2 <U+221A> stringr 1.4.0
## <U+221A> readr 1.3.1 <U+221A> forcats 0.4.0
## -- Conflicts ----------------------------------------------- tidyverse_conflicts() --
## x dplyr::filter() masks stats::filter()
## x dplyr::lag() masks stats::lag()
stocks <- tibble(
time = as.Date('2009-01-01') + 0:9,
X = rnorm(10, 0, 1),
Y = rnorm(10, 0, 2),
Z = rnorm(10, 0, 4)
)
stocks
## # A tibble: 10 x 4
## time X Y Z
## <date> <dbl> <dbl> <dbl>
## 1 2009-01-01 -0.497 -1.20 5.93
## 2 2009-01-02 1.22 1.58 -4.43
## 3 2009-01-03 1.68 -2.50 8.03
## 4 2009-01-04 1.58 0.744 -2.00
## 5 2009-01-05 0.775 1.87 -3.14
## 6 2009-01-06 0.0405 0.629 4.31
## 7 2009-01-07 -1.42 -1.36 9.63
## 8 2009-01-08 1.18 5.21 -0.231
## 9 2009-01-09 -0.581 -1.02 -0.680
## 10 2009-01-10 0.768 0.900 6.43
gather起stocks中的,X,Y,Z. 新命名一个key,命名一个value, 除去time不变化
gather(stocks, stock, price, -time)
## # A tibble: 30 x 3
## time stock price
## <date> <chr> <dbl>
## 1 2009-01-01 X -0.497
## 2 2009-01-02 X 1.22
## 3 2009-01-03 X 1.68
## 4 2009-01-04 X 1.58
## 5 2009-01-05 X 0.775
## 6 2009-01-06 X 0.0405
## 7 2009-01-07 X -1.42
## 8 2009-01-08 X 1.18
## 9 2009-01-09 X -0.581
## 10 2009-01-10 X 0.768
## # ... with 20 more rows
stocks %>% gather(stock, price, -time)##保留time不变化
## # A tibble: 30 x 3
## time stock price
## <date> <chr> <dbl>
## 1 2009-01-01 X -0.497
## 2 2009-01-02 X 1.22
## 3 2009-01-03 X 1.68
## 4 2009-01-04 X 1.58
## 5 2009-01-05 X 0.775
## 6 2009-01-06 X 0.0405
## 7 2009-01-07 X -1.42
## 8 2009-01-08 X 1.18
## 9 2009-01-09 X -0.581
## 10 2009-01-10 X 0.768
## # ... with 20 more rows
##
mini_iris <- iris[c(1, 51, 101), ]
mini_iris
## Sepal.Length Sepal.Width Petal.Length Petal.Width Species
## 1 5.1 3.5 1.4 0.2 setosa
## 51 7.0 3.2 4.7 1.4 versicolor
## 101 6.3 3.3 6.0 2.5 virginica
gather(mini_iris,key = "flower_att",value = "value",Sepal.Length, Sepal.Width, Petal.Length, Petal.Width)
## Species flower_att value
## 1 setosa Sepal.Length 5.1
## 2 versicolor Sepal.Length 7.0
## 3 virginica Sepal.Length 6.3
## 4 setosa Sepal.Width 3.5
## 5 versicolor Sepal.Width 3.2
## 6 virginica Sepal.Width 3.3
## 7 setosa Petal.Length 1.4
## 8 versicolor Petal.Length 4.7
## 9 virginica Petal.Length 6.0
## 10 setosa Petal.Width 0.2
## 11 versicolor Petal.Width 1.4
## 12 virginica Petal.Width 2.5
gather(mini_iris,key = "flower_att",value = "value",Sepal.Length:Petal.Width)
## Species flower_att value
## 1 setosa Sepal.Length 5.1
## 2 versicolor Sepal.Length 7.0
## 3 virginica Sepal.Length 6.3
## 4 setosa Sepal.Width 3.5
## 5 versicolor Sepal.Width 3.2
## 6 virginica Sepal.Width 3.3
## 7 setosa Petal.Length 1.4
## 8 versicolor Petal.Length 4.7
## 9 virginica Petal.Length 6.0
## 10 setosa Petal.Width 0.2
## 11 versicolor Petal.Width 1.4
## 12 virginica Petal.Width 2.5
-表示不gather的变量
gather(mini_iris,key = "flow_att",value = "value",-Species)
## Species flow_att value
## 1 setosa Sepal.Length 5.1
## 2 versicolor Sepal.Length 7.0
## 3 virginica Sepal.Length 6.3
## 4 setosa Sepal.Width 3.5
## 5 versicolor Sepal.Width 3.2
## 6 virginica Sepal.Width 3.3
## 7 setosa Petal.Length 1.4
## 8 versicolor Petal.Length 4.7
## 9 virginica Petal.Length 6.0
## 10 setosa Petal.Width 0.2
## 11 versicolor Petal.Width 1.4
## 12 virginica Petal.Width 2.5
省略掉key, value
gather(mini_iris,flow_att,value,-Species)##得到的结果相同
## Species flow_att value
## 1 setosa Sepal.Length 5.1
## 2 versicolor Sepal.Length 7.0
## 3 virginica Sepal.Length 6.3
## 4 setosa Sepal.Width 3.5
## 5 versicolor Sepal.Width 3.2
## 6 virginica Sepal.Width 3.3
## 7 setosa Petal.Length 1.4
## 8 versicolor Petal.Length 4.7
## 9 virginica Petal.Length 6.0
## 10 setosa Petal.Width 0.2
## 11 versicolor Petal.Width 1.4
## 12 virginica Petal.Width 2.5
在管道中演示一套
注意group_by与slice联用时,slice切割的是总分组的数目 如果group分了3组,那slice切割1的话 就是显示1*3,如果切割1:2的话,那就是2*3,显示6个观测
下面举例说明
展示分组中的序列1,包含3个species
注意slice与group_by的联用
library(dplyr)
mini_iris <-
iris %>%
group_by(Species) %>%
slice(1)
mini_iris %>% gather(key = flower_att, value = measurement, -Species)
## # A tibble: 12 x 3
## # Groups: Species [3]
## Species flower_att measurement
## <fct> <chr> <dbl>
## 1 setosa Sepal.Length 5.1
## 2 versicolor Sepal.Length 7
## 3 virginica Sepal.Length 6.3
## 4 setosa Sepal.Width 3.5
## 5 versicolor Sepal.Width 3.2
## 6 virginica Sepal.Width 3.3
## 7 setosa Petal.Length 1.4
## 8 versicolor Petal.Length 4.7
## 9 virginica Petal.Length 6
## 10 setosa Petal.Width 0.2
## 11 versicolor Petal.Width 1.4
## 12 virginica Petal.Width 2.5
再来举个例子 - mtcars数据集中的cyl分组为4-6-8 - 切割slice 1:2,即显示2组,4-6-8
by_cyl <- group_by(mtcars, cyl)
##
slice(by_cyl, 1:2)
## # A tibble: 6 x 11
## # Groups: cyl [3]
## mpg cyl disp hp drat wt qsec vs am gear carb
## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 22.8 4 108 93 3.85 2.32 18.6 1 1 4 1
## 2 24.4 4 147. 62 3.69 3.19 20 1 0 4 2
## 3 21 6 160 110 3.9 2.62 16.5 0 1 4 4
## 4 21 6 160 110 3.9 2.88 17.0 0 1 4 4
## 5 18.7 8 360 175 3.15 3.44 17.0 0 0 3 2
## 6 14.3 8 360 245 3.21 3.57 15.8 0 0 3 4
spread函数
这是一个与gather互为逆向操作的函数 函数参数有:参数1:data, 参数2:key,参数3:value,达到按key与value展开的效果
library(dplyr)
stocks <- data.frame(
time = as.Date('2009-01-01') + 0:9,
X = rnorm(10, 0, 1),
Y = rnorm(10, 0, 2),
Z = rnorm(10, 0, 4)
)
stocks
## time X Y Z
## 1 2009-01-01 -0.9981963 0.6149012 -1.880305
## 2 2009-01-02 0.9763906 1.2292060 -2.244749
## 3 2009-01-03 1.3475060 -1.6466510 4.497477
## 4 2009-01-04 0.6845907 -2.8694272 -10.145486
## 5 2009-01-05 -0.3132428 0.2366398 -1.401196
## 6 2009-01-06 1.0542915 -0.5094071 1.311380
## 7 2009-01-07 -2.5360015 1.5011045 1.188158
## 8 2009-01-08 -0.2878114 1.6744369 -3.015077
## 9 2009-01-09 -0.3004896 -2.8344579 4.376036
## 10 2009-01-10 -0.1714464 0.8319891 -2.288022
先key value, gather一下
stocksm <- stocks %>% gather(stock, price, -time)
stocksm
## time stock price
## 1 2009-01-01 X -0.9981963
## 2 2009-01-02 X 0.9763906
## 3 2009-01-03 X 1.3475060
## 4 2009-01-04 X 0.6845907
## 5 2009-01-05 X -0.3132428
## 6 2009-01-06 X 1.0542915
## 7 2009-01-07 X -2.5360015
## 8 2009-01-08 X -0.2878114
## 9 2009-01-09 X -0.3004896
## 10 2009-01-10 X -0.1714464
## 11 2009-01-01 Y 0.6149012
## 12 2009-01-02 Y 1.2292060
## 13 2009-01-03 Y -1.6466510
## 14 2009-01-04 Y -2.8694272
## 15 2009-01-05 Y 0.2366398
## 16 2009-01-06 Y -0.5094071
## 17 2009-01-07 Y 1.5011045
## 18 2009-01-08 Y 1.6744369
## 19 2009-01-09 Y -2.8344579
## 20 2009-01-10 Y 0.8319891
## 21 2009-01-01 Z -1.8803050
## 22 2009-01-02 Z -2.2447486
## 23 2009-01-03 Z 4.4974771
## 24 2009-01-04 Z -10.1454861
## 25 2009-01-05 Z -1.4011960
## 26 2009-01-06 Z 1.3113796
## 27 2009-01-07 Z 1.1881581
## 28 2009-01-08 Z -3.0150769
## 29 2009-01-09 Z 4.3760358
## 30 2009-01-10 Z -2.2880217
spread展开数据
按stock, price展开
stocksm %>% spread(stock, price)
## time X Y Z
## 1 2009-01-01 -0.9981963 0.6149012 -1.880305
## 2 2009-01-02 0.9763906 1.2292060 -2.244749
## 3 2009-01-03 1.3475060 -1.6466510 4.497477
## 4 2009-01-04 0.6845907 -2.8694272 -10.145486
## 5 2009-01-05 -0.3132428 0.2366398 -1.401196
## 6 2009-01-06 1.0542915 -0.5094071 1.311380
## 7 2009-01-07 -2.5360015 1.5011045 1.188158
## 8 2009-01-08 -0.2878114 1.6744369 -3.015077
## 9 2009-01-09 -0.3004896 -2.8344579 4.376036
## 10 2009-01-10 -0.1714464 0.8319891 -2.288022
按time, price展开
stocksm %>% spread(time, price)
## stock 2009-01-01 2009-01-02 2009-01-03 2009-01-04 2009-01-05 2009-01-06
## 1 X -0.9981963 0.9763906 1.347506 0.6845907 -0.3132428 1.0542915
## 2 Y 0.6149012 1.2292060 -1.646651 -2.8694272 0.2366398 -0.5094071
## 3 Z -1.8803050 -2.2447486 4.497477 -10.1454861 -1.4011960 1.3113796
## 2009-01-07 2009-01-08 2009-01-09 2009-01-10
## 1 -2.536001 -0.2878114 -0.3004896 -0.1714464
## 2 1.501104 1.6744369 -2.8344579 0.8319891
## 3 1.188158 -3.0150769 4.3760358 -2.2880217
说明一下gather-spread的互补性质
stocks
## time X Y Z
## 1 2009-01-01 -0.9981963 0.6149012 -1.880305
## 2 2009-01-02 0.9763906 1.2292060 -2.244749
## 3 2009-01-03 1.3475060 -1.6466510 4.497477
## 4 2009-01-04 0.6845907 -2.8694272 -10.145486
## 5 2009-01-05 -0.3132428 0.2366398 -1.401196
## 6 2009-01-06 1.0542915 -0.5094071 1.311380
## 7 2009-01-07 -2.5360015 1.5011045 1.188158
## 8 2009-01-08 -0.2878114 1.6744369 -3.015077
## 9 2009-01-09 -0.3004896 -2.8344579 4.376036
## 10 2009-01-10 -0.1714464 0.8319891 -2.288022
stocks %>%
gather(key=stock,value = price,-time) %>% ##先聚合
spread(key = stock,value = price) %>% ## 又展开还原
identical(stocks) ## 判断与原来的stocks是否完全一样
## [1] TRUE
总结一下 gather与spread,可以自如的将数据变换为宽数据或窄数据 gather的数据格式非常适用用于ggplot2的导入,用于可视化 说到这里了我们就绘制一下吧,当然关于可视化的内容暂时不展开讲。
牛刀小试
library(ggplot2)
p<-stocks %>%
gather(key = stock,value = price,-time) %>%
as_tibble() %>% ##直接导入到ggplot2进行可视化
ggplot2::ggplot(aes(x=stock,y=price,fill=stock))+
geom_boxplot()
p
改改颜色
p+scale_fill_brewer(palette="Dark2")
放上自己喜欢的颜色
p+scale_fill_manual(values=c("#999999", "#E69F00", "#56B4E9"))
tidyr::unite函数
能够方便的实现将多列粘贴到一起的功能 参数1:data数据框,参数2:新列名,参数3:sep分隔符,参数4:remove=T移除原列 下面举例说明,这个功能好用,但用起来比较简单
粘贴vs与am列
library(dplyr)
unite_(mtcars, "vs_am", c("vs","am"))
## mpg cyl disp hp drat wt qsec vs_am gear carb
## Mazda RX4 21.0 6 160.0 110 3.90 2.620 16.46 0_1 4 4
## Mazda RX4 Wag 21.0 6 160.0 110 3.90 2.875 17.02 0_1 4 4
## Datsun 710 22.8 4 108.0 93 3.85 2.320 18.61 1_1 4 1
## Hornet 4 Drive 21.4 6 258.0 110 3.08 3.215 19.44 1_0 3 1
## Hornet Sportabout 18.7 8 360.0 175 3.15 3.440 17.02 0_0 3 2
## Valiant 18.1 6 225.0 105 2.76 3.460 20.22 1_0 3 1
## Duster 360 14.3 8 360.0 245 3.21 3.570 15.84 0_0 3 4
## Merc 240D 24.4 4 146.7 62 3.69 3.190 20.00 1_0 4 2
## Merc 230 22.8 4 140.8 95 3.92 3.150 22.90 1_0 4 2
## Merc 280 19.2 6 167.6 123 3.92 3.440 18.30 1_0 4 4
## Merc 280C 17.8 6 167.6 123 3.92 3.440 18.90 1_0 4 4
## Merc 450SE 16.4 8 275.8 180 3.07 4.070 17.40 0_0 3 3
## Merc 450SL 17.3 8 275.8 180 3.07 3.730 17.60 0_0 3 3
## Merc 450SLC 15.2 8 275.8 180 3.07 3.780 18.00 0_0 3 3
## Cadillac Fleetwood 10.4 8 472.0 205 2.93 5.250 17.98 0_0 3 4
## Lincoln Continental 10.4 8 460.0 215 3.00 5.424 17.82 0_0 3 4
## Chrysler Imperial 14.7 8 440.0 230 3.23 5.345 17.42 0_0 3 4
## Fiat 128 32.4 4 78.7 66 4.08 2.200 19.47 1_1 4 1
## Honda Civic 30.4 4 75.7 52 4.93 1.615 18.52 1_1 4 2
## Toyota Corolla 33.9 4 71.1 65 4.22 1.835 19.90 1_1 4 1
## Toyota Corona 21.5 4 120.1 97 3.70 2.465 20.01 1_0 3 1
## Dodge Challenger 15.5 8 318.0 150 2.76 3.520 16.87 0_0 3 2
## AMC Javelin 15.2 8 304.0 150 3.15 3.435 17.30 0_0 3 2
## Camaro Z28 13.3 8 350.0 245 3.73 3.840 15.41 0_0 3 4
## Pontiac Firebird 19.2 8 400.0 175 3.08 3.845 17.05 0_0 3 2
## Fiat X1-9 27.3 4 79.0 66 4.08 1.935 18.90 1_1 4 1
## Porsche 914-2 26.0 4 120.3 91 4.43 2.140 16.70 0_1 5 2
## Lotus Europa 30.4 4 95.1 113 3.77 1.513 16.90 1_1 5 2
## Ford Pantera L 15.8 8 351.0 264 4.22 3.170 14.50 0_1 5 4
## Ferrari Dino 19.7 6 145.0 175 3.62 2.770 15.50 0_1 5 6
## Maserati Bora 15.0 8 301.0 335 3.54 3.570 14.60 0_1 5 8
## Volvo 142E 21.4 4 121.0 109 4.11 2.780 18.60 1_1 4 2
粘贴再分割是可逆的操作
mtcars %>%
unite(vs_am, vs, am) %>%
separate(vs_am, c("vs", "am"))
## mpg cyl disp hp drat wt qsec vs am gear carb
## Mazda RX4 21.0 6 160.0 110 3.90 2.620 16.46 0 1 4 4
## Mazda RX4 Wag 21.0 6 160.0 110 3.90 2.875 17.02 0 1 4 4
## Datsun 710 22.8 4 108.0 93 3.85 2.320 18.61 1 1 4 1
## Hornet 4 Drive 21.4 6 258.0 110 3.08 3.215 19.44 1 0 3 1
## Hornet Sportabout 18.7 8 360.0 175 3.15 3.440 17.02 0 0 3 2
## Valiant 18.1 6 225.0 105 2.76 3.460 20.22 1 0 3 1
## Duster 360 14.3 8 360.0 245 3.21 3.570 15.84 0 0 3 4
## Merc 240D 24.4 4 146.7 62 3.69 3.190 20.00 1 0 4 2
## Merc 230 22.8 4 140.8 95 3.92 3.150 22.90 1 0 4 2
## Merc 280 19.2 6 167.6 123 3.92 3.440 18.30 1 0 4 4
## Merc 280C 17.8 6 167.6 123 3.92 3.440 18.90 1 0 4 4
## Merc 450SE 16.4 8 275.8 180 3.07 4.070 17.40 0 0 3 3
## Merc 450SL 17.3 8 275.8 180 3.07 3.730 17.60 0 0 3 3
## Merc 450SLC 15.2 8 275.8 180 3.07 3.780 18.00 0 0 3 3
## Cadillac Fleetwood 10.4 8 472.0 205 2.93 5.250 17.98 0 0 3 4
## Lincoln Continental 10.4 8 460.0 215 3.00 5.424 17.82 0 0 3 4
## Chrysler Imperial 14.7 8 440.0 230 3.23 5.345 17.42 0 0 3 4
## Fiat 128 32.4 4 78.7 66 4.08 2.200 19.47 1 1 4 1
## Honda Civic 30.4 4 75.7 52 4.93 1.615 18.52 1 1 4 2
## Toyota Corolla 33.9 4 71.1 65 4.22 1.835 19.90 1 1 4 1
## Toyota Corona 21.5 4 120.1 97 3.70 2.465 20.01 1 0 3 1
## Dodge Challenger 15.5 8 318.0 150 2.76 3.520 16.87 0 0 3 2
## AMC Javelin 15.2 8 304.0 150 3.15 3.435 17.30 0 0 3 2
## Camaro Z28 13.3 8 350.0 245 3.73 3.840 15.41 0 0 3 4
## Pontiac Firebird 19.2 8 400.0 175 3.08 3.845 17.05 0 0 3 2
## Fiat X1-9 27.3 4 79.0 66 4.08 1.935 18.90 1 1 4 1
## Porsche 914-2 26.0 4 120.3 91 4.43 2.140 16.70 0 1 5 2
## Lotus Europa 30.4 4 95.1 113 3.77 1.513 16.90 1 1 5 2
## Ford Pantera L 15.8 8 351.0 264 4.22 3.170 14.50 0 1 5 4
## Ferrari Dino 19.7 6 145.0 175 3.62 2.770 15.50 0 1 5 6
## Maserati Bora 15.0 8 301.0 335 3.54 3.570 14.60 0 1 5 8
## Volvo 142E 21.4 4 121.0 109 4.11 2.780 18.60 1 1 4 2
separate函数-逆向的unite操作
参数1:data,参数2: 要拆分的列,参数3:拆分成的新变量,参数4:Sep分割模式 这个函数能做到将1列拆解为多列,用法与unite非常相似 这里不重复过多,举几个简单示例说明即可
library(dplyr)
df <- data.frame(x = c(NA, "a.b", "a.d", "b.c"))
df
## x
## 1 <NA>
## 2 a.b
## 3 a.d
## 4 b.c
分割x为A-B
df %>% separate(x, c("A", "B"))
## A B
## 1 <NA> <NA>
## 2 a b
## 3 a d
## 4 b c
如果你想只保留第二个变量
df %>% separate(x, c(NA, "B"))
## B
## 1 <NA>
## 2 b
## 3 d
## 4 c
一个比较难办的问题是,如果需要裂解的列,裂解出来并不是相同的长度怎么办?separate函数提供了几个参数,extra与fill参数来控制裂解的方式 extra用于控制裂解碎片过多,warn:警告信息但扔掉多余,drop:扔掉但并不警告,merge:不扔掉,多余的merge起来 fill,warn警告但从左侧开始填充,right,右侧填充NA,left左侧填充NA
df <- data.frame(x = c("a", "a b", "a b c", NA))
df
## x
## 1 a
## 2 a b
## 3 a b c
## 4 <NA>
这样的方式会有warning
df %>% separate(x, c("a", "b"))
## Warning: Expected 2 pieces. Additional pieces discarded in 1 rows [3].
## Warning: Expected 2 pieces. Missing pieces filled with `NA` in 1 rows [1].
## a b
## 1 a <NA>
## 2 a b
## 3 a b
## 4 <NA> <NA>
扔掉多余信息,右侧填充NA
df %>% separate(x, c("a", "b"), extra = "drop", fill = "right")
## a b
## 1 a <NA>
## 2 a b
## 3 a b
## 4 <NA> <NA>
merge多余的,并从左侧开始填充
df
## x
## 1 a
## 2 a b
## 3 a b c
## 4 <NA>
df %>% separate(x, c("a", "b"), extra = "merge", fill = "left")
## a b
## 1 <NA> a
## 2 a b
## 3 a b c
## 4 <NA> <NA>
同上
df <- data.frame(x = c("x: 123", "y: error: 7"))
df
## x
## 1 x: 123
## 2 y: error: 7
df %>% separate(x, c("key", "value"), ": ", extra = "merge")
## key value
## 1 x 123
## 2 y error: 7