R语言统计系列第13篇-K-M生存曲线与logrank检验
白介素2的读书笔记,分享临床科研干货,一起见证时间的力量
【临床数据挖掘专辑点我】【SEER系列文章专辑点我】
今天是各类统计方法R语言实现的第13期,我们主要介绍R语言统计系列第13篇-K-M生存曲线与logrank检验。
生存分析是将事件的结果和出现这一结果经历的事件结合起来分析的一类统计方法,其因变量的特征是既有事件是否发生,也有事件出现的时间长短。K-M生存曲线、logrank检验以及之后会讲到的Cox回归分析均属于生存分析。
数据整理
Kaplan-Meier法又称为积乘极限法,常用于估计生存率及其标准误,计算可信区间。
此处使用的是TCGA肝癌的数据,数据经下载整理,此处仍需再次整理。
# 载入数据
lihc<-read.table("tcga_lihc.tsv",header = T,row.names = 1,quote = "",sep = '\t')
summary(lihc)
## samples sample_type.samples
## TCGA-2V-A95S-01A: 1 Primary Tumor :377
## TCGA-2Y-A9GS-01A: 1 Recurrent Tumor : 3
## TCGA-2Y-A9GT-01A: 1 Solid Tissue Normal: 89
## TCGA-2Y-A9GU-01A: 1
## TCGA-2Y-A9GV-01A: 1
## TCGA-2Y-A9GW-01A: 1
## (Other) :463
## age_at_initial_pathologic_diagnosis tumor_stage.diagnoses
## Min. :16.00 stage i :212
## 1st Qu.:52.00 stage ii :107
## Median :62.00 stage iiia : 80
## Mean :60.26 not reported: 34
## 3rd Qu.:70.00 stage iiib : 12
## Max. :90.00 stage iiic : 11
## NA's :1 (Other) : 13
## neoplasm_histologic_grade OS OS.time
## : 8 Min. :0.000 Min. : 1.0
## G1: 68 1st Qu.:0.000 1st Qu.: 358.0
## G2:227 Median :0.000 Median : 636.0
## G3:153 Mean :0.406 Mean : 876.7
## G4: 13 3rd Qu.:1.000 3rd Qu.:1214.5
## Max. :1.000 Max. :3675.0
## NA's :6 NA's :6
## adjacent_hepatic_tissue_inflammation_extent_type PIK3CA
## :161 Min. :0.2124
## Mild :124 1st Qu.:0.9635
## None :162 Median :1.2005
## Severe: 22 Mean :1.2202
## 3rd Qu.:1.4643
## Max. :2.6090
## NA's :45
## AKT1 PTEN MYC TP53
## Min. :1.461 Min. :0.5392 Min. :0.3709 Min. :0.7284
## 1st Qu.:2.708 1st Qu.:2.4475 1st Qu.:2.7515 1st Qu.:2.3810
## Median :2.963 Median :2.7105 Median :3.9235 Median :2.9070
## Mean :2.976 Mean :2.7179 Mean :3.6971 Mean :2.8948
## 3rd Qu.:3.321 3rd Qu.:3.0350 3rd Qu.:4.7097 3rd Qu.:3.4730
## Max. :4.786 Max. :4.1610 Max. :7.0430 Max. :5.2620
## NA's :45 NA's :45 NA's :45 NA's :45
可以看到该数据包含样本名称,属于原发、复发还是癌旁,诊断年龄,分期,分级,OS状态,OS时间,癌旁炎症情况,还有几个癌基因/抑癌基因的log2(FPKM+1)
此处简单将有缺失值的样本删去,将癌旁和复发样本删去,将stage分为1、2、3、4,grade转化为1,2,3,4,癌旁炎症情况转换为1,2,3。当然也可考虑使用一些缺失值填充的方法,可以保留更多的样本。
lihc<-lihc[lihc$sample_type.samples=="Primary Tumor",]
lihc<-na.omit(lihc)
lihc<-lihc[,-c(1,2)]
lihc<-lihc[! lihc$tumor_stage.diagnoses =="not reported" & ! lihc$neoplasm_histologic_grade == "" &
! lihc$adjacent_hepatic_tissue_inflammation_extent_type == ""& !lihc$tumor_stage.diagnoses == "(Other)" ,]
summary(lihc)
## age_at_initial_pathologic_diagnosis tumor_stage.diagnoses
## Min. :16.00 stage i :115
## 1st Qu.:51.00 stage ii : 58
## Median :61.00 stage iiia: 34
## Mean :59.29 stage iiib: 5
## 3rd Qu.:69.00 stage iiic: 4
## Max. :84.00 stage iii : 2
## (Other) : 3
## neoplasm_histologic_grade OS OS.time
## : 0 Min. :0.0000 Min. : 1.0
## G1: 24 1st Qu.:0.0000 1st Qu.: 409.0
## G2:115 Median :0.0000 Median : 662.0
## G3: 77 Mean :0.2805 Mean : 960.1
## G4: 5 3rd Qu.:1.0000 3rd Qu.:1386.0
## Max. :1.0000 Max. :3675.0
##
## adjacent_hepatic_tissue_inflammation_extent_type PIK3CA
## : 0 Min. :0.3078
## Mild : 93 1st Qu.:0.9649
## None :112 Median :1.2170
## Severe: 16 Mean :1.2237
## 3rd Qu.:1.5000
## Max. :2.3290
##
## AKT1 PTEN MYC TP53
## Min. :1.505 Min. :1.095 Min. :0.5372 Min. :0.7284
## 1st Qu.:2.663 1st Qu.:2.439 1st Qu.:2.6570 1st Qu.:2.4230
## Median :2.970 Median :2.704 Median :3.7650 Median :2.9220
## Mean :2.946 Mean :2.724 Mean :3.5609 Mean :2.8823
## 3rd Qu.:3.326 3rd Qu.:3.041 3rd Qu.:4.4950 3rd Qu.:3.5330
## Max. :4.786 Max. :4.161 Max. :7.0430 Max. :5.2620
##
##此处四期太少,归入三期
lihc$tumor_stage.diagnoses<- ifelse(lihc$tumor_stage.diagnoses == "stage i",1,
ifelse(lihc$tumor_stage.diagnoses == "stage ii"|lihc$tumor_stage.diagnoses == "stage iia"|lihc$tumor_stage.diagnoses == "stage iib"|
lihc$tumor_stage.diagnoses == "stage iic",2,3))
##G4太少,归入G3
lihc$neoplasm_histologic_grade<- ifelse(lihc$neoplasm_histologic_grade == "G1",1,
ifelse(lihc$neoplasm_histologic_grade == "G2",2,3))
lihc$adjacent_hepatic_tissue_inflammation_extent_type<-ifelse(lihc$adjacent_hepatic_tissue_inflammation_extent_type == "None",1,
ifelse(lihc$adjacent_hepatic_tissue_inflammation_extent_type == "Mild",2,3))
summary(lihc)
## age_at_initial_pathologic_diagnosis tumor_stage.diagnoses
## Min. :16.00 Min. :1.000
## 1st Qu.:51.00 1st Qu.:1.000
## Median :61.00 Median :1.000
## Mean :59.29 Mean :1.697
## 3rd Qu.:69.00 3rd Qu.:2.000
## Max. :84.00 Max. :3.000
## neoplasm_histologic_grade OS OS.time
## Min. :1.000 Min. :0.0000 Min. : 1.0
## 1st Qu.:2.000 1st Qu.:0.0000 1st Qu.: 409.0
## Median :2.000 Median :0.0000 Median : 662.0
## Mean :2.262 Mean :0.2805 Mean : 960.1
## 3rd Qu.:3.000 3rd Qu.:1.0000 3rd Qu.:1386.0
## Max. :3.000 Max. :1.0000 Max. :3675.0
## adjacent_hepatic_tissue_inflammation_extent_type PIK3CA
## Min. :1.000 Min. :0.3078
## 1st Qu.:1.000 1st Qu.:0.9649
## Median :1.000 Median :1.2170
## Mean :1.566 Mean :1.2237
## 3rd Qu.:2.000 3rd Qu.:1.5000
## Max. :3.000 Max. :2.3290
## AKT1 PTEN MYC TP53
## Min. :1.505 Min. :1.095 Min. :0.5372 Min. :0.7284
## 1st Qu.:2.663 1st Qu.:2.439 1st Qu.:2.6570 1st Qu.:2.4230
## Median :2.970 Median :2.704 Median :3.7650 Median :2.9220
## Mean :2.946 Mean :2.724 Mean :3.5609 Mean :2.8823
## 3rd Qu.:3.326 3rd Qu.:3.041 3rd Qu.:4.4950 3rd Qu.:3.5330
## Max. :4.786 Max. :4.161 Max. :7.0430 Max. :5.2620
str(lihc)
## 'data.frame': 221 obs. of 11 variables:
## $ age_at_initial_pathologic_diagnosis : int 84 82 81 81 80 80 80 79 79 78 ...
## $ tumor_stage.diagnoses : num 1 2 1 1 3 1 1 2 2 1 ...
## $ neoplasm_histologic_grade : num 2 2 3 2 2 2 2 1 2 2 ...
## $ OS : int 0 1 1 0 1 1 0 0 0 1 ...
## $ OS.time : int 10 848 410 1168 1210 688 673 1241 387 1694 ...
## $ adjacent_hepatic_tissue_inflammation_extent_type: num 1 1 1 1 1 2 3 1 1 1 ...
## $ PIK3CA : num 1.03 1.61 1.94 1.04 1.22 ...
## $ AKT1 : num 4.79 3.88 3.02 2.7 3.59 ...
## $ PTEN : num 2.44 2.66 3.08 3.02 2.67 ...
## $ MYC : num 3.38 1.32 5.41 3.85 3.69 ...
## $ TP53 : num 0.882 2.604 5.262 3.719 2.878 ...
整理完毕。
首先查看分期
##此处直接绘图展示K-M生存曲线与logrank检验分析结果
library("survival")
## Warning: package 'survival' was built under R version 3.6.3
library("survminer")
## Warning: package 'survminer' was built under R version 3.6.3
## Loading required package: ggplot2
## Warning: package 'ggplot2' was built under R version 3.6.3
## Loading required package: ggpubr
## Loading required package: magrittr
fit <- survfit(Surv(OS.time, OS) ~ tumor_stage.diagnoses, data = lihc)
print(fit)
## Call: survfit(formula = Surv(OS.time, OS) ~ tumor_stage.diagnoses,
## data = lihc)
##
## n events median 0.95LCL 0.95UCL
## tumor_stage.diagnoses=1 115 23 NA 2456 NA
## tumor_stage.diagnoses=2 58 16 3258 1386 NA
## tumor_stage.diagnoses=3 48 23 1210 770 NA
res.sum <- surv_summary(fit)
## Warning in .get_data(x, data = data): The `data` argument is not provided. Data
## will be extracted from model fit.
head(res.sum)
## time n.risk n.event n.censor surv std.err upper lower
## 1 9 115 0 1 1.0000000 0.000000000 1 1.0000000
## 2 10 114 0 1 1.0000000 0.000000000 1 1.0000000
## 3 14 113 1 0 0.9911504 0.008888977 1 0.9740321
## 4 16 112 1 0 0.9823009 0.012627410 1 0.9582880
## 5 34 111 1 0 0.9734513 0.015535494 1 0.9442574
## 6 44 110 0 1 0.9734513 0.015535494 1 0.9442574
## strata tumor_stage.diagnoses
## 1 tumor_stage.diagnoses=1 1
## 2 tumor_stage.diagnoses=1 1
## 3 tumor_stage.diagnoses=1 1
## 4 tumor_stage.diagnoses=1 1
## 5 tumor_stage.diagnoses=1 1
## 6 tumor_stage.diagnoses=1 1
surv_diff <- survdiff(Surv(OS.time, OS) ~ tumor_stage.diagnoses, data = lihc)
surv_diff
## Call:
## survdiff(formula = Surv(OS.time, OS) ~ tumor_stage.diagnoses,
## data = lihc)
##
## N Observed Expected (O-E)^2/E (O-E)^2/V
## tumor_stage.diagnoses=1 115 23 34.2 3.6438 8.194
## tumor_stage.diagnoses=2 58 16 14.9 0.0783 0.104
## tumor_stage.diagnoses=3 48 23 12.9 7.8537 9.980
##
## Chisq= 11.7 on 2 degrees of freedom, p= 0.003
此处可以看出三组之间生存时间不全相等,具体那两组不等,需要进行多重比较。
##此处展示两两之间多重比较结果
surv_pair_diff <- pairwise_survdiff(Surv(OS.time, OS) ~ tumor_stage.diagnoses, data = lihc, p.adjust.method = "BH")
surv_pair_diff
##
## Pairwise comparisons using Log-Rank test
##
## data: lihc and tumor_stage.diagnoses
##
## 1 2
## 2 0.1389 -
## 3 0.0018 0.1389
##
## P value adjustment method: BH
可以看出仅有1和3之间生存有统计学差异
##绘图
ggsurvplot(fit,
pval = TRUE, conf.int = TRUE,
risk.table = TRUE, # Add risk table
risk.table.col = "strata", # Change risk table color by groups
linetype = "strata", # Change line type by groups
ggtheme = theme_bw(), # Change ggplot2 theme
palette = c("#E7B800", "#2E9FDF","#00AFBB")
)
## Warning: Vectorized input to `element_text()` is not officially supported.
## Results may be unexpected or may change in future versions of ggplot2.
ggsurvplot(fit,
pval = TRUE,conf.int = TRUE,
risk.table.col = "strata", # Change risk table color by groups
ggtheme = theme_bw(), # Change ggplot2 theme
palette = c("#E7B800", "#2E9FDF","#00AFBB"),
fun = "event")
再看分级
##此处直接绘图展示K-M生存曲线与logrank检验分析结果
library("survival")
library("survminer")
fit <- survfit(Surv(OS.time, OS) ~ neoplasm_histologic_grade, data = lihc)
print(fit)
## Call: survfit(formula = Surv(OS.time, OS) ~ neoplasm_histologic_grade,
## data = lihc)
##
## n events median 0.95LCL 0.95UCL
## neoplasm_histologic_grade=1 24 3 NA 2131 NA
## neoplasm_histologic_grade=2 115 31 2456 1685 NA
## neoplasm_histologic_grade=3 82 28 NA 1372 NA
res.sum <- surv_summary(fit)
## Warning in .get_data(x, data = data): The `data` argument is not provided. Data
## will be extracted from model fit.
head(res.sum)
## time n.risk n.event n.censor surv std.err upper lower
## 1 20 24 0 1 1 0 1 1
## 2 44 23 0 1 1 0 1 1
## 3 314 22 0 1 1 0 1 1
## 4 361 21 0 1 1 0 1 1
## 5 387 20 0 1 1 0 1 1
## 6 608 19 0 1 1 0 1 1
## strata neoplasm_histologic_grade
## 1 neoplasm_histologic_grade=1 1
## 2 neoplasm_histologic_grade=1 1
## 3 neoplasm_histologic_grade=1 1
## 4 neoplasm_histologic_grade=1 1
## 5 neoplasm_histologic_grade=1 1
## 6 neoplasm_histologic_grade=1 1
surv_diff <- survdiff(Surv(OS.time, OS) ~ neoplasm_histologic_grade, data = lihc)
surv_diff
## Call:
## survdiff(formula = Surv(OS.time, OS) ~ neoplasm_histologic_grade,
## data = lihc)
##
## N Observed Expected (O-E)^2/E (O-E)^2/V
## neoplasm_histologic_grade=1 24 3 8.54 3.5974 4.2020
## neoplasm_histologic_grade=2 115 31 32.11 0.0382 0.0794
## neoplasm_histologic_grade=3 82 28 21.35 2.0724 3.1834
##
## Chisq= 5.8 on 2 degrees of freedom, p= 0.06
此处可以看出三组之间生存时间比较没有统计学差异。
##此处展示两两之间多重比较结果
surv_pair_diff <- pairwise_survdiff(Surv(OS.time, OS) ~ neoplasm_histologic_grade, data = lihc, p.adjust.method = "BH")
surv_pair_diff
##
## Pairwise comparisons using Log-Rank test
##
## data: lihc and neoplasm_histologic_grade
##
## 1 2
## 2 0.092 -
## 3 0.084 0.242
##
## P value adjustment method: BH
依旧没有统计学差异
##绘图
ggsurvplot(fit,
pval = TRUE, conf.int = TRUE,
risk.table = TRUE, # Add risk table
risk.table.col = "strata", # Change risk table color by groups
linetype = "strata", # Change line type by groups
ggtheme = theme_bw(), # Change ggplot2 theme
palette = c("#E7B800", "#2E9FDF","#00AFBB")
)
## Warning: Vectorized input to `element_text()` is not officially supported.
## Results may be unexpected or may change in future versions of ggplot2.
ggsurvplot(fit,
pval = TRUE,conf.int = TRUE,
risk.table.col = "strata", # Change risk table color by groups
ggtheme = theme_bw(), # Change ggplot2 theme
palette = c("#E7B800", "#2E9FDF","#00AFBB"),
fun = "event")
最后看一下癌旁炎症情况
##此处直接绘图展示K-M生存曲线与logrank检验分析结果
library("survival")
library("survminer")
fit <- survfit(Surv(OS.time, OS) ~ adjacent_hepatic_tissue_inflammation_extent_type, data = lihc)
print(fit)
## Call: survfit(formula = Surv(OS.time, OS) ~ adjacent_hepatic_tissue_inflammation_extent_type,
## data = lihc)
##
## n events median 0.95LCL
## adjacent_hepatic_tissue_inflammation_extent_type=1 112 33 2542 1791
## adjacent_hepatic_tissue_inflammation_extent_type=2 93 25 NA 1386
## adjacent_hepatic_tissue_inflammation_extent_type=3 16 4 NA NA
## 0.95UCL
## adjacent_hepatic_tissue_inflammation_extent_type=1 NA
## adjacent_hepatic_tissue_inflammation_extent_type=2 NA
## adjacent_hepatic_tissue_inflammation_extent_type=3 NA
res.sum <- surv_summary(fit)
## Warning in .get_data(x, data = data): The `data` argument is not provided. Data
## will be extracted from model fit.
head(res.sum)
## time n.risk n.event n.censor surv std.err upper lower
## 1 6 112 0 1 1.0000000 0.000000000 1 1.0000000
## 2 9 111 0 1 1.0000000 0.000000000 1 1.0000000
## 3 10 110 0 1 1.0000000 0.000000000 1 1.0000000
## 4 11 109 1 0 0.9908257 0.009216688 1 0.9730877
## 5 20 108 0 1 0.9908257 0.009216688 1 0.9730877
## 6 34 107 1 0 0.9815656 0.013157325 1 0.9565767
## strata
## 1 adjacent_hepatic_tissue_inflammation_extent_type=1
## 2 adjacent_hepatic_tissue_inflammation_extent_type=1
## 3 adjacent_hepatic_tissue_inflammation_extent_type=1
## 4 adjacent_hepatic_tissue_inflammation_extent_type=1
## 5 adjacent_hepatic_tissue_inflammation_extent_type=1
## 6 adjacent_hepatic_tissue_inflammation_extent_type=1
## adjacent_hepatic_tissue_inflammation_extent_type
## 1 1
## 2 1
## 3 1
## 4 1
## 5 1
## 6 1
surv_diff <- survdiff(Surv(OS.time, OS) ~ adjacent_hepatic_tissue_inflammation_extent_type, data = lihc)
surv_diff
## Call:
## survdiff(formula = Surv(OS.time, OS) ~ adjacent_hepatic_tissue_inflammation_extent_type,
## data = lihc)
##
## N Observed Expected
## adjacent_hepatic_tissue_inflammation_extent_type=1 112 33 35.84
## adjacent_hepatic_tissue_inflammation_extent_type=2 93 25 21.81
## adjacent_hepatic_tissue_inflammation_extent_type=3 16 4 4.35
## (O-E)^2/E (O-E)^2/V
## adjacent_hepatic_tissue_inflammation_extent_type=1 0.2255 0.5561
## adjacent_hepatic_tissue_inflammation_extent_type=2 0.4661 0.7521
## adjacent_hepatic_tissue_inflammation_extent_type=3 0.0275 0.0298
##
## Chisq= 0.8 on 2 degrees of freedom, p= 0.7
此处可以看出三组之间生存时间比较没有统计学差异。
##此处展示两两之间多重比较结果
surv_pair_diff <- pairwise_survdiff(Surv(OS.time, OS) ~ adjacent_hepatic_tissue_inflammation_extent_type, data = lihc, p.adjust.method = "BH")
surv_pair_diff
##
## Pairwise comparisons using Log-Rank test
##
## data: lihc and adjacent_hepatic_tissue_inflammation_extent_type
##
## 1 2
## 2 0.97 -
## 3 0.97 0.97
##
## P value adjustment method: BH
依旧没有统计学差异
##绘图
ggsurvplot(fit,
pval = TRUE, conf.int = TRUE,
risk.table = TRUE, # Add risk table
risk.table.col = "strata", # Change risk table color by groups
linetype = "strata", # Change line type by groups
ggtheme = theme_bw(), # Change ggplot2 theme
palette = c("#E7B800", "#2E9FDF","#00AFBB")
)
## Warning: Vectorized input to `element_text()` is not officially supported.
## Results may be unexpected or may change in future versions of ggplot2.
ggsurvplot(fit,
pval = TRUE,conf.int = TRUE,
risk.table.col = "strata", # Change risk table color by groups
ggtheme = theme_bw(), # Change ggplot2 theme
palette = c("#E7B800", "#2E9FDF","#00AFBB"),
fun = "event")
好了,今天的R语言实现统计方法系列推文暂时告一段落,我们下次再见吧!小伙伴们如果有什么统计上的问题,或者如果想要学习什么方面的生物信息内容,可以在微信群或者知识星球提问,没准哪天的推文就是专门解答你的问题哦!