带临床信息的肿瘤突变maf文件分析维度更多
每个癌症都去找各自的肿瘤突变maf文件很麻烦,所以我们才会选择 PanCanAtlas Publications Scalable Open Science Approach for Mutation Calling of Tumor Exomes Using Multiple Genomic Pipelines ,详见:https://gdc.cancer.gov/about-data/publications/mc3-2017 :它提供如下所示的文件:
MC3 Public MAF - mc3.v0.2.8.PUBLIC.maf.gz MC3 Controlled MAF - mc3.v0.2.8.CONTROLLED.maf.gz mc3.v0.2.8.BROAD_VALIDATIONv2.maf.gz mc3.v0.2.9.CONTROLLED_lt3_b.maf.gz
目前只有第一个文件 mc3.v0.2.8.PUBLIC.maf.gz 是公开可以下载的,718M的文件, MC3 这个项目,使用了以下的标准 过滤germline variants:
low normal depth coverage non-exonic sites sites outside of capture kit sites marked by the Broad Panel of Normals samples marked as being contaminated by ContEst variants that were only called by a single caller
其中controlled-access MAF 文件包含 22,485,627 variants (来自10,510个肿瘤样本,共有13,044,511 single-nucleotide variant (SNV) 和 9,441,116 indels), 但是 open-access MAF文件包含3,600,963 variants (10,295样本,3,427,680 SNV和173,283 indels),所以粗略估算一下,open-access 的MC3 call set 仅占原始MAF中variant call 的44%。
我看到了GitHub有一个project把这个数据文件制作成为了R包,https://github.com/PoisonAlien/TCGAmutations ,不知道是否有授权,是否合规。我们无需使用这样的r包了,自己下载文件,并且 读入这个718M的文件 mc3.v0.2.8.PUBLIC.maf.gz 即可!代码如下:
rm(list = ls())
require(maftools)
laml=read.maf('mc3.v0.2.8.PUBLIC.maf.gz')
laml
save(laml,file = 'mc3.Rdata')
rm(list = ls())
require(maftools)
load(file = 'mc3.Rdata')
加上临床信息
同样的是在 https://gdc.cancer.gov/about-data/publications/pancanatlas 页面下载 :
TCGA-Clinical Data Resource (CDR) Outcome - TCGA-CDR-SupplementalTableS1.xlsx
简单检查临床信息
文件里面主要关心的临床指标有:
"age_at_initial_pathologic_diagnosis" "gender" "race" "ajcc_pathologic_tumor_stage" "clinical_stage" "histological_type" "histological_grade" "initial_pathologic_dx_year" "menopause_status" "birth_days_to"
自行下载Excel后打开 TCGA-CDR-SupplementalTableS1.xlsx查看,如下所示:
首先看看性别情况:
library(readxl)
phe=read_excel('TCGA-CDR-SupplementalTableS1.xlsx',sheet = 1)
library(gplots)
balloonplot(table(phe$gender,phe$type))
可以看到蛮多癌症具有性别特异性,比如妇科癌症的卵巢癌子宫癌宫颈癌等等:
也可以看到 clinical_stage 信息比较乱;
相比起来 ,另外一个临床指标 ajcc_pathologic_tumor_stage 就好看很多 :
简单检查结局事件生存情况
并且病人的结局事件也是 多种多样:
OS: overall survival event, 1 for death from any cause, 0 for alive.
DSS: disease-specific survival event
DFI: disease-free interval event
PFI: progression-free interval event
自行下载Excel后打开 TCGA-CDR-SupplementalTableS1.xlsx查看,如下所示:
也可以简单的统计:
balloonplot(table(phe$vital_status,phe$type))
这个时候,替换为比例好一点。
首先病人按照临床指标分组后看突变差异
这里我们仅仅是按照stage举例,首先把stage给格式化:
table(phe$ajcc_pathologic_tumor_stage)
phe$new_stage = ifelse(phe$ajcc_pathologic_tumor_stage %in% c( 'Stage I','Stage IA','Stage IB'),'s1',
ifelse(phe$ajcc_pathologic_tumor_stage %in% c('Stage II' ,'Stage IIA','Stage IIB','Stage IIC'),'s2',
ifelse(phe$ajcc_pathologic_tumor_stage %in% c('Stage III','Stage IIIA','Stage IIIB','Stage IIIC'),'s3',
ifelse(phe$ajcc_pathologic_tumor_stage %in% c( 'Stage IV','Stage IVA' ,'Stage IVB','Stage IVC'),'s4','other'
) ) ) )
table(phe$new_stage )
balloonplot(table(phe$new_stage,phe$type))
现在看起来清晰很多,我们就选择BRCA吧:
每次加载前面的 mc3.Rdata 文件,耗费好几分钟,所以还是按照癌症拆分开来!
require(maftools)
load(file = 'mc3.Rdata')
library(readxl)
phe=read_excel('TCGA-CDR-SupplementalTableS1.xlsx',sheet = 1)[,2:3]
phe=as.data.frame(phe)
table(phe$type)
cp_list=split(phe$bcr_patient_barcode,phe$type)
cg_tmp=laml@data
lapply(1:length(cp_list), function(i){
cp=cp_list[[i]]
pro=names(cp_list)[i]
cg_cp_tmp=cg_tmp[substring(cg_tmp$Tumor_Sample_Barcode,1,12) %in% cp,]
laml = read.maf(maf = cg_cp_tmp)
save(laml,file= paste0('maftools-',pro,'.Rdata') )
})
得到每个癌症各自的rdata文件大小如下所示:
14M Sep 16 11:39 maftools-ACC.Rdata
167M Sep 16 11:39 maftools-BLCA.Rdata
149M Sep 16 11:40 maftools-BRCA.Rdata
98M Sep 16 11:40 maftools-CESC.Rdata
3.4M Sep 16 11:40 maftools-CHOL.Rdata
290M Sep 16 11:40 maftools-COAD.Rdata
8.2M Sep 16 11:40 maftools-DLBC.Rdata
44M Sep 16 11:40 maftools-ESCA.Rdata
81M Sep 16 11:41 maftools-GBM.Rdata
128M Sep 16 11:41 maftools-HNSC.Rdata
3.8M Sep 16 11:41 maftools-KICH.Rdata
36M Sep 16 11:41 maftools-KIRC.Rdata
37M Sep 16 11:41 maftools-KIRP.Rdata
10M Sep 16 11:41 maftools-LAML.Rdata
51M Sep 16 11:41 maftools-LGG.Rdata
63M Sep 16 11:41 maftools-LIHC.Rdata
235M Sep 16 11:42 maftools-LUAD.Rdata
219M Sep 16 11:42 maftools-LUSC.Rdata
4.4M Sep 16 11:42 maftools-MESO.Rdata
64M Sep 16 11:42 maftools-OV.Rdata
36M Sep 16 11:42 maftools-PAAD.Rdata
3.3M Sep 16 11:42 maftools-PCPG.Rdata
38M Sep 16 11:43 maftools-PRAD.Rdata
77M Sep 16 11:43 maftools-READ.Rdata
30M Sep 16 11:43 maftools-SARC.Rdata
542M Sep 16 11:43 maftools-SKCM.Rdata
255M Sep 16 11:44 maftools-STAD.Rdata
3.7M Sep 16 11:44 maftools-TGCT.Rdata
14M Sep 16 11:44 maftools-THCA.Rdata
5.0M Sep 16 11:44 maftools-THYM.Rdata
907M Sep 16 11:45 maftools-UCEC.Rdata
13M Sep 16 11:45 maftools-UCS.Rdata
2.4M Sep 16 11:45 maftools-UVM.Rdata
5.6G Sep 5 00:30 mc3.Rdata
需要认真读文档:https://www.bioconductor.org/packages/devel/bioc/vignettes/maftools/inst/doc/maftools.html
首先查看BRCA整个癌症的突变特征:
load(file = 'Rdata/maftools-BRCA.Rdata')
oncoplot(maf = laml, top = 10)
然后加上前面制作好的stage信息:
laml@clinical.data
pos=match(substring(as.character(laml@clinical.data$Tumor_Sample_Barcode),1,12),phe$bcr_patient_barcode)
laml@clinical.data$stage=phe$new_stage[pos]
table(laml@clinical.data$stage)
other s1 s2 s3 s4
24 173 590 220 19
oncoplot(maf = laml,
clinicalFeatures=c('stage'),sortByAnnotation=T,
top = 10)
这样虽然可以肉眼查看不同stage的BRCA病人的突变全景图,但是很难区分各个基因是否有比例差异。首先呢,我们可以借助mafCompare: compare two cohorts (MAF). ,可以举例说明s1和s3两个分组的差异。
df=as.data.frame(laml@data)
cg=as.character(laml@clinical.data$Tumor_Sample_Barcode)[laml@clinical.data$stage=='s1']
cg
s1.maf <- read.maf(df[df$Tumor_Sample_Barcode %in% cg, ])
cg=as.character(laml@clinical.data$Tumor_Sample_Barcode)[laml@clinical.data$stage=='s3']
cg
s3.maf <- read.maf(df[df$Tumor_Sample_Barcode %in% cg, ])
pt.vs.rt <- mafCompare(m1 = s1.maf, m2 = s3.maf, m1Name = 'stage1',
m2Name = 'stage3', minMut = 5)
forestPlot(mafCompareRes = pt.vs.rt, pVal = 0.1)
代码难度有点多,需要R语言功底扎实。两个分组,差异还是有的,这个结果就是大家的生物学知识发挥的用武之地咯!
下期预告:有生存统计信息就可以做生存分析,根据突变位点, 可以在各个癌症内部批量进行生存分析检验,挖掘有意义的基因或者基因列表。