带临床信息的肿瘤突变maf文件分析维度更多 / 四六文摘

每个癌症都去找各自的肿瘤突变maf文件很麻烦，所以我们才会选择 PanCanAtlas Publications Scalable Open Science Approach for Mutation Calling of Tumor Exomes Using Multiple Genomic Pipelines ，详见：https://gdc.cancer.gov/about-data/publications/mc3-2017 ：它提供如下所示的文件：

MC3 Public MAF - mc3.v0.2.8.PUBLIC.maf.gz
MC3 Controlled MAF - mc3.v0.2.8.CONTROLLED.maf.gz
mc3.v0.2.8.BROAD_VALIDATIONv2.maf.gz
mc3.v0.2.9.CONTROLLED_lt3_b.maf.gz

目前只有第一个文件 mc3.v0.2.8.PUBLIC.maf.gz 是公开可以下载的，718M的文件， MC3 这个项目，使用了以下的标准过滤germline variants:

low normal depth coverage
non-exonic sites
sites outside of capture kit
sites marked by the Broad Panel of Normals
samples marked as being contaminated by ContEst
variants that were only called by a single caller

其中controlled-access MAF 文件包含 22,485,627 variants （来自10,510个肿瘤样本，共有13,044,511 single-nucleotide variant (SNV) 和 9,441,116 indels），但是 open-access MAF文件包含3,600,963 variants （10,295样本，3,427,680 SNV和173,283 indels），所以粗略估算一下，open-access 的MC3 call set 仅占原始MAF中variant call 的44%。

我看到了GitHub有一个project把这个数据文件制作成为了R包，https://github.com/PoisonAlien/TCGAmutations ，不知道是否有授权，是否合规。我们无需使用这样的r包了，自己下载文件，并且读入这个718M的文件 mc3.v0.2.8.PUBLIC.maf.gz 即可！代码如下：

rm(list = ls()) require(maftools) laml=read.maf('mc3.v0.2.8.PUBLIC.maf.gz') laml save(laml,file = 'mc3.Rdata')

rm(list = ls()) require(maftools) load(file = 'mc3.Rdata')

加上临床信息

同样的是在 https://gdc.cancer.gov/about-data/publications/pancanatlas 页面下载：

TCGA-Clinical Data Resource (CDR) Outcome - TCGA-CDR-SupplementalTableS1.xlsx

简单检查临床信息

文件里面主要关心的临床指标有：

"age_at_initial_pathologic_diagnosis"
"gender"
"race"
"ajcc_pathologic_tumor_stage"
"clinical_stage"
"histological_type"
"histological_grade"
"initial_pathologic_dx_year"
"menopause_status"
"birth_days_to"

自行下载Excel后打开 TCGA-CDR-SupplementalTableS1.xlsx查看，如下所示：

主要关心的临床指标

首先看看性别情况：

library(readxl) phe=read_excel('TCGA-CDR-SupplementalTableS1.xlsx',sheet = 1) library(gplots) balloonplot(table(phe$gender,phe$type))

可以看到蛮多癌症具有性别特异性，比如妇科癌症的卵巢癌子宫癌宫颈癌等等：

癌症具有性别特异性

也可以看到 clinical_stage 信息比较乱；

clinical_stage

相比起来，另外一个临床指标 ajcc_pathologic_tumor_stage 就好看很多：

ajcc_pathologic_tumor_stage

简单检查结局事件生存情况

并且病人的结局事件也是多种多样：

OS: overall survival event, 1 for death from any cause, 0 for alive.
DSS: disease-specific survival event
DFI: disease-free interval event
PFI: progression-free interval event

自行下载Excel后打开 TCGA-CDR-SupplementalTableS1.xlsx查看，如下所示：

结局事件及其对应的统计时间

也可以简单的统计：

balloonplot(table(phe$vital_status,phe$type))

各个癌症肿瘤死亡率不一样

这个时候，替换为比例好一点。

首先病人按照临床指标分组后看突变差异

这里我们仅仅是按照stage举例，首先把stage给格式化：

table(phe$ajcc_pathologic_tumor_stage) phe$new_stage = ifelse(phe$ajcc_pathologic_tumor_stage %in% c( 'Stage I','Stage IA','Stage IB'),'s1', ifelse(phe$ajcc_pathologic_tumor_stage %in% c('Stage II' ,'Stage IIA','Stage IIB','Stage IIC'),'s2', ifelse(phe$ajcc_pathologic_tumor_stage %in% c('Stage III','Stage IIIA','Stage IIIB','Stage IIIC'),'s3', ifelse(phe$ajcc_pathologic_tumor_stage %in% c( 'Stage IV','Stage IVA' ,'Stage IVB','Stage IVC'),'s4','other' ) ) ) ) table(phe$new_stage ) balloonplot(table(phe$new_stage,phe$type))

现在看起来清晰很多，我们就选择BRCA吧：

每次加载前面的 mc3.Rdata 文件，耗费好几分钟，所以还是按照癌症拆分开来！

require(maftools) load(file = 'mc3.Rdata') library(readxl) phe=read_excel('TCGA-CDR-SupplementalTableS1.xlsx',sheet = 1)[,2:3] phe=as.data.frame(phe) table(phe$type) cp_list=split(phe$bcr_patient_barcode,phe$type) cg_tmp=laml@data lapply(1:length(cp_list), function(i){ cp=cp_list[[i]] pro=names(cp_list)[i] cg_cp_tmp=cg_tmp[substring(cg_tmp$Tumor_Sample_Barcode,1,12) %in% cp,] laml = read.maf(maf = cg_cp_tmp) save(laml,file= paste0('maftools-',pro,'.Rdata') ) })

得到每个癌症各自的rdata文件大小如下所示：

14M Sep 16 11:39 maftools-ACC.Rdata 167M Sep 16 11:39 maftools-BLCA.Rdata 149M Sep 16 11:40 maftools-BRCA.Rdata 98M Sep 16 11:40 maftools-CESC.Rdata 3.4M Sep 16 11:40 maftools-CHOL.Rdata 290M Sep 16 11:40 maftools-COAD.Rdata 8.2M Sep 16 11:40 maftools-DLBC.Rdata 44M Sep 16 11:40 maftools-ESCA.Rdata 81M Sep 16 11:41 maftools-GBM.Rdata 128M Sep 16 11:41 maftools-HNSC.Rdata 3.8M Sep 16 11:41 maftools-KICH.Rdata 36M Sep 16 11:41 maftools-KIRC.Rdata 37M Sep 16 11:41 maftools-KIRP.Rdata 10M Sep 16 11:41 maftools-LAML.Rdata 51M Sep 16 11:41 maftools-LGG.Rdata 63M Sep 16 11:41 maftools-LIHC.Rdata 235M Sep 16 11:42 maftools-LUAD.Rdata 219M Sep 16 11:42 maftools-LUSC.Rdata 4.4M Sep 16 11:42 maftools-MESO.Rdata 64M Sep 16 11:42 maftools-OV.Rdata 36M Sep 16 11:42 maftools-PAAD.Rdata 3.3M Sep 16 11:42 maftools-PCPG.Rdata 38M Sep 16 11:43 maftools-PRAD.Rdata 77M Sep 16 11:43 maftools-READ.Rdata 30M Sep 16 11:43 maftools-SARC.Rdata 542M Sep 16 11:43 maftools-SKCM.Rdata 255M Sep 16 11:44 maftools-STAD.Rdata 3.7M Sep 16 11:44 maftools-TGCT.Rdata 14M Sep 16 11:44 maftools-THCA.Rdata 5.0M Sep 16 11:44 maftools-THYM.Rdata 907M Sep 16 11:45 maftools-UCEC.Rdata 13M Sep 16 11:45 maftools-UCS.Rdata 2.4M Sep 16 11:45 maftools-UVM.Rdata 5.6G Sep 5 00:30 mc3.Rdata

需要认真读文档：https://www.bioconductor.org/packages/devel/bioc/vignettes/maftools/inst/doc/maftools.html

首先查看BRCA整个癌症的突变特征:

load(file = 'Rdata/maftools-BRCA.Rdata') oncoplot(maf = laml, top = 10)

然后加上前面制作好的stage信息：

laml@clinical.data pos=match(substring(as.character(laml@clinical.data$Tumor_Sample_Barcode),1,12),phe$bcr_patient_barcode) laml@clinical.data$stage=phe$new_stage[pos] table(laml@clinical.data$stage) other s1 s2 s3 s4 24 173 590 220 19 oncoplot(maf = laml, clinicalFeatures=c('stage'),sortByAnnotation=T, top = 10)

这样虽然可以肉眼查看不同stage的BRCA病人的突变全景图，但是很难区分各个基因是否有比例差异。首先呢，我们可以借助mafCompare: compare two cohorts (MAF). ，可以举例说明s1和s3两个分组的差异。

df=as.data.frame(laml@data) cg=as.character(laml@clinical.data$Tumor_Sample_Barcode)[laml@clinical.data$stage=='s1'] cg s1.maf <- read.maf(df[df$Tumor_Sample_Barcode %in% cg, ])


cg=as.character(laml@clinical.data$Tumor_Sample_Barcode)[laml@clinical.data$stage=='s3']

cg

s3.maf <- read.maf(df[df$Tumor_Sample_Barcode %in% cg, ])
pt.vs.rt <- mafCompare(m1 = s1.maf, m2 = s3.maf, m1Name = 'stage1',

                       m2Name = 'stage3', minMut = 5)

forestPlot(mafCompareRes = pt.vs.rt, pVal = 0.1)

代码难度有点多，需要R语言功底扎实。两个分组，差异还是有的，这个结果就是大家的生物学知识发挥的用武之地咯！

下期预告：有生存统计信息就可以做生存分析，根据突变位点，可以在各个癌症内部批量进行生存分析检验，挖掘有意义的基因或者基因列表。

带临床信息的肿瘤突变maf文件分析维度更多

加上临床信息

简单检查临床信息

简单检查结局事件生存情况

首先病人按照临床指标分组后看突变差异

相关推荐