肿瘤突变分析系列之乳腺癌
肿瘤突变分析系列的帖子还在继续,TCGA中涉及的肿瘤此处进行一个系统的分析。
今天和大家一起探究一下乳腺癌的突变分析,在开始之前呢,我们先来看一篇JCO的文章(Frequency of Germline Mutations in 25 Cancer Susceptibility Genes in a Sequential Series of Patients With Breast Cancer. Journal of clinical oncology. March 14, 2016.),这篇文章后面还跟进了3篇评论,可以自行查看。
此处我们来一起学习一下摘要。
作者在单中心评估一系列乳腺癌患者中25种癌症易感基因(包括BRCA1/2)的突变频率和预测因子以检验基因检测在该人群中应用价值。
2010年至2012年期间在单一癌症中心纳入488例I至III期乳腺癌患者。收集个人和家族癌症病史,并用NGS对种系DNA进行测序以鉴定突变。
结果: 在10.7%的女性中发现了有害突变,其中包括BRCA1/2(非犹太患者为5.1%)占6.1%和其他乳腺/卵巢癌易感基因占4.6%,包括CHEK2(n = 10),ATM(n = 4),BRIP1(n = 4),PALB2,PTEN,NBN,RAD51C,RAD51D,MSH6和PMS2各一个。然而低年龄(P <.01),德系犹太血统(P <.01),三阴性乳腺癌(P = .01),乳腺癌/卵巢癌家族史(P = .01)可以预测BRCA1/2突变,没有预测其他乳腺癌易感基因突变的因素。当这些基因被分析为单个组时,预测BRCA1/2突变的因素不能预测其他乳腺/卵巢癌易感基因的突变。可能需要寻找其他的生物标志。
《TCGA体细胞突变系列教程--胃癌》进行了各种包的安装以及数据的下载,此处不再介绍包的安装了。这次介绍整体和上次是一直的,此次在突变分析基础上给大家介绍一下,如何绘制一个基因突变与否两组的生存关系,以及在突变的瀑布图中添加临床信息(这两个功能都可以直接在TCGA官网的网页上进行完成)。
options(stringsAsFactors=F)
setwd("D:\\PaperProject\\All.mutationFiles\\somatic.maf")
library("maftools") ## 包的安装请自行查看上次的帖子。
## Warning message:
## "package 'maftools' was built under R version 3.5.1"
## 在读入文件之前我们先开看一下read.maf()这个函数
help(read.maf) #查看帮助选项,每个参数有详细的解释
read.maf(maf, clinicalData = NULL, removeDuplicatedVariants = TRUE,
useAll = TRUE, gisticAllLesionsFile = NULL, gisticAmpGenesFile = NULL,
gisticDelGenesFile = NULL, gisticScoresFile = NULL, cnLevel = "all",
cnTable = NULL, isTCGA = FALSE, vc_nonSyn = NULL, verbose = TRUE)
## 现在主要关注的是:maf,clinicalData,两者联系起来靠Tumor_Sample_Barcode,
## 这里就需要,整理一下maf,clinicalData从XENA或者TCGA下载最新的。
library(data.table)
## Warning message:
## "package 'data.table' was built under R version 3.4.4"
## 文件下载自 GDC
d1 <- fread("TCGA.BRCA.mutect.995c0111-d90b-4140-bee7-3845436c3b42.DR-10.0.somatic.maf")
dim(d1)
## 120988 120
d1[1:3,1:17]
Hugo_Symbol Entrez_Gene_Id Center NCBI_Build Chromosome Start_Position End_Position Strand Variant_Classification Variant_Type Reference_Allele Tumor_Seq_Allele1 Tumor_Seq_Allele2 dbSNP_RS dbSNP_Val_Status Tumor_Sample_Barcode Matched_Norm_Sample_Barcode
USP24 23358 WUGSC GRCh38 chr1 55159655 55159655 + Missense_Mutation SNP T T C rs150880897 by1000G;byCluster;byFrequency TCGA-D8-A1XY-01A-11D-A14K-09 TCGA-D8-A1XY-10A-01D-A14K-09
ERICH3 127254 WUGSC GRCh38 chr1 74571494 74571494 + Missense_Mutation SNP C C T TCGA-D8-A1XY-01A-11D-A14K-09 TCGA-D8-A1XY-10A-01D-A14K-09
KIF26B 55083 WUGSC GRCh38 chr1 245419680 245419680 + Silent SNP G G T TCGA-D8-A1XY-01A-11D-A14K-09 TCGA-D8-A1XY-10A-01D-A14K-09
sample = substr(d1$Tumor_Sample_Barcode,1,16) #此处调整sample id是为了和临床数据整合到一起。
sample[1]
## 'TCGA-D8-A1XY-01A'
d1$Tumor_Sample_Barcode <- sample
d1[1:3,]
Hugo_Symbol Entrez_Gene_Id Center NCBI_Build Chromosome Start_Position End_Position Strand Variant_Classification Variant_Type ... FILTER CONTEXT src_vcf_id tumor_bam_uuid normal_bam_uuid case_id GDC_FILTER COSMIC MC3_Overlap GDC_Validation_Status
USP24 23358 WUGSC GRCh38 chr1 55159655 55159655 + Missense_Mutation SNP ... panel_of_normals CTGGATTGTAG d083d669-6646-463b-853e-c58da8d06439 4374e19d-c5e7-49cf-8707-05ae5aeb7369 aadee87c-6a68-4580-bd10-64ac273b1e3d 0130d616-885e-4a6c-9d03-2f17dd692a05 common_in_exac;gdc_pon True Unknown
ERICH3 127254 WUGSC GRCh38 chr1 74571494 74571494 + Missense_Mutation SNP ... PASS TTCCTCTACCA d083d669-6646-463b-853e-c58da8d06439 4374e19d-c5e7-49cf-8707-05ae5aeb7369 aadee87c-6a68-4580-bd10-64ac273b1e3d 0130d616-885e-4a6c-9d03-2f17dd692a05 COSM1474194 True Unknown
KIF26B 55083 WUGSC GRCh38 chr1 245419680 245419680 + Silent SNP ... PASS GCCTCGCAGGG d083d669-6646-463b-853e-c58da8d06439 4374e19d-c5e7-49cf-8707-05ae5aeb7369 aadee87c-6a68-4580-bd10-64ac273b1e3d 0130d616-885e-4a6c-9d03-2f17dd692a05 COSM1473725;COSM1473726 True Unknown
fwrite(d1,file="BRCA.maf",sep="\t")
## 此处临床数据从xena下载,数据需要自行整理出:生存时间:"time",
## 生存状态:"status",这里为了对数据有个直观的认识和探索,
## 此处就在Excel中写个if()语句就能完成了,临床信息很多,
## 在R中整理这些也不难,所以此处不贴代码了。但是这里先读入临床数据给大家也一起看一下。
clinicalData = fread("phonotype.txt")
dim(clinicalData)
## 1283 189
clincalData[1:3,]
Tumor_Sample_Barcode additional_pharmaceutical_therapy additional_radiation_therapy additional_surgery_locoregional_procedure additional_surgery_metastatic_procedure age_at_initial_pathologic_diagnosis anatomic_neoplasm_subdivision anatomic_treatment_site axillary_lymph_node_stage_method_type axillary_lymph_node_stage_other_method_descriptive_text ... code.tissue_source_site name.tissue_source_site project.tissue_source_site days_to_collection.samples initial_weight.samples is_ffpe.samples oct_embedded.samples sample_type.samples sample_type_id.samples state.samples
TCGA-A8-A07F-01A 65 Left Upper Outer Quadrant Primary Tumor Field ... A8 Indivumed Breast invasive carcinoma 453 150 FALSE FALSE Primary Tumor 1 live
TCGA-A8-A081-01A 80 Right ... A8 Indivumed Breast invasive carcinoma 637 150 FALSE FALSE Primary Tumor 1 live
TCGA-A2-A25A-01A 44 Left Upper Inner Quadrant Axillary lymph node dissection alone ... A2 Walter Reed Breast invasive carcinoma 2927 160 FALSE TRUE Primary Tumor 1 live
## 准备和整理了两个文件maf,clincald data,保证有相同Tumor_Sample_Barcode
d2 <- read.maf(maf="BRCA.maf",clinicalData="phonotype.txt")
## reading maf..
## silent variants: 45177
ID N
1: Samples 986
2: 3'Flank 870
3: 3'UTR 8850
4: 5'Flank 725
5: 5'UTR 2874
6: IGR 9
7: Intron 6094
8: RNA 1948
9: Silent 22391
10: Splice_Region 1416
## Summarizing..
ID summary Mean Median
1: NCBI_Build GRCh38 NA NA
2: Center WUGSC NA NA
3: Samples 986 NA NA
4: nGenes 16131 NA NA
5: Frame_Shift_Del 2947 2.992 2
6: Frame_Shift_Ins 1849 1.877 1
7: In_Frame_Del 574 0.583 0
8: In_Frame_Ins 420 0.426 0
9: Missense_Mutation 62316 63.265 32
10: Nonsense_Mutation 5734 5.821 2
11: Nonstop_Mutation 86 0.087 0
12: Splice_Site 1804 1.831 1
13: Translation_Start_Site 81 0.082 0
14: total 75811 76.965 39
## Gene Summary..
Hugo_Symbol Frame_Shift_Del Frame_Shift_Ins In_Frame_Del In_Frame_Ins
1: TP53 47 15 5 0
2: PIK3CA 0 0 14 0
3: TTN 13 12 1 7
---
16130: ZSCAN31 0 0 0 0
16131: ZSCAN4 0 0 0 0
Missense_Mutation Nonsense_Mutation Nonstop_Mutation Splice_Site
1: 205 46 0 26
2: 342 1 0 0
3: 258 24 0 6
---
16130: 1 0 0 0
16131: 0 1 0 0
Translation_Start_Site total MutatedSamples AlteredSamples
1: 0 344 338 338
2: 0 357 324 324
3: 0 321 190 190
---
16130: 0 1 1 1
16131: 0 1 1 1
NOTE: Possible FLAGS among top ten genes:
[1] "TTN" "MUC16" "HMCN1"
## Checking clinical data..
## Done !
getClinicalData(d2) #查看临床信息
Tumor_Sample_Barcode additional_pharmaceutical_therapy additional_radiation_therapy additional_surgery_locoregional_procedure additional_surgery_metastatic_procedure age_at_initial_pathologic_diagnosis anatomic_neoplasm_subdivision anatomic_treatment_site axillary_lymph_node_stage_method_type axillary_lymph_node_stage_other_method_descriptive_text ... code.tissue_source_site name.tissue_source_site project.tissue_source_site days_to_collection.samples initial_weight.samples is_ffpe.samples oct_embedded.samples sample_type.samples sample_type_id.samples state.samples
TCGA-A8-A07F-01A NA NA NA NA 65 Left_Upper_Outer_Quadrant Primary_Tumor_Field NA NA ... A8 Indivumed Breast_invasive_carcinoma 453 150 FALSE FALSE Primary_Tumor 1 live
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
TCGA-BH-A0B2-11A NA NA NA NA NA NA NA NA NA ... NA NA NA NA NA NA NA NA NA NA
getSampleSummary(d2) #查看每个样品发生突变的情况
Tumor_Sample_Barcode Frame_Shift_Del Frame_Shift_Ins In_Frame_Del In_Frame_Ins Missense_Mutation Nonsense_Mutation Nonstop_Mutation Splice_Site Translation_Start_Site total
TCGA-AN-A046-01A 2 12 0 0 4168 697 5 78 0 4962
... ... ... ... ... ... ... ... ... ... ...
TCGA-A8-A08C-01A 0 0 0 0 1 0 0 0 0 1
plotmafSummary(maf = d2, rmOutlier = TRUE,
addStat = 'median', dashboard = TRUE,
titvRaw=FALSE)#绘制整体的突变情况
![](http://n4.ikafan.com/assetsj/blank.gif)
## 乳腺癌中突变中错义突变是重要部分,TP53榜上有名
oncoplot(maf = d2, top = 20,clinicalFeatures = c('tumor_stage.diagnoses', 'gender.demographic'), fontSize = 12)
# 绘制前20个突变基因的瀑布图。
# oncoplot()中参数gene=c()可以指定基因名,绘制感兴趣的基因的瀑布图,
# 应用clinicalFeatures可以在瀑布图下添加临床特征。
![](http://n4.ikafan.com/assetsj/blank.gif)
oncostrip(maf = d2, genes = c('TP53','NEB', 'FLG'))
##挑选目的基因,此处随机挑选作为展示
somaticInteractions(maf = d2, top = 25, pvalue = c(0.05, 0.1)) #突变间的相关性分析
![](http://n4.ikafan.com/assetsj/blank.gif)
geneCloud(input = d2, minMut = 20)
![](http://n4.ikafan.com/assetsj/blank.gif)
## Survival analysis based on grouping of DNMT3A mutation status
mafSurvival(maf = d2, genes = 'TP53', time = 'time', Status = 'status', isTCGA = F)
## 因为time,status为自己整理,所以isTCGA选择了F,这样分析更具有普遍性。
Looking for clinical data in annoatation slot of MAF..
Number of mutated samples for given genes:
TP53
338
Median survival..
Group medianTime N
1: Mutant 822.5 338
2: WT 952.0 945
![](http://n4.ikafan.com/assetsj/blank.gif)
还等什么,去TCGA数据库下载MAF文件,同样的代码,你马上就能复现这些美图!