下载TCGA所有癌症的maf文件做signature分析
才sanger研究所已经做好了这个分析,但是值得我们重复一下,效果如下:

首先TCGA所有癌症的maf文件
maf格式的mutation记录文件在TCGA里面已经是level4的数据啦,所以是完全open的,可以随意下载,只需要去其GDC官网简单点击,选择即可。
主要步骤就是在https://portal.gdc.cancer.gov/repository里面点击过滤文件类型,选择maf格式,再过滤access权限,选择open即可,最后得到的132个文件就是我们需要的。
总共是2.19GB的文件,每个癌症种类都有4种maf文件,分别是用mutect,muse,vanscan,somaticsniper这4款软件call 到的somatic mutation文件。
下载方式这里我选择下载它们132个文件的manifest文件,然后用GDC提供的官方工具来下载!关于这个工具,我 在生信技能树论坛写过教程,就不多说了,自己去看哈,现在下载TCGA数据也是非常方便,首先是GDC网站及客户端 就是安装成功后,运行 ./gdc-client download -m manifest_xxx.txt j即可。这个manifest文件就是自己刚才创造并且下载的。
cd ~/institute/TCGA/GDC_NCBI/all~/biosoft/GDC/gdc-client download -m gdc_manifest.2017-08-25T02-57-11.281090.txt
但是这个工具,提供的电脑操作系统版本有限哦If you are a user of CentOS 6 or RedHat Enterprise Release 6 and wish to use the Data Transfer Tool, contact the GDC Help Desk for assistance.
所以我是在MAC里面下载好了,再上传到我的服务器去的!
然后根据MAF文件制作signature
我是根据这篇文章Mutational Profile of Metastatic Breast Cancers: A Retrospective Analysis里面的方法来做的,他们的方法描述如下:
De novo mutational signature analysis was done using the Matlab Welcome Trust Sanger Institute’s signature framework. We used the deconstructSigs R package to determine the contribution of the known signatures that explain each sample mutational profile with more than 50 somatic mutations. We considered the 13 signatures (Signatures 1, 2, 3, 5, 6, 8, 10, 13, 17, 18, 20, 26, and 30) operative in breast cancer as defined in COSMIC (http://cancer.sanger.ac.uk/signatures/matrix.png). A signature was defined as operative or predominant if its contribution to the mutational pattern was respectively >25% (or >100 mutations) or >50%.
虽然我以前写过一个类似的教程,用SomaticSignatures包来解析maf突变数据获得mutation signature 这里还是再学习学习这个新的工具deconstructSigs R package吧。
这是个R包,所以直接在Rstudio里面安装即可,这里选取BRCA的somatic mutation的MAF文件做一下分析,看看四个软件找出的变异,是否在signature上面有差异。
install.packages('deconstructSigs')# dependencies 'BSgenome', 'BSgenome.Hsapiens.UCSC.hg19'BiocInstaller::biocLite('BSgenome')BiocInstaller::biocLite('BSgenome.Hsapiens.UCSC.hg19')## https://github.com/raerose01/deconstructSigsfile1='TCGA.STAD.muse..somatic.maf.gz'TCGA.STAD.muse=read.table(file1,sep = '\t',quote="",header = T)TCGA.STAD.muse[1:5,1:15]## data frame including 5 columns: sample.ID,chr,pos,ref,altsample.mut.ref <- data.fram(Sample='TCGA.STAD.muse',chr = TCGA.STAD.muse[,5],pos = TCGA.STAD.muse[,6],ref = TCGA.STAD.muse[,11],alt = TCGA.STAD.muse[,13])sigs.input <- mut.to.sigs.input(mut.ref = sample.mut.ref,sample.id = "Sample",chr = "chr",pos = "pos",ref = "ref",alt = "alt")class(sigs.input)sample_1 = whichSignatures(tumor.ref = sigs.input,signatures.ref = signatures.nature2013,sample.id = 'TCGA.STAD.muse',contexts.needed = TRUE,tri.counts.method = 'exome')# Plot exampleplot_example <- whichSignatures(tumor.ref = sigs.input ,signatures.ref = signatures.nature2013,sample.id = 'TCGA.STAD.muse' )# Plot outputplotSignatures(plot_example, sub = 'example')
这个里面有一个问题,就是deconstructSigs R package似乎只支持hg19版本的基因组,而我下载的TCGA的MAF是hg38版本的,所以代码虽然是对的,但实际上做出的结果是不对的,需要把下载的TCGA的maf文件进行坐标转换。
而所谓批量,无非就是在上面的R脚本里面加入一个循环咯。
(点击阅读原文有这个包的详细说明书哈!)
注意事项,下载的MAF文件可能有两种格式 ,可能是47列,或者120列,第一行一般都是 头文件,注释着每一列的信息,的确,信息量有点略大。如下:
1 Hugo_Symbol2 Entrez_Gene_Id3 Center4 NCBI_Build5 Chromosome6 Start_Position7 End_Position8 Strand9 Consequence10 Variant_Classification11 Variant_Type12 Reference_Allele13 Tumor_Seq_Allele114 Tumor_Seq_Allele215 dbSNP_RS16 dbSNP_Val_Status17 Tumor_Sample_Barcode18 Matched_Norm_Sample_Barcode19 Match_Norm_Seq_Allele120 Match_Norm_Seq_Allele221 Tumor_Validation_Allele122 Tumor_Validation_Allele223 Match_Norm_Validation_Allele124 Match_Norm_Validation_Allele225 Verification_Status26 Validation_Status27 Mutation_Status28 Sequencing_Phase29 Sequence_Source30 Validation_Method31 Score32 BAM_File33 Sequencer34 t_ref_count35 t_alt_count36 n_ref_count37 n_alt_count38 HGVSc39 HGVSp40 HGVSp_Short41 Transcript_ID42 RefSeq43 Protein_position44 Codons45 Hotspot46 cDNA_change47 Amino_Acid_Change
1 Hugo_Symbol2 Entrez_Gene_Id3 Center4 NCBI_Build5 Chromosome6 Start_Position7 End_Position8 Strand9 Variant_Classification10 Variant_Type11 Reference_Allele12 Tumor_Seq_Allele113 Tumor_Seq_Allele214 dbSNP_RS15 dbSNP_Val_Status16 Tumor_Sample_Barcode17 Matched_Norm_Sample_Barcode18 Match_Norm_Seq_Allele119 Match_Norm_Seq_Allele220 Tumor_Validation_Allele121 Tumor_Validation_Allele222 Match_Norm_Validation_Allele123 Match_Norm_Validation_Allele224 Verification_Status25 Validation_Status26 Mutation_Status27 Sequencing_Phase28 Sequence_Source29 Validation_Method30 Score31 BAM_File32 Sequencer33 Tumor_Sample_UUID34 Matched_Norm_Sample_UUID35 HGVSc36 HGVSp37 HGVSp_Short38 Transcript_ID39 Exon_Number40 t_depth41 t_ref_count42 t_alt_count43 n_depth44 n_ref_count45 n_alt_count46 all_effects47 Allele48 Gene49 Feature50 Feature_type51 One_Consequence52 Consequence53 cDNA_position54 CDS_position55 Protein_position56 Amino_acids57 Codons58 Existing_variation59 ALLELE_NUM60 DISTANCE61 TRANSCRIPT_STRAND62 SYMBOL63 SYMBOL_SOURCE64 HGNC_ID65 BIOTYPE66 CANONICAL67 CCDS68 ENSP69 SWISSPROT70 TREMBL71 UNIPARC72 RefSeq73 SIFT74 PolyPhen75 EXON76 INTRON77 DOMAINS78 GMAF79 AFR_MAF80 AMR_MAF81 ASN_MAF82 EAS_MAF83 EUR_MAF84 SAS_MAF85 AA_MAF86 EA_MAF87 CLIN_SIG88 SOMATIC89 PUBMED90 MOTIF_NAME91 MOTIF_POS92 HIGH_INF_POS93 MOTIF_SCORE_CHANGE94 IMPACT95 PICK96 VARIANT_CLASS97 TSL98 HGVS_OFFSET99 PHENO100 MINIMISED101 ExAC_AF102 ExAC_AF_Adj103 ExAC_AF_AFR104 ExAC_AF_AMR105 ExAC_AF_EAS106 ExAC_AF_FIN107 ExAC_AF_NFE108 ExAC_AF_OTH109 ExAC_AF_SAS110 GENE_PHENO111 FILTER112 CONTEXT113 src_vcf_id114 tumor_bam_uuid115 normal_bam_uuid116 case_id117 GDC_FILTER118 COSMIC119 MC3_Overlap120 GDC_Validation_Status
