把vcf文件转换为maf格式，肿瘤外显子上游分析教程到此为止 / 四六文摘

cd ~/biosoft/vcf2maf/ export VCF2MAF_URL=`curl -sL https://api.github.com/repos/mskcc/vcf2maf/releases | grep -m1 tarball_url | cut -d\" -f4` curl -L -o mskcc-vcf2maf.tar.gz $VCF2MAF_URL; tar -zxf mskcc-vcf2maf.tar.gz; cd mskcc-vcf2maf-* mv * ../ perl ~/biosoft/vcf2maf/vcf2maf.pl --man

帮助文档如下：

[jianmingzeng@jade mskcc-vcf2maf-747a1bb]$ perl vcf2maf.pl --help Usage: perl vcf2maf.pl --help perl vcf2maf.pl --input-vcf WD4086.vcf --output-maf WD4086.maf --tumor-id WD4086 --normal-id NB4086 Options: --input-vcf Path to input file in VCF format --output-maf Path to output MAF file --tmp-dir Folder to retain intermediate VCFs after runtime [Default: Folder containing input VCF] --tumor-id Tumor_Sample_Barcode to report in the MAF [TUMOR] --normal-id Matched_Norm_Sample_Barcode to report in the MAF [NORMAL] --vcf-tumor-id Tumor sample ID used in VCF's genotype columns [--tumor-id] --vcf-normal-id Matched normal ID used in VCF's genotype columns [--normal-id] --custom-enst List of custom ENST IDs that override canonical selection --vep-path Folder containing the vep script [~/vep] --vep-data VEP's base cache/plugin directory [~/.vep] --vep-forks Number of forked processes to use when running VEP [4] --buffer-size Number of variants VEP loads at a time; Reduce this for low memory systems [5000] --any-allele When reporting co-located variants, allow mismatched variant alleles too --ref-fasta Reference FASTA file [~/.vep/homo_sapiens/91_GRCh37/Homo_sapiens.GRCh37.75.dna.primary_assembly.fa.gz] --filter-vcf A VCF for FILTER tag common_variant. Set to 0 to disable [~/.vep/ExAC_nonTCGA.r0.3.1.sites.vep.vcf.gz] --max-filter-ac Use tag common_variant if the filter-vcf reports a subpopulation AC higher than this [10] --species Ensembl-friendly name of species (e.g. mus_musculus for mouse) [homo_sapiens] --ncbi-build NCBI reference assembly of variants MAF (e.g. GRCm38 for mouse) [GRCh37] --cache-version Version of offline cache to use with VEP (e.g. 75, 84, 91) [Default: Installed version] --maf-center Variant calling center to report in MAF [.] --retain-info Comma-delimited names of INFO fields to retain as extra columns in MAF [] --min-hom-vaf If GT undefined in VCF, minimum allele fraction to call a variant homozygous [0.7] --remap-chain Chain file to remap variants to a different assembly before running VEP --help Print a brief help message and quit --man Print the detailed manual

上面的命令参数看的头晕，没关系，学习一个新的软件最好的方法就是实践：

## 自己单独运行VEP，注释速度很慢 # Processed 1969 total variants (9 vars/sec, 9 vars/sec total) perl ~/vep/variant_effect_predictor.pl -i somatic.vcf -o test.vcf \ --cache --force_overwrite --assembly GRCh38 --vcf # 看起来上面的VEP完全没有必要运行

# 一步到位，包括VEP和 vcf2maf，而且注释非常快,因为调用了5个线程 # Processed 1969 total variants (53 vars/sec, 53 vars/sec total) perl ~/biosoft/vcf2maf/vcf2maf.pl --input-vcf somatic.vcf --output-maf test.maf \ --ref-fasta ~/.vep/homo_sapiens/86_GRCh38/Homo_sapiens.GRCh38.dna.primary_assembly.fa.gz \ --tumor-id 1092NTT --normal-id 1257NT --ncbi-build GRCh38

通常我们是批量运行：

for i in *filter.vcf ; do echo $i j=$(basename "$i" _filter.vcf ) echo ${j^^} perl ~/biosoft/vcf2maf/vcf2maf.pl --input-vcf $i --output-maf ${j^^}.maf \ --ref-fasta ~/.vep/homo_sapiens/86_GRCh38/Homo_sapiens.GRCh38.dna.primary_assembly.fa.gz \ --tumor-id ${j^^}_T --normal-id ${j^^}_N --ncbi-build GRCh38 done 1.3M May 30 14:00 OSCC_01.maf 745K May 30 14:01 OSCC_04.maf 576K May 30 14:01 OSCC_06.maf 730K May 30 14:02 OSCC_09.maf

默认调用5个线程，所以注释速度非常快！！！

得到的maf后，肿瘤外显子测序数据的上游分析就到此为止了，接下来的下游分析，也就是联系自己的项目设计实际情况来讲生物学故事，会更折磨人！！！

如果你没有自己的肿瘤外显子下面，却想一步到位拿到maf去练手，可以直接下载tcga数据库的，看昨天教程：

生信小技巧第8课，加上 TCGA的28篇教程- 批量下载TCGA所有数据

比如你可以下载TCGA里面的乳腺癌的1000个左右的样本的突变信息，绘图如下：

是不是很赞啊！

把vcf文件转换为maf格式，肿瘤外显子上游分析教程到此为止

相关推荐