安装snpEFF工具并对VCF文件进行注释【直播】我的基因组85
这个软件比较重要,尤其是对做遗传变异相关研究的,很多人做完了snp-calling后喜欢用ANNOVAR来进行注释,但是那个注释还是相对比较简单,只能得到该突变位点在基因的哪个区域,哪个基因这样的信息,如果想了解更具体一点,就需要更加功能化的软件了,snpEFF就是其中的佼佼者,而且是java平台软件,非常容易安装及使用!而且它的手册写的非常详细:http://snpeff.sourceforge.net/SnpEff_manual.html 官网是:http://snpeff.sourceforge.net/
软件安装
选择最新版软件下载:https://sourceforge.net/projects/snpeff/files/ 并解压即可使用。
wget https://sourceforge.net/projects/snpeff/files/snpEff_latest_core.zip
unzip snpEff_latest_core.zip
`
软件安装效果图如下:
下载合适的数据库
查看数据库物种以及版本,并且下载合适自己的,比如我这个人类的WGS数据,比对用的是hg19,所以选择了
java -jar ~/biosoft/SnpEff/snpEff/snpEff.jar databases|grep GRCh
java -jar ~/biosoft/SnpEff/snpEff/snpEff.jar download GRCh37.75
PS: 有些时候软件的命令会出现下载失败的情况,但是会给出下载链接,可以自己用wget来下载,命令是 wget https://nchc.dl.sourceforge.net/project/snpeff/databases/v4_2/snpEff_v4_2_GRCh37.75.zip
如果是自己下载,要保证解压后的data目录在snpEff目录。
人类的snpEffv42_GRCh37.75.zip这个也就663M,我一10M/S的下载速度,一杯咖啡的时间就下载完啦。
人类所有版本数据如下:
GRCh37.70 Homo_sapiens http://downloads.sourceforge.net/project/snpeff/databases/v4_2/snpEff_v4_2_GRCh37.70.zip
GRCh37.75 Homo_sapiens http://downloads.sourceforge.net/project/snpeff/databases/v4_2/snpEff_v4_2_GRCh37.75.zip
GRCh37.GTEX Homo_sapiens, Gencode 12, GTEX project http://downloads.sourceforge.net/project/snpeff/databases/v4_2/snpEff_v4_2_GRCh37.GTEX.zip
GRCh38.81 Homo_sapiens http://downloads.sourceforge.net/project/snpeff/databases/v4_2/snpEff_v4_2_GRCh38.81.zip
GRCh38.82 Homo_sapiens http://downloads.sourceforge.net/project/snpeff/databases/v4_2/snpEff_v4_2_GRCh38.82.zip
GRCh38.p2.RefSeq Human genome GRCh38 using RefSeq transcripts http://downloads.sourceforge.net/project/snpeff/databases/v4_2/snpEff_v4_2_GRCh38.p2.RefSeq.zip
下载解压后效果图如下:
所有的数据库文件都是在GRCh37.75那个文件夹下面,截图的其它文件是我下载的其它数据库,而且截图是2015年安装下载的数据库了,现在下载的是663M,之前只有642M,略微有一点点的区别
软件下载很快,但是数据库下载就需要一定时间啦,去喝杯咖啡吧。
PS:然后软件本身会提供example文件,里面就是一堆各种各样的vcf数据,而且还提供了运行的shell命令,非常简单(比如:examples.sh) , 可以直接测试并理解。
注释自己的数据
行也很简单: java -Xmx4G -jar snpEff.jar -i vcf -o vcf GRCh37.75 example.vcf >example_snpeff.vcf
指定输入输出格式都是vcf,然后指定刚才下载的必备数据库,然后输入输出文件即可!我的命令如下:
java -jar ~/biosoft/SnpEff/snpEff/snpEff.jar -i vcf GRCh37.75 hg19_exon.snp.vcf>hg19_exon.snp.snpEff.vcf
这里为了节省时间,我仅仅是对我位于外显子的vcf文件进行了注释,随便摘取其中一个位点的注释结果秀给大家看一看吧:
1 1263362 rs4970433 G A 1109.77 PASS AC=1;AF=0.500;AN=2;BaseQRankSum=4.912;ClippingRankSum=-0.335;DB;DP=75;ExcessHet=3.0103;FS=3.074;MLEAC=1;MLEAF=0.500;MQ=60.00;MQRankSum=-1.133;QD=14.80;ReadPosRankSum=1.144;SOR=1.075;ANN=A|3_prime_UTR_variant|MODIFIER|GLTPD1|ENSG00000224051|transcript|ENST00000343938|protein_coding|3/3|c.*219G>A|||||219|,A|upstream_gene_variant|MODIFIER|CPSF3L|ENSG00000127054|transcript|ENST00000540437|protein_coding||c.-6029C>T|||||3373|,A|upstream_gene_variant|MODIFIER|TAS1R3|ENSG00000169962|transcript|ENST00000339381|protein_coding||c.-3364G>A|||||3332|,A|upstream_gene_variant|MODIFIER|CPSF3L|ENSG00000127054|transcript|ENST00000435064|protein_coding||c.-3374C>T|||||3291|,A|upstream_gene_variant|MODIFIER|CPSF3L|ENSG00000127054|transcript|ENST00000323275|retained_intron||n.-3361C>T|||||3361|,A|upstream_gene_variant|MODIFIER|CPSF3L|ENSG00000127054|transcript|ENST00000411962|protein_coding||c.-3374C>T|||||3309|,A|upstream_gene_variant|MODIFIER|CPSF3L|ENSG00000127054|transcript|ENST00000421495|protein_coding||c.-14068C>T|||||3345|,A|upstream_gene_variant|MODIFIER|CPSF3L|ENSG00000127054|transcript|ENST00000458452|nonsense_mediated_decay||c.-3374C>T|||||3332|,A|upstream_gene_variant|MODIFIER|CPSF3L|ENSG00000127054|transcript|ENST00000419704|protein_coding||c.-3374C>T|||||3329|,A|upstream_gene_variant|MODIFIER|CPSF3L|ENSG00000127054|transcript|ENST00000532772|nonsense_mediated_decay||c.-3374C>T|||||3354|,A|upstream_gene_variant|MODIFIER|CPSF3L|ENSG00000127054|transcript|ENST00000450926|protein_coding||c.-3374C>T|||||3312|,A|upstream_gene_variant|MODIFIER|CPSF3L|ENSG00000127054|transcript|ENST00000545578|protein_coding||c.-6948C>T|||||3345|,A|upstream_gene_variant|MODIFIER|CPSF3L|ENSG00000127054|transcript|ENST00000527098|nonsense_mediated_decay||c.-3374C>T|||||3316|,A|upstream_gene_variant|MODIFIER|CPSF3L|ENSG00000127054|transcript|ENST00000528879|nonsense_mediated_decay||c.-3374C>T|||||3334|,A|upstream_gene_variant|MODIFIER|CPSF3L|ENSG00000127054|transcript|ENST00000434694|protein_coding||c.-3374C>T|||||3352|WARNING_TRANSCRIPT_INCOMPLETE,A|upstream_gene_variant|MODIFIER|CPSF3L|ENSG00000127054|transcript|ENST00000488042|retained_intron||n.-3353C>T|||||3353|,A|upstream_gene_variant|MODIFIER|CPSF3L|ENSG00000127054|transcript|ENST00000526797|nonsense_mediated_decay||c.-3374C>T|||||3324|,A|upstream_gene_variant|MODIFIER|CPSF3L|ENSG00000127054|transcript|ENST00000526332|protein_coding||c.-8630C>T|||||3322|WARNING_TRANSCRIPT_NO_STOP_CODON,A|upstream_gene_variant|MODIFIER|CPSF3L|ENSG00000127054|transcript|ENST00000430786|nonsense_mediated_decay||c.-3374C>T|||||3357|,A|upstream_gene_variant|MODIFIER|CPSF3L|ENSG00000127054|transcript|ENST00000527719|protein_coding||c.-6029C>T|||||3311|WARNING_TRANSCRIPT_INCOMPLETE,A|upstream_gene_variant|MODIFIER|CPSF3L|ENSG00000127054|transcript|ENST00000530031|protein_coding||c.-3374C>T|||||3323|WARNING_TRANSCRIPT_INCOMPLETE,A|upstream_gene_variant|MODIFIER|CPSF3L|ENSG00000127054|transcript|ENST00000526113|retained_intron||n.-3345C>T|||||3345|,A|upstream_gene_variant|MODIFIER|CPSF3L|ENSG00000127054|transcript|ENST00000526904|retained_intron||n.-3324C>T|||||3324|,A|upstream_gene_variant|MODIFIER|CPSF3L|ENSG00000127054|transcript|ENST00000429572|retained_intron||n.-3316C>T|||||3316|,A|upstream_gene_variant|MODIFIER|CPSF3L|ENSG00000127054|transcript|ENST00000498173|retained_intron||n.-3295C>T|||||3295|,A|upstream_gene_variant|MODIFIER|CPSF3L|ENSG00000127054|transcript|ENST00000496353|retained_intron||n.-3371C>T|||||3371|,A|upstream_gene_variant|MODIFIER|CPSF3L|ENSG00000127054|transcript|ENST00000493534|processed_transcript||n.-3324C>T|||||3324|,A|upstream_gene_variant|MODIFIER|CPSF3L|ENSG00000127054|transcript|ENST00000534345|protein_coding||c.-3374C>T|||||3311|WARNING_TRANSCRIPT_INCOMPLETE,A|upstream_gene_variant|MODIFIER|CPSF3L|ENSG00000127054|transcript|ENST00000532952|processed_transcript||n.-3340C>T|||||3340|,A|upstream_gene_variant|MODIFIER|CPSF3L|ENSG00000127054|transcript|ENST00000490853|processed_transcript||n.-3333C>T|||||3333|,A|upstream_gene_variant|MODIFIER|CPSF3L|ENSG00000127054|transcript|ENST00000531019|nonsense_mediated_decay||c.-3374C>T|||||3322|,A|upstream_gene_variant|MODIFIER|CPSF3L|ENSG00000127054|transcript|ENST00000498476|protein_coding||c.-3374C>T|||||3322|WARNING_TRANSCRIPT_INCOMPLETE,A|upstream_gene_variant|MODIFIER|CPSF3L|ENSG00000127054|transcript|ENST00000470679|nonsense_mediated_decay||c.-3380C>T|||||3379|WARNING_TRANSCRIPT_NO_START_CODON,A|upstream_gene_variant|MODIFIER|CPSF3L|ENSG00000127054|transcript|ENST00000525285|nonsense_mediated_decay||c.-3371C>T|||||3369|WARNING_TRANSCRIPT_NO_START_CODON,A|upstream_gene_variant|MODIFIER|CPSF3L|ENSG00000127054|transcript|ENST00000530233|retained_intron||n.-3324C>T|||||3324|,A|downstream_gene_variant|MODIFIER|GLTPD1|ENSG00000224051|transcript|ENST00000488011|protein_coding||c.*511G>A|||||511|WARNING_TRANSCRIPT_INCOMPLETE,A|non_coding_exon_variant|MODIFIER|GLTPD1|ENSG00000224051|transcript|ENST00000464957|processed_transcript|2/2|n.1089G>A||||||
GT:AD:DP:GQ:PL 0/1:34,41:75:99:1138,0,787
结果解读:
可以看到这个结果非常复杂,对结果理解了多少,就是我们对软件理解了多少。
具体大家看readme吧,注释信息太多了,而且我觉得有点冗余了,大家按需索取:
1. chromosome_number_variation
2. exon_loss_variant
3. frameshift_variant
4. stop_gained
5. stop_lost
6. start_lost
7. splice_acceptor_variant
8. splice_donor_variant
9. rare_amino_acid_variant
10. missense_variant
11. inframe_insertion
12. disruptive_inframe_insertion
13. inframe_deletion
14. disruptive_inframe_deletion
15. 5_prime_UTR_truncation+exon_loss_variant
16. 3_prime_UTR_truncation+exon_loss
17. splice_branch_variant
18. splice_region_variant
19. splice_branch_variant
20. stop_retained_variant
21. initiator_codon_variant
22. synonymous_variant
23. initiator_codon_variant+non_canonical_start_codon
24. stop_retained_variant
25. coding_sequence_variant
26. 5_prime_UTR_variant
27. 3_prime_UTR_variant
28. 5_prime_UTR_premature_start_codon_gain_variant
29. upstream_gene_variant
30. downstream_gene_variant
31. TF_binding_site_variant
32. regulatory_region_variant
33. miRNA
34. custom
35. sequence_feature
36. conserved_intron_variant
37. intron_variant
38. intragenic_variant
39. conserved_intergenic_variant
40. intergenic_region
41. coding_sequence_variant
42. non_coding_exon_variant
43. nc_transcript_variant
44. gene_variant
45. chromosome
官方说明书见:http://snpeff.sourceforge.net/VCFannotationformat_v1.0.pdf