在R里面玩转vcf文件
看到一个成熟的R包,所以推荐一下
https://knausb.github.io/vcfR_documentation/
https://cran.r-project.org/web/packages/vcfR/vignettes/intro_to_vcfR.html
最重要的使用方法,当然是读入vcf文件啦,如下:
library(vcfR)
vcf_file='/Users/jmzeng/germline/merge.dbsnp.vcf'
vcf <- read.vcfR( vcf_file, verbose = FALSE )
十几秒钟就轻轻松松读入一个300多M的vcf文件啦,成为一个S4对象:
> vcf
***** Object of Class vcfR *****
39 samples
24 CHROMs
212,875 variants
Object size: 356.6 Mb
0 percent missing data
***** ***** *****
>
懂R的朋友就应该是知道如何去查看帮助文档来学习它啦,最重要的两句话是:
Objects of class vcfR can be manipulated with vcfR-method and extract.gt.
Contents of the vcfR object can be visualized with the plot method.
最基本的3个元素是:
meta
character vector for the meta information
fix
matrix for the fixed information
gt
matrix for the genotype information
其中meta存储着vcf的头文件,而fix存储在vcf的固定列,gt存储在样本基因型信息。
最基本的操作函数如下:
show(object)
colnames(vcf@fix)
vcf@fix[1:4,1:4]
colnames(vcf@gt)
vcf@gt[1:4,1:4]
head(x, n = 6, maxchar = 80)
plot(x, y, ...) ## 主要查看突变位点的质量值的分布情况
x[i, j, samples = NULL, ..., drop]
dim(x)
nrow(x)
解析出来了读入的文件所变成的对象,后续分析就简单多了。
包自带的测试数据
pkg <- "pinfsc50"
vcf_file <- system.file("extdata", "pinf_sc50.vcf.gz", package = pkg)
dna_file <- system.file("extdata", "pinf_sc50.fasta", package = pkg)
gff_file <- system.file("extdata", "pinf_sc50.gff", package = pkg)
library(vcfR)
vcf <- read.vcfR( vcf_file, verbose = FALSE )
dna <- ape::read.dna(dna_file, format = "fasta")
gff <- read.table(gff_file, sep="\t", quote="")
library(vcfR)
chrom <- create.chromR(name='Supercontig', vcf=vcf, seq=dna, ann=gff)
plot(chrom)
chromoqc(chrom, dp.alpha=20)
chromoqc(chrom, xlim=c(5e+05, 6e+05))
可以出一下还不错的图,我觉得颜值还行!
■ ■ ■