高效!!!替代BlastX的程序-ghostz
计算资源是有限的,
时间是有限的,耐心也是有限的
不可能为了一个并不绝对可靠的注释,等上一两个月
BlastX的速度,慢得无话可说
替代工具有很多,包括 ghostz 和 diamond
本文介绍ghostz这个工具
##########
# ghostz使用手册
############
# ghostz主文献
http://bioinformatics.oxfordjournals.org/content/31/8/1183.full.pdf+html
# 作者发表的文献列表
http://www.researchgate.net/profile/Takashi_Ishida2/publications
# ghostz 主页
http://www.bi.cs.titech.ac.jp/ghostz/
# 主要策略是,对库的序列子串重叠区包括种子的序列进行聚类,缩减了最初步骤需要的时间
# 创建程序主文件夹
# 创建程序文件夹
mkdir/opt/bioSoftWare/ghostz
cd$_
# 下载程序文件
wget http://www.bi.cs.titech.ac.jp/ghostz/releases/ghostz-1.0.0.tar.gz
tar xvzf ghostz.tar.gz
cd ghostz
make # 即可安装完毕
# 声明程序到环境
echo"export PATH=`pwd`:$PATH">>/etc/bash.bashrc
source /etc/bash.bashrc
##############
# 操作说明
###############
Usage
GHOSTZ requires specifically formatted database files for homology search.These files can be generated from FASTA formatted DNA/protein sequencefiles.
Users have to prepare a database file in FASTA format and convert it intoGHOSTZ format database files by using GHOSTZ "db" command at first.GHOSTZ"db" command requires 2 args ([-i dbFastaFile] and [-o dbName]). GHOSTZ "db" command divides adatabase FASTA file into several database chunks and generates several files (.inf,.ind,.nam,.pos,.seq).
# 注意这一句
All generated files are needed for the search. Users can specify the sizeof each chunk. Smaller chunk size requires smaller memory, but efficiency ofthe search will decrease.
#
For executing homology search, GHOSTZ "aln" command is used and that command requires at least2 args([-i qryName] and [-d dbName]).
# 示例
Example
ghostz db -i ./db.fasta -o exdb
ghostz aln -i exqry -d exdb -oexout
# help
db: convert a FASTAfile to GHOSTX format database files
# 格式化fasta序列未ghostx数据库文件
ghostz db [-i dbFastaFile][-o dbName][-C clustering][-l chunkSize]
[-LclusteringSubsequenceLength] [-s seedThreshold]
Options:
(Required)
-i STR Protein sequences in FASTA format for adatabase
# 输入的蛋白fasta序列 -- 亦即是说,help有误,也可以是核酸序列...待试验?
-o STR The name of the database
# 输出的数据库文件名字
(Optional)
-C STR Clustering, T (enable) or F (disable)[T]
# 是否进行库聚类,这是Ghostz的核心,如果选择F, 就没必要用ghostz了
-l INT Chunk size of the database (bytes)[1073741824(=1GB)]
# 数据库块大小 -- 应是会影响速度,个人认为,越大越快 1GB=1024*1024*1024=1073741824字节 # 注意 1GB 建库的时候内存会占用到41GB,比对的时候会到6.7GB # 这些都是针对KEGG蛋白库而言的
-L INT Length of a subsequence for clustering [10]# 在KEGG蛋白库上,L=10 可以最好的权衡准确性敏感性和计算时间
# 用于聚类的子串长度 -- 应是越小越快 # 使用的是重叠区有种子的序列子串做clustering
-s INT The seed threshold [39]
# 用于做seed的序列'复杂度'阈值?
# 见原文《Faster sequencehomology searches by clustering subsequences》
# 然而,我并没有读明白这个怎么计算来
# For example, when Tseed¼39, 'HDGLNP’ is not usedin the seed search because its score is 38 and does not exceed Tseed. However, 'HDGLNPA’is used in the seed search because its score is 42, which exceeds Tseed. Inaddition, in our implementation, the length of subsequences for seed isrestricted to 6–8 residues, because a perfect hash function is used.
aln: Search homologues of queries from database
ghostz aln [-i queries][-o output][-d database][-v maxNumAliSub]
[-b maxNumAliQue][-h hitsSize][-l queriesChunkSize][-q queryType]
[-t databaseType][-F filter][-a numThreads]
Options:
(Required)
-i STR Sequences in FASTA format # 查询序列必须是fasta格式的
-o STR Output file # 输出文件名字
-d STR database name (must be formatted)# 格式化好的数据库文件 .. 要后缀吗?
(Optional)
-v INT Maximum number of alignments for eachsubject [1] # 被比对到的序列,最高产生多少个比对结果(hsp?) # 这个待查看似乎并没有起作用,设置更高,也没有改变比对结果
-b INT Maximum number of the output for a query [10]# 查询序列输出的结果个数上限(如果是做NR注释,然后转GO注释的话,建议至少20,20160423 如果是之后,可能设置到30甚至50以上都正常,因为太多predict)
-l INT Chunk size of the queries (bytes)[134217728(=128MB)]# 查询序列同时用于查询的size大小,128Mb 应是满足所有转录组组装结果
-q STR Query sequence type, p (protein) or d (dna)[p]# 查询序列的类型蛋白还是 DNA
-t STR Database sequence type, p (protein) or d (dna)[p]# 数据库的序列类型蛋白还是 DNA
-F STR Filter query sequence, T (enable) or F (disable)[T] # 过滤查询序列与否?过滤,基于什么过滤? # 这个待查看
-a INT The number of threads [1] # 使用的线程数
#
##########
# 示例
###########
# 建库
ghostz db -i ./db.fasta -o exdb
# 比对
ghostz aln -i exqry-d exdb -o exout -a 16
# 建库
ghostz db -i nr -onrGhost
# 比对 blastx
ghostz aln -iAnanas.Unigene.fa -d /opt/bioSoftWare/ghostz/database/nrGhost -o exout -q d -t p -a 30
# 比对 blastn # 似乎没办法使用
# ghostz aln -i Ananas.Unigene.fa -d/opt/bioSoftWare/ghostz/database/nrGhost -o exout -q d -t d -a 30
# ghostz aln -i sample.fa -d ghostzNT -o exout -q d -t d -a 30 -F F
##########
# 结果展示
############
Search results
GHOSTZ outputs the tab-deliminated file as search results.
Example)
hsa:124045... hsa:124045... 100 139 0 0 1 139 1 139 2.04391e-76 283.878
hsa:124045... ptr:454320... 99.2126 127 1 0 13 139 14 140 5.96068e-68 255.758
hsa:124045... mcc:714360... 88.9764 127 14 0 13 139 14 140 5.05773e-59 226.098
hsa:124045... rno:292078... 58.6777 121 46 2 13 133 14 130 1.38697e-32 138.272
hsa:124045... mmu:320869... 55.9055 127 50 3 13 139 12 132 1.17414e-31 135.191
hsa:124045... pon:100434... 96.4912 57 2 0 13 69 14 70 3.65839e-25 113.62
hsa:124045... bta:100335... 44.9275 138 71 3 2 139 25 157 4.04482e-24 110.153
hsa:124045... aml:100464... 26.6667 75 46 2 13 81 1183 1254 0.820692 32.7278
hsa:124045... bfo:BRAFLD... 56 25 10 1 108 131 581 605 0.820692 32.7278
hsa:124045... tgu:100227... 26.1682 107 69 3 25 130 150 247 1.8283131.5722
Each column shows;
1. Name of a query sequence
2. Name of a homologue sequence (subject)
3. Sequence Identity # 这个对应的是一个hsp的
4. Alignment length # 这个对应全局的,应该是包含了gap长度 ##!!!!!!!!!!!!!! 注意,非常重要,对于blastx,需要乘以3 才能得到相当的核酸水平的比对结果长度
5. The number of mismatches in the alignment # 对应的应该是当前 hsp 的
6. The number of gap openingsin the alignemt
7. Start position of the query in the alignment # 整个query的
8. End position of the query in the alignemnt # 整个query的
9. Start position of the subject in thealignment # 整个subject的
10. End position of thesubject in the alignment # 整个subject的 --> 详细见下文
11. E-value # ? 整个subject的还是第一个hsp的? 应该是整个subject的
12. Normalized score # 应该是整个subject的
# 这个格式和blast+的-m 6 格式完全一致 (或者说几乎一致)
'qseqid sseqid pident length mismatch gapopen qstart qend sstart send
evalue bitscore'
#QueryName SubjectName Identity AlignmentLen MisMatchesNum GapNums QueryStart QueryEnd SubjectStart SubjectEnd E-value NormScore
comp113679_c0 gi|743859531|ref|XP_010942772.1| PREDICTED: uncharacterizedprotein LOC105060676 [Elaeis guineensis] 9.87124233 158 7 796 230 1 225 3.72427e-46 191.815
comp113679_c0 gi|743859531|ref|XP_010942772.1| PREDICTED: uncharacterizedprotein LOC105060676 [Elaeis guineensis] 21.03 233 132 7 796 230 1 225 3.72427e-46 191.815
comp113679_c0 gi|743859531|ref|XP_010942772.1| PREDICTED: uncharacterizedprotein LOC105060676 [Elaeis guineensis] 26.1803233 120 7 796 230 1 225 3.72427e-46 191.815
comp113679_c0 gi|743859531|ref|XP_010942772.1| PREDICTED: uncharacterizedprotein LOC105060676 [Elaeis guineensis] 42.9185233 81 7 796 230 1 225 3.72427e-46 191.815
comp113679_c0 gi|743859531|ref|XP_010942772.1| PREDICTED: uncharacterizedprotein LOC105060676 [Elaeis guineensis] 47.2103233 71 7 796 230 1 225 3.72427e-46 191.815
#
#######
# 比对结果过滤命令
######
# 进行比对
ghostz aln -ibna_cds.fa -d /opt/bioSoftWare/ghostz/database/nrGhost -oAllTrans.fa2nrGhost -q d -t p -a 30
# 进行evalue筛选
perl-F'\t'-lane 'print if ($F[10]<=1e-5)' AllTrans.fa2nrGhost >AllTrans.fa2nrGhost.1e_5.xls
# 按照queryCov筛选 !!! 注意,要计算queryCov的时候,blastx 必须对alignmentLen*3
perl -e 'my $label_pattern=q{>};my $minCov=0.33;my$file=shift;my $table=shift;my$ref=&Single_Label_Reader($file,$label_pattern);my $content;my%length;while($content=&$ref){my ($label,@content)=@$content;chomp($label,@content);$label=~s/^>//;$label=~s/[\r\n]//g;$seq=join q{},@content;$seq=~s/\s+//g;$length{$label}=length($seq)}openTABLE,q{<},$table;while(<TABLE>){($queryName,$alignmentLen) = (split/\t/,$_)[0,3];print if ($alignmentLen*3/$length{$queryName} >= $minCov);}subSingle_Label_Reader {my $file=shift;my $label_pattern=shift;die"Error:$file unreachable.\n" unless (-s $file);open IN,q{<},$fileor die "Error:$file unreadable.\n";my $temp1=<IN>;my$temp2=$temp1;my @temp;return sub {return q{} if(eof(IN));while(!eof(IN)){$temp1=$temp2;@temp=();while(<IN>){if(/$label_pattern/){$temp2=$_;last;}push @temp,$_;}return [$temp1,@temp];}}}' in.fastaAllTrans.fa2nrGhost.1e_5.xls > AllTrans.fa2nrGhost.1e_5.0.33.xls
# 这一步筛选掉的很少 936974 -> 927141
# 针对表格,添加Header
perl-F'\t'-lane 'BEGIN{print"QueryName\tSubjectName\tIdentity\tAlignmentLen\tMisMatchesNum\tGapNums\tQueryStart\tQueryEnd\tSubjectStart\tSubjectEnd\tE-value\tNormScore"}$F[0]=~s/(.*)_i1/$1/;printjoin qq{\t},@F' CaixinUnigene2plantNR.table.filter.xls > NR.anno
# 提取query2gi 表格
perl-lane '@gi=map {/gi\|(\d+)/g} $_;$gi{$F[0]}->{$_}=1 for@gi;}for(sort keys %gi){print qq{$_\t},join qq{,},keys %{$gi{$_}}}{'pineappleGenomeCDS2plant_nr.1e-5.0.33.table > query2gi.table
# 提取物种分布
perl-F'\t'-lane 'next if /^#/;next if ($seen{$F[0]}++);next unless($F[1]=~/[\[\]]/);$F[1]=~s/.*\[(.*?)\].*$/$1/;print $F[1]'Unigenes.fasta2nrGhost.1e_5.xls|sort|uniq -c|sort -n -r -k1|perl-lne 's/^\s+//;@F=split/\s+/,$_,2;print qq{$F[1]\t$F[0]}'> species.distribution.count.xls
# 去除亚种品种等的影响 --- 由于是双名法那么就是
perl-F'\t'-lane '$F[0]=~s/^(\S+\s+\S+).*/$1/;$count{$F[0]}+=$F[1]}for(sort {$count{$b} <=> $count{$a}} keys %count){printqq{$_\t$count{$_}}}{' ecies.distribution.count.xls >species.distribution.count.mod.xls
###########
# 绘制显示百分比的热图
############
dd<-read.table("species.distribution.count.mod.xls",header=F,sep="\t",stringsAsFactors=F)
head(dd)
# 重新排序一遍并保存前20个
sepNums<-9
dd_top<-head(dd[order(dd$V2,decreasing=T),],sepNums-1);dd_top
dd_down<-tail(dd[order(dd$V2,decreasing=T),],nrow(dd)-sepNums+1)
tail(dd[order(dd$V2,decreasing=T),],nrow(dd)-sepNums+1)
dd_down<-data.frame("others",sum(dd_down$V2));dd_down
colnames(dd_down)<-c("V1","V2")
dd_final<-rbind(dd_top,dd_down)
str(dd_final)
library(ggplot2)
p<-ggplot(dd_final)
head(dd_final)
library(RColorBrewer)
# colours<-brewer.pal(9, "YlGn")
# colours<-rev(colours)
# colours<-colours[order(dd_final$V1)]
#
# 自己调一个颜色
Newcolor<-colorRampPalette(c("darkgreen","orange","lightyellow"))(100)[c(seq(1,50,by=2),seq(50,100,by=2))]
# barplot(rep(1,length(Newcolor)),col=Newcolor)
colours<-Newcolor[seq(1,length(Newcolor),length(Newcolor)/dim(dd_final)[1])]
colours<-colours[order(dd_final$V1)]
tiff(file="SpeciesDisCumSum.tiff",res=300,width=(600*4.17),height=(600*4.17))
p+geom_bar(aes(x=factor(1),y=V2,fill=V1),stat="identity")+geom_text(aes(x=factor(1),y=V2,label=ifelse(V2/sum(V2)>0.05,paste0(round(cumsum(V2)/sum(V2)*100,2),"%"),"")),stat="identity",position="stack")+coord_polar(theta="y")+theme(axis.ticks=element_blank(),axis.text.y=element_blank(),axis.title.y=element_blank(),panel.grid.major=element_line(linetype="dashed",colour="grey"),panel.background=element_rect(fill="white"))+scale_fill_manual(guide = guide_legend(title ="SpeciesDistribution"),values = colours)+xlab("")+ylab("")
dev.off()
tiff(file="SpeciesDisSum.tiff",res=300,width=(600*4.17),height=(600*4.17))
p+geom_bar(aes(x=factor(1),y=V2,fill=V1),stat="identity")+geom_text(aes(x=factor(1),y=cumsum(V2)-V2/2,label=ifelse(V2/sum(V2)>0.05,paste0(round(V2/sum(V2)*100,2),"%"),(""))),stat="identity")+coord_polar(theta="y")+theme(axis.ticks=element_blank(),axis.text.y=element_blank(),axis.title.y=element_blank(),panel.grid.major=element_line(linetype="dashed",colour="grey"),panel.background=element_rect(fill="white"))+xlab("")+ylab("")+scale_fill_manual(guide = guide_legend(title ="SpeciesDistribution"),values = colours)
dev.off()
#