高效!!!替代BlastX的程序-ghostz

计算资源是有限的,

时间是有限的,耐心也是有限的

不可能为了一个并不绝对可靠的注释,等上一两个月

BlastX的速度,慢得无话可说

替代工具有很多,包括 ghostz 和 diamond

本文介绍ghostz这个工具

##########

#   ghostz使用手册

############

# ghostz主文献

http://bioinformatics.oxfordjournals.org/content/31/8/1183.full.pdf+html

# 作者发表的文献列表

http://www.researchgate.net/profile/Takashi_Ishida2/publications

# ghostz 主页

http://www.bi.cs.titech.ac.jp/ghostz/

# 主要策略是,对库的序列子串重叠区包括种子的序列进行聚类,缩减了最初步骤需要的时间

# 创建程序主文件夹

# 创建程序文件夹

mkdir/opt/bioSoftWare/ghostz

cd$_

# 下载程序文件

wget http://www.bi.cs.titech.ac.jp/ghostz/releases/ghostz-1.0.0.tar.gz

tar xvzf ghostz.tar.gz

cd ghostz

make    # 即可安装完毕

# 声明程序到环境

echo"export PATH=`pwd`:$PATH">>/etc/bash.bashrc

source /etc/bash.bashrc

##############

#   操作说明

###############

Usage

GHOSTZ requires specifically formatted database files for homology search.These files can be generated from FASTA formatted DNA/protein sequencefiles.

Users have to prepare a database file in FASTA format and convert it intoGHOSTZ format database files by using GHOSTZ "db" command at first.GHOSTZ"db" command requires 2 args ([-i dbFastaFile] and [-o dbName]). GHOSTZ "db" command divides adatabase FASTA file into several database chunks and generates several files (.inf,.ind,.nam,.pos,.seq).

# 注意这一句

All generated files are needed for the search. Users can specify the sizeof each chunk. Smaller chunk size requires smaller memory, but efficiency ofthe search will decrease.

#

For executing homology search, GHOSTZ "aln" command is used and that command requires at least2 args([-i qryName] and [-d dbName]).

# 示例

Example

ghostz db  -i ./db.fasta -o exdb

ghostz aln -i exqry -d exdb -oexout

# help

db: convert a FASTAfile to GHOSTX format database files

# 格式化fasta序列未ghostx数据库文件

ghostz db [-i dbFastaFile][-o dbName][-C clustering][-l chunkSize]

[-LclusteringSubsequenceLength]  [-s seedThreshold]

Options:

(Required)

-i STR    Protein sequences in FASTA format for adatabase

# 输入的蛋白fasta序列 -- 亦即是说,help有误,也可以是核酸序列...待试验?

-o STR    The name of the database

# 输出的数据库文件名字

(Optional)

-C STR    Clustering, T (enable) or F (disable)[T]

# 是否进行库聚类,这是Ghostz的核心,如果选择F, 就没必要用ghostz了

-l INT    Chunk size of the database (bytes)[1073741824(=1GB)]

# 数据库块大小 -- 应是会影响速度,个人认为,越大越快  1GB=1024*1024*1024=1073741824字节 # 注意 1GB 建库的时候内存会占用到41GB,比对的时候会到6.7GB # 这些都是针对KEGG蛋白库而言的

-L INT    Length of a subsequence for clustering [10]# 在KEGG蛋白库上,L=10 可以最好的权衡准确性敏感性和计算时间

# 用于聚类的子串长度 -- 应是越小越快 # 使用的是重叠区有种子的序列子串做clustering

-s INT    The seed threshold [39]

# 用于做seed的序列'复杂度'阈值?

# 见原文《Faster sequencehomology searches by clustering subsequences》

# 然而,我并没有读明白这个怎么计算来

# For example, when Tseed¼39, 'HDGLNP’ is not usedin the seed search because its score is 38 and does not exceed Tseed. However, 'HDGLNPA’is used in the seed search because its score is 42, which exceeds Tseed. Inaddition, in our implementation, the length of subsequences for seed isrestricted to 6–8 residues, because a perfect hash function is used.

aln:  Search homologues of queries from database

ghostz aln [-i queries][-o output][-d database][-v maxNumAliSub]

[-b maxNumAliQue][-h hitsSize][-l queriesChunkSize][-q queryType]

[-t databaseType][-F filter][-a numThreads]

Options:

(Required)

-i STR    Sequences in FASTA format  # 查询序列必须是fasta格式的

-o STR    Output file  # 输出文件名字

-d STR    database name (must be formatted)# 格式化好的数据库文件 .. 要后缀吗?

(Optional)

-v INT    Maximum number of alignments for eachsubject [1]  # 被比对到的序列,最高产生多少个比对结果(hsp?) # 这个待查看似乎并没有起作用,设置更高,也没有改变比对结果

-b INT    Maximum number of the output for a query [10]# 查询序列输出的结果个数上限(如果是做NR注释,然后转GO注释的话,建议至少20,20160423 如果是之后,可能设置到30甚至50以上都正常,因为太多predict)

-l INT    Chunk size of the queries (bytes)[134217728(=128MB)]# 查询序列同时用于查询的size大小,128Mb 应是满足所有转录组组装结果

-q STR    Query sequence type, p (protein) or d (dna)[p]# 查询序列的类型蛋白还是 DNA

-t STR    Database sequence type, p (protein) or d (dna)[p]# 数据库的序列类型蛋白还是 DNA

-F STR    Filter query sequence, T (enable) or F (disable)[T]   # 过滤查询序列与否?过滤,基于什么过滤?   # 这个待查看

-a INT    The number of threads [1]  # 使用的线程数

#

##########

#   示例

###########

# 建库

ghostz db  -i ./db.fasta -o exdb

# 比对

ghostz aln -i exqry-d exdb -o exout -a 16

# 建库

ghostz db -i nr -onrGhost

# 比对 blastx

ghostz aln -iAnanas.Unigene.fa -d /opt/bioSoftWare/ghostz/database/nrGhost -o exout -q d -t p -a 30

# 比对 blastn  # 似乎没办法使用

# ghostz aln -i Ananas.Unigene.fa -d/opt/bioSoftWare/ghostz/database/nrGhost -o exout -q d -t d -a 30

# ghostz aln -i sample.fa -d ghostzNT -o exout -q d -t d -a 30 -F F

##########

#   结果展示

############

Search results

GHOSTZ outputs the tab-deliminated file as search results.

Example)

hsa:124045...  hsa:124045...   100       139     0       0       1      139     1       139     2.04391e-76     283.878

hsa:124045...  ptr:454320...   99.2126        127     1       0       13     139     14      140     5.96068e-68     255.758

hsa:124045...  mcc:714360...   88.9764        127     14      0       13     139     14      140     5.05773e-59     226.098

hsa:124045...  rno:292078...   58.6777        121     46      2       13     133     14      130     1.38697e-32     138.272

hsa:124045...  mmu:320869...   55.9055        127     50      3       13     139     12      132     1.17414e-31     135.191

hsa:124045...  pon:100434...   96.4912        57      2       0       13     69      14      70      3.65839e-25     113.62

hsa:124045...  bta:100335...   44.9275        138     71      3       2      139     25      157     4.04482e-24     110.153

hsa:124045...  aml:100464...   26.6667        75      46      2       13     81      1183    1254    0.820692        32.7278

hsa:124045...  bfo:BRAFLD...   56    25      10      1       108    131     581     605     0.820692        32.7278

hsa:124045...  tgu:100227...   26.1682        107     69      3       25     130     150     247     1.8283131.5722

Each column shows;

1.  Name of a query sequence

2.  Name of a homologue sequence (subject)

3.  Sequence Identity # 这个对应的是一个hsp的

4.  Alignment length  # 这个对应全局的,应该是包含了gap长度 ##!!!!!!!!!!!!!! 注意,非常重要,对于blastx,需要乘以3 才能得到相当的核酸水平的比对结果长度

5.  The number of mismatches in the alignment # 对应的应该是当前 hsp 的

6.  The number of gap openingsin the alignemt

7.  Start position of the query in the alignment  # 整个query的

8.  End position of the query in the alignemnt    # 整个query的

9.  Start position of the subject in thealignment   # 整个subject的

10. End position of thesubject in the alignment # 整个subject的 --> 详细见下文

11. E-value   # ? 整个subject的还是第一个hsp的? 应该是整个subject的

12. Normalized score  # 应该是整个subject的

# 这个格式和blast+的-m 6 格式完全一致 (或者说几乎一致)

'qseqid sseqid pident length mismatch gapopen qstart qend sstart send

evalue bitscore'

#QueryName SubjectName   Identity   AlignmentLen  MisMatchesNum GapNums    QueryStart    QueryEnd   SubjectStart  SubjectEnd E-value    NormScore

comp113679_c0   gi|743859531|ref|XP_010942772.1| PREDICTED: uncharacterizedprotein LOC105060676 [Elaeis guineensis]   9.87124233     158     7       796    230     1       225     3.72427e-46     191.815

comp113679_c0   gi|743859531|ref|XP_010942772.1| PREDICTED: uncharacterizedprotein LOC105060676 [Elaeis guineensis]   21.03   233     132     7       796     230     1       225    3.72427e-46     191.815

comp113679_c0   gi|743859531|ref|XP_010942772.1| PREDICTED: uncharacterizedprotein LOC105060676 [Elaeis guineensis]   26.1803233     120     7       796    230     1       225     3.72427e-46     191.815

comp113679_c0   gi|743859531|ref|XP_010942772.1| PREDICTED: uncharacterizedprotein LOC105060676 [Elaeis guineensis]   42.9185233     81      7       796    230     1       225     3.72427e-46     191.815

comp113679_c0   gi|743859531|ref|XP_010942772.1| PREDICTED: uncharacterizedprotein LOC105060676 [Elaeis guineensis]   47.2103233     71      7       796    230     1       225     3.72427e-46     191.815

#

#######

#   比对结果过滤命令

######

# 进行比对

ghostz aln -ibna_cds.fa -d /opt/bioSoftWare/ghostz/database/nrGhost -oAllTrans.fa2nrGhost -q d -t p -a 30

# 进行evalue筛选

perl-F'\t'-lane 'print if ($F[10]<=1e-5)' AllTrans.fa2nrGhost >AllTrans.fa2nrGhost.1e_5.xls

# 按照queryCov筛选 !!! 注意,要计算queryCov的时候,blastx 必须对alignmentLen*3

perl -e 'my $label_pattern=q{>};my $minCov=0.33;my$file=shift;my $table=shift;my$ref=&Single_Label_Reader($file,$label_pattern);my $content;my%length;while($content=&$ref){my ($label,@content)=@$content;chomp($label,@content);$label=~s/^>//;$label=~s/[\r\n]//g;$seq=join q{},@content;$seq=~s/\s+//g;$length{$label}=length($seq)}openTABLE,q{<},$table;while(<TABLE>){($queryName,$alignmentLen) = (split/\t/,$_)[0,3];print if ($alignmentLen*3/$length{$queryName} >= $minCov);}subSingle_Label_Reader {my $file=shift;my $label_pattern=shift;die"Error:$file unreachable.\n" unless (-s $file);open IN,q{<},$fileor die "Error:$file unreadable.\n";my $temp1=<IN>;my$temp2=$temp1;my @temp;return sub {return q{} if(eof(IN));while(!eof(IN)){$temp1=$temp2;@temp=();while(<IN>){if(/$label_pattern/){$temp2=$_;last;}push @temp,$_;}return [$temp1,@temp];}}}' in.fastaAllTrans.fa2nrGhost.1e_5.xls > AllTrans.fa2nrGhost.1e_5.0.33.xls

# 这一步筛选掉的很少 936974 -> 927141

# 针对表格,添加Header

perl-F'\t'-lane 'BEGIN{print"QueryName\tSubjectName\tIdentity\tAlignmentLen\tMisMatchesNum\tGapNums\tQueryStart\tQueryEnd\tSubjectStart\tSubjectEnd\tE-value\tNormScore"}$F[0]=~s/(.*)_i1/$1/;printjoin qq{\t},@F' CaixinUnigene2plantNR.table.filter.xls > NR.anno

# 提取query2gi 表格

perl-lane '@gi=map {/gi\|(\d+)/g} $_;$gi{$F[0]}->{$_}=1 for@gi;}for(sort keys %gi){print qq{$_\t},join qq{,},keys %{$gi{$_}}}{'pineappleGenomeCDS2plant_nr.1e-5.0.33.table > query2gi.table

# 提取物种分布

perl-F'\t'-lane 'next if /^#/;next if ($seen{$F[0]}++);next unless($F[1]=~/[\[\]]/);$F[1]=~s/.*\[(.*?)\].*$/$1/;print $F[1]'Unigenes.fasta2nrGhost.1e_5.xls|sort|uniq -c|sort -n -r -k1|perl-lne 's/^\s+//;@F=split/\s+/,$_,2;print qq{$F[1]\t$F[0]}'> species.distribution.count.xls

# 去除亚种品种等的影响 --- 由于是双名法那么就是

perl-F'\t'-lane '$F[0]=~s/^(\S+\s+\S+).*/$1/;$count{$F[0]}+=$F[1]}for(sort {$count{$b} <=> $count{$a}} keys %count){printqq{$_\t$count{$_}}}{'     ecies.distribution.count.xls >species.distribution.count.mod.xls

###########

# 绘制显示百分比的热图

############

dd<-read.table("species.distribution.count.mod.xls",header=F,sep="\t",stringsAsFactors=F)

head(dd)

# 重新排序一遍并保存前20个

sepNums<-9

dd_top<-head(dd[order(dd$V2,decreasing=T),],sepNums-1);dd_top

dd_down<-tail(dd[order(dd$V2,decreasing=T),],nrow(dd)-sepNums+1)

tail(dd[order(dd$V2,decreasing=T),],nrow(dd)-sepNums+1)

dd_down<-data.frame("others",sum(dd_down$V2));dd_down

colnames(dd_down)<-c("V1","V2")

dd_final<-rbind(dd_top,dd_down)

str(dd_final)

library(ggplot2)

p<-ggplot(dd_final)

head(dd_final)

library(RColorBrewer)

# colours<-brewer.pal(9, "YlGn")

# colours<-rev(colours)

# colours<-colours[order(dd_final$V1)]

#

# 自己调一个颜色

Newcolor<-colorRampPalette(c("darkgreen","orange","lightyellow"))(100)[c(seq(1,50,by=2),seq(50,100,by=2))]

# barplot(rep(1,length(Newcolor)),col=Newcolor)

colours<-Newcolor[seq(1,length(Newcolor),length(Newcolor)/dim(dd_final)[1])]

colours<-colours[order(dd_final$V1)]

tiff(file="SpeciesDisCumSum.tiff",res=300,width=(600*4.17),height=(600*4.17))

p+geom_bar(aes(x=factor(1),y=V2,fill=V1),stat="identity")+geom_text(aes(x=factor(1),y=V2,label=ifelse(V2/sum(V2)>0.05,paste0(round(cumsum(V2)/sum(V2)*100,2),"%"),"")),stat="identity",position="stack")+coord_polar(theta="y")+theme(axis.ticks=element_blank(),axis.text.y=element_blank(),axis.title.y=element_blank(),panel.grid.major=element_line(linetype="dashed",colour="grey"),panel.background=element_rect(fill="white"))+scale_fill_manual(guide = guide_legend(title ="SpeciesDistribution"),values = colours)+xlab("")+ylab("")

dev.off()

tiff(file="SpeciesDisSum.tiff",res=300,width=(600*4.17),height=(600*4.17))

p+geom_bar(aes(x=factor(1),y=V2,fill=V1),stat="identity")+geom_text(aes(x=factor(1),y=cumsum(V2)-V2/2,label=ifelse(V2/sum(V2)>0.05,paste0(round(V2/sum(V2)*100,2),"%"),(""))),stat="identity")+coord_polar(theta="y")+theme(axis.ticks=element_blank(),axis.text.y=element_blank(),axis.title.y=element_blank(),panel.grid.major=element_line(linetype="dashed",colour="grey"),panel.background=element_rect(fill="white"))+xlab("")+ylab("")+scale_fill_manual(guide = guide_legend(title ="SpeciesDistribution"),values = colours)

dev.off()

#

(0)

相关推荐