3-跟着science学习宏基因组-基于read的MEGAN一体化物种功能注释和结果分析

写在前面:

跟着science学习宏基因组终于迎来了第一批的结果,今天的MEGAN是一个比较好的注释流程,一体化注释数据库社区版本可以做很多数据库注释。基于这套流程,我们做了数据清洗,值得注意的是这套清洗方法使用的R语言,不要说R语言不能做哈,现在的R语言能做,速度甚至超过pandas,不信你们来试试。

step 10  结果MEGAN一体化注释物种和功能

MEGAN流程学习

MEGAN安装

# 软件下载
wget -c https://software-ab.informatik.uni-tuebingen.de/download/megan6/MEGAN_Community_unix_6_21_1.sh

# NCBI-nr编号与物种和功能注释
wget -c https://software-ab.informatik.uni-tuebingen.de/download/megan6/megan-map-Jan2021.db.zip

# 核酸与物种信息 655MB
wget -c https://software-ab.informatik.uni-tuebingen.de/download/megan6/megan-nucl-Jan2021.db.zip
# 安装java
sudo apt install openjdk-13-jre-headless
# 安装MEGAN
./MEGAN_Community_unix_6_21_1.sh

首先是对diamond输出结果daa文件进行物种和功能注释:

~/megan/tools/daa2rma -i ./diamond/SUBERR793599.1.daa ./diamond/SUBERR793599.2.daa --paired -ms 50 -me 0.01 -top 50 -mdb ~/db/megan/megan-map-Oct2019.db -o ./diamond/SUBERR793599.rma

下一步就是从ram文件中提取信息,这里我使用rma2info命令;在这篇science中使用的:Blast2LCA 工具直接从daa文件直接到分类信息,但是这个Blast2LCA 工具在最新版的megan中已经找不到了,在megan社区讨论过这个问题,Blast2LCA 无法输出可重复的结果。点击查看,所以下面我们使用rma2info工具,这个功能在之前的版本中就有了,这次替代了个Blast2LCA。

提取物种和功能信息

# 提取物种注释
# ~/megan/tools/rma2info -i ./diamond/SUBERR793599.rma -c2c Taxonomy -r2c Taxonomy -n true --paths true --ranks true --list true --listMore true --bacteriaOnly true -v > ./diamond/SUBERR793599Taxonomy.txt

~/megan/tools/rma2info -i ./diamond/SUBERR793599.rma -c2c Taxonomy -r2c Taxonomy --paths true --list ture > ./diamond/SUBERR793599Taxonomy.txt

# 提取EGGNOG注释
我们只需要提取原始的注释文件,后续再使用小工具进行合并整理。
~/megan/tools/rma2info -i ./diamond/SUBERR793599.rma -r2c EGGNOG --paths > ./diamond/SUBERR793599eggnog.txt

# 提取SEED注释
~/megan/tools/rma2info -i ./diamond/SUBERR793599.rma -r2c SEED --paths > ./diamond/SUBERR793599SEED.txt

# 提取INTERPRO2GO注释
~/megan/tools/rma2info -i ./diamond/SUBERR793599.rma -r2c INTERPRO2GO --paths > ./diamond/SUBERR793599INTERPRO2GO.txt

批量运行

# ${base}
for i in ./rawdata_0.01/*_1.fq.gz
do
base=$(basename $i _1.fq.gz)
echo $base

# MEGAN注释
~/megan/tools/daa2rma -i ./diamond/${base}.1.daa ./diamond/${base}.2.daa --paired -ms 50 -me 0.01 -top 50 -mdb ~/db/megan/megan-map-Oct2019.db -o ./diamond/${base}.rma
# 提取物种注释
~/megan/tools/rma2info -i ./diamond/${base}.rma -c2c Taxonomy -r2c Taxonomy --paths true --list ture > ./diamond/${base}Taxonomy.txt

# 提取EGGNOG注释
我们只需要提取原始的注释文件,后续再使用小工具进行合并整理。
~/megan/tools/rma2info -i ./diamond/${base}.rma -r2c EGGNOG --paths > ./diamond/${base}eggnog.txt

# 提取SEED注释
~/megan/tools/rma2info -i ./diamond/${base}.rma -r2c SEED --paths > ./diamond/${base}SEED.txt

# 提取INTERPRO2GO注释
~/megan/tools/rma2info -i ./diamond/${base}.rma -r2c INTERPRO2GO --paths > ./diamond/${base}INTERPRO2GO.txt
done

基于MEGAN注释数据的后续处理

我们提取原始注释文件,用于后续分析会更加方便。

~/megan/tools/rma2info -i ./diamond/MEGAN_RMA/SUBERR793599.rma -r2c Taxonomy --paths --majorRanksOnly > ./diamond/SUBERR793599Taxonomycs.txt

#-配合下面代码简单整理;
sed -i "s/\x27/ /" ./diamond/C1Taxonomycs.txt
sed -i 's|size=||' ./diamond/C1Taxonomycs.txt
sed -i "s/$(echo '\t')/;/" ./diamond/C1Taxonomycs.txt
#get rid of single/double quotes in taxon names KP
sed -i "s/'//g" ./diamond/C1Taxonomycs.txt
sed -i 's/"//g' ./diamond/C1Taxonomycs.txt

~/megan/tools/rma2info --in ./diamond/MEGAN_RMA/S1.rma --read2class Taxonomy --paths --majorRanksOnly > ./diamond/S1Taxonomycs.txt

sed -i "s/\x27/ /" ./diamond/S1Taxonomycs.txt
sed -i 's|size=||' ./diamond/S1Taxonomycs.txt
sed -i "s/$(echo '\t')/;/" ./diamond/S1Taxonomycs.txt
#get rid of single/double quotes in taxon names KP
sed -i "s/'//g" ./diamond/S1Taxonomycs.txt
sed -i 's/"//g' ./diamond/S1Taxonomycs.txt

提取EGGNOG注释

我们只需要提取原始的注释文件,后续再使用小工具进行合并整理。

~/megan/tools/rma2info -i ./diamond/MEGAN_RMA/C1.rma -r2c EGGNOG --paths > ./diamond/C1eggnogcs.txt

提取SEED注释

~/megan/tools/rma2info -i ./diamond/MEGAN_RMA//C1.rma -r2c SEED --paths > ./diamond/C1SEED.txt

提取INTERPRO2GO注释

~/megan/tools/rma2info -i ./diamond/MEGAN_RMA//C1.rma -r2c INTERPRO2GO --paths > ./diamond/C1INTERPRO2GO.txt

MEGAN 注释结果整理

下一步:构造megan到后续分析的表格

  • 物种表格这里只能注释到种水平,首先去重复,按照种合并,然后全部样本合并一个,构造otu名称;(得到类似扩增子的物种注释表格)

  • 基于read的功能注释 也只能使用功能单位进行合并,类似物种注释文件。

在这里设计了一整套R脚本,用于处理注释结果,并再后面都整理成了phyloseq对象,并可以用于我后面的处理流程。

#----Taxonomy物种注释表格整理#-----
# --导入数据
library(tidyverse)
library(phyloseq)

fl.0 =dir(path, pattern = c("Taxonomy.txt"), full.names = TRUE, ignore.case = TRUE)

for (i in 1:length(fl.0)) {
file = fl.0[i]
samp.name <- sapply(strsplit(basename(file), "Taxonomy.txt"), `[`, 1)

otut <- read.delim(file,header = F)

#-去除MEGAN给出的不同分类等级的统计数据
otut.1 <- otut[!str_detect(otut$V1, "root"),]

head(otut.1)
dim(otut.1)
# 去除重复的基因,重复的注释,去掉,注意这里去掉的基因和注释信息都是一样的
otut.1 %>% distinct( V1,V2, .keep_all = TRUE) %>%
dim()
otut.2 <- otut.1 %>% distinct( V1,V2, .keep_all = TRUE)
# 统计count

otut.3 <- as.data.frame(table(otut.2$V2))

head(otut.3)
colnames(otut.3) <- c("ASV",samp.name)

write.csv(otut.3,paste(samp.name,"_count.csv",sep = ""),row.names = F)

}

# 不同样本物种注释结果合并
path <- "./"
fl.1 =dir(path, pattern = c("_count.csv"), full.names = TRUE, ignore.case = TRUE)

if (length(fl.1) > 2) {
data <- read.csv(fl.1[1]) %>% full_join(read.csv(fl.1[2 ]),by = "ASV")
for (i in 3:length(fl.1)) {
data <- data %>%
full_join(read.csv(fl.1[i ]),by = "ASV")
}

}else if(length(fl.1) == 2 ) {
data <- read.csv(fl.1[1]) %>% full_join(read.csv(fl.1[2 ]),by = "ASV")
}

head(data)

write.csv(data,"./otu_count.csv",row.names = F)

#--注释文件

otutax = separate(data, col = ASV,
into = c("subKingdom","sub2Kingdom","Kingdom","Phylum","subPhylum","Class","Order","Family","Genus","Species"),
sep = ";",
remove = F)
head(otutax)
otutax$OTUname = paste("ASV_",1:length(data$ASV),sep = "")

otu = otutax[,c(colnames(data)[-1])]
row.names(otu) = otutax$OTUname

tax = otutax[,c(4,5,7,8,9,10,11)]
row.names(tax) = otutax$OTUname
head(tax)

ps <- phyloseq(otu_table(as.matrix(otu),taxa_are_rows = T),
tax_table(as.matrix(tax)))

saveRDS(ps,"ps_tax.rds")

#----eggnog功能表格整理#-----

fl.0 =dir(path, pattern = c("eggnog.txt"), full.names = TRUE, ignore.case = TRUE)
i= 1
for (i in 1:length(fl.0)) {
file = fl.0[i]
samp.name <- sapply(strsplit(basename(file), "eggnog.txt"), `[`, 1)

otut <- read.delim(file,header = F)

# 去除重复的基因,重复的注释,去掉,注意这里去掉的基因和注释信息都是一样的
otut %>% distinct( V1,V2, .keep_all = TRUE) %>%
dim()
otut.2 <- otut %>% distinct( V1,V2, .keep_all = TRUE)
# 统计count

otut.3 <- as.data.frame(table(otut.2$V2))

head(otut.3)
colnames(otut.3) <- c("eggnog",samp.name)

write.csv(otut.3,paste(samp.name,"_eggnogcount.csv",sep = ""),row.names = F)

}

# 不同样本物种注释结果合并
path <- "./"
fl.1 =dir(path, pattern = c("_eggnogcount.csv"), full.names = TRUE, ignore.case = TRUE)

if (length(fl.1) > 2) {
data <- read.csv(fl.1[1]) %>% full_join(read.csv(fl.1[2 ]),by = "eggnog")
for (i in 3:length(fl.1)) {
data <- data %>%
full_join(read.csv(fl.1[i ]),by = "eggnog")
}

}else if(length(fl.1) == 2 ) {
data <- read.csv(fl.1[1]) %>% full_join(read.csv(fl.1[2 ]),by = "eggnog")
}

head(data)

write.csv(data,"./eggnog_count.csv",row.names = F)

#--eggnog注释文件整理

otutax = separate(data, col = eggnog,
into = c("eggNOG.id","eggNOG.group","eggNOG.descrip","COG1192"),
sep = ";",
remove = F)
head(otutax)
otutax$OTUname = paste("eggnog_",1:length(data$eggnog),sep = "")

otu = otutax[,c(colnames(data)[-1])]
row.names(otu) = otutax$OTUname

tax = otutax[,c(2,3,4,5)]
row.names(tax) = otutax$OTUname

head(tax)

ps <- phyloseq(otu_table(as.matrix(otu),taxa_are_rows = T),
tax_table(as.matrix(tax)))

saveRDS(ps,"ps_eggnog.rds")

#--INTERPRO2GO功能注释数据整理#-----------

fl.0 =dir(path, pattern = c("INTERPRO2GO.txt"), full.names = TRUE, ignore.case = TRUE)
i= 1
for (i in 1:length(fl.0)) {
file = fl.0[i]
samp.name <- sapply(strsplit(basename(file), "INTERPRO2GO.txt"), `[`, 1)

otut <- read.delim(file,header = F)

# 去除重复的基因,重复的注释,去掉,注意这里去掉的基因和注释信息都是一样的
otut %>% distinct( V1,V2, .keep_all = TRUE) %>%
dim()
otut.2 <- otut %>% distinct( V1,V2, .keep_all = TRUE)
# 统计count

otut.3 <- as.data.frame(table(otut.2$V2))

head(otut.3)
colnames(otut.3) <- c("INTERPRO2GO",samp.name)

write.csv(otut.3,paste(samp.name,"_INTERPRO2GOcount.csv",sep = ""),row.names = F)

}

# 不同样本物种注释结果合并
path <- "./"
fl.1 =dir(path, pattern = c("_INTERPRO2GOcount.csv"), full.names = TRUE, ignore.case = TRUE)

if (length(fl.1) > 2) {
data <- read.csv(fl.1[1]) %>% full_join(read.csv(fl.1[2 ]),by = "INTERPRO2GO")
for (i in 3:length(fl.1)) {
data <- data %>%
full_join(read.csv(fl.1[i ]),by = "INTERPRO2GO")
}

}else if(length(fl.1) == 2 ) {
data <- read.csv(fl.1[1]) %>% full_join(read.csv(fl.1[2 ]),by = "INTERPRO2GO")
}

head(data)

write.csv(data,"./INTERPRO2GO_count.csv",row.names = F)

#--INTERPRO2GO注释文件整理

otutax = separate(data, col = INTERPRO2GO,
into = c("INTERPRO2GO","GO.group","GO.sub","IPR"),
sep = ";",
remove = F)
head(otutax)
otutax$OTUname = paste("INTERPRO2GO_",1:length(data$INTERPRO2GO),sep = "")

otu = otutax[,c(colnames(data)[-1])]
row.names(otu) = otutax$OTUname

tax = otutax[,c(2,3,4,5)]
row.names(tax) = otutax$OTUname

head(tax)

ps <- phyloseq(otu_table(as.matrix(otu),taxa_are_rows = T),
tax_table(as.matrix(tax)))

saveRDS(ps,"ps_INTERPRO2GO.rds")

#--SEED功能注释整理#----------

fl.0 =dir(path, pattern = c("SEED.txt"), full.names = TRUE, ignore.case = TRUE)
i= 1
for (i in 1:length(fl.0)) {
file = fl.0[i]
samp.name <- sapply(strsplit(basename(file), "SEED.txt"), `[`, 1)

otut <- read.delim(file,header = F)

# 去除重复的基因,重复的注释,去掉,注意这里去掉的基因和注释信息都是一样的
otut %>% distinct( V1,V2, .keep_all = TRUE) %>%
dim()
otut.2 <- otut %>% distinct( V1,V2, .keep_all = TRUE)
# 统计count

otut.3 <- as.data.frame(table(otut.2$V2))

head(otut.3)
colnames(otut.3) <- c("SEED",samp.name)

write.csv(otut.3,paste(samp.name,"_SEEDcount.csv",sep = ""),row.names = F)

}

# 不同样本物种注释结果合并
path <- "./"
fl.1 =dir(path, pattern = c("_SEEDcount.csv"), full.names = TRUE, ignore.case = TRUE)

if (length(fl.1) > 2) {
data <- read.csv(fl.1[1]) %>% full_join(read.csv(fl.1[2 ]),by = "SEED")
for (i in 3:length(fl.1)) {
data <- data %>%
full_join(read.csv(fl.1[i ]),by = "SEED")
}

}else if(length(fl.1) == 2 ) {
data <- read.csv(fl.1[1]) %>% full_join(read.csv(fl.1[2 ]),by = "SEED")
}

head(data)

write.csv(data,"./SEED_count.csv",row.names = F)

#--SEED注释文件整理

otutax = separate(data, col = SEED,
into = c("SEED","SEED.group","gene.id","SEED.discrip"),
sep = ";",
remove = F)
head(otutax)
otutax$OTUname = paste("SEED_",1:length(data$SEED),sep = "")

otu = otutax[,c(colnames(data)[-1])]
row.names(otu) = otutax$OTUname

tax = otutax[,c(2,3,4,5)]
row.names(tax) = otutax$OTUname

head(tax)

ps <- phyloseq(otu_table(as.matrix(otu),taxa_are_rows = T),
tax_table(as.matrix(tax)))

saveRDS(ps,"ps_SEED.rds")

物种功能数据处理

这部分流程就很简单了,也不是这个science的主要流程。尤其是我们将数据已经很好的整理成了phylseq格式,很容易套进去我们的多元统计流程里面。

这是science的原始命令:

令我奇怪的是我看不懂这个命令,输出文件在什么地方呢?

## 注释掉了第一行有输出文件
# shell: "/data/tools/MEGAN/mtools/mtools/bin/lcamapper.sh -i {input} -f Detect -ms 50 -me 0.01 -tp 50 -gt /data/tools/MEGAN/mtools/mtools/data/gi_taxid_prot-4March2015.bin -o {output}"

# 下面这部分命令的输出文件 究竟在什么地方呢?
shell: "java -Xmx32G -Djava.awt.headless=true -Duser.language=en -Duser.region=US -cp '/data/tools/MEGAN/{params.megan_version}/jars/MEGAN.jar:/data/tools/MEGAN/{params.megan_version}/jars/data.jar' megan.tools.Blast2LCA -i {input} -f DAA -ms 50 -me 0.01 -top 50 -a2t {params.megan_mapping}"

团队工作及其成果 (点击查看)

了解 交流 合作

  • 团队成员邮箱 袁军:

    junyuan@njau.edu.cn;

    文涛:

    2018203048@njau.edu.cn

  • 团队公众号:

(0)

相关推荐