差异分析及功能注释（下） / 四六文摘

作者 | 单细胞天地小编刘小泽

课程链接在：http://jm.grazy.cn/index/mulitcourse/detail.html?cid=55
这次会介绍如何进行差异分析及功能注释，以及比较不同分组。对应视频第三单元9-11讲

回顾

差异分析及功能注释（上）

3 差异基因功能注释

差异基因得到的方法有很多，例如DESeq2、EdgeR、Wilcox、SCDE等等，真正有意义的差异基因，即使使用不同的方法，最后结果也不会相差太多

有了差异基因，然后就是去注释

3.1 先对monocle的结果进行注释

1rm(list=ls()) 2options(stringsAsFactors = F) 3library(clusterProfiler) 4 5load(file = 'step3.1-DEG-monocle_summary.Rdata')

得到差异基因名

1de_genes <- subset(de_clusters, qval<0.05) 2> length(de_genes$genes) 3[1] 4435

然后基因ID转换

1entrez_genes <- bitr(de_genes$genes, fromType="SYMBOL", 2 toType="ENTREZID", 3 OrgDb="org.Mm.eg.db")

作者剔除掉一个基因名

1entrez_genes <- entrez_genes[!entrez_genes$ENTREZID %in% "101055843",] 2> length(entrez_genes$SYMBOL) 3[1] 4281

因为有一些de_genes的SYMBOL没有对应的ENTREZID，因此看到少了100多个基因。

然后，把存在ENTREZID的那些基因的基因名和cluster信息提取出来

1de_gene_clusters <- de_genes[de_genes$genes %in% entrez_genes$SYMBOL, 2 c("genes", "cluster")] 3# 保持de_gene_clusters$genes的顺序不变，将他的symbol变成entrez ID 4de_gene_clusters <- data.frame( 5 ENTREZID=entrez_genes$ENTREZID[entrez_genes$SYMBOL %in% de_gene_clusters$genes], 6 cluster=de_gene_clusters$cluster 7)

将差异基因对应到每个cluster

也就是拆分成4个cluster组成的列表

1list_de_gene_clusters <- split(de_gene_clusters$ENTREZID, 2 de_gene_clusters$cluster)

然后可以将4个cluster同时进行GO注释，并且放在一起（耗时！）

1formula_res <- compareCluster( 2 ENTREZID~cluster, 3 data=de_gene_clusters, 4 fun="enrichGO", 5 OrgDb="org.Mm.eg.db", 6 ont = "BP", 7 pAdjustMethod = "BH", 8 pvalueCutoff = 0.01, 9 qvalueCutoff = 0.05 10) 11# org.Mm.eg.db更新于2019年4月 12 13# 可视化 14pdf('DEG_GO_each_cluster.pdf',width = 11,height = 6) 15dotplot(formula_res, showCategory=5) 16dev.off()

简化GO注释结果

The simplified version of enriched result is more clear and give us a more comprehensive view of the whole story
https://guangchuangyu.github.io/2015/10/use-simplify-to-remove-redundancy-of-enriched-go-terms/

1start_time <- Sys.time() 2lineage1_ego <- simplify( 3 formula_res, 4 cutoff=0.5, 5 by="p.adjust", 6 select_fun=min 7) 8end_time <- Sys.time() 9(end_time - start_time) 10# Time difference of 1.588762 mins 11 12pdf('DEG_GO_each_cluster_simplified.pdf',width = 11,height = 6) 13dotplot(lineage1_ego, showCategory=5) 14dev.off()

保存结果

1write.csv(formula_res@compareClusterResult, 2 file="DEG_GO_each_cluster.csv") 3# 简化版本 4write.csv(lineage1_ego@compareClusterResult, 5 file="DEG_GO_each_cluster_simplified.csv")

这几个数字需要好好了解一下

看第一行：背景基因是20000多个，其中属于这个GO通路的有256个基因；然后C1这个cluster这里总共得到的差异基因是1236个，其中属于这个GO通路的是66个

4 以第一个GO:0140014为例，进行探索

首先获得GO:0140014中的基因 =》 256个

1library(org.Mm.eg.db) 2# 结果有3百多万个 3go2gene <- toTable(org.Mm.egGO2ALLEGS)

其实看到里面的基因ID存在重复，一个go_id会对应多个同样的gene_id，那么就要去重复，获取unique gene id

1uni_gene <- unique(go2gene$gene_id[go2gene$go_id == 'GO:0140014']) 2> length(uni_gene) 3[1] 256

结果和👆的BgRatio中记录的一致

然后拿到第一群的差异基因 =》 1284个

1c1_genes <- list_de_gene_clusters[['C1']] 2> length(c1_genes) 3[1] 1284

这里的1284比之前的1236多个几十个，说明存在几十个基因没有GO注释

找Cluster1在GO:0140014中的基因 =》 66个

其实就是找c1_genes和go2gene的交叉

1overlap_genes <- intersect(c1_genes,uni_gene) 2> length(overlap_genes) 3[1] 66

也就是对应上面GO通路的66个基因

5 差异分析结果画热图

Monocle的结果会用到上一篇的 1.6 用包装好的pheatmap画热图的代码

ROST的结果进行差异分析

导入数据

1rm(list = ls()) 2options(warn=-1) 3options(stringsAsFactors = F) 4source("../analysis_functions.R") 5 6# 加载RPKM表达矩阵 7load('../female_rpkm.Rdata') 8# 加载ROTS 的差异分析结果 9load('DEG-ROTS_summary.Rdata') 10head(summary_pop1) 11head(summary_pop2) 12head(summary_pop3) 13head(summary_pop4) 14 15# 6个发育时期获取 16head(colnames(females)) 17female_stages <- sapply(strsplit(colnames(females), "_"), `[`, 1) 18names(females) <- colnames(females) 19table(female_stages) 20# 4个cluster获取 21cluster <- read.csv('female_clustering.csv') 22female_clustering=cluster[,2];names(female_clustering)=cluster[,1] 23table(female_clustering) 24 25RPKM.full=females

每个群挑选各自的top18的差异基因绘制热图

1population_subset<-c(rownames(summary_pop1[summary_pop1$ROTS.statistic<0,])[1:18], 2 rownames(summary_pop2[summary_pop2$ROTS.statistic<0,])[1:18], 3 rownames(summary_pop3[summary_pop3$ROTS.statistic<0,])[1:18], 4 rownames(summary_pop4[summary_pop4$ROTS.statistic<0,])[1:18])

总共得到了72个基因，然后取出

1RPKM_heatmap<-RPKM.full[population_subset,]

将小表达矩阵的列名重新按照4个cluster排序，并且log转换

1RPKM_heatmap<-RPKM_heatmap[,order(female_clustering)] 2RPKM_heatmap<-log2(RPKM_heatmap+1)

每个群赋予不同颜色

1popul.col<-sort(female_clustering);table(popul.col) 2popul.col<-replace(popul.col, popul.col=='C1',"#1C86EE" ) 3popul.col<-replace(popul.col, popul.col=='C2',"#00EE00" ) 4popul.col<-replace(popul.col, popul.col=='C3',"#FF9912" ) 5popul.col<-replace(popul.col, popul.col=='C4',"#FF3E96" )

画图

1library(gplots) 2png("ROST-DEG-heatmap.png") 3heatmap.2(as.matrix(RPKM_heatmap), 4 ColSideColors = as.character(popul.col), tracecol = NA, 5 dendrogram = "none",col=bluered, labCol = FALSE, 6 scale="none", key = TRUE, symkey = F, symm=F, key.xlab = "", 7 key.ylab = "", density.info = "density", 8 key.title = "log2(RPKM+1)", keysize = 1.2, denscol="black", Colv=FALSE) 9dev.off()

6 两种差异分析方法结果比较

首先还是加载它们的结果

1rm(list=ls()) 2options(stringsAsFactors = F) 3 4load(file = 'DEG-monocle_summary.Rdata') 5load(file = 'DEG-ROTS_summary.Rdata') 6 7# monocle差异基因 8mnc_DEG <- subset(de_clusters, qval<0.05) 9# ROTS差异基因 10head(summary_pop1) 11head(summary_pop2) 12head(summary_pop3) 13head(summary_pop4)

对第一群比较

1g1=rownames(summary_pop1[summary_pop1$FDR>0.05,]) 2g2=rownames(mnc_DEG[mnc_DEG$cluster=='C1',]) 3> length(intersect(g1,g2));length(g1);length(g2) 4[1] 194 5[1] 7006 6[1] 1319

看到ROTS得到了7006个差异基因，monocle得到1319个，它们共有194个。就数字来讲，ROTS挑选的有些”粗糙“，一般差异基因都不会挑选太多

1## 对第二群比较 2g1=rownames(summary_pop2[summary_pop2$FDR>0.05,]) 3g2=rownames(mnc_DEG[mnc_DEG$cluster=='C2',]) 4> length(intersect(g1,g2));length(g1);length(g2) 5[1] 620 6[1] 7015 7[1] 1100 8# 第二群，二者的重叠度较高 9 10## 对第三群比较 11g1=rownames(summary_pop3[summary_pop3$FDR>0.05,]) 12g2=rownames(mnc_DEG[mnc_DEG$cluster=='C3',]) 13> length(intersect(g1,g2));length(g1);length(g2) 14[1] 205 15[1] 7909 16[1] 1129 17 18## 对第四群比较 19g1=rownames(summary_pop2[summary_pop2$FDR>0.05,]) 20g2=rownames(mnc_DEG[mnc_DEG$cluster=='C2',]) 21> length(intersect(g1,g2));length(g1);length(g2) 22[1] 620 23[1] 7015 24[1] 1100

“

差异分析及功能注释（下）

回顾