PGSEA和GSVA你会怎么选择呢?

GSEA 相信看过我生信菜鸟团博客的朋友都已经耳熟能详了的,其需要样本的描述以及分组信息,来计算每个基因的差异度量对它们进行排序,然后走GSEA。

虽然有ssGSEA这样的单样本的分析,但仍然不够,也有GSVA这样的算法来弥补,这里要介绍的是另外一个包,PGSEA。

it tests for each sample whether the average expression of genes in a gene sets deviates from the overall average expression (expression of all genes in all samples).

使用GSVA方法计算某基因集在各个样本的表现

安装PGSEA这个R包

安装并且查看 PDF教程:

## try http:// if https:// URLs are not supported
source("https://bioconductor.org/biocLite.R")
options(BioC_mirror="http://mirrors.ustc.edu.cn/bioc/")
biocLite("PGSEA")
library(PGSEA)
browseVignettes("PGSEA")  

最新版教程: https://bioconductor.org/packages/release/bioc/html/PGSEA.html

重点就是PGSEA函数及smcPlot函数,前者根据表达矩阵及基因集来进行GSEA分析,后者用来可视化分析后的结果。

因为PGSEA分析后的结果是每个基因集在每个样本的一个score,所以也是一个表达矩阵,也可以进行limma的差异分析流程。

执行PGSEA分析及可视化

library(PGSEA)
library(gskb)
data(mm_miRNA)
gse<-read.csv("http://ge-lab.org/gskb/GSE40261.csv",header=TRUE, row.name=1)
# Gene are centered by mean expression
gse <- gse - apply(gse,1,mean)

## 保证表达矩阵和基因集的基因名字是对应的即可。

d1 <- scan("http://ge-lab.org/gskb/2-MousePath/MousePath_Co-expression_gmt.gmt", what="", sep="\n", skip=1)
mm_Co_expression <- strsplit(d1, "\t")
names(mm_Co_expression) <- sapply(mm_Co_expression, '[[', 1)

pg <- PGSEA(gse, cl=mm_Co_expression, range=c(15,2000), p.value=NA)

# Remove pathways that has all NAs. This could be due to that pathway has
# too few matching genes.
pg2 <- pg[rowSums(is.na(pg))!= dim(gse)[2], ]

# Difference in Average Z score in two groups of samples is calculated and
# the pathways are ranked by absolute value.
diff <- abs( apply(pg2[,1:4],1,mean) - apply(pg2[,5:8], 1, mean) )
pg2 <- pg2[order(-diff), ]

sub <- factor( c( rep("Control",4),rep("Anti-miR-29",4) ) )

smcPlot(pg2[1:15,],sub,scale=c(-12,12),show.grid=TRUE,margins=c(1,1,7,19),col=.rwb)


分析结果也可以走limma流程

library(PGSEA)
library(GEOquery)
library(GSEABase)
gse <- getGEO("GSE7023",GSEMatrix=TRUE)
#load("gse.rda")
subtype <- gsub("\\.", "_",gsub("subtype: ", "", phenoData(gse[[1]])$"characteristics_ch1"))
pheno <- new("AnnotatedDataFrame", data = data.frame(subtype), varMetadata = data.frame(labelDescription="subtype"))
rownames(pheno@data) <- colnames(exprs(gse[[1]]))
eset <- new("ExpressionSet", exprs = exprs(gse[[1]]), phenoData = pheno)

data(VAIgsc)
details(VAIgsc[[1]])

pgNF <- PGSEA(eset, VAIgsc, ref=which(subtype=="NO"), p.value=NA)

library(limma)

design <- model.matrix(~ -1+factor(subtype))
colnames(design) <- names(table(subtype))
fit <- lmFit(pgNF, design)
contrast.matrix <- makeContrasts(P2B-NO , levels=design)
fit <- contrasts.fit(fit, contrast.matrix)
fit <- eBayes(fit)

topTable(fit, n=10)[,c("logFC","t","adj.P.Val")]

可以看到都是给一个基因集(GO/KEGG/BIOCARTA/REACTOME/MSIGDB)在每个样本里面打分,把所有基因在所有样本的表达量矩阵转换为了所有基因集在所有样本的打分矩阵!!!

那么,回答我,你会怎么选择呢?

(0)

相关推荐