OSCA单细胞数据分析笔记11—Cell type annotation / 四六文摘

对应原版教程第12章http://bioconductor.org/books/release/OSCA/overview.html

“物以类聚”的类是什么类？比如将一群水果分为不同的类群，则又红又圆特征的可能是苹果。对于单细胞聚类的结果，类的最直接注释就是细胞类型。本节将学习单细胞数据分析过程中注释细胞类型的三种思路。

笔记要点

1、参考已注释细胞类型的单细胞表达矩阵

2、观察各种细胞类型marker gene sets在每个细胞里的表达“活性”

3、cluster差异基因富集分析结果联系细胞类型

1、参考已注释细胞类型的单细胞表达矩阵

1.1 背景知识

如果已知若干细胞类型的全部基因表达情况，对于未知的单细胞表达矩阵的每一个细胞，可以计算与前者的各个相似性程度，根据最优的相似结果，进而推断出细胞类型。
在如上描述中，有两个关键点。首先是已知细胞类型的表达矩阵(列名为细胞类型，行名为基因)。常用的celldex数据包提供了人/鼠多个参考表达矩阵，多来自于Bulk RNA-seq或芯片测序技术，详见https://bioconductor.org/packages/3.12/data/experiment/vignettes/celldex/inst/doc/userguide.html
第二的关键点是评价未知细胞的表达情况与已知细胞类型的表达图谱的相似性方法。这里重点介绍SingleR包提供的评价以及预测方法

1.2 示例数据

测试数据集(未知细胞类型)：人外周血组织单细胞测序结果

sce.pbmc #已完成质控、标准化、聚类；具体可参考原教程 #class: SingleCellExperiment #dim: 33694 737280

参考数据集：celldex包提供的BlueprintEncodeData()

The Blueprint/ENCODE reference consists of bulk RNA-seq data for pure stroma and immune cells generated by Blueprint (Martens and Stunnenberg 2013) and ENCODE projects (The ENCODE Project Consortium 2012).

library(celldex) ref <- BlueprintEncodeData() ref

1.3 `SingleR`细胞类型预测

（1）默认直接预测每个细胞的类型

library(SingleR) pred <- SingleR(test=sce.pbmc, ref=ref, labels=ref$label.main) head(pred) # DataFrame with 6 rows and 5 columns # scores first.labels tuning.scores labels pruned.labels # <matrix> <character> <DataFrame> <character> <character> # AAACCTGAGAAGGCCT-1 0.251873:0.118890:0.287805:... Monocytes 0.413717:0.00491616 Monocytes Monocytes # AAACCTGAGACAGACC-1 0.280080:0.133100:0.334590:... Monocytes 0.469530:0.40291032 Monocytes Monocytes # AAACCTGAGGCATGGT-1 0.211503:0.153752:0.345878:... CD4+ T-cells 0.180486:0.06419868 CD4+ T-cells CD4+ T-cells # AAACCTGCAAGGTTCT-1 0.218346:0.155343:0.367597:... CD8+ T-cells 0.307063:0.19412309 CD8+ T-cells CD8+ T-cells # AAACCTGCAGGCGATA-1 0.323697:0.182205:0.483679:... Monocytes 0.558153:0.50413520 Monocytes Monocytes # AAACCTGCATGAAGTA-1 0.312739:0.166972:0.461094:... Monocytes 0.531659:0.46923231 Monocytes Monocytes


table(pred$labels)

# B-cells CD4+ T-cells CD8+ T-cells           DC  Eosinophils Erythrocytes          HSC    Monocytes     NK cells 

# 549          773         1274            1            1            5           14         1117          251

celldex包提供的参考数据集里一般提供label.main、label.fine两种resolution的细胞类型注释，前者的分辨率低，重点关注主要的细胞类型；后者则适合具体的细胞亚类研究。

观察具体每个位置细胞对所有参考细胞类型的评分。从图中可以看出B-cells、Monocytes的评分具有专一性，而CD4+与CD8+细胞可能会出现难以区分的情况。

plotScoreHeatmap(pred)

如下图，每一列代表一个未知细胞对所有参考细胞类型的相似性评分情况。

进一步观察注释的细胞类型的分类与之前聚类结果的混淆关系。从图中可以看出，多个cluster对应于一种细胞类型Monocytes，可能反应可不同的细胞亚类情况。而CD4+与CD8+细胞会出现对应于同一个cluster的情况。

tab <- table(Assigned=pred$pruned.labels, Cluster=colLabels(sce.pbmc))


# Adding a pseudo-count of 10 to avoid strong color jumps with just 1 cell.

library(pheatmap)

pheatmap(log2(tab+10), color=colorRampPalette(c("white", "blue"))(101))

（2）统一预测每个cluster的细胞类型

pred2 <- SingleR(test=sce.pbmc, ref=ref, clusters = colLabels(sce.pbmc), labels=ref$label.main) # B-cells CD4+ T-cells CD8+ T-cells HSC Monocytes NK cells # 1 2 4 1 7 1 table(pred2$labels) plotScoreHeatmap(pred2)

2、观察各种细胞类型marker gene sets在每个细胞里的表达“活性”

2.1 背景知识

首先得到每种细胞类型的marker gene set(一般指特异高表达基因集)。
然后计算对应于单细胞测序结果中，每个细胞的表达信息中，上述基因集的高表达程度。
AUCell包对此提供的计算方式是采用Wilcoxon rank sum test。简单来说先将一个细胞的基因按表达量从高到低排序，根据秩次比较在基因集中的基因表达水平是否区分于(高于)不在基因集中的基因表达水平。最终展示的指标为AUC值，值越接近1说明该细胞越特异表达该基因基，越有可能为该基因集对应的细胞类型。

2.2 示例数据

测试数据集：鼠脑单细胞测序数据

library(scRNAseq) sce.tasic <- TasicBrainData() sce.tasic # class: SingleCellExperiment # dim: 24058 1809

细胞类型marker gene set：获取方式很多，也有专门的数据库整理。教程中是根据已标注细胞类型的单细胞表达矩阵计算出每个细胞类型的marker gene set

sce.zeisel # class: SingleCellExperiment # dim: 19839 2816 library(scran) wilcox.z <- pairwiseWilcox(sce.zeisel, sce.zeisel$level1class, lfc=1, direction="up") markers.z <- getTopMarkers(wilcox.z$statistics, wilcox.z$pairs, pairwise=FALSE, n=50) #每个细胞类型的marker gene set为元素组成的list对象

2.3 `AUCell`计算

首先需提前使用GSEABase包整理marker gene sets的对象为指定的GeneSetCollection对象格式

library(GSEABase) all.sets <- lapply(names(markers.z), function(x) { GeneSet(markers.z[[x]], setName=x) }) all.sets <- GeneSetCollection(all.sets)

然后计算测序数据中每个细胞的基因表达排名

rankings <- AUCell_buildRankings(counts(sce.tasic), plotStats=FALSE, verbose=FALSE)

计算每个细胞对多种细胞mark基因集的AUC结果

cell.aucs <- AUCell_calcAUC(all.sets, rankings) results <- t(assay(cell.aucs)) head(results) # gene sets # cells astrocytes_ependymal endothelial-mural interneurons microglia oligodendrocytes pyramidal CA1 pyramidal SS # Calb2_tdTpositive_cell_1 0.1386528 0.04264310 0.5306093 0.04845394 0.1317958 0.2317668 0.3476778 # Calb2_tdTpositive_cell_2 0.1366283 0.04884635 0.4538357 0.02682648 0.1211293 0.2062570 0.2762466 # Calb2_tdTpositive_cell_3 0.1087323 0.07269892 0.3458998 0.03583482 0.1567204 0.3218726 0.5244492 # Calb2_tdTpositive_cell_4 0.1321658 0.04993108 0.5112665 0.05387632 0.1480893 0.2546714 0.3505890 # Calb2_tdTpositive_cell_5 0.1512892 0.07161420 0.4929771 0.06655747 0.1385766 0.2088478 0.3009582 # Calb2_tdTpositive_cell_6 0.1342012 0.09161375 0.3378310 0.03201310 0.1552946 0.4010508 0.5392798

对于每一个细胞来说，一般使用AUC值最高的细胞类型标签

new.labels <- colnames(results)[max.col(results)] # new.labels # astrocytes_ependymal endothelial-mural interneurons microglia oligodendrocytes pyramidal SS # 69 29 776 23 41 871

对于这种注释细胞类型的方法比较适用于不同细胞类型能够的marker基因集不重叠的情况，即对于细胞亚类的区分效果可能不会很好。因此，在选取细胞类型特异基因集时需要多加考虑。

3、cluster差异基因GO/KEGG富集分析结果联系细胞类型

3.1 背景知识

这是基于聚类结果，计算每个cluster相较于其它cluster的上调差异基因。然后对每个cluster的up DEG进行富集分析，最后根据富集分析结果，手动注释出细胞类型。
富集背景通路基可以选择GO(Gene Ontology)的BP, biologicalterm set结果，因为方便与细胞类型联系起来。

3.2 示例数据

测试数据集：小鼠乳腺组织测序数据

sce.mam

3.3 `limma`包go富集分析`goana()`

#cluster差异分析 markers.mam <- findMarkers(sce.mam, direction="up", lfc=1)


#选择cluster 2的DEG

chosen <- "2"

cur.markers <- markers.mam[[chosen]]
# 基因名格式转换

library(org.Mm.eg.db)

entrez.ids <- mapIds(org.Mm.eg.db, keys=rownames(cur.markers), 

    column="ENTREZID", keytype="SYMBOL")
# 进一步筛选具有显著意义的差异基因用于接下来的富集分析

is.de <- cur.markers$FDR <= 0.05 

summary(is.de)
library(limma)

go.out <- goana(unique(entrez.ids[is.de]), species="Mm", 

    universe=unique(entrez.ids))
# Only keeping biological process terms that are not overly general.

go.out <- go.out[order(go.out$P.DE),]

go.useful <- go.out[go.out$Ont=="BP" & go.out$N <= 200,]

head(go.useful, 10)

# Term Ont   N DE         P.DE

# GO:0006641                         triglyceride metabolic process  BP 105 11 2.799728e-10

# GO:0006639                         acylglycerol metabolic process  BP 135 11 4.173564e-09

# GO:0006638                        neutral lipid metabolic process  BP 137 11 4.876288e-09

# GO:0006119                              oxidative phosphorylation  BP  94  9 2.722992e-08

# GO:0042775 mitochondrial ATP synthesis coupled electron transport  BP  51  7 8.032239e-08

# GO:0042773               ATP synthesis coupled electron transport  BP  52  7 9.225569e-08

# GO:0046390                  ribose phosphate biosynthetic process  BP 147 10 1.203158e-07

# GO:0022408              negative regulation of cell-cell adhesion  BP 187 11 1.225564e-07

# GO:0009152             purine ribonucleotide biosynthetic process  BP 131  9 4.817966e-07

# GO:0035148                                         tube formation  BP 174 10 5.775054e-07

We see an enrichment for genes involved in lipid synthesis, cell adhesion and tube formation. Given that this is a mammary gland experiment, we might guess that cluster 2 contains luminal epithelial cells responsible for milk production and secretion. 如上推理过程可以看出，这种预测细胞类型需要有较强的生物背景知识。

如果想进一步看具体富集到的某一term里所有基因在该cluster的DEG列表分布，可如下操作：

# Extract symbols for each GO term; done once. tab <- select(org.Mm.eg.db, keytype="SYMBOL", keys=rownames(sce.mam), columns="GOALL") by.go <- split(tab[,1], tab[,2])


# Identify genes associated with an interesting term.

adhesion <- unique(by.go[["GO:0022408"]])

head(cur.markers[rownames(cur.markers) %in% adhesion,1:4], 10)

# DataFrame with 10 rows and 4 columns

# Top     p.value         FDR summary.logFC

# <integer>   <numeric>   <numeric>     <numeric>

#   Spint2         11 3.28234e-34 1.37163e-31       2.39280

# Epcam          17 8.86978e-94 7.09531e-91       2.32968

# Cebpb          21 6.76957e-16 2.03800e-13       1.80192

# Cd24a          21 3.24195e-33 1.29669e-30       1.72318

# Btn1a1         24 2.16574e-13 6.12488e-11       1.26343

# Cd9            51 1.41373e-11 3.56592e-09       2.73785

# Ceacam1        52 1.66948e-38 7.79034e-36       1.56912

# Sdc4           59 9.15001e-07 1.75467e-04       1.84014

# Anxa1          68 2.58840e-06 4.76777e-04       1.29724

# Cdh1           69 1.73658e-07 3.54897e-05       1.31265

以上分别介绍了三种注释细胞类型的方法，各有侧重。细胞类型注释是一个单细胞数据分析过程中的重要步骤，还有其它一些注释方法，有机会再多多学习。

OSCA单细胞数据分析笔记11—Cell type annotation