OSCA单细胞数据分析笔记-3 SingleCellExperiment数据结构 / 四六文摘

对应原版教程第4章 http://bioconductor.org/books/release/OSCA/
SingleCellExperiment数据结构（下面简称sce）基于SummarizedExperiment(https://bioconductor.org/packages/release/bioc/html/SummarizedExperiment.html)衍生而来，专门用于存储scRNA-seq数据。这是导入测序比对数据到R的第一步，也是之后分析流程的主要对象。

1、sce主要结构组成

如下图所示，我目前对sce结构的理解是，围绕scRNA-seq的原始count数据，储存了4组相关信息

（1）Assays，即counts表达矩阵的标准化处理的矩阵(可以有任意多种，但常见的也就两三种)；
（2）colData，即scRNA-seq的每个细胞的信息(例如批次信息、分组信息、表达概况信息)；
（3）rowData，即scRNA-seq的每个基因的信息(例如表达概况、不同类基因名ID)；
（4）reducedDims，即每个细胞的降维特征信息(主要有PCA、tSNE、uMAP三类)

2、R包准备与简单构建sce

2.1 R包安装

主要用到两个包：SingleCellExperiment，是构建sce对象的基础包；scater，是分析scRNA-seq的常用工具包之一。

BiocManager::install('SingleCellExperiment') BiocManager::install('scater') library(SingleCellExperiment) library(scater)

值得一提的是加载基于sce的单细胞分析工具包时都会自动加载包括SingleCellExperiment在内的依赖包。

2.2 简单构建sce

简单构建sce对象只需要提供单细胞count表达矩阵即可；
如下，模拟一个包含3个细胞，比对到10个基因上的count表达矩阵

counts_matrix <- data.frame(cell_1 = rpois(10, 10), cell_2 = rpois(10, 10), cell_3 = rpois(10, 30)) rownames(counts_matrix) <- paste0("gene_", 1:10) counts_matrix <- as.matrix(counts_matrix) # must be a matrix object! # cell_1 cell_2 cell_3 # gene_1 17 12 21 # gene_2 14 6 36 # gene_3 12 11 30 # gene_4 8 15 26 # gene_5 15 11 31 # gene_6 9 8 29 # gene_7 8 12 28 # gene_8 7 10 32 # gene_9 9 10 27 # gene_10 6 6 26

然后套到下面这个命令里即可

sce <- SingleCellExperiment(assays = list(counts = counts_matrix)) sce # class: SingleCellExperiment # dim: 10 3 # metadata(0): # assays(1): counts #rownames(10): gene_1 gene_2 ... gene_9 gene_10 #rowData names(0): #colnames(3): cell_1 cell_2 cell_3 #colData names(0): #reducedDimNames(0): #altExpNames(0):

3、sce四组成

3.1 标准化数据之assays

3.2 细胞信息之colData

主要包括两类：一是细胞的实验、分组等信息；二是根据表达矩阵概括的每个细胞的表达信息。

相关函数命令有
colData(sce) 查看当前sce对象所有的细胞信息；配合$美元符可方便的用于查看或新添某一条colData。尤其特殊一点是colData(sce)$foo等价于sce$foo

#当前sce有一条sizeFactor cell-info，为scater::logNormCounts时自动添加的 colData(sce) # DataFrame with 3 rows and 1 column # sizeFactor # <numeric> # cell_1 0.640244 # cell_2 0.615854 # cell_3 1.743902


#添加批次信息

#colData(sce)$batch

sce$batch=c(1,1,2)  #注意顺序要一致

colnames(colData(sce))

#[1] "batch"      "sizeFactor"
#也可在构建sce对象时，就添加上

cell_metadata <- data.frame(batch = c(1, 1, 2))

rownames(cell_metadata) <- paste0("cell_", 1:3)

sce <- SingleCellExperiment(assays = list(counts = counts_matrix),

                            colData = cell_metadata)
#利用scater包，注释每个细胞的表达概况

sce <- scater::addPerCellQC(sce)

colData(sce)

# DataFrame with 3 rows and 9 columns

# batch sizeFactor       sum  detected percent_top_50 percent_top_100 percent_top_200 percent_top_500     total

#           <numeric>  <numeric> <integer> <integer>      <numeric>       <numeric>       <numeric>       <numeric> <integer>

# cell_1         1   0.640244       105        10            100             100             100             100       105

# cell_2         1   0.615854       101        10            100             100             100             100       101

# cell_3         2   1.743902       286        10            100             100             100             100       286

3.3 gene信息之colData

我目前所认识主要包括两类：一是不同种基因ID，二是基因的表达情况

3.4 细胞降维信息之reducedDims

细胞降维信息，简单来说就是根据每个细胞的上千维表达情况压缩成几十个，甚至几个特征属性从而概括这个细胞的表达特征。方式有很多，常见的有三种PCA、tSNE、uMAP

4、sce取子集

如上，sce的四部分都是基于原始的细胞名和gene名搭建的。
如果只取其中一个细胞信息，则其余注释信息也当相应的取该细胞的注释信息；同理对基因筛选也是。

sce # class: SingleCellExperiment # dim: 10 3 # metadata(0): # assays(2): counts logcounts # rownames(10): gene_1 gene_2 ... gene_9 gene_10 # rowData names(2): mean detected # colnames(3): cell_1 cell_2 cell_3 # colData names(10): batch sizeFactor ... total foo # reducedDimNames(1): PCA # altExpNames(0):


sce[1:2,]  #筛选基因

# class: SingleCellExperiment 

# dim: 2 3 

# metadata(0):

# assays(2): counts logcounts

# rownames(2): gene_1 gene_2

# rowData names(2): mean detected

# colnames(3): cell_1 cell_2 cell_3

# colData names(10): batch sizeFactor ... total foo

# reducedDimNames(1): PCA

# altExpNames(0):
sce[,c(1,3)]  #筛选细胞

sce[,sce$batch==1]  #根据注释信息筛选

5、简单对比Seurat数据结构

library(Seurat) scRNA=as.Seurat(sce) scRNA # An object of class Seurat # 10 features across 3 samples within 1 assay # Active assay: RNA (10 features, 0 variable features) # 1 dimensional reduction calculated: PCA scRNA@assays$RNA@counts #对应sce的counts assay scRNA@assays$RNA@data #对应sce的log2count assay scRNA@meta.data #对应sce的colData scRNA@assays$RNA@meta.features #应该对应sce的rowData，但显示为空值 scRNA@reductions$PCA@cell.embeddings #对应sce的reduceDim subset(scRNA, subset = batch == 1) #利用subset取子集

以上学习了SingleCellExperiment对象结构的基础知识，部分知识点如altExp()、colLabels()、sizeFactors()等到具体应用时再做学习，就不一一介绍了。至此OSCA的第一部分就到这里了，内容比较少。第二部分就主要聚焦于scRNA-seq的分析流程知识点。

OSCA单细胞数据分析笔记-3 SingleCellExperiment数据结构