单细胞转录组3大R包之Seurat

牛津大学的Rahul Satija等开发的Seurat,最早公布在Nature biotechnology, 2015,文章是; Spatial reconstruction of single-cell gene expression data , 在2017年进行了非常大的改动,所以重新在biorxiv发表了文章在 Integrated analysis of single cell transcriptomic data across conditions, technologies, and species 。 功能涵盖了scRNA-seq的QC、过滤、标准化、批次效应、PCA、tNSE亚群聚类分析、差异基因分析、亚群特异性标志物鉴定等等等。其GitHub地址是:http://satijalab.org/seurat/给初学者提供了一个2,700 PBMC scRNA-seq dataset from 10X genomics的数据实战指导,非常容易学会: http://satijalab.org/seurat/pbmc3k_tutorial.html 数据在: https://personal.broadinstitute.org/rahuls/seurat/seurat_files_nbt.zip同时还提供两个公共数据的实战演练教程:https://www.dropbox.com/s/4d00eyd84qscyd2/IntegratedAnalysis_Examples.zip?dl=1http://bit.ly/IAexpmat下载后如下所示:IntegratedAnalysis_Examples├── [ 211]  INSTALL├── [1.4K]  README.md├── [ 170]  data│   ├── [ 83K]  Supplementary_Table_MarrowCellData.tsv│   ├── [547K]  Supplementary_Table_PancreasCellData.tsv│   └── [ 561]  regev_lab_cell_cycle_genes.txt├── [ 170]  examples│   ├── [2.9K]  marrow_commandList.R│   └── [2.5K]  pancreas_commandList.R└── [ 170]  tutorial├── [8.2K]  Seurat_AlignmentTutorial.Rmd└── [7.6M]  Seurat_AlignmentTutorial.pdfIntegratedAnalysis_ExpressionMatrices├── [102M]  marrow_mars.expressionMatrix.txt├── [ 66M]  marrow_ss2.expressionMatrix.txt├── [330M]  pancreas_human.expressionMatrix.txt├── [ 54M]  pancreas_mouse.expressionMatrix.txt├── [165M]  pbmc_10X.expressionMatrix.txt└── [101M]  pbmc_SeqWell.expressionMatrix.txtseurat的用法这里的测试数据是经由Illumina NextSeq 500测到的2,700 single cells 表达矩阵,下载地址;根据表达矩阵构建seurat对象需要准备好3个输入文件library(Seurat)library(dplyr)library(Matrix)## https://s3-us-west-2.amazonaws.com/10x.files/samples/cell/pbmc3k/pbmc3k_filtered_gene_bc_matrices.tar.gz## 下载整个压缩包解压即可重现整个流程# Load the PBMC datasetlist.files("~/Downloads/filtered_gene_bc_matrices/hg19/")## [1] "barcodes.tsv" "genes.tsv"    "matrix.mtx"pbmc.data <- Read10X(data.dir = "~/Downloads/filtered_gene_bc_matrices/hg19/")# Examine the memory savings between regular and sparse matricesdense.size <- object.size(x = as.matrix(x = pbmc.data))dense.size## 709264728 bytessparse.size <- object.size(x = pbmc.data)sparse.size## 38715120 bytesdense.size / sparse.size## 18.3 bytes# Initialize the Seurat object with the raw (non-normalized data).  Keep all# genes expressed in >= 3 cells (~0.1% of the data). Keep all cells with at# least 200 detected genespbmc <- CreateSeuratObject(raw.data = pbmc.data, min.cells = 3, min.genes = 200,project = "10X_PBMC")pbmc## An object of class seurat in project 10X_PBMC##  13714 genes across 2700 samples.进行一系列的QC步骤mito.genes <- grep(pattern = "^MT-", x = rownames(x = pbmc@data), value = TRUE)percent.mito <- Matrix::colSums(pbmc@raw.data[mito.genes, ]) / Matrix::colSums(pbmc@raw.data)# AddMetaData adds columns to object@meta.data, and is a great place to stash QC statspbmc <- AddMetaData(object = pbmc, metadata = percent.mito, col.name = "percent.mito")VlnPlot(object = pbmc, features.plot = c("nGene", "nUMI", "percent.mito"), nCol = 3)

# GenePlot is typically used to visualize gene-gene relationships, but can be used for anything# calculated by the object, i.e. columns in object@meta.data, PC scores etc.# Since there is a rare subset of cells with an outlier level of high mitochondrial percentage# and also low UMI content, we filter these as wellpar(mfrow = c(1, 2))GenePlot(object = pbmc, gene1 = "nUMI", gene2 = "percent.mito")GenePlot(object = pbmc, gene1 = "nUMI", gene2 = "nGene")

# We filter out cells that have unique gene counts over 2,500 or less than 200# Note that low.thresholds and high.thresholds are used to define a 'gate'# -Inf and Inf should be used if you don't want a lower or upper threshold.pbmc <- FilterCells(object = pbmc, subset.names = c("nGene", "percent.mito"), low.thresholds = c(200, -Inf), high.thresholds = c(2500, 0.05))可以看到这里选择的QC标准是 200~2500基因范围内,以及线粒体基因表达占比小于5%的才保留。normalization这里默认根据细胞测序文库大小进行normalization,简单的做一个log转换即可。summary(pbmc@raw.data[,1])##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.##  0.0000  0.0000  0.0000  0.1764  0.0000 76.0000pbmc <- NormalizeData(object = pbmc, normalization.method = "LogNormalize",scale.factor = 10000)summary(pbmc@data[,1])##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.##  0.0000  0.0000  0.0000  0.1171  0.0000  5.7531rDetection of variable genes across the single cellspbmc <- FindVariableGenes(object = pbmc, mean.function = ExpMean, dispersion.function = LogVMR, x.low.cutoff = 0.0125, x.high.cutoff = 3, y.cutoff = 0.5)

length(x = pbmc@var.genes)## [1] 1838Scaling the data and removing unwanted sources of variation需要去除那些technical noise,batch effects, or even biological sources of variation (cell cycle stage)pbmc <- ScaleData(object = pbmc, vars.to.regress = c("nUMI", "percent.mito"))summary(pbmc@scale.data[,1])PCA分析pbmc <- RunPCA(object = pbmc, pc.genes = pbmc@var.genes, do.print = TRUE, pcs.print = 1:5, genes.print = 5)## [1] "PC1"## [1] "CST3"   "TYROBP" "FCN1"   "LST1"   "AIF1"## [1] ""## [1] "PTPRCAP" "IL32"    "LTB"     "CD2"     "CTSW"## [1] ""## [1] ""## [1] "PC2"## [1] "NKG7" "GZMB" "PRF1" "CST7" "GZMA"## [1] ""## [1] "CD79A"    "MS4A1"    "HLA-DQA1" "TCL1A"    "HLA-DQB1"## [1] ""## [1] ""## [1] "PC3"## [1] "PF4"   "PPBP"  "SDPR"  "SPARC" "GNG11"## [1] ""## [1] "CYBA"     "HLA-DPA1" "HLA-DPB1" "HLA-DRB1" "CD37"## [1] ""## [1] ""## [1] "PC4"## [1] "IL32"   "GIMAP7" "AQP3"   "FYB"    "MAL"## [1] ""## [1] "CD79A"    "HLA-DQA1" "CD79B"    "MS4A1"    "HLA-DQB1"## [1] ""## [1] ""## [1] "PC5"## [1] "FCER1A"  "LGALS2"  "MS4A6A"  "S100A8"  "CLEC10A"## [1] ""## [1] "FCGR3A"        "CTD-2006K23.1" "IFITM3"        "ABI3"## [5] "CEBPB"## [1] ""## [1] ""对PCA分析结果可以进行一系列的可视化: PrintPCA, VizPCA, PCAPlot, and PCHeatmap# Examine and visualize PCA results a few different waysPrintPCA(object = pbmc, pcs.print = 1:5, genes.print = 5, use.full = FALSE)## [1] "PC1"## [1] "CST3"   "TYROBP" "FCN1"   "LST1"   "AIF1"## [1] ""## [1] "PTPRCAP" "IL32"    "LTB"     "CD2"     "CTSW"## [1] ""## [1] ""## [1] "PC2"## [1] "NKG7" "GZMB" "PRF1" "CST7" "GZMA"## [1] ""## [1] "CD79A"    "MS4A1"    "HLA-DQA1" "TCL1A"    "HLA-DQB1"## [1] ""## [1] ""## [1] "PC3"## [1] "PF4"   "PPBP"  "SDPR"  "SPARC" "GNG11"## [1] ""## [1] "CYBA"     "HLA-DPA1" "HLA-DPB1" "HLA-DRB1" "CD37"## [1] ""## [1] ""## [1] "PC4"## [1] "IL32"   "GIMAP7" "AQP3"   "FYB"    "MAL"## [1] ""## [1] "CD79A"    "HLA-DQA1" "CD79B"    "MS4A1"    "HLA-DQB1"## [1] ""## [1] ""## [1] "PC5"## [1] "FCER1A"  "LGALS2"  "MS4A6A"  "S100A8"  "CLEC10A"## [1] ""## [1] "FCGR3A"        "CTD-2006K23.1" "IFITM3"        "ABI3"## [5] "CEBPB"## [1] ""## [1] ""VizPCA(object = pbmc, pcs.use = 1:2)

PCAPlot(object = pbmc, dim.1 = 1, dim.2 = 2)

# ProjectPCA scores each gene in the dataset (including genes not included in the PCA) based on their correlation# with the calculated components. Though we don't use this further here, it can be used to identify markers that# are strongly correlated with cellular heterogeneity, but may not have passed through variable gene selection.# The results of the projected PCA can be explored by setting use.full=T in the functions abovepbmc <- ProjectPCA(object = pbmc, do.print = FALSE)最重要的就是 PCHeatmap 函数了PCHeatmap(object = pbmc, pc.use = 1, cells.use = 500, do.balanced = TRUE, label.columns = FALSE)

PCHeatmap(object = pbmc, pc.use = 1:12, cells.use = 500, do.balanced = TRUE, label.columns = FALSE, use.full = FALSE)

找到有统计学显著性的主成分主成分分析结束后需要确定哪些主成分所代表的基因可以进入下游分析,这里可以使用JackStraw做重抽样分析。可以用JackStrawPlot可视化看看哪些主成分可以进行下游分析。pbmc <- JackStraw(object = pbmc, num.replicate = 100, do.print = FALSE)JackStrawPlot(object = pbmc, PCs = 1:12)

当然,也可以用最经典的碎石图来确定主成分。PCElbowPlot(object = pbmc)

这个确定主成分是非常有挑战性的: - The first is more supervised, exploring PCs to determine relevant sources of heterogeneity, and could be used in conjunction with GSEA for example. - The second implements a statistical test based on a random null model, but is time-consuming for large datasets, and may not return a clear PC cutoff. - The third is a heuristic that is commonly used, and can be calculated instantly.在本例子里面,3种方法结果差异不大,可以在PC7~10直接挑选。Cluster the cells# save.SNN = T saves the SNN so that the clustering algorithm can be rerun using the same graph# but with a different resolution value (see docs for full details)pbmc <- FindClusters(object = pbmc, reduction.type = "pca", dims.use = 1:10, resolution = 0.6, print.output = 0, save.SNN = TRUE)A useful feature in Seurat v2.0 is the ability to recall the parameters that were used in the latest function calls for commonly used functions. For FindClusters, we provide the function PrintFindClustersParams to print a nicely formatted formatted summary of the parameters that were chosen.PrintFindClustersParams(object = pbmc)## Parameters used in latest FindClusters calculation run on: 2018-01-22 07:43:31## =============================================================================## Resolution: 0.6## -----------------------------------------------------------------------------## Modularity Function    Algorithm         n.start         n.iter##      1                   1                 100             10## -----------------------------------------------------------------------------## Reduction used          k.param          k.scale          prune.SNN##      pca                 30                25              0.0667## -----------------------------------------------------------------------------## Dims used in calculation## =============================================================================## 1 2 3 4 5 6 7 8 9 10# While we do provide function-specific printing functions, the more general function to# print calculation parameters is PrintCalcParams().Run Non-linear dimensional reduction (tSNE)同样也是一个函数,这个结果也可以像PCA分析一下挑选合适的PC进行下游分析。pbmc <- RunTSNE(object = pbmc, dims.use = 1:10, do.fast = TRUE)# note that you can set do.label=T to help label individual clustersTSNEPlot(object = pbmc)

这一步很耗时,可以保存该对象,便于重复,以及分享交流save(pbmc, file = "pbmc3k.rData")Finding differentially expressed genes (cluster biomarkers)差异分析在seurat包里面被封装成了函数:FindMarkers,有一系列参数可以选择,然后又4种找差异基因的算法:ROC test (“roc”)t-test (“t”)LRT test based on zero-inflated data (“bimod”, default)LRT test based on tobit-censoring models (“tobit”)# find all markers of cluster 1cluster1.markers <- FindMarkers(object = pbmc, ident.1 = 1, min.pct = 0.25)print(x = head(x = cluster1.markers, n = 5))##                p_val avg_logFC pct.1 pct.2    p_val_adj## S100A9  0.000000e+00  3.827593 0.996 0.216  0.00000e+00## S100A8  0.000000e+00  3.786535 0.973 0.123  0.00000e+00## LGALS2  0.000000e+00  2.634722 0.908 0.060  0.00000e+00## FCN1    0.000000e+00  2.369524 0.956 0.150  0.00000e+00## CD14   8.129864e-290  1.949317 0.663 0.029 1.11493e-285# find all markers distinguishing cluster 5 from clusters 0 and 3cluster5.markers <- FindMarkers(object = pbmc, ident.1 = 5, ident.2 = c(0,3), min.pct = 0.25)print(x = head(x = cluster5.markers, n = 5))##                p_val avg_logFC pct.1 pct.2     p_val_adj## GZMB   3.854665e-190  3.195021 0.955 0.084 5.286288e-186## IGFBP7 2.967797e-155  2.175917 0.542 0.010 4.070037e-151## GNLY   7.492111e-155  3.514718 0.961 0.143 1.027468e-150## FGFBP2 2.334109e-150  2.559484 0.852 0.085 3.200998e-146## FCER1G 4.819154e-141  2.280724 0.839 0.100 6.608987e-137# find markers for every cluster compared to all remaining cells, report only the positive onespbmc.markers <- FindAllMarkers(object = pbmc, only.pos = TRUE, min.pct = 0.25, thresh.use = 0.25)pbmc.markers %>% group_by(cluster) %>% top_n(2, avg_logFC)## # A tibble: 16 x 7## # Groups:   cluster [8]##            p_val avg_logFC pct.1 pct.2     p_val_adj cluster     gene##            <dbl>     <dbl> <dbl> <dbl>         <dbl>  <fctr>    <chr>##  1 1.315805e-234  1.149058 0.924 0.483 1.804495e-230       0     LDHB##  2 3.311687e-129  1.068122 0.662 0.202 4.541648e-125       0     IL7R##  3  0.000000e+00  3.827593 0.996 0.216  0.000000e+00       1   S100A9##  4  0.000000e+00  3.786535 0.973 0.123  0.000000e+00       1   S100A8##  5  0.000000e+00  2.977399 0.936 0.042  0.000000e+00       2    CD79A##  6 1.038405e-271  2.492236 0.624 0.022 1.424068e-267       2    TCL1A##  7 8.029765e-207  2.158812 0.974 0.230 1.101202e-202       3     CCL5##  8 1.118949e-181  2.113428 0.588 0.050 1.534527e-177       3     GZMK##  9 1.066599e-173  2.275509 0.962 0.137 1.462733e-169       4   FCGR3A## 10 1.996623e-123  2.151881 1.000 0.316 2.738169e-119       4     LST1## 11 9.120707e-265  3.334634 0.955 0.068 1.250814e-260       5     GZMB## 12 6.251673e-192  3.763928 0.961 0.131 8.573544e-188       5     GNLY## 13 2.510362e-238  2.729243 0.844 0.011 3.442711e-234       6   FCER1A## 14  7.037034e-21  1.965168 1.000 0.513  9.650588e-17       6 HLA-DPB1## 15 2.592342e-186  4.952160 0.933 0.010 3.555138e-182       7      PF4## 16 7.813553e-118  5.889503 1.000 0.023 1.071551e-113       7     PPBP值得注意的是: The ROC test returns the 'classification power’ for any individual marker (ranging from 0 - random, to 1 - perfect).cluster1.markers <- FindMarkers(object = pbmc, ident.1 = 0, thresh.use = 0.25, test.use = "roc", only.pos = TRUE)同时,该包提供了一系列可视化方法来检查差异分析的结果的可靠性:VlnPlot (shows expression probability distributions across clusters)FeaturePlot (visualizes gene expression on a tSNE or PCA plot) are our most commonly used visualizationsJoyPlot, CellPlot, and DotPlotVlnPlot(object = pbmc, features.plot = c("MS4A1", "CD79A"))

# you can plot raw UMI counts as wellVlnPlot(object = pbmc, features.plot = c("NKG7", "PF4"), use.raw = TRUE, y.log = TRUE)

FeaturePlot(object = pbmc, features.plot = c("MS4A1", "GNLY", "CD3E", "CD14", "FCER1A", "FCGR3A", "LYZ", "PPBP", "CD8A"), cols.use = c("grey", "blue"), reduction.use = "tsne")DoHeatmap generates an expression heatmap for given cells and genes. In this case, we are plotting the top 20 markers (or all markers if less than 20) for each cluster.pbmc.markers %>% group_by(cluster) %>% top_n(10, avg_logFC) -> top10# setting slim.col.label to TRUE will print just the cluster IDS instead of every cell nameDoHeatmap(object = pbmc, genes.use = top10$gene, slim.col.label = TRUE, remove.key = TRUE)

Assigning cell type identity to clusters这个主要取决于生物学背景知识:Cluster IDMarkersCell Type0IL7RCD4 T cells1CD14, LYZCD14+ Monocytes2MS4A1B cells3CD8ACD8 T cells4FCGR3A, MS4A7FCGR3A+ Monocytes5GNLY, NKG7NK cells6FCER1A, CST3Dendritic Cells7PPBPMegakaryocytescurrent.cluster.ids <- c(0, 1, 2, 3, 4, 5, 6, 7)new.cluster.ids <- c("CD4 T cells", "CD14+ Monocytes", "B cells", "CD8 T cells", "FCGR3A+ Monocytes", "NK cells", "Dendritic cells", "Megakaryocytes")pbmc@ident <- plyr::mapvalues(x = pbmc@ident, from = current.cluster.ids, to = new.cluster.ids)TSNEPlot(object = pbmc, do.label = TRUE, pt.size = 0.5)

Further subdivisions within cell types# First lets stash our identities for laterpbmc <- StashIdent(object = pbmc, save.name = "ClusterNames_0.6")# Note that if you set save.snn=T above, you don't need to recalculate the SNN, and can simply put:# pbmc <- FindClusters(pbmc,resolution = 0.8)pbmc <- FindClusters(object = pbmc, reduction.type = "pca", dims.use = 1:10, resolution = 0.8, print.output = FALSE)# Demonstration of how to plot two tSNE plots side by side, and how to color points based on different criteriaplot1 <- TSNEPlot(object = pbmc, do.return = TRUE, no.legend = TRUE, do.label = TRUE)plot2 <- TSNEPlot(object = pbmc, do.return = TRUE, group.by = "ClusterNames_0.6", no.legend = TRUE, do.label = TRUE)plot_grid(plot1, plot2)

# Find discriminating markerstcell.markers <- FindMarkers(object = pbmc, ident.1 = 0, ident.2 = 1)# Most of the markers tend to be expressed in C1 (i.e. S100A4). However, we can see that CCR7 is upregulated in# C0, strongly indicating that we can differentiate memory from naive CD4 cells.# cols.use demarcates the color palette from low to high expressionFeaturePlot(object = pbmc, features.plot = c("S100A4", "CCR7"), cols.use = c("green", "blue"))

The memory/naive split is bit weak, and we would probably benefit from looking at more cells to see if this becomes more convincing. In the meantime, we can restore our old cluster identities for downstream processing.还有一个非常给力的用法,限于篇幅,就不介绍了,大家可以自行探索。后面还有一个10X的单细胞实战,用的就是这个包,敬请期待。

(0)

相关推荐

  • 单细胞Marker基因可示化包Nebulosa

    与传统的转录组测序相比,单细胞测序技术噪声很大,使得单细胞转录组数据包含大量的dropout事件(导致基因表达量为0或接近0),即使是一些标记(Marker)基因也有可能表达量很低.当在使用其对聚类的 ...

  • Seurat学习与使用(一)

    简介Seurat是一个r包,被设计用于单细胞rna-seq数据的细胞质控和分析.Seurat旨在使用户能够识别和解释单细胞转录组数据中的异质性来源,同时提供整合不同类型的单细胞数据的函数.目前Seur ...

  • 单细胞工具箱|Seurat官网标准流程

    学习单细胞转录组肯定先来一遍Seurat官网的标准流程. 数据来源于Peripheral Blood Mononuclear Cells (PBMC),共2700个单细胞, Illumina Next ...

  • 单细胞转录组3大R包之scater

    scater 这个R包很强大,是McCarthy et al. 2017 发表的,包含的功能有: Automated computation of QC metrics Transcript quan ...

  • 单细胞转录组3大R包之monocle2

    主要是针对单细胞转录组测序数据开发的,用来找不同细胞类型或者不同细胞状态的差异表达基因.分析起始是表达矩阵,作者推荐用比较老旧的Tophat+Cufflinks流程,或者RSEM, eXpress,S ...

  • 在Ubuntu下安装单细胞3大R包

    查看Ubuntu系统以及R版本 cat /etc/issue 通常来说,很多R包的安装对R版本是有要求的,比如BiocManager需要 R (≥ 3.5.0),但是并不需要最新版R语言. R到3.5 ...

  • 一些单细胞转录组R包的对象

    对象应该是很重要的,至少是在R语言里面 ExpressionSet Bioconductor的ExpressionSet是基石,多次讲解过,GEO数据库在R里面下载的就是这个对象. 通常不需要自己从头 ...

  • 10X单细胞转录组理论上有3个文件才能被读入R进行seurat分析

    我在单细胞天地教程:表达矩阵逆转为10X的标准输出3个文件,详细介绍过 10X文件的3个标准文件: 比如SRR7722939数据集里面,文件barcodes.tsv 和 genes.tsv,就是表达矩 ...

  • 把这个R包大卸八块

    本来应该这是一个很正常的学习过程,之前总结了一篇博文Bioconductor的质谱蛋白组学数据分析,对蛋白组学定量那块比较感兴趣,正好看到一个R包-MSstats,其可用来对DDA,SRM和DIA的结 ...

  • 多个单细胞转录组样本的数据整合之CCA-Seurat包

    单细胞水平的研究是仅次于NGS的一次生物信息学领域的革命,同样的随随便便发CNS的黄金时期也过去了,现在想发高分文章,拿多个病人的多个样本进行单细胞转录组测序是非常正常的,比如下面的: 发表在 Nat ...

  • 学习使用各种单细胞R包来处理数据

    号外:中秋节广州3天入门课程报名马上截止:(中秋节一起来学习!)全国巡讲第16站-广州(生信入门课加量不加价) 单细胞R包如过江之卿,这里只考讲解5个R包,分别是: scater,monocle,Se ...

  • 单细胞转录组的肿瘤研究3大应用方向等你来攻克

    还记得几年前前准备单细胞课程作业,查遍全网基本上找不到中文资料,甚至英文文献都少得可怜,虽然那个时候单细胞就已经显现出热点的趋势,大多数CNS之作,但是在癌症领域仅仅是6个癌症类型有单细胞转录组技术应 ...