单细胞转录组3大R包之scater
scater 这个R包很强大,是McCarthy et al. 2017 发表的,包含的功能有:Automated computation of QC metricsTranscript quantification from read data with pseudo-alignmentData format standardisationRich visualizations for exploratory analysisSeamless integration into the Bioconductor universeSimple normalisation methodsR包工作流程图
S4对象主要是基于 SCESet 对象来进行下游分析,跟ExpressionSet对象类似,也是常见的3个组成:exprs, a numeric matrix of expression values, where rows are features, and columns are cellsphenoData, an AnnotatedDataFrame object, where rows are cells, and columns are cell attributes (such as cell type, culture condition, day captured, etc.)featureData, an AnnotatedDataFrame object, where rows are features (e.g. genes), and columns are feature attributes, such as biotype, gc content, etc.主要就是读取scRNA上游分析处理得到的表达矩阵,加上每个样本的描述信息,形成矩阵之后。对样本进行过滤,然后对基因进行过滤。针对过滤后的表达矩阵进行各种分类的可视化。最新的文档如下:HTMLR ScriptAn introduction to the scater packageHTMLR ScriptData visualisation methods in scaterHTMLR ScriptExpression quantification and importHTMLR ScriptQuality control with scaterHTMLR ScriptTransition from SCESet to SingleCellExperimentPDF其GitHub上面专门开设了介绍它用法的课程:http://hemberg-lab.github.io/scRNA.seq.course/测试数据suppressPackageStartupMessages(library(scater))data("sc_example_counts")data("sc_example_cell_info")example_sce <- SingleCellExperiment(assays = list(counts = sc_example_counts),colData = sc_example_cell_info)exprs(example_sce) <- log2(calculateCPM(example_sce, use.size.factors = FALSE) + 1)keep_feature <- rowSums(exprs(example_sce) > 0) > 0example_sce <- example_sce[keep_feature,]example_sce <- calculateQCMetrics(example_sce,feature_controls = list(eg = 1:40))#scater_gui(example_sce)但是真的非常好用,所有的可视化都集中在了 scater_gui 这个函数产生的shiny网页里面:plotScater: a plot method exists for SingleCellExperiment objects, which gives an overview of expression across cells.plotQC: various methods are available for producing QC diagnostic plots.plotPCA: produce a principal components plot for the cells.plotTSNE: produce a t-distributed stochastic neighbour embedding (reduced dimension) plot for the cells.plotDiffusionMap: produce a diffusion map (reduced dimension) plot for the cells.plotMDS: produce a multi-dimensional scaling plot for the cells.plotReducedDim: plot a reduced-dimension representation of the cells.plotExpression: plot expression levels for a defined set of features.plotPlatePosition: plot cells in their position on a plate, coloured by cell metadata and QC metrics or feature expression level.plotColData: plot cell metadata and QC metrics.plotRowData: plot feature metadata and QC metrics.可以充分的探索自己的数据,随便看一个可视化函数的结果:## ----plot-expression, eval=TRUE--------------------------------------------plotExpression(example_sce, rownames(example_sce)[1:6],x = "Mutation_Status", exprs_values = "exprs",colour = "Treatment")
详细的QC做QC要结合上面的可视化步骤,所有没办法自动化,只能先可视化,肉眼分辨一下哪些样本或者基因数据是需要舍弃的。library(knitr)opts_chunk$set(fig.align = 'center', fig.width = 6, fig.height = 5, dev = 'png')library(ggplot2)theme_set(theme_bw(12))## ----quickstart-load-data, message=FALSE, warning=FALSE--------------------suppressPackageStartupMessages(library(scater))data("sc_example_counts")data("sc_example_cell_info")## ----quickstart-make-sce, results='hide'-----------------------------------gene_df <- DataFrame(Gene = rownames(sc_example_counts))rownames(gene_df) <- gene_df$Geneexample_sce <- SingleCellExperiment(assays = list(counts = sc_example_counts),colData = sc_example_cell_info,rowData = gene_df)example_sce <- normalise(example_sce)## Warning in .local(object, ...): using library sizes as size factors## ----quickstart-add-exprs, results='hide'----------------------------------exprs(example_sce) <- log2(calculateCPM(example_sce, use.size.factors = FALSE) + 1)## ----filter-no-exprs-------------------------------------------------------keep_feature <- rowSums(exprs(example_sce) > 0) > 0example_sce <- example_sce[keep_feature,]example_sceset <- calculateQCMetrics(example_sce, feature_controls = list(eg = 1:40))colnames(colData(example_sceset))## [1] "Cell"## [2] "Mutation_Status"## [3] "Cell_Cycle"## [4] "Treatment"## [5] "total_features"## [6] "log10_total_features"## [7] "total_counts"## [8] "log10_total_counts"## [9] "pct_counts_top_50_features"## [10] "pct_counts_top_100_features"## [11] "pct_counts_top_200_features"## [12] "pct_counts_top_500_features"## [13] "total_features_endogenous"## [14] "log10_total_features_endogenous"## [15] "total_counts_endogenous"## [16] "log10_total_counts_endogenous"## [17] "pct_counts_endogenous"## [18] "pct_counts_top_50_features_endogenous"## [19] "pct_counts_top_100_features_endogenous"## [20] "pct_counts_top_200_features_endogenous"## [21] "pct_counts_top_500_features_endogenous"## [22] "total_features_feature_control"## [23] "log10_total_features_feature_control"## [24] "total_counts_feature_control"## [25] "log10_total_counts_feature_control"## [26] "pct_counts_feature_control"## [27] "total_features_eg"## [28] "log10_total_features_eg"## [29] "total_counts_eg"## [30] "log10_total_counts_eg"## [31] "pct_counts_eg"## [32] "is_cell_control"colnames(rowData(example_sceset))## [1] "Gene" "is_feature_control"## [3] "is_feature_control_eg" "mean_counts"## [5] "log10_mean_counts" "rank_counts"## [7] "n_cells_counts" "pct_dropout_counts"## [9] "total_counts" "log10_total_counts"首先是基于样本的过滤,用 colData(object) 可以查看各个样本统计情况total_counts: total number of counts for the cell (aka 'library size’)log10_total_counts: total_counts on the log10-scaletotal_features: the number of features for the cell that have expression above the detection limit (default detection limit is zero)filter_on_total_counts: would this cell be filtered out based on its log10-total_counts being (by default) more than 5 median absolute deviations from the median log10-total_counts for the dataset?filter_on_total_features: would this cell be filtered out based on its total_features being (by default) more than 5 median absolute deviations from the median total_features for the dataset?counts_feature_controls: total number of counts for the cell that come from (a set of user-defined) control features. Defaults to zero if no control features are indicated.counts_endogenous_features: total number of counts for the cell that come from endogenous features (i.e. not control features). Defaults to total_counts if no control features are indicated.log10_counts_feature_controls: total number of counts from control features on the log10-scale. Defaults to zero (i.e. log10(0 + 1), offset to avoid infinite values) if no control features are indicated.log10_counts_endogenous_features: total number of counts from endogenous features on the log10-scale. Defaults to zero (i.e. log10(0 + 1), offset to avoid infinite values) if no control features are indicated.n_detected_feature_controls: number of defined feature controls that have expression greater than the threshold defined in the object. *pct_counts_feature_controls: percentage of all counts that come from the defined control features. Defaults to zero if no control features are defined.然后是基于基因的过滤,用 rowData(object) 可以查看各个基因统计情况mean_exprs: the mean expression level of the gene/feature.exprs_rank: the rank of the feature’s expression level in the cell.total_feature_counts: the total number of counts mapped to that feature across all cells.log10_total_feature_counts: total feature counts on the log10-scale.pct_total_counts: the percentage of all counts that are accounted for by the counts mapping to the feature.is_feature_control: is the feature a control feature? Default is FALSE unless control features are defined by the user.n_cells_exprs: the number of cells for which the expression level of the feature is above the detection limit (default detection limit is zero).scater一站式过滤低质量样本scater包自己提供了一个基于PCA的QC标准,不需要自己根据文库大小,覆盖的基因数量,外源的ERCC spike-ins 含量以及线粒体DNA含量来进行人工过滤。默认的筛选条件如下:pct_counts_top100featurestotal_featurespct_counts_feature_controlsn_detected_feature_controlslog10_counts_endogenous_featureslog10_counts_feature_controls一站式QC函数如下:dat_pca <- scater::plotPCA(dat_qc,size_by = "total_features",shape_by = "use",pca_data_input = "pdata",detect_outliers = TRUE,return_SCESet = TRUE)还有更详细的教程,需要看https://www.bioconductor.org/help/workflows/simpleSingleCell/http://hemberg-lab.github.io/scRNA.seq.course/index.htmlsessionInfo()过滤只是它最基本的工具,它作为单细胞转录组3大R包,功能肯定是非常全面的,比如前面我们讲解的normalization,DEG, features selection,cluster,它都手到擒来,只不过是包装的是其它R包的函数。