单细胞转录组3大R包之scater

scater 这个R包很强大,是McCarthy et al. 2017 发表的,包含的功能有:Automated computation of QC metricsTranscript quantification from read data with pseudo-alignmentData format standardisationRich visualizations for exploratory analysisSeamless integration into the Bioconductor universeSimple normalisation methodsR包工作流程图

S4对象主要是基于 SCESet 对象来进行下游分析,跟ExpressionSet对象类似,也是常见的3个组成:exprs, a numeric matrix of expression values, where rows are features, and columns are cellsphenoData, an AnnotatedDataFrame object, where rows are cells, and columns are cell attributes (such as cell type, culture condition, day captured, etc.)featureData, an AnnotatedDataFrame object, where rows are features (e.g. genes), and columns are feature attributes, such as biotype, gc content, etc.主要就是读取scRNA上游分析处理得到的表达矩阵,加上每个样本的描述信息,形成矩阵之后。对样本进行过滤,然后对基因进行过滤。针对过滤后的表达矩阵进行各种分类的可视化。最新的文档如下:HTMLR ScriptAn introduction to the scater packageHTMLR ScriptData visualisation methods in scaterHTMLR ScriptExpression quantification and importHTMLR ScriptQuality control with scaterHTMLR ScriptTransition from SCESet to SingleCellExperimentPDF其GitHub上面专门开设了介绍它用法的课程:http://hemberg-lab.github.io/scRNA.seq.course/测试数据suppressPackageStartupMessages(library(scater))data("sc_example_counts")data("sc_example_cell_info")example_sce <- SingleCellExperiment(assays = list(counts = sc_example_counts),colData = sc_example_cell_info)exprs(example_sce) <- log2(calculateCPM(example_sce, use.size.factors = FALSE) + 1)keep_feature <- rowSums(exprs(example_sce) > 0) > 0example_sce <- example_sce[keep_feature,]example_sce <- calculateQCMetrics(example_sce,feature_controls = list(eg = 1:40))#scater_gui(example_sce)但是真的非常好用,所有的可视化都集中在了 scater_gui 这个函数产生的shiny网页里面:plotScater: a plot method exists for SingleCellExperiment objects, which gives an overview of expression across cells.plotQC: various methods are available for producing QC diagnostic plots.plotPCA: produce a principal components plot for the cells.plotTSNE: produce a t-distributed stochastic neighbour embedding (reduced dimension) plot for the cells.plotDiffusionMap: produce a diffusion map (reduced dimension) plot for the cells.plotMDS: produce a multi-dimensional scaling plot for the cells.plotReducedDim: plot a reduced-dimension representation of the cells.plotExpression: plot expression levels for a defined set of features.plotPlatePosition: plot cells in their position on a plate, coloured by cell metadata and QC metrics or feature expression level.plotColData: plot cell metadata and QC metrics.plotRowData: plot feature metadata and QC metrics.可以充分的探索自己的数据,随便看一个可视化函数的结果:## ----plot-expression, eval=TRUE--------------------------------------------plotExpression(example_sce, rownames(example_sce)[1:6],x = "Mutation_Status", exprs_values = "exprs",colour = "Treatment")

详细的QC做QC要结合上面的可视化步骤,所有没办法自动化,只能先可视化,肉眼分辨一下哪些样本或者基因数据是需要舍弃的。library(knitr)opts_chunk$set(fig.align = 'center', fig.width = 6, fig.height = 5, dev = 'png')library(ggplot2)theme_set(theme_bw(12))## ----quickstart-load-data, message=FALSE, warning=FALSE--------------------suppressPackageStartupMessages(library(scater))data("sc_example_counts")data("sc_example_cell_info")## ----quickstart-make-sce, results='hide'-----------------------------------gene_df <- DataFrame(Gene = rownames(sc_example_counts))rownames(gene_df) <- gene_df$Geneexample_sce <- SingleCellExperiment(assays = list(counts = sc_example_counts),colData = sc_example_cell_info,rowData = gene_df)example_sce <- normalise(example_sce)## Warning in .local(object, ...): using library sizes as size factors## ----quickstart-add-exprs, results='hide'----------------------------------exprs(example_sce) <- log2(calculateCPM(example_sce, use.size.factors = FALSE) + 1)## ----filter-no-exprs-------------------------------------------------------keep_feature <- rowSums(exprs(example_sce) > 0) > 0example_sce <- example_sce[keep_feature,]example_sceset <- calculateQCMetrics(example_sce, feature_controls = list(eg = 1:40))colnames(colData(example_sceset))##  [1] "Cell"##  [2] "Mutation_Status"##  [3] "Cell_Cycle"##  [4] "Treatment"##  [5] "total_features"##  [6] "log10_total_features"##  [7] "total_counts"##  [8] "log10_total_counts"##  [9] "pct_counts_top_50_features"## [10] "pct_counts_top_100_features"## [11] "pct_counts_top_200_features"## [12] "pct_counts_top_500_features"## [13] "total_features_endogenous"## [14] "log10_total_features_endogenous"## [15] "total_counts_endogenous"## [16] "log10_total_counts_endogenous"## [17] "pct_counts_endogenous"## [18] "pct_counts_top_50_features_endogenous"## [19] "pct_counts_top_100_features_endogenous"## [20] "pct_counts_top_200_features_endogenous"## [21] "pct_counts_top_500_features_endogenous"## [22] "total_features_feature_control"## [23] "log10_total_features_feature_control"## [24] "total_counts_feature_control"## [25] "log10_total_counts_feature_control"## [26] "pct_counts_feature_control"## [27] "total_features_eg"## [28] "log10_total_features_eg"## [29] "total_counts_eg"## [30] "log10_total_counts_eg"## [31] "pct_counts_eg"## [32] "is_cell_control"colnames(rowData(example_sceset))##  [1] "Gene"                  "is_feature_control"##  [3] "is_feature_control_eg" "mean_counts"##  [5] "log10_mean_counts"     "rank_counts"##  [7] "n_cells_counts"        "pct_dropout_counts"##  [9] "total_counts"          "log10_total_counts"首先是基于样本的过滤,用 colData(object) 可以查看各个样本统计情况total_counts: total number of counts for the cell (aka 'library size’)log10_total_counts: total_counts on the log10-scaletotal_features: the number of features for the cell that have expression above the detection limit (default detection limit is zero)filter_on_total_counts: would this cell be filtered out based on its log10-total_counts being (by default) more than 5 median absolute deviations from the median log10-total_counts for the dataset?filter_on_total_features: would this cell be filtered out based on its total_features being (by default) more than 5 median absolute deviations from the median total_features for the dataset?counts_feature_controls: total number of counts for the cell that come from (a set of user-defined) control features. Defaults to zero if no control features are indicated.counts_endogenous_features: total number of counts for the cell that come from endogenous features (i.e. not control features). Defaults to total_counts if no control features are indicated.log10_counts_feature_controls: total number of counts from control features on the log10-scale. Defaults to zero (i.e. log10(0 + 1), offset to avoid infinite values) if no control features are indicated.log10_counts_endogenous_features: total number of counts from endogenous features on the log10-scale. Defaults to zero (i.e. log10(0 + 1), offset to avoid infinite values) if no control features are indicated.n_detected_feature_controls: number of defined feature controls that have expression greater than the threshold defined in the object. *pct_counts_feature_controls: percentage of all counts that come from the defined control features. Defaults to zero if no control features are defined.然后是基于基因的过滤,用 rowData(object) 可以查看各个基因统计情况mean_exprs: the mean expression level of the gene/feature.exprs_rank: the rank of the feature’s expression level in the cell.total_feature_counts: the total number of counts mapped to that feature across all cells.log10_total_feature_counts: total feature counts on the log10-scale.pct_total_counts: the percentage of all counts that are accounted for by the counts mapping to the feature.is_feature_control: is the feature a control feature? Default is FALSE unless control features are defined by the user.n_cells_exprs: the number of cells for which the expression level of the feature is above the detection limit (default detection limit is zero).scater一站式过滤低质量样本scater包自己提供了一个基于PCA的QC标准,不需要自己根据文库大小,覆盖的基因数量,外源的ERCC spike-ins 含量以及线粒体DNA含量来进行人工过滤。默认的筛选条件如下:pct_counts_top100featurestotal_featurespct_counts_feature_controlsn_detected_feature_controlslog10_counts_endogenous_featureslog10_counts_feature_controls一站式QC函数如下:dat_pca <- scater::plotPCA(dat_qc,size_by = "total_features",shape_by = "use",pca_data_input = "pdata",detect_outliers = TRUE,return_SCESet = TRUE)还有更详细的教程,需要看https://www.bioconductor.org/help/workflows/simpleSingleCell/http://hemberg-lab.github.io/scRNA.seq.course/index.htmlsessionInfo()过滤只是它最基本的工具,它作为单细胞转录组3大R包,功能肯定是非常全面的,比如前面我们讲解的normalization,DEG, features selection,cluster,它都手到擒来,只不过是包装的是其它R包的函数。

(0)

相关推荐

  • 单细胞RNA

    一.单细胞single cell RNA-seq简介 1.Bulk RNA-seq(大量RNA-seq) Measures the average expression level for each ...

  • 单细胞转录组3大R包之Seurat

    牛津大学的Rahul Satija等开发的Seurat,最早公布在Nature biotechnology, 2015,文章是: Spatial reconstruction of single-ce ...

  • 单细胞转录组3大R包之monocle2

    主要是针对单细胞转录组测序数据开发的,用来找不同细胞类型或者不同细胞状态的差异表达基因.分析起始是表达矩阵,作者推荐用比较老旧的Tophat+Cufflinks流程,或者RSEM, eXpress,S ...

  • 在Ubuntu下安装单细胞3大R包

    查看Ubuntu系统以及R版本 cat /etc/issue 通常来说,很多R包的安装对R版本是有要求的,比如BiocManager需要 R (≥ 3.5.0),但是并不需要最新版R语言. R到3.5 ...

  • 一些单细胞转录组R包的对象

    对象应该是很重要的,至少是在R语言里面 ExpressionSet Bioconductor的ExpressionSet是基石,多次讲解过,GEO数据库在R里面下载的就是这个对象. 通常不需要自己从头 ...

  • 把这个R包大卸八块

    本来应该这是一个很正常的学习过程,之前总结了一篇博文Bioconductor的质谱蛋白组学数据分析,对蛋白组学定量那块比较感兴趣,正好看到一个R包-MSstats,其可用来对DDA,SRM和DIA的结 ...

  • 多个单细胞转录组样本的数据整合之CCA-Seurat包

    单细胞水平的研究是仅次于NGS的一次生物信息学领域的革命,同样的随随便便发CNS的黄金时期也过去了,现在想发高分文章,拿多个病人的多个样本进行单细胞转录组测序是非常正常的,比如下面的: 发表在 Nat ...

  • 学习使用各种单细胞R包来处理数据

    号外:中秋节广州3天入门课程报名马上截止:(中秋节一起来学习!)全国巡讲第16站-广州(生信入门课加量不加价) 单细胞R包如过江之卿,这里只考讲解5个R包,分别是: scater,monocle,Se ...

  • 单细胞转录组的肿瘤研究3大应用方向等你来攻克

    还记得几年前前准备单细胞课程作业,查遍全网基本上找不到中文资料,甚至英文文献都少得可怜,虽然那个时候单细胞就已经显现出热点的趋势,大多数CNS之作,但是在癌症领域仅仅是6个癌症类型有单细胞转录组技术应 ...

  • TCGA(转录组)差异分析三大R包及其结果对比

    最近我们最优秀的R语言讲师小洁也开启了TCGA知识库打卡之旅,分享一下她其中一个学习成果,TCGA(转录组)差异分析三大R包及其结果对比. 如果你跟着她的教程学会了相关分析,可以尝试完成一个学徒作业: ...