scRNA芝加哥大学Yoav Gilad lab实验的Tung et al 2017)文章

文章是 Batch effects and the effective design of single-cell gene expression studies, 做了不少工作,不知道为什么要发表在SR上面。测了来源于hapmap计划的3个人的单细胞Single-cell RNA sequencing (scRNA-seq)。用的是single-cell Fluidigm C1 platform来做单细胞分选,材料是three human induced pluripotent stem cell (iPSC) lines of three Yorubaindividuals (abbreviation: YRI)

C1平台是96孔板,所以96个样本是一个batch

测序策略

We added ERCC spike-in controls to each sample, and used 5-bp random sequence UMIs to allow for the direct quantification of mRNA molecule numbers.

We obtained an average of 6.3 +/− 2.1 million sequencing reads per sample (range 0.4–11.2 million reads).

Download the raw FASTQ files at GEO record GSE77288.

单细胞表达数据的质控

挑选真正的单细胞

首先做 visual inspection

Based on that visual inspection, we flagged 21 samples that did not contain any cell, and 54 samples that contained more than one cell (across all batches).

考虑batch effects

就是一堆PCA类似的聚类分析,作图。

用spike-ins来矫正batch effects是不可行的,虽然不同的batch中的spike-ins显著变化,但是不同的细胞之间的spike-ins也是显著变化,无法区分technical and biological variation

细胞过滤标准

  • Only one cell observed per well.

  • At least 1,556,255 mapped reads.

  • Less than 36.4% unmapped reads.

  • Less than 3.2% ERCC reads.

  • More than 6,788 genes with at least one read.

After filtering, we maintained 564 high quality single cells (NA19098: 142, NA19101: 201, NA19239: 221).

基因过滤标准

  • The quality control analyses were performed using all protein-coding genes (Ensembl GRCh37 release 82) with at least one observed read.

  • Using the high quality single cells, we further removed genes with low expression levels for downstream analyses. We removed all genes with a mean log2 cpm less than 2

  • We also removed genes with molecule counts larger than 1,024 for the correction of collision probability.

    In the end we kept 13,058 endogenous genes and 48 ERCC spike-in genes.

数据处理流程

  • To assess read quality, we ran FastQC and observed a decrease in base quality at the 3′ end of the reads.

  • Thus we removed low quality bases from the 3′ end using sickle with default settings.

  • To handle the UMI sequences at the 5′ end of each read, we used umitools to find all reads with a UMI of the pattern NNNNNGGG (reads without UMIs were discarded).

  • We then mapped reads to human genome hg19 (only including chromosomes 1–22, X, and Y, plus the ERCC sequences) with Subjunc40, discarding non-uniquely mapped reads (option -u).

  • To obtain gene-level counts, we assigned reads to protein-coding genes (Ensembl GRCh37 release 82) and the ERCC spike-in genes using featureCounts41.

重抽样看测序深度对scRNA的影响

看起来有点麻烦

去除批次效应

好像也很麻烦

(0)

相关推荐