你的2个10x数据可以整合公共数据库的2个一起分析

看到华大基因团队的课题组丢在GigaScience杂志的一篇文章:Single-cell transcriptomic landscape of nucleated cells in umbilical cord blood ,链接是:https://academic.oup.com/gigascience/article/8/5/giz047/5484799  ,该研究仅仅是2个志愿者的脐带血单细胞转录组而已,但是成功的跟10X官网的外周血单细胞转录组细胞分群进行对比描述,讲述了一个比较好的生物学故事!

  • To acquire a transcriptomic map of UCB cells at single-cell resolution, we collected samples of UCB from 2 healthy donors and isolated nucleated cells for single-cell RNA-seq using the 10 × Chromium platform.

  • After stringent quality control and filtering by multiple criteria (see Methods), transcriptomes of 7,852 and 9,785 single cells from the 2 UCB samples (UCB1 and UCB2) were acquired, detecting a mean of 1,270 and 1,460 genes per cell, respectively.

上面是单细胞转录组数据量的描述,是10X仪器的单细胞转录组数据,并没有看到表达矩阵上传到GEO数据库,但是在NCBI和EBI可以下载原始的fastq数据文件,ID是 PRJNA524398 。

To determine the unique cell subpopulations and the specific state of gene expression in UCB, we used the public single-cell transcriptomics dataset of PB cells for comparison. This dataset includes 2 independently generated libraries (PB1 and PB2), containing a total of 11,948 single-cell profiles of peripheral blood mononuclear cells (PBMCs) measuring 1,069 genes per cell on average. These are at a comparable level with those of the UCB data.

两个数据集的比较如下:

 

公共数据来源10X官网:https://support.10xgenomics.com/single-cell-gene-expression/datasets

Public single-cell gene expression datasets of PBMCs (PB1 and PB2) were generated from a sample from a single donor. In the present study, PB1 and PB2 correspond to Cell Ranger 2.0.1-processed “8k PBMCs from a healthy donor” and “4k PBMCs from a healthy donor,” respectively。

自有数据采用BGISEQ-500 sequencer,看起来这个文章是为了支撑他们的测序仪的准确性吧。

Purified DNA nanoballs were sequenced using the BGISEQ-500 sequencer, generating reads containing 16 base pairs of 10x™ barcodes, 10 base pairs of UMIs, and 100 base pairs of 3′ complementary DNA sequences. Each library was sequenced in 3 lanes, yielding ∼1.9 billion reads in total

数据上传到了两个地方:

  • The raw data reported in this study are deposited in the NCBI Sequence Read Archive under bioproject No. PRJNA524398
  • and in the CNGB Nucleotide Sequence Archive (CNSA) (CNSA: https://db.cngb.org/cnsa/) with accession No. CNP0000090.

其中作者测试了3个方法矫正技术误差,To minimize any technical variance that could lead to misinterpretation of the data, we rigorously tested 3 widely used algorithms for batch effect correction, namely, CCA, SVA, and MNN. Based on a quantitative evaluation of cell segregation in the tSNE space, MNN and CCA appeared comparable and effective for our datasets, although MNN scored marginally higher.

文章的描述如下:

To isolate biological variance from the interfering technical variances in the remaining data, we employed 3 independent computational methods, canonical correlation analysis (CCA) [19], surrogate variable analysis (SVA) [20], and mutual nearest neighbors (MNN) [21], to systemically correct the potential technical variance (Supplementary Fig. S2A–D).

最后选择了  mutual nearest neighbors (MNN)  方法,而不是我们一直推荐的CCA

Results indicated that the MNN algorithm most successfully eliminated the batch effect in the current dataset (Supplementary Fig. S2E and F). Thus, we proceeded to use MNN-corrected expression matrices for the Seurat pipeline and all subsequent analysis.

假如你的10X样品数据量也不够

试试看自己领域的公共数据库,虽然算不上多如牛毛,但基本上十几个公共数据集还是可以找到的,比较10X走入了寻常百姓家!

(0)

相关推荐