UCSCXenaTools介绍
UCSCXenaTools 提供了下载 UCSC Xena 平台数据的 R 客户端,为官方文档 https://ucsc-xena.gitbook.io/project/overview-of-features/download-data 所推荐。
安装
从 CRAN 安装稳定版本:
install.packages("UCSCXenaTools")
从 GitHub 安装开发版本:
# install.packages("remotes")
remotes::install_github("ropensci/UCSCXenaTools")
如果你想要确保在本地构建好包文档,请添加额外两个选项:
remotes::install_github("ropensci/UCSCXenaTools", build_vignettes = TRUE, dependencies = TRUE)
如果存在问题或是 Bug,请到 GitHub Issue 上提问。
Data Hub 列表
UCSC Xena 平台根据数据源分为不同的 data hub,(基本上)所有的数据集都可以在 https://xenabrowser.net/datapages/ 找到。
当前 UCSCXenaTools 支持 10 个 data hub:
UCSC Public Hub: https://ucscpublic.xenahubs.net
TCGA Hub:
https://tcga.xenahubs.net
GDC Xena Hub: https://gdc.xenahubs.net
ICGC Xena Hub: https://icgc.xenahubs.net
Pan-Cancer Atlas Hub: https://pancanatlas.xenahubs.net
GA4GH (TOIL) Hub: https://toil.xenahubs.net
Treehouse Hub: https://xena.treehouse.gi.ucsc.edu
PCAWG Hub: https://pcawg.xenahubs.net
ATAC-seq Hub: https://atacseq.xenahubs.net
Singel Cell Xena hub: https://singlecell.xenahubs.net
如果 data hub 的 URL 改变了,或者有新的 data hub 出现了,请通过邮箱 w_shixiang@163.com 或者 GitHub issue 联系我。
使用方法
我将该包支持的标准流程分为 5 步,它们分别由对应的函数实现:
生成
XenaHub
对象 ——XenaGenerate()
过滤数据 ——
XenaFilter()
检索数据 ——
XenaQuery()
下载数据 ——
XenaDownload()
导入数据 ——
XenaPrepare()
它们可以使用管道符号 %>%
进行连接。
下面通过下载 TCGA hub 的肺癌临床数据进行演示。
XenaData 数据框
UCSCXenaTools 使用包内置数据集 XenaData
辅助生成 XenaHub
对象,这个数据集记录了当前所有数据集的信息。
library(UCSCXenaTools)
data(XenaData)
head(XenaData)
#> # A tibble: 6 x 17
#> XenaHosts XenaHostNames XenaCohorts XenaDatasets SampleCount DataSubtype
#> <chr> <chr> <chr> <chr> <int> <chr>
#> 1 https://… publicHub Breast Can… ucsfNeve_pu… 51 gene expre…
#> 2 https://… publicHub Breast Can… ucsfNeve_pu… 57 phenotype
#> 3 https://… publicHub Glioma (Ko… kotliarov20… 194 copy number
#> 4 https://… publicHub Glioma (Ko… kotliarov20… 194 phenotype
#> 5 https://… publicHub Lung Cance… weir2007_pu… 383 copy number
#> 6 https://… publicHub Lung Cance… weir2007_pu… 383 phenotype
#> # … with 11 more variables: Label <chr>, Type <chr>,
#> # AnatomicalOrigin <chr>, SampleType <chr>, Tags <chr>, ProbeMap <chr>,
#> # LongTitle <chr>, Citation <chr>, Version <chr>, Unit <chr>,
#> # Platform <chr>
Workflow
生成对象并过滤数据集:
# The options in XenaFilter function support Regular Expression
XenaGenerate(subset = XenaHostNames=="tcgaHub") %>%
XenaFilter(filterDatasets = "clinical") %>%
XenaFilter(filterDatasets = "LUAD|LUSC|LUNG") -> df_todo
df_todo
#> class: XenaHub
#> hosts():
#> https://tcga.xenahubs.net
#> cohorts() (3 total):
#> TCGA Lung Cancer (LUNG)
#> TCGA Lung Adenocarcinoma (LUAD)
#> TCGA Lung Squamous Cell Carcinoma (LUSC)
#> datasets() (3 total):
#> TCGA.LUNG.sampleMap/LUNG_clinicalMatrix
#> TCGA.LUAD.sampleMap/LUAD_clinicalMatrix
#> TCGA.LUSC.sampleMap/LUSC_clinicalMatrix
有时候我们仅仅知道一些关键字词, XenaScan()
函数可以用于逐行扫描 XenaData
所有的列。
x1 = XenaScan(pattern = 'Blood')
x2 = XenaScan(pattern = 'LUNG', ignore.case = FALSE)
x1 %>%
XenaGenerate()
#> class: XenaHub
#> hosts():
#> https://ucscpublic.xenahubs.net
#> https://tcga.xenahubs.net
#> cohorts() (6 total):
#> Connectivity Map
#> TARGET Acute Lymphoblastic Leukemia
#> Pediatric tumor (Khan)
#> Acute lymphoblastic leukemia (Mullighan 2008)
#> TCGA Pan-Cancer (PANCAN)
#> TCGA Acute Myeloid Leukemia (LAML)
#> datasets() (34 total):
#> cmap/rankMatrix_reverse
#> TARGET_ALL/TARGETcnv_genomicMatrix
#> TARGET_ALL/TARGETexp_genomicMatrix
#> ...
#> TCGA.LAML.sampleMap/mutation_wustl
#> TCGA.LAML.sampleMap/Pathway_Paradigm_RNASeq_And_Copy_Number
x2 %>%
XenaGenerate()
#> class: XenaHub
#> hosts():
#> https://tcga.xenahubs.net
#> cohorts() (1 total):
#> TCGA Lung Cancer (LUNG)
#> datasets() (13 total):
#> TCGA.LUNG.sampleMap/HumanMethylation27
#> TCGA.LUNG.sampleMap/HumanMethylation450
#> TCGA.LUNG.sampleMap/Gistic2_CopyNumber_Gistic2_all_data_by_genes
#> ...
#> TCGA.LUNG.sampleMap/HiSeqV2_exon
#> TCGA.LUNG.sampleMap/AgilentG4502A_07_3
检索和下载:
XenaQuery(df_todo) %>%
XenaDownload() -> xe_download
#> This will check url status, please be patient.
#> All downloaded files will under directory /var/folders/mx/rfkl27z90c96wbmn3_kjk8c80000gn/T//RtmpNjqKIQ.
#> The 'trans_slash' option is FALSE, keep same directory structure as Xena.
#> Creating directories for datasets...
#> Downloading TCGA.LUNG.sampleMap/LUNG_clinicalMatrix.gz
#> Downloading TCGA.LUAD.sampleMap/LUAD_clinicalMatrix.gz
#> Downloading TCGA.LUSC.sampleMap/LUSC_clinicalMatrix.gz
导入 R :
cli = XenaPrepare(xe_download)
class(cli)
#> [1] "list"
names(cli)
#> [1] "LUNG_clinicalMatrix.gz" "LUAD_clinicalMatrix.gz"
#> [3] "LUSC_clinicalMatrix.gz"
浏览数据集
创建两个 XenaHub
对象
to_browse
- 包含 1 个队列 1 个数据集to_browse2
- 包含 2 个队列 2 个数据集
XenaGenerate(subset = XenaHostNames=="tcgaHub") %>%
XenaFilter(filterDatasets = "clinical") %>%
XenaFilter(filterDatasets = "LUAD") -> to_browse
to_browse
#> class: XenaHub
#> hosts():
#> https://tcga.xenahubs.net
#> cohorts() (1 total):
#> TCGA Lung Adenocarcinoma (LUAD)
#> datasets() (1 total):
#> TCGA.LUAD.sampleMap/LUAD_clinicalMatrix
XenaGenerate(subset = XenaHostNames=="tcgaHub") %>%
XenaFilter(filterDatasets = "clinical") %>%
XenaFilter(filterDatasets = "LUAD|LUSC") -> to_browse2
to_browse2
#> class: XenaHub
#> hosts():
#> https://tcga.xenahubs.net
#> cohorts() (2 total):
#> TCGA Lung Adenocarcinoma (LUAD)
#> TCGA Lung Squamous Cell Carcinoma (LUSC)
#> datasets() (2 total):
#> TCGA.LUAD.sampleMap/LUAD_clinicalMatrix
#> TCGA.LUSC.sampleMap/LUSC_clinicalMatrix
XenaBrowse()
可以用于浏览数据所在的 UCSC Xena 页面,使用默认浏览器打开。默认情况下,我只允许用户一次打开一个网页,以避免页面过多。
# 运行下面会打开浏览器
XenaBrowse(to_browse)
XenaBrowse(to_browse, type = "cohort")
# 运行下面会报错
XenaBrowse(to_browse2)
#> Error in XenaBrowse(to_browse2): This function limite 1 dataset to browse.
#> Set multiple to TRUE if you want to browse multiple links.
XenaBrowse(to_browse2, type = "cohort")
#> Error in XenaBrowse(to_browse2, type = "cohort"): This function limite 1 cohort to browse.
#> Set multiple to TRUE if you want to browse multiple links.
如果你确定你想要浏览多个页面,也可以通过设定实现:
XenaBrowse(to_browse2, multiple = TRUE)
XenaBrowse(to_browse2, type = "cohort", multiple = TRUE)
更多使用方法
上面描述了该包的核心功能,更多用法可以浏览下面的链接查看:
UCSCXenaTools 详细介绍 - PDF
UCSCXenaTools API - PDF
另外,我在 rOpenSci 上有发表一篇博文讲解如何使用该包下载和清理数据,然后用于生存分析:UCSCXenaTools: Retrieve Gene Expression and Clinical Information from UCSC Xena for Survival Analysis
断点续传
最近,在和官方开发者交流后,UCSC Xena 平台支持了断点续传的功能,这样下载大型数据集就比较有保障了。下面分别演示了不使用断点续传以及分别使用 curl 或者 wget 进行断点续传。
library(UCSCXenaTools)
xe = XenaGenerate(subset = XenaDatasets == "TcgaTargetGtex_expected_count")
xe
xq = XenaQuery(xe)
# 默认情况,不会断点续传
XenaDownload(xq, destdir = "~/test/", force = TRUE)
# 使用 curl 进行断点续传
XenaDownload(xq, destdir = "~/test/", method = "curl", extra = "-C -", force = TRUE)
# 使用 wget 进行断点续传
XenaDownload(xq, destdir = "~/test/", method = "wget", extra = "-c", force = TRUE)
本文是由官方文档 https://cran.r-project.org/web/packages/UCSCXenaTools/vignettes/USCSXenaTools.html 翻译而成。
写在最后
1
2
3
生信入门课大纲 |
|
1 |
生信R语言入门 |
2 |
GEO数据挖掘 |
3 |
生信linux入门 |
4 |
转录组课题设计与数据分析 |