TCGA的28篇教程- 指定癌症查看感兴趣基因的表达量
m长期更新列表:
使用R语言的cgdsr包获取TCGA数据(cBioPortal)TCGA的28篇教程- 使用R语言的RTCGA包获取TCGA数据 (离线打包版本)TCGA的28篇教程- 使用R语言的RTCGAToolbox包获取TCGA数据 (FireBrowse portal)TCGA的28篇教程- 批量下载TCGA所有数据 ( UCSC的 XENA)TCGA的28篇教程- 数据下载就到此为止吧
- 文章来源
- 那么首先需要得到上图的基因列表
- 使用下载这些基因在GBM的芯片数据的表达量。
- 下载感兴趣基因的表达矩阵
- 画散点图
- 相关系数的总结
文章来源
这里我重现一篇CELL文章的某个分析要点,具体见我博客的解读:http://www.bio-info-trainee.com/3072.html
在TCGA的GBM的芯片表达数据里面可以看到 以下正相关关系:
NCR2 expression
upregulation of PDGF-DD-induced NK cell cytokine genes (芯片数据)
downregulation of tumor cell-cycle genes
greater survival
TCGA GBM mRNA gene expression data obtained using the Affymetrix HT Human Genome U133a microarray platform (n = 539 patients)was downloaded through the UCSC data portal (https://xenabrowser.net) and matched to the gene expression data (Figures 4C and 4D). 7 of 9 cytokine genes (Figure 4C) and 27 of 34 cell cycle genes (Figure 4D) were matched with the GBM cohort.
分别如下图
NCR2基因表达与细胞因子相关基因表达正相关
NCR2表达与细胞周期相关基因表达负相关
NCR2高表达的保护作用
那么首先需要得到上图的基因列表
在KEGG官网可以查看细胞周期基因列表:http://www.genome.jp/kegg-bin/show_pathway?hsa04110 , 而细胞因子是:http://www.genome.jp/kegg-bin/show_pathway?hsa04060
这里可以用R语言本身的包KEGG.db
来获取.
library(KEGG.db)
ls("package:KEGG.db")## [1] "KEGG" "KEGGENZYMEID2GO" "KEGGEXTID2PATHID"
## [4] "KEGGGO2ENZYMEID" "KEGGMAPCOUNTS" "KEGGPATHID2EXTID"
## [7] "KEGGPATHID2NAME" "KEGGPATHNAME2ID" "KEGG_dbInfo"
## [10] "KEGG_dbconn" "KEGG_dbfile" "KEGG_dbschema"cellcycle_genes=KEGGPATHID2EXTID[['hsa04110']]
cytokine_genes=KEGGPATHID2EXTID[['hsa04060']]
使用下载这些基因在GBM的芯片数据的表达量。
library(cgdsr)
library(DT)
# Get list of cancer studies at server
## 获取有哪些数据集
# save(all_TCGA_studies,all_dataset,all_tables,cellcycle_expr,cytokine_expr,NCR2_expr,file = 'GBM_microarray_TCGA.Rdata')
load(file = 'GBM_microarray_TCGA.Rdata')
mycgds <- CGDS("http://www.cbioportal.org/public-portal/")
#all_TCGA_studies <- getCancerStudies(mycgds)
all_TCGA_studies[grepl('gbm',all_TCGA_studies$cancer_study_id),]## cancer_study_id name
## 55 gbm_tcga_pub2013 Glioblastoma (TCGA, Cell 2013)
## 56 gbm_tcga_pub Glioblastoma (TCGA, Nature 2008)
## 57 gbm_tcga Glioblastoma Multiforme (TCGA, Provisional)
## 97 lgggbm_tcga_pub Merged Cohort of LGG and GBM (TCGA, Cell 2016)
## description
## 55 <a href="http://cancergenome.nih.gov/">The Cancer Genome Atlas (TCGA)</a> Glioblastoma project.<br> <a href="http://tcga-data.nci.nih.gov/tcga/">Raw data via the TCGA Data Portal</a>.
## 56 <a href="http://cancergenome.nih.gov/">The Cancer Genome Atlas (TCGA)</a> Glioblastoma project. 206 primary glioblastoma samples.<br> <a href="https://tcga-data.nci.nih.gov/docs/publications/gbm_2008/">Raw data via the TCGA Data Portal</a>.
## 57 TCGA Glioblastoma Multiforme; raw data at the <A HREF="https://tcga-data.nci.nih.gov/">NCI</A>.
## 97 Whole-exome sequencing from TCGA LGG and GBM cases
带有GBM的项目有4个。
下载感兴趣基因的表达矩阵
很明显那篇文章作者提到了Affymetrix HT Human Genome U133a microarray platform (n = 539 patients),所以用的是gbm_tcga那个数据集,用下面的代码下载:
## 获取在 "gbm_tcga" 数据集中有哪些表格(每个表格都是一个样本列表)
#all_tables <- getCaseLists(mycgds, "gbm_tcga")
DT::datatable(all_tables[,1:4],
extensions = 'FixedColumns',
options = list(
#dom = 't',
scrollX = TRUE,
fixedColumns = TRUE
))## 而后获取可以下载哪几种数据,一般是mutation,CNV和表达量数据
#all_dataset <- getGeneticProfiles(mycgds, "gbm_tcga")
DT::datatable(all_dataset,
extensions = 'FixedColumns',
options = list(
#dom = 't',
scrollX = TRUE,
fixedColumns = TRUE
))all_dataset[4,]
## genetic_profile_id genetic_profile_name
## 4 gbm_tcga_mrna_U133 mRNA expression (U133 microarray only)
## genetic_profile_description
## 4 mRNA expression data from the Affymetrix U133 microarray.
## cancer_study_id genetic_alteration_type show_profile_in_analysis_tab
## 4 740 MRNA_EXPRESSION falseall_tables[9,1:4]
## case_list_id case_list_name
## 9 gbm_tcga_mrna_U133 Tumor Samples with mRNA data (U133 microarray only)
## case_list_description cancer_study_id
## 9 All samples with mRNA expression data (528 samples) 740my_dataset <- 'gbm_tcga_mrna_U133'
my_table <- "gbm_tcga_mrna_U133"
#cellcycle_expr <- getProfileData(mycgds, cellcycle_genes, my_dataset, my_table)
dim(cellcycle_expr)## [1] 528 124
#cytokine_expr <- getProfileData(mycgds, cytokine_genes, my_dataset, my_table)
dim(cytokine_expr)## [1] 528 265
#NCR2_expr <- getProfileData(mycgds, 'NCR2', my_dataset, my_table)
画散点图
cor_plot <- function(x,y){
#x=NCR2_expr$NCR2
#y=cellcycle_expr$CCNB1
plot(x,y, xlab = 'NCR2', ylab = 'gene')
model = lm(y ~ x)
summary(model)
int = model$coefficient["(Intercept)"]
slope =model$coefficient["x"]
abline(int, slope,
lty=1, lwd=2, col="red")
r= format(cor(x,y),digits = 4)
p= format(cor.test(x,y)$p.value,digits = 4)
title(main = paste0('p value=',p),
sub = paste0('correlation=',r))
}
cor_plot(x=NCR2_expr$NCR2,y=cellcycle_expr$CCNB1)
cor_plot(x=NCR2_expr$NCR2,y=cellcycle_expr$BUB1B)
cor_plot(x=NCR2_expr$NCR2,y=cytokine_expr$CCL1)
cor_plot(x=NCR2_expr$NCR2,y=cytokine_expr$CCL4)
这些图跟作者的差不多了,但是很明显作者的展现方式并不是最好的。因为这样的散点图太多了。 其实只需要p值以及相关性系数即可。
相关系数的总结
boxplot(as.numeric(cor(NCR2_expr$NCR2,cellcycle_expr)))
可以看到,NCR2这个基因跟整个细胞周期的基因整体表达量都是负相关的。
boxplot(as.numeric(cor(NCR2_expr$NCR2,cytokine_expr)))
至于跟细胞因子的关系嘛,就跨度很大,有正有负,首先是因为细胞因子涉及到的基因比较多。
TCGA不只是套路,更多的是理解这个宏大计划的背景,挖掘数据背后的价值,尤其是应该结合自己的课题实际!