P值0.05界限为什么不好(纯R代码复现一篇数据挖掘文章)
感觉好像每隔几年学术界就掀起了一阵批判P值的讨论,正好我这里有个例子,分享给大家。
依然是安排给学徒完成的:菜鸟团(周一数据挖掘专栏)成果展
文献是:Cancer Manag Res. 2018 标题经过了更改:
最开始时 Tumour purity as a prognostic factor in colon cancer
然后是 Low tumor purity is associated with poor prognosis, heavy mutation burden, and intense immune phenotype in colon cancer
数据来源: GSE17536/17537 and GSE39582 (n=794), 还有TCGA的 (n=454).Colorectal cancer (CRC)
作者收集了 GSE17536/17537,GSE39582, and TCGA 共 1248病人,分析表明 Stage III and MMR-deficient
(dMMR) 病人具有比较低的肿瘤纯度,而较低的较低的肿瘤纯度是一个独立的病人预后风险因子。(大家去文章里面看P值)
以前的pan-cancer研究表明,在 idney renal clear cell carcinoma and lower-grade glioma 里面,高纯度的肿瘤预示着好的生存。
使用ESTIMATE算法看肿瘤纯度
首先下载表达矩阵,然后使用 算法处理它,得到每个样本的肿瘤纯度值,进行统计,如下:
把肿瘤纯度来工具不同的分类指标进行量化,如下:
对值得探索的因子绘森林图:
看肿瘤纯度对生存的影响:
使用CIBERSORT算法看细胞比例
作者 We applied CIBERSORT algorithm on RNA-Seq and microarray data
to estimate the relative proportion of 22 immune cells of leukocytes for each CC sample.
发现:immunotherapy-associated markers (PD-1, PD-L1, CTLA-4, LAG-3, and TIM-3) were high expressed in low purity CC. 散点图展示相关性,如下:
然后这些细胞比例和肿瘤纯度的相关性如下:
https://www.biorxiv.org/content/biorxiv/early/2018/02/12/263723.full.pdf
文章使用了三个结肠癌的geo数据集和tcga数据,将样本按照分期进行分类,利用estimate包计算出个样本的肿瘤纯度后绘图,查看分期与肿瘤纯度是否有显著关系。
# Step0 Before starting your project --------------------------------------
rm(list = objects( all = TRUE ))
options( stringsAsFactors = FALSE )
## 这是踩过的坑,最好设置环境为英文
Sys.setlocale("LC_ALL","English")
# Step1 download GEO dateset by GEOquery ----------------------------------
## 一共有三个数据,每个数据集都要下载
GSE_ID <- 'GSE17536'
GSE_ID <- 'GSE17537'
GSE_ID <- 'GSE39582'
GSE_file <- paste('./data/', GSE_ID, '.Rdata', sep = "", collapse = NULL)
library( "GEOquery" )
if (!file.exists( GSE_file )) {
## dir.create("./data/"), 我将所有的数据保存在data目录下面
download_GEO <- function(GSE_ID) {
options( download.file.method.GEOquery = 'libcurl' )
## dir.create("./raw_data/"),我将原始下载的数据保存在raw_data目录下面
gset <- getGEO( GSE_ID, getGPL = T, destdir = "./raw_data/" )
save( gset, file = GSE_file )
}
download_GEO( GSE_ID )
}
# Step2 Data loading ---------------------------------------------------
load( GSE_file )
class( gset )
length( gset )
exprSet <- gset[[1]]
str( exprSet, max.level = 2 )
## assayData <- exprSet@assayData$exprs[1:5,1:6]
assayData <- exprs(exprSet)
dim(assayData)
assayData[1:5, 1:6]
## phenoData <- exprSet@phenoData@data
phenoData <- pData(exprSet)
dim(phenoData)
phenoData[1:2,1:6]
# Step3 Grouping by special clinical information --------------------------
pheno_num <- c()
invisible(
lapply(1:ncol(phenoData),
function(col_num){
## Assume that the classification project is between 2 and 4
if (1 < dim(table(phenoData[,col_num])) &
dim(table(phenoData[,col_num])) < 7) {
pheno_num <<- append(pheno_num, col_num, after = length(pheno_num))
}
}
)
)
View(phenoData[, pheno_num])
names(phenoData[, pheno_num])
for (i in pheno_num) {
print(table(phenoData[,i]))
}
## choose special pheno
## 三个数据集样本信息的的分期列名是不一致的,根据实际情况做修改
group_list <- phenoData[, "tnm.stage:ch1"]
table(group_list)
names(group_list) <- rownames(phenoData)
## 利用group_list为表达矩阵排序,保证数据顺序一致
newAssayData <- assayData[, names(group_list)]
dim(newAssayData)
newAssayData[1:5, 1:6]
head(group_list)
## GSE39582数据集中存在非,1,2,3,4期的分类,需要把这几个样本剔除,
## group_list我在这里没有处理,后面会处理掉
newAssayData = newAssayData[ , names(group_list)[!grepl( "0", group_list )]]
newAssayData = newAssayData[ , names(group_list[!grepl( "0", group_list )])[!grepl( 'N/A', group_list[!grepl( "0", group_list )])]]
dim(newAssayData)
newAssayData[1:5, 1:6]
if (!file.exists( "./data/AssayData_by_group.Rdata" )) {
save( newAssayData, group_list, file = "./data/AssayData_by_group.Rdata" )
}
# Step4 Filt gene ---------------------------------------------------------
load( './data/AssayData_by_group.Rdata' )
exprSet <- gset[[1]]
## Prioritize data uploaded by authors
featureData <- exprSet@featureData@data
dim( featureData )
colnames( featureData )
View( featureData )
## 表达矩阵仅保留最大表达量的探针,这是在探针维度上进行了筛选
featureData$max <- apply(newAssayData, 1, max)
featureData <- featureData[order(featureData$`ENTREZ_GENE_ID`,
featureData$max,
decreasing = T), ]
featureData <- featureData[featureData$`ENTREZ_GENE_ID` != '', ]
dim( featureData )
featureData <- featureData[!duplicated(featureData$`ENTREZ_GENE_ID`), ]
dim( featureData )
## 数据注释里面存在一个探针对应多个基因的情况,我选择保留所有的基因与探针的对应关系
library("plyr")
split_gene <- strsplit( as.character( featureData[, 11] ), " /// ")
probe2gene <- mapply( cbind, featureData[,1], split_gene )
probe2gene <- ldply(probe2gene, data.frame)
ID2gene <- probe2gene[,2:3]
dim(ID2gene)
AssayData <- newAssayData[ID2gene$X1, ]
dim(AssayData)
AssayData[1:5, 1:6]
## 但是又有多个相同的基因出现了,我们再对基因排序,任何取表达量最大的探针
ID2gene$max <- apply(AssayData, 1, max)
ID2gene <- ID2gene[order(ID2gene$X2,
ID2gene$max,
decreasing = T), ]
ID2gene <- ID2gene[!duplicated(ID2gene$X2), ]
dim(ID2gene)
AssayData <- AssayData[ID2gene$X1, ]
dim(AssayData)
AssayData[1:5, 1:6]
##
## 到这里表达矩阵的处理就结束了,代码比较繁杂,也可以选择,直接舍弃“///”列,或者使用R包去注释数据
rownames(AssayData) <- ID2gene$X2
# AssayData = log2(AssayData)
# dim(AssayData)
# AssayData[1:5,1:6]
# group_list <- unname(group_list)
## 数据集较多,我把每组数据都分别保留
AssayData_GSE17536 <- AssayData
group_GSE17536 <- group_list
AssayData_GSE17537 <- AssayData
group_GSE17537 <- group_list
AssayData_GSE39582 <- AssayData
group_GSE39582 <- group_list
save(AssayData_GSE17536, group_GSE17536,
AssayData_GSE17537, group_GSE17537,
AssayData_GSE39582, group_GSE39582,
file = './data/final_AssayData.Rdata')
# Step0 Before starting your project --------------------------------------
## Remove everything in the working environment, not including loaded libraries.
rm(list = objects( all = TRUE ))
options( stringsAsFactors = FALSE )
## Change the library location of the packages
## Even your updated your R, you can still use your packages.
.libPaths( c( "G:/R-packages",
"C:/Program Files/R/R-3.5.2/library") )
.libPaths()
# Step1 Install packages --------------------------------------------------
## 这个R包需要从rforge上下载
library(utils)
rforge <- "http://r-forge.r-project.org"
install.packages("estimate", repos = rforge, dependencies = TRUE)
# Step2 calculate tumor purity --------------------------------------------
load("./data/final_AssayData.Rdata")
library(estimate)
dim(AssayData_GSE17536)
dim(AssayData_GSE17537)
dim(AssayData_GSE39582)
AssayData_GSE17536 <- as.data.frame(AssayData_GSE17536)
AssayData_GSE17537 <- as.data.frame(AssayData_GSE17537)
AssayData_GSE39582 <- as.data.frame(AssayData_GSE39582)
AssayData_GSE17536$gene <- rownames(AssayData_GSE17536)
AssayData_GSE17537$gene <- rownames(AssayData_GSE17537)
AssayData_GSE39582$gene <- rownames(AssayData_GSE39582)
## 将三个数据集合并
all_data <- merge(AssayData_GSE17536, AssayData_GSE17537, by = "gene")
all_data <- merge(all_data, AssayData_GSE39582, by = "gene")
rownames(all_data) <- all_data[, 1]
all_data <- all_data[, -1]
## 这里处理了group_list
group_GSE39582 <- group_GSE39582[!grepl( "0", group_GSE39582 )]
group_GSE39582 <- group_GSE39582[!grepl( "N/A", group_GSE39582 )]
group_GSE39582[group_GSE39582 == "tnm.stage: 1"] <- '1'
group_GSE39582[group_GSE39582 == "tnm.stage: 2"] <- '2'
group_GSE39582[group_GSE39582 == "tnm.stage: 3"] <- '3'
group_GSE39582[group_GSE39582 == "tnm.stage: 4"] <- '4'
## 下面拿到的group_list和all_data就是最终的
group_list <- c(group_GSE17536, group_GSE17537, group_GSE39582)
all_data <- all_data[, names(group_list)]
## 这个包比较神奇,只接收文件,所以要把数据在写到文档里面
write.table(as.data.frame(all_data), './data/varianCancerExpr.txt',
quote = FALSE, sep = "\t")
## 就两步,就完成了肿瘤纯度的计算
filterCommonGenes(input.f = "./data/varianCancerExpr.txt",
output.f = "./data/genes.gct", id = "GeneSymbol")
estimateScore("./data/genes.gct", "./data/estimate_score.gct",
platform = "affymetrix")
## 在把数据读进来,就可以绘图了
estimate_score <- read.table("./data/estimate_score.gct",
sep = "\t", skip = 2,
header = T)
rownames(estimate_score) <- estimate_score[, 1]
estimate_score <- as.data.frame(t(estimate_score[, -c(1,2)]))
library(ggpubr)
gghistogram(estimate_score, x = "StromalScore",
add = "mean", bins = 50)
gghistogram(estimate_score, x = "ImmuneScore",
add = "mean", bins = 50)
gghistogram(estimate_score, x = "ESTIMATEScore",
add = "mean", bins = 50)
gghistogram(estimate_score, x = "TumorPurity",
add = "mean", bins = 50)
estimate_score <- estimate_score[rownames(phenoData),]
estimate_score$sex <- phenoData[,'gender:ch1']
estimate_score$grade <- phenoData[,'grade:ch1']
estimate_score$survial <- phenoData[,'dss_event (disease specific survival; death from cancer):ch1']
estimate_score <- estimate_score[names(group_list),]
estimate_score$stage <- as.character(group_list)
ggboxplot(estimate_score, x = "sex", y = "TumorPurity",
color = "sex", palette = c("#00AFBB", "#E7B800"),
add = "jitter", shape = "sex") + stat_compare_means(label.y = 0.4)
ggboxplot(estimate_score, x = "grade", y = "TumorPurity",
color = "grade", palette = c("#00AFBB", "#E7B800", "#FC4E07"),
add = "jitter", shape = "grade") + stat_compare_means(label.y = 0.4)
ggboxplot(estimate_score, x = "survial", y = "TumorPurity",
color = "survial", palette = c("#00AFBB", "#E7B800"),
add = "jitter", shape = "survial") + stat_compare_means(label.y = 0.4)
## 我两两分期比较了一下,感觉结果不理想
my_comparisons <- list( c("1", "2"), c("2", "3"), c("3", "4"),
c("1", "3"), c("2", "4"), c("1", "4"))
ggviolin(estimate_score, x = "stage", y = "TumorPurity", fill = "stage",
palette = c("#00AFBB", "#E7B800", "#FC4E07", "#c00000"),
add = "boxplot", add.params = list(fill = "white")) +
stat_compare_means(comparisons = my_comparisons) +
stat_compare_means(label.y = 1.6)
这个样子就很诡异,O(∩_∩)O哈哈~