使用MethylMix包识别甲基化驱动的癌症基因
真正做生信工具(R包,软件,网页)的很少,但是需要用工具的却超级多,比如:GEPIA2详解(中国智造-肿瘤数据库),主要是看表达量和生存,引用就是(1659+130),再比如GDCRNATools这个R包,早就了ceRNA等数据挖掘套路。我在:惊!3个同样的数据挖掘策略居然同时发表,提到过大家耳熟能详的策略,有:
差异分析+PPI网络+hub基因 WGCNA+hub基因 诊断模型构建 预后模型构建 肿瘤免疫,CIBERSOFT计算的LM22比例分组,以及ESTIMATE算法等等 m6A等生物学功能基因集 药敏信息
(mRNA,lncRNA,miRNA,甲基化,蛋白)均可走上述流程,也就是说33种癌症乘以5种亚型,乘以5种分子,乘以15个策略就已经是过万篇数据挖掘课题了,而且你仔细搜索一下就发现,真的是已经有了过万篇数据挖掘文章了哦!
人民群众的模仿能力是远超出我的想象,这里分享一个策略给大家。最近看到一篇文章,做的是甲基化和表达量联合分析,核心数据集如下所示:
完整的路线图如下:
文章链接是:https://www.frontiersin.org/articles/10.3389/fgene.2020.00294/full
我有预感,这样的策略,马上也会应用于33种癌症乘以5种亚型,就是一百多篇文章出来啦。不过,能模仿下来,首先就有一个R语言的拦路虎哦!再怎么强调生物信息学数据分析学习过程的计算机基础知识的打磨都不为过,我把它粗略的分成基于R语言的统计可视化,以及基于Linux的NGS数据处理:
把R的知识点路线图搞定,如下:
了解常量和变量概念 加减乘除等运算(计算器) 多种数据类型(数值,字符,逻辑,因子) 多种数据结构(向量,矩阵,数组,数据框,列表) 文件读取和写出 简单统计可视化 无限量函数学习
如果你学会了R,那么很容易就可以看懂MethylMix包的使用方法啦。去探索你感兴趣的癌症吧,当然了,又不是一定需要癌症数据,只要是有转录组和甲基化数据的项目,都可以使用起来。这个时候学习起来也不晚,虽然你错过了这个数据挖掘策略,但是未来十年仍然是有几十上百个好用的策略,到时候就需要你的R语言水平啦!可以考虑我们生信技能树的官方学习班(下一期是8月31号开始哈):
数据挖掘学习班第5期(线上直播3周,马拉松式陪伴,带你入门),原价4800的数据挖掘全套课程, 疫情期间半价即可抢购。 生信爆款入门-第7期(线上直播4周,马拉松式陪伴,带你入门),原价9600的生信入门全套课程,疫情期间3.3折即可抢购。
MethylMix说明书
简介
用途:识别甲基化驱动的癌症基因,可以识别癌症中高甲基化和低甲基化基因。
数据集
以下是包含在R包中的数据集。
GEcancer
数据集构成:胶质母细胞瘤患者的癌症基因表达矩阵,14行基因和251列样本。强调基因要放在行,样本要放在列。 参考链接:Comprehensive genomic characterization defines human glioblastoma genes and core pathways. Nature. 2008 Oct 23; 455(7216):1061-8. doi: 10.1038/nature07385. 用法: data(GEcancer)
METcancer
数据集构成:胶质母细胞瘤患者的癌细胞甲基化矩阵,14行基因和251列样本。 参考链接:Comprehensive genomic characterization defines human glioblastoma genes and core pathways. Nature. 2008 Oct 23; 455(7216):1061-8. doi: 10.1038/nature07385. 用法: data(METcancer)
METnormal
数据集构成:胶质母细胞瘤患者的正常细胞甲基化矩阵,14行基因和251列样本。 参考链接:Comprehensive genomic characterization defines human glioblastoma genes and core pathways. Nature. 2008 Oct 23; 455(7216):1061-8. doi: 10.1038/nature07385. 用法: data(METnormal)
ProbeAnnotation
数据集构成:注释文件。把CpG位点注释成基因
SNPprobes
数据集构成:含有SNP探针的数据集
函数
ClusterProbes
用途:把每个探针注释到一个基因。一个基因有多个CpG位点,对这些CpG位点聚类,如果同时提供正常组织的样本,那么只对同时出现在正常样本和癌症样本中的探针进行操作。去除了SNP探针。用法:
ClusterProbes(MET_Cancer, MET_Normal, CorThreshold = 0.4)
### MET_Cancer: 癌症组织样本
### MET_Normal: 正常组织样本
### CorThreshold: 划分聚类的相关性阈值
输出结果:
聚类之后的数据集 探针和基因的对应关系
Download_DNAmethylation
用途:从TCGA下载DNA甲基化数据用法:
Download_DNAmethylation(CancerSite, TargetDirectory, downloadData = TRUE)
### CancerSite:各种癌症的缩写名称,例如OV,THCA等
### TargetDirectory:下载文件存放位置
### downloadData (默认:TRUE). 如果是false, 则返回下载链接
输出结果:
Download_GeneExpression
用途:从TCGA下载RNAseq数据用法:
Download_GeneExpression(CancerSite, TargetDirectory, downloadData = TRUE)
### CancerSite:同上
### TargetDirectory:下载文件存放位置
### downloadData (默认:TRUE). 如果是false, 则返回下载链接
输出结果:
Preprocess_DNAmethylation
用途:预处理从TCGA下载DNA甲基化数据用法:
Preprocess_DNAmethylation(CancerSite, METdirectories,
MissingValueThreshold = 0.2)
### CancerSite: 同上
### METdirectories: 由Download_DNAmethylation返回的对象。由下载数据路径组成的向量。
### MissingValueThreshold: 当存在缺失值时,移除样本或基因的阈值
输出结果:预处理之后的癌症组织和正常组织矩阵
Preprocess_GeneExpression
用途:预处理从TCGA下载RNAseq数据用法:
Preprocess_GeneExpression(CancerSite, MAdirectories,
MissingValueThresholdGene = 0.3, MissingValueThresholdSample = 0.1)
### CancerSite: 同上
### MAdirectories: 由Download_DNAmethylation返回的对象。由下载数据路径组成的向量。
### MissingValueThresholdGene:默认值0.3,当存在缺失值时,移除样本或基因的阈值。
### MissingValueThresholdSample:默认值0.1,当存在缺失值时,移除样本或基因的阈值。
输出结果:预处理之后的癌症组织和正常组织矩阵
GetData
用途:打包了Download_DNAmethylation
,Download_GeneExpression
,Preprocess_DNAmethylation
,Preprocess_GeneExpression
, ClusterProbes
函数。
用法:
GetData(cancerSite, targetDirectory)
### cancerSite:同上
### targetDirectory:输出文件存放位置
输出结果:
gdac: 存放直接从TCGA下载的数据的文件夹 MET_CancerSite_Processed.rds: 处理过的甲基化矩阵(未聚类,未注释) GE_CancerSite_Processed.rds: 处理过的基因表达矩阵 data_CancerSite.rds: 甲基化矩阵(已聚类,已从探针注释到基因)和基因表达矩阵
MethylMix
用途:对DNA甲基化数据建立mixture model
用法:
MethylMix(METcancer, GEcancer, METnormal = NULL, listOfGenes = NULL, filter = TRUE, NoNormalMode = FALSE, OutputRoot = "")
### 三个数据集:METcancer, GEcancer,METnormal
### listOfGenes:和rownames(METcancer)一致
### filter:默认值TRUE 选择甲基化和基因表达呈负相关的基因
### NoNormalMode:默认值FALSE,在癌症组织中的甲基化状态不和正常组织比较
### OutputRoot:存储 MethylMix results object的路径
输出结果:
MethylationDrivers:符合以下两个条件的基因被挑选出来:
转录预测基因 差异甲基化 NrComponents:driver基因的甲基化状态数量
MixtureStates:每个driver基因的 Differential Methylation values(DM-values)
MethylationStates: DM-values矩阵,driver genes作为行名sample作为列名
Classifications:Matrix with integers indicating to which mixture component each cancer sample was assigned to, for each gene.
Models:Beta mixture model parameters for each driver gene.
MethylMix_ModelGeneExpression
用途:对DNA甲基化数据建立mixture model
用法:
MethylMix_ModelGeneExpression(METcancer, GEcancer, CovariateData = NULL)
### 两个数据集:METcancer和GEcancer
### CovariateData:一般不需要设置,如果样本来源于不同的组织类型,则可以设置。
输出结果:输出甲基化和基因表达呈显著负相关的的基因名
MethylMix_PlotModel
用途:根据MethylMix函数结果画图
用法:
MethylMix_PlotModel(GeneName, MixtureModelResults, METcancer, GEcancer = NULL, METnormal = NULL)
#GeneName: Name of the gene for which to create a MethylMix plot.
#MixtureModelResults:MethylMix函数返回的list
### 三个数据集:METcancer,GEcancer,METnormal
输出结果:两张图:
MethylMix plots:a histogram of the methylation data (MixtureModelPlot) scatterplot between DNA methylation and gene expression
MethylMix_Predict
用途:Given a new data set with methylation data, this function predicts the mixture component for each new sample and driver gene. Predictions are based on posterior probabilities calculated with MethylMix’x fitted mixture model
用法:
MethylMix_Predict(newBetaValuesMatrix, MethylMixResult)
### newBetaValuesMatrix:新矩阵,genes/cpg sites in rows, samples in columns. genes/cpg sites的名字要和MethylMix函数中用到的矩阵中的genes/cpg sites的名字保持一致,但是genes/cpg sites数量可以不一致。
### MethylMixResult:MethylMix函数输出结果
输出结果:输出一个预测矩阵,driver genes in rows, new samples in columns
predictOneGene
用途:Given a new vector of beta values, this function calculates a matrix with posterior prob of belonging to each mixture commponent (columns) for each new beta value (rows), and return the number of the mixture component with highest posterior probabilit
用法:
predictOneGene(newVector, mixtureModel)
### newVector:vector with new beta values
### mixtureModel:beta mixture model object for the gene being evaluated.
输出结果:输出一个预测矩阵,driver genes in rows, new samples in columns
代码事例
### Optional register cluster to run in parallel
library(doParallel)
cl <- makeCluster(5)
registerDoParallel(cl)
### Methylation data for ovarian cancer
cancerSite <- "OV"
targetDirectory <- paste0(getwd(), "/")
### GetData可以替代下面的代码
GetData(cancerSite, targetDirectory)
### Downloading methylation data
METdirectories <- Download_DNAmethylation(cancerSite, targetDirectory, TRUE)
### Processing methylation data
METProcessedData <- Preprocess_DNAmethylation(cancerSite, METdirectories)
### Saving methylation processed data
saveRDS(METProcessedData, file = paste0(targetDirectory, "MET_", cancerSite, "_Processed.rds"))
### Clustering methylation data
res <- ClusterProbes(METProcessedData[[1]], METProcessedData[[2]])
### Saving methylation clustered data
toSave <- list(METcancer = res[[1]], METnormal = res[[2]], ProbeMapping = res$ProbeMapping)
saveRDS(toSave, file = paste0(targetDirectory, "MET_", cancerSite, "_Clustered.rds"))
stopCluster(cl)
### load the three data sets needed for MethylMix
data(METcancer)
data(METnormal)
data(GEcancer)
### run MethylMix on a small set of example data
MethylMixResults <- MethylMix(METcancer, GEcancer, METnormal)
### run in parallel
library(doParallel)
cl <- makeCluster(5)
registerDoParallel(cl)
MethylMixResults <- MethylMix(METcancer, GEcancer, METnormal)
stopCluster(cl)
### load data sets
data(METcancer)
data(GEcancer)
### model gene expression
MethylMixResults <- MethylMix_ModelGeneExpression(METcancer, GEcancer)
### Load the three data sets needed for MethylMix
data(METcancer)
data(METnormal)
data(GEcancer)
### Run methylmix on a small set of example data
MethylMixResults <- MethylMix(METcancer, GEcancer, METnormal)
### Plot the most famous methylated gene for glioblastoma
MethylMix_PlotModel("MGMT", MethylMixResults, METcancer)
### Plot MGMT also with its normal methylation variation
MethylMix_PlotModel("MGMT", MethylMixResults, METcancer, METnormal = METnormal)
### Plot a MethylMix model for another gene
MethylMix_PlotModel("ZNF217", MethylMixResults, METcancer, METnormal = METnormal)
### Also plot the inverse correlation with gene expression (creates two separate plots)
MethylMix_PlotModel("MGMT", MethylMixResults, METcancer, GEcancer, METnormal)
### Plot all functional and differential genes
for (gene in MethylMixResults$MethylationDrivers) {
MethylMix_PlotModel(gene, MethylMixResults, METcancer, METnormal = METnormal)
}
### load the three data sets needed for MethylMix
data(METcancer)
data(METnormal)
data(GEcancer)
### run MethylMix on a small set of example data
MethylMixResults <- MethylMix(METcancer, GEcancer, METnormal)
### toy example new data, of same dimension of original METcancer data
newMETData <- matrix(runif(length(METcancer)), nrow = nrow(METcancer))
rownames(newMETData) <- rownames(METcancer)
colnames(newMETData) <- paste0("sample", 1:ncol(METcancer))
predictions <- MethylMix_Predict(newMETData, MethylMixResults)