ChAMP 包分析450K甲基化芯片数据(一站式)
早在我们举办甲基化芯片专题学习的时候,见:
就有非常棒的一站式教程投稿,也因此我结识了优秀的六六,以及其教程大力推荐的R包作者,见:
但是当时的教程题目并没有着重宣传该R包,恰好技能树联盟新成员也总结了自己的经验,成员介绍见:
那么我们就一起学习其优秀的总结笔记吧!
ChAMP PACKAGE
▛ 用来分析illuminate甲基化数据的包 (EPIC and 450k) ▟
⊙ 不同格式的数据导入
| .idat files
| a beta-valued matrix
⊙ Quality Control plots
⊙ Type-2 探针的矫正方法
| SWAN1
| Peak Based Correction (PBC)2
| BMIQ3 (the default choice)
⊙ Functional Normalization function|the minfi package
⊙ 查看批次效应的方法|singular value decomposition (SVD) method,for correction of multiple batch effects the ComBat method.
⊙ 矫正cell-type heterogeneity|RefbaseEWAS
⊙ 推断CNV变异
⊙ Differentially Methylated Regions (DMR)
| Lasso method
| Bumphunter
| DMRcate
⊙ find Differentially Methylated Blocks
⊙ Gene Set Enrichment Analysis (GSEA)
⊙ Infer gene modules in user-specified gene-networks that exhibit differential methylation between phenotypes (整合FEM package)
⊙ 其他分析甲基化数据的包
| IMA
| minfi
| methylumi
| RnBeads
| wateRmelon
1、安装ChAMP包
source("https://bioconductor.org/biocLite.R")
biocLite("ChAMP")
source("http://bioconductor.org/biocLite.R")
biocLite(c("minfi","ChAMPdata","Illumina450ProbeVariants.db","sva","IlluminaHumanMethylation450kmanifest","limma"))
biocLite("YourErrorPackage")
library("ChAMP")
如果报错:
错误: package or namespace load failed for 'ChAMP' in inDL(x, as.logical(local), as.logical(now), ...):
无法载入共享目标对象'D:/work/R-3.4.3/library/mvtnorm/libs/x64/mvtnorm.dll’::
已达到了DLL数目的上限...
解决方案:
设置环境变量R_MAX_NUM_DLLS, 不管是什么操作系统,R语言对应的环境变量都可以在.Renviron文件中进行设置。
这个文件可以保存在任意目录下,文件中就一句话,内容如下:
R_MAX_NUM_DLLS=500
500表示允许的最多的dll文件数目,设置好之后,重新启动R, 然后输入如下命令:
normalizePath("d:/Documents/.Renviron", mustWork = FALSE)
第一个参数为.Renviron文件的真实路径,然后在加载ChAMP包就可以了。
2、用测试数据跑流程
测试数据包括 450k(.idat) 和 850k(simulated EPIC data) 两个数据集
testDir=system.file("extdata",package="ChAMPdata")
myLoad <- champ.load(testDir,arraytype="450K")
data(EPICSimData)
3、ChAMP Pipeline
- 绿色发光线表示主要的分析步骤
- 灰色线条为可选的步骤
- 黑点表示准备好的甲基化数据
- 蓝色表示准备工作:Loading, Normalization, Quality Control checks
- 红色表示产生分析结果:Differentially Methylated Positions (DMPs), Differentially Methylated Regions (DMRs), Differentially methylated Blocks, EpiMod (a method for detecting differentially methylated gene modules derived from FEM package), Pathway Enrichment Results etc.
- 黄色表示交互界面画图
450K步骤
Full Pipeline
一步跑完结果,但是可能报错
champ.process(directory = testDir)
一步一步跑
myLoad <- cham.load(testDir)
# Or you may separate about code as champ.import(testDir) + champ.filter()
CpG.GUI()
champ.QC() # Alternatively: QC.GUI()
myNorm <- champ.norm()
champ.SVD()
# If Batch detected, run champ.runCombat() here.
myDMP <- champ.DMP()
DMP.GUI()
myDMR <- champ.DMR()
DMR.GUI()
myBlock <- champ.Block()
Block.GUI()
myGSEA <- champ.GSEA()
myEpiMod <- champ.EpiMod()
myCNA <- champ.CNA()
# If DataSet is Blood samples, run champ.refbase() here.
myRefbase <- champ.refbase()
EPIC pipelinedata(EPICSimData)
CpG.GUI(arraytype="EPIC")
champ.QC()
myNorm <- champ.norm(arraytype="EPIC")
champ.SVD()
myDMP <- champ.DMP(arraytype="EPIC")
DMP.GUI()
myDMR <- champ.DMR()
DMR.GUI()
myDMR <- champ.DMR(arraytype="EPIC")
DMR.GUI(arraytype="EPIC")
myBlock <- champ.Block(arraytype="EPIC")
Block.GUI(arraytype="EPIC")
myGSEA <- champ.GSEA(arraytype="EPIC")
myEpiMod <- champ.EpiMod(arraytype="EPIC")
最多在8G内存电脑上可以跑200个样本,如果在服务器上多核跑,需要命令
library("doParallel")
detectCores()
ChAMP pipeline
1. Loading Data
.idat files 为原始芯片文件,包括pd file (Sample_Sheet.csv)文件(表型,编号等)
myLoad$pd
2. Filtering Data
ChAMP提供了 champ.filter() 函数,可以输入 (beta, M, Meth, UnMeth, intensity)格式的文件并进行过滤质控。 新版本的ChAMP包中champ.load()函数已经包含了此功能。
champ.filter() 函数有个参数autoimpute,可以填补或保留由过滤导致的NA空缺值。
如果输入多个数据框进行过滤,他们的行名和列名必须一致,否则champ.filter()认为是不同来源的数据,将停止过滤。
低质量的样本(有较多的探针没有信号)将会被过滤掉,Sample_Name 要与pd file中的列名称一致。
imputation需要detection P matrix, beta or M matrix信息,且ProbeCutoff 不能等于0,这个参数控制探针的NA ratio,来决定是否填补。
如果想用beadcount信息进行过滤,champ.import() 函数会返回beads信息。
使用方法为:
myImport <- champ.import(testDir)
myLoad <- champ.filter()
Section 1: Read PD Files Start: Reading CSV File
Section 2: Read IDAT files Start:Extract Mean value for Green and Red Channel Success
Your Red Green Channel contains 622399 probes.
Section 3: Use Annotation Start:Reading 450K Annotation,there are 613 control probes in Annotation,Generating Meth and UnMeth Matrix,485512 Meth probes
Generating beta Matrix
Generating M Matrix
Generating intensity Matrix
Calculating Detect P value
Counting Beads
You may want to process champ.filter() next,This function is provided for user need to do filtering on some beta (or M) matrix, which contained most filtering system in champ.load except beadcount.
Section 1: Check Input Start:You have inputed beta,intensity for Analysis.
Checking Finished :filterDetP,filterBeads,filterMultiHit,filterSNPs,filterNoCG,filterXY would be done on beta,intensity.
You also provided :detP,beadcount .
Section 2: Filtering Start
The fraction of failed positions per sample
Failed CpG Fraction.
C1 0.0013429122
C2 0.0022162171
C3 0.0003563249
C4 0.0002842360
T1 0.0003831007
T2 0.0011946152
T3 0.0014953286
T4 0.0015447610
Filtering probes with a detection p-value above 0.01.
Removing 2728 probes.
Filtering BeadCount Start
Filtering probes with a beadcount <3 in at least 5% of samples.
Removing 9291 probes
Filtering NoCG Start
Only Keep CpGs, removing 2959 probes from the analysis.
Filtering SNPs Start
Using general 450K SNP list for filtering.
Filtering probes with SNPs as identified in Zhou's Nucleic Acids Research Paper 2016.
Removing 49231 probes from the analysis.
Filtering MultiHit Start
Filtering probes that align to multiple locations as identified in Nordlund et al
Removing 7003 probes from the analysis.
Filtering XY Start
Filtering probes located on X,Y chromosome, removing 9917 probes from the analysis.
Updating PD file
Fixing Outliers Start
Replacing all value smaller/equal to 0 with smallest positive value.
Replacing all value greater/equal to 1 with largest value below 1..
过滤步骤为:
detection p-value (< 0.01)。这个值储存在.idat文件中,champ.import()函数读入这个值并形成数据框。p< 0.01的探针认为实验失败。过滤过程为:样本探针失败率阈值=0.1,再在剩下的样本中过滤探针。参数SampleCutoff 和 ProbeCutoff控制这两个阈值。
ChAMP will filter out probes with <3 beads ( filterBeads 参数控制) in at least 5% (beadCutoff 参数控制)of samples per probe.
默认过滤non-CpG probes
by default ChAMP will filter all SNP-related probe。需要用population参数选择群体。如果不选,用General Recommended Probes provided by Zhou to do filtering。
ChAMP will filter all multi-hit probes.
默认过滤掉chromosome X and Y上的探针。filterXY 参数控制。
如果没有原始的.IDAT 数据,用champ.filter() 函数进行过滤。
注意:
champ.load() can not perform filtering on beta matrix alone. For users have no .IDAT data but beta matrix and Sample_Sheet.csv, you may want perform filtering using the champ.filter() function and then use following functions to do analysis.
CpG.GUI() 函数查看甲基化位点的分布情况。CpGs on chromosome, CpG island, TSS reagions.
CpG.GUI(CpG=rownames(myLoad$beta),arraytype="450K")
3. Further quality control and exploratory analysis
用champ.QC() function and QC.GUI() function检查数据质量
champ.QC()
champ.QC()函数会生成三个图:
mdsPlot (Multidimensional Scaling Plot): 基于前1000个最易变化的位点查看样本的相似度,用颜色标记不同的样本分组。
densityPlot: 查看每个样本的beta值分布,有严重偏离的样本预示着质量较差(如亚硫酸盐处理不完全等)
dendrogram:所有样本的聚类图。champ.QC()函数中Feature.sel="None" 参数表示直接通过探针数值来计算样本的距离,比较耗内存;还有 “SVD” method。
QC.GUI() 函数也可以画图,但是比较耗内存。
包括5张图:mdsPlot, type-I&II densityPlot, sample beta distribution plot, dendrogram plot and top 1000 most variable CpG’s heatmap.
QC.GUI(beta=myLoad$beta,arraytype="450K")
type-I&II densityPlot图可以帮助查看两个探针的标准化状态。
Top variable CpGs’ heatmap将前1000个差异最大的位点和状态表示出来。
4. Normalization
type-I 和 type-II 两种探针化学反应不同,设计也不同,导致分布区域也不同。这两种探针检测出的差异可能是因为探针所在位置不平衡导致的生物学差异引起的(如CpG位置的差异引起的)。最主要是type-II 探针exhibit a reduced dynamic range. 因此,针对 type-II probe bias的矫正是必要的。
champ.norm() 函数可以实现这个功能。针对type-II 探针有4种标准化的方法:BMIQ, SWAN, PBC 和 unctionalNormliazation。
850k 芯片用BMIQ标准化要好一点。但是BMIQ对质量差的样本或者甲基化偏差比较大的control样本效果不好。“cores”参数控制电脑核数,PDFplot=TRUE将图保存在resultsDir里。
myNorm <- champ.norm(beta=myLoad$beta,arraytype="450K",cores=5)
QC.GUI(myNorm,arraytype="450K")
5. SVD Plot
The singular value decomposition method (SVD) 用来用于评估数据集中变量的主要成分。这些显著性位点可能与我们感兴趣的生物学现象相关联,也可能与技术相关,如批次效应或群体效应。样本的病历信息越详细越好(如:dates of hybridization, season in which samples were collected, epidemiological information, etc),可以将这些因素包含进SVD中。
如果从 .idat导入原始文件,设置champ.SVD()函数的RGEffect=TRUE ,芯片上18个内置的对照探针(包括亚硫酸盐处理效率)将纳入确信的因素进行分析。
champ.SVD()函数将把pd文件中的所有协变量和表型数据纳入进行分析。可以用cbind()函数将自己的协变量与myLoad$pd合并进行分析。但是对于分类变量和数字变量处理方法是不一样的。 分类变量要转换成“factor” or “character”类型,数字变量转换成数字类型。
champ.SVD()分析时会把协变量打印在屏幕上,结果是热图,保存为SVDsummary.pdf文件。黑色表示最显著的p值。如果发现技术因素有影响,就需要用ComBat等方法重新标准化数据,包括variation related to the beadchip, position and/or plate。
champ.SVD(beta=myNorm,pd=myLoad$pd)
上图是用自带的测试数据绘制的,不是很复杂,看不出来。下图用GSE40279的656个样本绘制的。其中年龄是数字变量,其他都为分类变量。
6. Batch Effect Correction
ComBat方法是sva 包里的一个方法,已经整合到ChAMP包里了,batchname=c("Slide")参数控制矫正因素。champ.runCombat() 函数自动把Sample_Group作为协变量矫正,现在又加入了另一个参数variablename用来加入自己的协变量进行矫正。如果用户在 champ.runCombat()函数中写的 batchname正确,函数将自动进行批次效应矫正。
ComBat如果直接用beta值进行矫正,输出可能不在0-1之间,所以计算机在计算前需要做一个变换。如果用M-values矫正,参数 logitTrans=FALSE设置。有时候批次效应和变异会混杂在一起,如果矫正了批次效应,变异也会消失,
myCombat <- champ.runCombat(beta=myNorm,pd=myLoad$pd,batchname=c("Slide"))
champ.SVD()
7.1 Differential Methylation Probes(DMP & DMR & DMB)
目的是找出几百万CpG中的哪些在疾病中发生了变化,而这些变化又是如何导致了基因发生了变化,最终导致了人体生病。
DMP代表找出Differential Methylation Probe(差异化CpG位点),DMR代表找出Differential Methylation Region(差异化CpG区域),Block代表Differential Methylation Block(更大范围的差异化region区域)。
champ.DMP() 实现了 limma包中利用linear model计算差异甲基化位点的p-value。最新的champ.DMP()包支持分析数值型变量如年龄,分类型变量如包含多个表型的:“tumor”, “metastasis”, “control”。数值型变量(如年龄)会用linear regression模型作为协变量进行分析,to find your covariate-related CpGs, say age-related CpGs.分类型变量会按类型分类进行比较,如比较“tumor–metastatic”, “tumor-control”, and “metastatis-control”之间的差异,结果会输出一个数据框,包含差异的探针:P-value, t-statistic and difference in mean methylation(被转换为logFC,类似于RNA-seq中的log fold-change)。还包括每个探针的注释,相同组的平均beta值,两组之间的delta beta值(与 logFC相同的意思,老版本的包需要)。高级用户可以用limma 包进一步用输出的探针及p值进行DMR分析。
myDMP <- champ.DMP(beta = myNorm,pheno=myLoad$pd$Sample_Group)
head(myDMP[[1]])
champ.DMP()返回的是list,新版本的ChAMP包含GUI交互界面检查myDMP的结果。用户提供未经修改的champ.DMP (myDMP)函数产生的orginal beta matrix结果和covariates,DMP.GUI() 函数自动检测covariates是数值型还是分类型。分类型如case/control, DMP.GUI()自动画出显著性差异位点。
DMP.GUI(DMP=myDMP[[1]],beta=myNorm,pheno=myLoad$pd$Sample_Group)
7.2 Hydroxymethylation Analysis 羟甲基化
一些用户想做羟甲基化,下面为示例代码
myDMP <- champ.DMP(beta=myNorm, pheno=myLoad$pd$Sample_Group, compare.group=c("oxBS", "BS"))
hmc <- myDMP[[1]][myDMP[[1]]$deltaBeta>0,]
8. Differential Methylation Regions 差异甲基化区域
DMRs主要指一连串的CpG都会出现很明显的差异,champ.DMR()函数计算并返回一个数据框,包括:detected DMRs, with their length, clusters, number of CpGs annotated.
函数包含三种算法Bumphunter, ProbeLasso and DMRcate. Bumphunter比较可靠,精确度可以有90%以上,ProbeLasso有75%左右,DMRcate是后来集成进去的,没有评测过。Bumphunter 算法首先将所有的探针分成几小类,然后用随机permutation方法评估候选的DMRs.
myDMR <- champ.DMR(beta=myNorm,pheno=myLoad$pd$Sample_Group,method="Bumphunter")
head(myDMR$DMRcateDMR)
DMR.GUI(DMR=myDMR)
9. Differential Methylation Blocks
在Block-finder 功能中,champ.Block()函数首先在全基因组范围上计算small clusters (regions) ,然后对于每个cluster,计算平均值和位置,将每个区域压缩为一个单元。 When we finding DMB, only single unit from open sea would be used to do clustering. Here Bumphunter algorithm will be used to find “DMRs” over these regions (single units after collapse). In our previous paper23, and other scientists’ work24 we demonstrated that Differential Methylated Blocks may show universal feature across various cancers
myBlock <- champ.Block(beta=myNorm,pheno=myLoad$pd$Sample_Group,arraytype="450K")
head(myBlock$Block)
Block.GUI(Block=myBlock,beta=myNorm,pheno=myLoad$pd$Sample_Group,runDMP=TRUE,compare.group=NULL,arraytype="450K")
10. Gene Set Enrichment Analysis
寻找作用通路网络中的疾病关联小网络
After previous steps, you may already get some significant DMPs or DMRs, thus you may want to know if genes involved in these significant DMPs or DMRs are enriched for specific biological terms or pathways. To achieve this analysis, you can use champ.GSEA() to do GSEA analysis.champ.GSEA() would automatically extract gene information, transfer CpG information into gene information then conduct GSEA on each list.
There are two ways to do GSEA. In previous version, ChAMP used pathway information downloaded fromMSigDB. ThenFisher Exact Test will be used to calculate the enrichment status of each pathway. After gene enrichment analysis, champ.GSEA() function would automatically return pathways with P-value smaller then adjPval cutoff.
However, as pointed out by Geeleher [citation], since different genes has different numbers of CpGs contained inside, the two situation that one genes with 50 CpGs inside but only one of them show significant methylation, and one gene with 2 CpGs inside but two are significant methylated should not be eaqualy treated. The solution is use number of CpGs contained by genes to correct significant genes. as implemented in the gometh function from missMethyl package25. In gometh function, it used number of CpGs contained by each gene replace length as biased data, to correct this issue. The idea of gometh is fitting a curve for numbers of CpGs across genes related with GSEA, then using the probability weighting function to correct GO’s p value.
champ.GSEA() function as “goseq” to use goseq method to do GSEA, or user may set it as “fisher” to do normal Gene Set Enrichment Analysis.
myGSEA <- champ.GSEA(beta=myNorm,DMP=myDMP[[1]], DMR=myDMR, arraytype="450K",adjPval=0.05, method="fisher")
# myDMP and myDMR could (not must) be used directly.
11. Differential Methylated Interaction Hotspots
champ.EpiMod() This function usesFEM package to infer differentially methylated gene modules within a user-specific functional gene-network. This network could be e.g. a protein-protein interaction network. Thus, the champ.EpiMod() function can be viewed as a functional supervised algorithm, which uses a network of relations between genes (usually a PPI network), to identify subnetworks where a significant number of genes are associated with a phenotype of interest (POI). The EpiMod algorithm can be run in two different modes: at the probe level, in which case the most differentially methylated probe is assigned to each gene, or at the gene-level in which case a DNAm value is assigned to each gene using an optimized procedure described in detail in Jiao Y, Widschwendter M, Teschendorff AE Bioinformatics 2014. Originally, the FEM package was developed to infer differentially methylated gene modules which are also deregulated at the gene expression level, however here we only provide the EpiMod version, which only infers differentially methylated modules. More advanced user may refer to FEM package for more information.
myEpiMod <- champ.EpiMod(beta=myNorm,pheno=myLoad$pd$Sample_Group)
12. Cell Type Heterogeneity
由于DNA甲基化有高的细胞特异性,许多DMPs/DMRs的变化是由细胞成分导致的。许多方法可以矫正这个问题:RefbaseEWAS用组织的细胞类型做参考数据库,确定细胞比例。In ChAMP, we include a reference databases for whole blood, one for 27K and the other for 450K. After champ.refbase() function, cell type heterogeneity corrected beta matrix, and cell-type specific proportions in each sample will be returned. Do remember champ.refbase() can only works on Blood Sample Data Set.
myRefBase <- champ.refbase(beta=myNorm,arraytype="450K")
# Our test data set is not whole blood. So it should not be run here.
参考:https://bioconductor.org/packages/release/bioc/vignettes/ChAMP/inst/doc/ChAMP.html
http://blog.csdn.net/joshua_hit/article/details/54982018
https://www.jianshu.com/p/6411e8acfab3
◆ ◆ ◆ ◆ ◆