GWAS宝刀未老
今年(2020)2月,解放军总医院放射治疗科的研究团队,在国际期刊《Journal of Cancer》 上发表了题为"Precise prediction of the radiation pneumonitis in lung cancer: an explorative preliminary mathematical model using genotype information"的科研论文。
该论文揭示了辐射敏感性在基因层面的可预测性,并建立了精准预测模型,该模型的灵敏度和特异性均超过90%。此项成果达世界领先水平,为临床上放疗患者的分类治疗提供了良好的科学依据。
我粗略浏览了一下, 发现居然就是GWAS的科研思维,区别就是样本量超级小,基因分型芯片比较个性化,文章的突破点就是病人队列以及关心的临床问题。
研究大纲
非常容易理解,收集了一百多个肺癌病人(不同的stage,不同的histology),其中一半是**RP grade≥2 **,在他们经过放射治疗前测基因型,约3万个合格位点,
Radiation pneumonitis (RP) is the most significant dose-limiting toxicity and is one major obstacle for lung cancer radiotherapy. Grade ≥2 RP usually needs clinical interventions and serve RP could be life threatening. The purpose of this study is to develop an approach for the personalized RP risk prediction. a multiple linear regression model named Radiation Pneumonitis Index (RPI) was built, for the assessment of Grade ≥2RP risk.
这个RP值是研究的核心:Once diagnosed, RP was further graded by at least two radiation oncologists following the Common Toxicity Criteria for Adverse Events (CTCAE) version 4.03.
研究方法
使用的是Infinium® Global Screening Array system (Illumina, San Diego, CA, USA) 这个基因分型芯片,约7万个位点。
测的样品是:Peripheral blood leukocytes from patients before the radiotherapy was used for genomic DNA extraction using the Maxwell system (Promega, Madison, WI, USA).
病人队列是:Archived information of 118 lung cancer patients was obtained from the People's Liberation Army General Hospital.
GWAS分析步骤是:
We excluded SNPs in each individual dataset that had a mean GenCall score < 0.7, missingness >5%, MAF < 0.01 or a Hardy-Weinberg equilibrium test P < 10-6 using PLINK. We also excluded variants with multiple alleles. A total of 720,078 SNPs in the genotypic data set and 299,054 SNPs in the dataset passed this process for further prediction. 质控后仍然有 299,000 sites 基因型是0,1,2这样的野生型,杂合,纯合的3分类法( We assigned value 0 to 'WW' genotype, value 1 to 'WA'/'AW' genotype and value 2 to 'AA' genotype. ) 建模后;Thirty-nine effective SNP sites were discovered after applying the GLMNET regression on 90 sets of random training data.
重点就是GLMNET算法,全称是:Generalized Linear Models via Lasso and Elastic-Net Regularization。
出图如下:
100个病人的3万个基因型位点
其实这个时候,有点类似于传统的表达矩阵的生存分析了,只不过是它这个基因型数据呢,“表达量”只有0,1,2这3种形式,而不是真正RNA-seq或者基因芯片那样的表达矩阵。这些病人的临床结局事件也超级类似,我们生存分析的时候,病人通常是存活或者死亡两个状态,而这个病人的分类也是 Grade ≥2 RP与否。
关于这个GLMNET算法,作者写得很模糊,在R包glmnet里面有:
岭回归(Ridge Regression) 套索方法(LASSO:least absolute shrinkage and selection operator)
主要是在R里面实现, LASSO回归α=1 ,Ridge回归α=0 ,一般Elastic Net模型0<α<1 。其中参数α是控制应对高相关性(highly correlated)数据时模型的性状。学它之前需要先搞定简单的线性回归(Linear Regression)以及Logistic回归,一个资料推荐给大家:https://rstudio-pubs-static.s3.amazonaws.com/208326_53c039603b3c45619f9fb2c0baf5fa28.html
文章提到的GLMNET算法,全称是:Generalized Linear Models via Lasso and Elastic-Net Regularization ,也就是他并不指明具体的方法,难道是不希望我们重复出来吗?再说,他本来就没有提供100个病人的3万个基因型位点矩阵也没有具体病人临床信息,我们只能是看看,不说话。
最后,我们生信技能树确实没有GWAS相关教程,但是在生信菜鸟团,我们还有一个GWAS专题的: