师兄用临床数据一年发五篇文章,原来靠这个诀窍!(建议收藏!)
大家好,我是风。这里是风风的碎碎念专栏。过了腊八,马上就是“年”了,我在腊八的0点开始写这篇推文,不知道今年春节能不能回家呢?有人因为疫情无所事事,但是也有的人因为疫情空闲时间发了几篇SCI,比如某隔壁宿舍的某师兄,趁着疫情放假,竟然连发5篇论文!丧心病狂啊!为了打探这种绝世武功秘籍,我把他的文章都下了下来,仔细比对,好家伙,终于发现了他的发文秘籍,原来靠的是临床数据 决策树啊!
什么是决策树?
作为临床医生或者研究生,大家最熟悉的就是自己手上的临床数据了,要是手上的数据有随访带来的预后数据还好,可以上预后分析嘛,万一要是没有,难道就没办法了吗?当然不是!预后分析有预后分析的套路,诊断研究当然也有诊断研究的模型,比如今天说的“决策树”模型!
什么是“决策树”呢?决策树是数据挖掘领域中的常用模型,其基本思想是对预测变量进行二元分离,从而构建一颗可用于预测新样本单元所属类别的树,常用决策树包括经典决策树和条件推断树。
经典决策树
经典决策树,以一个二元输出变量和一组预测变量为基础。这一个二元输出变量可以是良性恶性,或者生存死亡,或者复发不复发等等二分类数据,预测变量可以包括年龄性别等临床变量,或者是抽血检查结果,或者是病理上的细胞特征,我们以预测肿瘤的良恶性为例来进行讲解,具体决策树的算法过程我们不需要了解,只需要如何利用决策树对我们的数据进行分析即可。
经典决策树的构建
我们将使用rpart包的rpart()函数构建决策树,一般经典决策树的分支都比较多,我们需要使用十折交叉验证法选择预测误差最小的树,因此就需要对rpart结果进行剪枝,也就是使用prune()函数对决策树进行剪枝。下面我们以一份乳腺癌临床数据为例,通过各个变量构建决策树判别乳腺肿块的良恶性:
library(rpart)
library(tidyverse)
# 数据准备
breast <- read.table('breast_cancer.data',
sep = ',',
header = FALSE,
na.strings = '?') # 加载数据,数据在后台获取
names(breast) <- c('ID',
'Tumor_Thickness',
'SizeUniformity',
'shapeUniformity',
'maginalAdhesion',
'singleEpiCellSize',
'bareNuc',
'blandChr',
'normalNuc',
'mitosis',
'class') #对数据进行重命名
df <- breast[-1]
df$class <- factor(df$class,levels = c(2,4),
labels = c('benign', 'malignant')) # 构建因子,区分良恶性
train <- sample(nrow(df), 0.7*nrow(df)) # 选取总数据70%的样本构建模型训练集
set.seed(2021)
df.train <- df[train,]
df.validata <- df[-train,] # 将剩下30%样本作为验证集
table(df.train$class) # 查看训练集数据
##
## benign malignant
## 330 159
table(df.validata$class) # 查看验证集数据
##
## benign malignant
## 128 82
# 经典决策树
set.seed(202101292)
# 生成决策树
dtree <- rpart(class ~ ., # class为预测结果,使用.代表所有变量
data = df.train, # 利用验证集构建模型
method = 'class',
parms = list(split = 'information'))
print(dtree) # 查看决策树信息
## n= 489
##
## node), split, n, loss, yval, (yprob)
## * denotes terminal node
##
## 1) root 489 159 benign (0.67484663 0.32515337)
## 2) SizeUniformity< 2.5 309 9 benign (0.97087379 0.02912621) *
## 3) SizeUniformity>=2.5 180 30 malignant (0.16666667 0.83333333)
## 6) SizeUniformity< 4.5 68 27 malignant (0.39705882 0.60294118)
## 12) bareNuc< 3.5 26 4 benign (0.84615385 0.15384615) *
## 13) bareNuc>=3.5 42 5 malignant (0.11904762 0.88095238) *
## 7) SizeUniformity>=4.5 112 3 malignant (0.02678571 0.97321429) *
summary(dtree) # 查看决策树具体信息
## Call:
## rpart(formula = class ~ ., data = df.train, method = 'class',
## parms = list(split = 'information'))
## n= 489
##
## CP nsplit rel error xerror xstd
## 1 0.75471698 0 1.0000000 1.0000000 0.06514843
## 2 0.05660377 1 0.2452830 0.3081761 0.04176119
## 3 0.01000000 3 0.1320755 0.1761006 0.03231305
##
## Variable importance
## SizeUniformity shapeUniformity bareNuc singleEpiCellSize
## 23 17 16 15
## blandChr normalNuc maginalAdhesion Tumor_Thickness
## 15 14 1 1
##
## Node number 1: 489 observations, complexity param=0.754717
## predicted class=benign expected loss=0.3251534 P(node) =1
## class counts: 330 159
## probabilities: 0.675 0.325
## left son=2 (309 obs) right son=3 (180 obs)
## Primary splits:
## SizeUniformity < 2.5 to the left, improve=186.6152, (0 missing)
## shapeUniformity < 2.5 to the left, improve=177.5238, (0 missing)
## bareNuc < 2.5 to the left, improve=176.0200, (11 missing)
## blandChr < 3.5 to the left, improve=172.9661, (0 missing)
## normalNuc < 2.5 to the left, improve=149.5745, (0 missing)
## Surrogate splits:
## shapeUniformity < 3.5 to the left, agree=0.914, adj=0.767, (0 split)
## singleEpiCellSize < 2.5 to the left, agree=0.885, adj=0.689, (0 split)
## blandChr < 3.5 to the left, agree=0.881, adj=0.678, (0 split)
## normalNuc < 2.5 to the left, agree=0.877, adj=0.667, (0 split)
## bareNuc < 2.5 to the left, agree=0.871, adj=0.650, (0 split)
##
## Node number 2: 309 observations
## predicted class=benign expected loss=0.02912621 P(node) =0.6319018
## class counts: 300 9
## probabilities: 0.971 0.029
##
## Node number 3: 180 observations, complexity param=0.05660377
## predicted class=malignant expected loss=0.1666667 P(node) =0.3680982
## class counts: 30 150
## probabilities: 0.167 0.833
## left son=6 (68 obs) right son=7 (112 obs)
## Primary splits:
## SizeUniformity < 4.5 to the left, improve=21.59943, (0 missing)
## bareNuc < 3.5 to the left, improve=21.24393, (2 missing)
## blandChr < 3.5 to the left, improve=20.17688, (0 missing)
## shapeUniformity < 3.5 to the left, improve=16.80110, (0 missing)
## Tumor_Thickness < 6.5 to the left, improve=15.72829, (0 missing)
## Surrogate splits:
## shapeUniformity < 5.5 to the left, agree=0.789, adj=0.441, (0 split)
## singleEpiCellSize < 3.5 to the left, agree=0.750, adj=0.338, (0 split)
## blandChr < 3.5 to the left, agree=0.711, adj=0.235, (0 split)
## maginalAdhesion < 2.5 to the left, agree=0.706, adj=0.221, (0 split)
## bareNuc < 1.5 to the left, agree=0.672, adj=0.132, (0 split)
##
## Node number 6: 68 observations, complexity param=0.05660377
## predicted class=malignant expected loss=0.3970588 P(node) =0.1390593
## class counts: 27 41
## probabilities: 0.397 0.603
## left son=12 (26 obs) right son=13 (42 obs)
## Primary splits:
## bareNuc < 3.5 to the left, improve=18.806650, (1 missing)
## Tumor_Thickness < 6.5 to the left, improve=10.921810, (0 missing)
## blandChr < 3.5 to the left, improve= 8.433749, (0 missing)
## shapeUniformity < 2.5 to the left, improve= 8.035364, (0 missing)
## maginalAdhesion < 6.5 to the left, improve= 6.901954, (0 missing)
## Surrogate splits:
## shapeUniformity < 2.5 to the left, agree=0.731, adj=0.308, (1 split)
## Tumor_Thickness < 4.5 to the left, agree=0.716, adj=0.269, (0 split)
## maginalAdhesion < 2.5 to the left, agree=0.687, adj=0.192, (0 split)
## blandChr < 2.5 to the left, agree=0.672, adj=0.154, (0 split)
## normalNuc < 2.5 to the left, agree=0.672, adj=0.154, (0 split)
##
## Node number 7: 112 observations
## predicted class=malignant expected loss=0.02678571 P(node) =0.2290389
## class counts: 3 109
## probabilities: 0.027 0.973
##
## Node number 12: 26 observations
## predicted class=benign expected loss=0.1538462 P(node) =0.05316973
## class counts: 22 4
## probabilities: 0.846 0.154
##
## Node number 13: 42 observations
## predicted class=malignant expected loss=0.1190476 P(node) =0.08588957
## class counts: 5 37
## probabilities: 0.119 0.881
dtree$cptable # 查看cptable选择最佳决策树
## CP nsplit rel error xerror xstd
## 1 0.75471698 0 1.0000000 1.0000000 0.06514843
## 2 0.05660377 1 0.2452830 0.3081761 0.04176119
## 3 0.01000000 3 0.1320755 0.1761006 0.03231305
现在我们已经构建了一个决策树模型,并且知道了这个决策树的信息,但是该怎么根据结果对原始决策树进行剪枝呢?我们只需要看cptable中的结果,以我们这个模型为例,CP是复杂度参数,用于惩罚过大的树;树的大小就是分支数nsplit,有n个分支的树就有n 1个终端节点;rel error即真实误差,也可以理解为原始误差;xerror就是进行十折交叉验证后的误差,可以理解为矫正后的误差;xstd就是交叉验证误差的标准差。
知道了每个参数后,我们对得到的模型进行解读,在我们的模型中,最小的交叉验证误差是0.1358025,标准误差是0.02829439,那么我们的决策树模型的最优的树应该是交叉验证误差在0.1358025±0.02829439之间的树,也就是0.1075081-0.1640969之间,那回到cptable中,找到这个区间的值就是0.1358025,对应的CP值是0.0100000,nsplit为4,也就是我们的最优树应该是有4个分支,5个终端节点,得到这些信息后我们就要可视化我们的最优树了:
# 绘制一个十折交叉图形验证我们上面判断的结果plotcp(dtree) #发现结果一致,选择CP值为0.0100000
dtree.pruned <- prune(dtree, cp = .0100)
# 利用rpart.plot包的prp()函数展示决策树
library(rpart.plot)
prp(dtree.pruned,
type = 2,
extra = 104,
fallen.leaves = TRUE,main = 'Decision Tree')
# 上面的图片风格不喜欢的话,可以接着用下面这种美化后的风格library(partykit)plot(as.party(dtree.pruned))
# 接着利用外部验证集结果验证决策树模型的可靠性
dtree.pred <- predict(dtree.pruned,
df.validata,
type = 'class')
dtree.perf <- table(df.validata$class,
dtree.pred,
dnn = c('Actual', 'Predicted'))
dtree.perf
## Predicted
## Actual benign malignant
## benign 124 4
## malignant 9 73
这样经典决策树模型就构建完成,难度不大,但是竞争力很强,看多了风险预测模型,来个决策树模型在最后给文章提亮一下好像也还不错是吧?那能不能再高端一点呢?当然可以,我们接着看条件推断树.
条件推断树的构建
条件推断树和经典决策树类似,区别在于算法不同,条件推断树的变量和分割的选取是基于显著性检验,而不是同质性或者纯净度这一类的变量,我们同样直接看如何操作,这一部分代码非常简单:
library(party)fit.ctree <- ctree(class ~., data = df.train) # 使用ctree构建模型print(fit.ctree)## ## Conditional inference tree with 7 terminal nodes## ## Response: class ## Inputs: Tumor_Thickness, SizeUniformity, shapeUniformity, maginalAdhesion, singleEpiCellSize, bareNuc, blandChr, normalNuc, mitosis ## Number of observations: 489 ## ## 1) bareNuc <= 2; criterion = 1, statistic = 333.722## 2) SizeUniformity <= 3; criterion = 1, statistic = 226.573## 3) mitosis <= 1; criterion = 1, statistic = 79.262## 4)* weights = 295 ## 3) mitosis > 1## 5)* weights = 8 ## 2) SizeUniformity > 3## 6)* weights = 16 ## 1) bareNuc > 2## 7) blandChr <= 3; criterion = 1, statistic = 42.137## 8) Tumor_Thickness <= 6; criterion = 1, statistic = 20.794## 9)* weights = 26 ## 8) Tumor_Thickness > 6## 10)* weights = 15 ## 7) blandChr > 3## 11) bareNuc <= 8; criterion = 0.991, statistic = 10.744## 12)* weights = 40 ## 11) bareNuc > 8## 13)* weights = 89summary(fit.ctree)## Length Class Mode ## 1 BinaryTree S4plot(fit.ctree) # 绘制条件推断树
# 利用验证集对数据进行验证
ctree.pred <- predict(fit.ctree,
df.validata,
type = 'response')
ctree.perf <- table(df.validata$class,
ctree.pred,
dnn = c('Actual','Predicted'))
ctree.perf
## Predicted
## Actual benign malignant
## benign 124 4
## malignant 7 75
凌晨两点,我写完了这个推文,都说深夜容易引发人的感慨,此言诚不我欺!我竟不禁想到几年前我另一位师兄问我会不会做决策树的时候,我还只能问他决策树是什么东东/(ㄒoㄒ)/~~ 那可能今天我不会的东西,过完春节我就会了?哎深夜发梦胡思乱想胡言乱语语无伦次想得美了!今天就到这里,大家晚安,后台回复“feng34”获得本次代码和数据,那我们下次再见吧!*^_^*~