非线性建模 (Non-linear modeling)

1. 背景

1.1 线性

线性是指可以用图形表示为直线的数学关系或函数。【Wikipedia】

1.2 线性假设

线性是建立线性回归模型时的一个常见假设。在线性关系中,一个连续变量在整个变量的范围内具有相同的影响。这使得估计值很容易解释;即自变量增加一个单位,结果中相应系数的改变。虽然这产生了具有其优点的简单模型,但很难相信大自然只遵循一条简单的直线关系。通过使用大数据集,应该允许检验模型的非线性效应。这种情况下,就必须用其他的回归来拟合了。

1.3 处理非线性

在研究大多数与健康相关的结局时,年龄是一个重要的混杂因素,而且可能是最常见的调整变量。一般避免将年龄作为线性变量建模。

在瑞典髋关节置换术登记局的数据库中,包括3万多名患者,他们在术前和术后都填写了EQ-5D表格。研究关注年龄及其对全髋关节置换术后一年的EQ-5D指数和EQ-VAS的影响【1】。

例如:y ~ β0 + βage * Age + βsex * Sex

有许多非线性的替代方法可以用来更好地找到实际的关系,大多数依赖于将单个连续变量转换成多个连续变量。最简单的形式是将变量转化为多项式,将年龄扩展到年龄的平方:

y ~ β0 + βage * Age + βage * Age2 + βsex * Sex

这使得线性发生弯曲,但是平方项的系数难以解释,如果加入立方项(Cubic),如,Age3,要解释这些系数几乎是不可能的。由于这种解释上的困难,要么使用rms::contrast函数,要么使用stats::predict来说明变量的作用。

1.4 样条曲线(spline)

多项式最常用的替代形式是样条。样条(spline)最基本的形式是由不同“结点(knots)”相连的线组成。例如,有一个结点的线性样条可以呈V形,而两个结点可以呈N形。结点的数量决定了样条曲线的灵活性,即更多的节点可以更详细地描述这种关系。

2.模型

本文将讨论以下几个选项:(1)在线性回归设置中建模非线性效应;(2)在真实数据集上对方法进行基准测试;(3)观察非线性的实际情况。

2.1 限制性立方样条(The restricted cubic spline)

限制性立方样条也称自然样条:①在数据的中心使用立方项,并将末端限制在一条直线上,防止中心扭曲末端,即卷曲(curling);②Knots连接,曲线平滑,没有分段回归的阶跃现象。

首先使用caret::createDataPartition来分割数据集,然后用rms::datadist来实现限制性立方样条的功能。Frank Harrell's rms-package

library(caret)# splitting the dataset using the caret::createDataPartitiontrain_no <- createDataPartition(1:nrow(my_dataset), list = FALSE, p = .7)train <- my_dataset[train_no, ]test <- my_dataset[-train_no, ]
# A list with all the fits that are later to be benchmarkedfits <- list(eq5d = list(), eq_vas = list())
# The rms-setuplibrary(rms)ddist <- datadist(train)options("datadist" = "ddist")

Frank Harrell是限制性三次样条(也就是自然样条)的支持者。这种类型的样条,在数据的中心使用立方项,并将末端限制在一条直线上,防止中心扭曲末端,即卷曲(curling)。他的rcs()也很好地与anova整合,以检查非线性是否真的存在:

idx_model <- ols(eq5d1 ~ rcs(Age, 3) + Sex * Charnley_class + rcs(eq5d0, 3)+rcs(pain0, 3), data=train, x=TRUE, y=TRUE)anova(idx_model, ss=FALSE)# gives:# # Analysis of Variance Response: eq5d1 # Factor F d.f. P # Age 140.71 2 <.0001# # Nonlinear 41.07 1 <.0001# Sex (Factor+Higher Order Factors) 26.94 3 <.0001# All Interactions 6.80 2 0.0011# Charnley_class (Factor+Higher Order Factors) 275.08 4 <.0001# All Interactions 6.80 2 0.0011# eq5d0 522.37 2 <.0001# Nonlinear 36.54 1 <.0001# pain0 3.85 2 0.0213# # Nonlinear 4.56 1 0.0328# Sex * Charnley_class (Factor+Higher Order Factors) 6.80 2 0.0011# TOTAL NONLINEAR 30.48 3 <.0001# TOTAL NONLINEAR + INTERACTION 21.02 5 <.0001# TOTAL 330.16 11 <.0001# # Error d.f.: 19096

可以看到,这个模型对于EQ-5D指数来说似乎是可以的,它既支持了非线性,也支持了Charnley_class和性别之间的相互作用。如果我们由于某种原因不能使用rms包,可以使用如下所示的两个常规模型的splines::ns来检查线性度。

lm1 <- lm(eq5d1 ~ Age + Sex * Charnley_class + ns(eq5d0, 3)+ns(pain0, 3), data=train)lm2 <- lm(eq5d1 ~ ns(Age, 3) + Sex * Charnley_class + ns(eq5d0, 3)+ns(pain0, 3), data=train)anova(lm1, lm2)# gives:# # Analysis of Variance Table# Model 1: eq5d1 ~ Age + Sex * Charnley_class + ns(eq5d0, 3) + ns(pain0, 3)# Model 2: eq5d1 ~ ns(Age, 3) + Sex * Charnley_class + ns(eq5d0, 3) + ns(pain0, 3)## Res.Df RSS Df Sum of Sq F Pr(>F) # 1 19095 193.01 # 2 19093 192.66 2 0.35112 17.398 2.825e-08 ***# ---# Signif. codes: 0 '***’ 0.001 '**’ 0.01 '*’ 0.05 '.’ 0.1 ' ’ 1

为了避免过度拟合,根据AIC(Akaike information criterion)/BIC(Bayesian information criterion  )准则选择模型。选择最小值的模型,相比BIC,一般AIC允许略复杂的模型。我们将使用AIC为EQ-5D指数寻找最佳结点数:

#' A simple function for updating the formula and extracting the information criteria#' #' @param no A number that is used together with the add_var_str#' @param fit A regression fit that is used for the update#' @param rm_var The variable that is to be substituted#' @param add_var_str A sprintf() string that accepts the no parameter for each update#' @param ic_fn The information criteria function (AIC/BIC)getInfCrit <- function(no, fit, rm_var, add_var_str, ic_fn) { new_var <- sprintf(add_var_str, no) updt_frml <- as.formula(sprintf(".~.-%s+%s", rm_var, new_var)) ret <- ic_fn(update(fit, updt_frml)) names(ret) <- new_var return(ret)}
# We start with a model where the other splines have been AIC-selectedidx_model <- ols(eq5d1 ~ rcs(Age, 3) + Sex * Charnley_class + rcs(eq5d0, 8)+rcs(pain0, 6), data=train, x=TRUE, y=TRUE)sapply(3:9, getInfCrit, fit = idx_model, rm_var = "rcs(Age, 3)", add_var_str = "rcs(Age, %d)", ic_fn=AIC)
# rcs(Age, 3) rcs(Age, 4) rcs(Age, 5) rcs(Age, 6) rcs(Age, 7) rcs(Age, 8) rcs(Age, 9) # -33678.89 -33674.68 -33686.30 -33686.93 -33685.95 -33686.73 -33699.37
fits$eq5d[["RCS with AIC"]] <- update(idx_model, .~.-rcs(Age, 3)+rcs(Age, 5))

模型是否应该在3个结点时停止仍需被讨论,因为在4个和5个结点时,AIC值下降更厉害。不幸的是4个结点时,可能拟合最好。我们也可以尝试更多的结点,但即使有适当的可视化,这些也很难解释。在对混杂建模时,如术前EQ-5D指数(eq5d0)和术前疼痛(pain0),我们可能更喜欢更高的结数,以避免任何残留的混杂,我们不需要担心可视化/解释它们之间的关系。

现在,如果我们把同样的方法应用到EQ-VAS,我们得到:

vas_model <- ols(eq_vas1 ~ Sex * Charnley_cat + Sex + rcs(Age, 3) + rcs(eq_vas0, 3) + OpNr + rcs(pain0, 3), data=train, x=TRUE, y=TRUE)sapply(3:9, getInfCrit, fit = vas_model, rm_var = "rcs(Age, 3)", add_var_str = "rcs(Age, %d)", ic_fn=AIC)
# rcs(Age, 3) rcs(Age, 4) rcs(Age, 5) rcs(Age, 6) rcs(Age, 7) rcs(Age, 8) rcs(Age, 9) # 166615.6 166619.8 166600.2 166600.1 166602.0 166603.0 166596.7
fits$eq_vas[["RCS with AIC"]] <- update(vas_model, .~.-rcs(Age, 3)+rcs(Age, 5))

BIC准则:BIC比更常用的AIC更保守,如若样本容量大,可以选择BIC。

idx_model <- ols(eq5d1 ~ rcs(Age, 3) + Sex * Charnley_cat + rcs(eq5d0, 3)+rcs(pain0, 3), data=train, x=TRUE, y=TRUE)
sapply(3:9, getInfCrit, fit = idx_model, rm_var = "rcs(Age, 3)", add_var_str = "rcs(Age, %d)", ic_fn=BIC)# rcs(Age, 3) rcs(Age, 4) rcs(Age, 5) rcs(Age, 6) rcs(Age, 7) rcs(Age, 8) rcs(Age, 9) # -33486.35 -33474.16 -33477.95 -33470.69 -33462.17 -33455.28 -33459.98 fits$eq5d[["RCS with BIC"]] <- idx_model
vas_model <- ols(eq_vas1 ~ rcs(Age, 3) + Sex * Charnley_cat + rcs(eq_vas0, 3) + OpNr + rcs(pain0, 3), data=train, x=TRUE, y=TRUE)
sapply(3:9, getInfCrit, fit = vas_model, rm_var = "rcs(Age, 3)", add_var_str = "rcs(Age, %d)", ic_fn=BIC)
# rcs(Age, 3) rcs(Age, 4) rcs(Age, 5) rcs(Age, 6) rcs(Age, 7) rcs(Age, 8) rcs(Age, 9) # 166725.7 166737.8 166726.0 166733.8 166743.5 166752.4 166754.0
fits$eq_vas[["RCS with BIC"]] <- vas_model

2.2 B-splines

B-splines即Basis spline,一种限制性立方样条的常用替代方法,利用结点来控制灵活性。由于它们在末端不受限制,所以它们的尾部比限制性立方样条更灵活。为了进行比较,我们将对其他变量使用相同的模型:

# Use the same setting model as used in the RCSvars <- attr(terms(fits$eq5d[["RCS with AIC"]]), "term.labels")rm_var <- vars[grep("Age", vars, fixed = TRUE)]idx_model <- update(fits$eq5d[["RCS with AIC"]], sprintf(".~.-%s+bs(Age, 3)", rm_var))sapply(3:9, getInfCrit, fit = idx_model, rm_var = "bs(Age, 3)", add_var_str = "bs(Age, %d)", ic_fn=AIC)
# bs(Age, 3) bs(Age, 4) bs(Age, 5) bs(Age, 6) bs(Age, 7) bs(Age, 8) bs(Age, 9) # -33669.07 -33683.35 -33680.55 -33683.44 -33683.03 -33681.79 -33685.55 # Chose 6 knots for illustration as it otherwise be the same as the BIC model - not that interesting
fits$eq5d[["BS with AIC"]] <- update(idx_model, .~.-bs(Age, 3)+bs(Age, 6))
vars <- attr(terms(fits$eq5d[["RCS with BIC"]]), "term.labels")rm_var <- vars[grep("Age", vars, fixed = TRUE)]idx_model <- update(fits$eq5d[["RCS with BIC"]], sprintf(".~.-%s+bs(Age, 3)", rm_var))sapply(3:9, getInfCrit, fit = idx_model, rm_var = "bs(Age, 3)", add_var_str = "bs(Age, %d)", ic_fn=BIC)
# bs(Age, 3) bs(Age, 4) bs(Age, 5) bs(Age, 6) bs(Age, 7) bs(Age, 8) bs(Age, 9) # -33468.29 -33475.09 -33464.40 -33459.42 -33451.12 -33442.38 -33438.71
fits$eq5d[["BS with BIC"]] <- update(idx_model, .~.-bs(Age, 3)+bs(Age, 4))
vars <- attr(terms(fits$eq_vas[["RCS with AIC"]]), "term.labels")rm_var <- vars[grep("Age", vars, fixed = TRUE)]vas_model <- update(fits$eq_vas[["RCS with AIC"]], sprintf(".~.-%s+bs(Age, 3)", rm_var))sapply(3:9, getInfCrit, fit = vas_model, rm_var = "bs(Age, 3)", add_var_str = "bs(Age, %d)", ic_fn=AIC)
# bs(Age, 3) bs(Age, 4) bs(Age, 5) bs(Age, 6) bs(Age, 7) bs(Age, 8) bs(Age, 9) # 166640.3 166617.2 166621.1 166612.9 166613.2 166614.8 166615.0
fits$eq_vas[["BS with AIC"]] <- update(vas_model, .~.-bs(Age, 3)+bs(Age, 6))
vars <- attr(terms(fits$eq_vas[["RCS with BIC"]]), "term.labels")rm_var <- vars[grep("Age", vars, fixed = TRUE)]vas_model <- update(fits$eq_vas[["RCS with BIC"]], sprintf(".~.-%s+bs(Age, 3)", rm_var))sapply(3:9, getInfCrit, fit = vas_model, rm_var = "bs(Age, 3)", add_var_str = "bs(Age, %d)", ic_fn=BIC)
# bs(Age, 3) bs(Age, 4) bs(Age, 5) bs(Age, 6) bs(Age, 7) bs(Age, 8) bs(Age, 9) # 166750.4 166735.1 166746.9 166746.5 166754.7 166764.2 166772.2
fits$eq_vas[["BS with BIC"]] <- update(vas_model, .~.-bs(Age, 3)+bs(Age, 4))

2.3 多项式(Polynomials)

多项式可以是不同程度的,经常被认为是一种简单的方法来拟合更灵活的非线性关系。由于大多数患者的平均年龄在69.1岁左右,因此需要记住的是,这些患者对曲线外观的影响最大。因此,有可能是尾部的可靠性比中心部分低。

# Use the same setting model as used in the RCSvars <- attr(terms(fits$eq5d[["RCS with AIC"]]), "term.labels")rm_var <- vars[grep("Age", vars, fixed = TRUE)]idx_model <- update(fits$eq5d[["RCS with AIC"]], sprintf(".~.-%s+poly(Age, 2)", rm_var))sapply(2:9, getInfCrit, fit = idx_model, rm_var = "poly(Age, 2)", add_var_str = "poly(Age, %d)", ic_fn=AIC)
# poly(Age, 2) poly(Age, 3) poly(Age, 4) poly(Age, 5) poly(Age, 6) poly(Age, 7) poly(Age, 8) poly(Age, 9) # -33669.58 -33669.07 -33680.10 -33678.83 -33681.82 -33681.89 -33680.35 -33680.17
fits$eq5d[["Polynomial with AIC"]] <- update(idx_model, .~.-poly(Age, 2)+poly(Age, 6))
vars <- attr(terms(fits$eq5d[["RCS with BIC"]]), "term.labels")rm_var <- vars[grep("Age", vars, fixed = TRUE)]idx_model <- update(fits$eq5d[["RCS with BIC"]], sprintf(".~.-%s+poly(Age, 2)", rm_var))sapply(2:9, getInfCrit, fit = idx_model, rm_var = "poly(Age, 2)", add_var_str = "poly(Age, %d)", ic_fn=BIC)
# poly(Age, 2) poly(Age, 3) poly(Age, 4) poly(Age, 5) poly(Age, 6) poly(Age, 7) poly(Age, 8) poly(Age, 9) # -33476.79 -33468.29 -33471.83 -33462.59 -33457.76 -33450.37 -33440.97 -33432.99
fits$eq5d[["Polynomial with BIC"]] <- idx_model
vars <- attr(terms(fits$eq_vas[["RCS with AIC"]]), "term.labels")rm_var <- vars[grep("Age", vars, fixed = TRUE)]vas_model <- update(fits$eq_vas[["RCS with AIC"]], sprintf(".~.-%s+poly(Age, 2)", rm_var))sapply(2:9, getInfCrit, fit = vas_model, rm_var = "poly(Age, 2)", add_var_str = "poly(Age, %d)", ic_fn=AIC)
# poly(Age, 2) poly(Age, 3) poly(Age, 4) poly(Age, 5) poly(Age, 6) poly(Age, 7) poly(Age, 8) poly(Age, 9) # 166638.4 166640.3 166622.3 166623.9 166615.7 166616.8 166617.2 166617.5
fits$eq_vas[["Polynomial with AIC"]] <- update(vas_model, .~.-poly(Age, 2)+poly(Age, 6))
vars <- attr(terms(fits$eq_vas[["RCS with BIC"]]), "term.labels")rm_var <- vars[grep("Age", vars, fixed = TRUE)]vas_model <- update(fits$eq_vas[["RCS with BIC"]], sprintf(".~.-%s+poly(Age, 2)", rm_var))sapply(2:9, getInfCrit, fit = vas_model, rm_var = "poly(Age, 2)", add_var_str = "poly(Age, %d)", ic_fn=BIC)
# poly(Age, 2) poly(Age, 3) poly(Age, 4) poly(Age, 5) poly(Age, 6) poly(Age, 7) poly(Age, 8) poly(Age, 9) # 166740.6 166750.4 166740.2 166749.7 166749.3 166758.3 166766.6 166774.7
fits$eq_vas[["Polynomial with BIC"]] <- update(vas_model, .~.-poly(Age, 2)+poly(Age, 4))

2.4 多变量分数多项式模型(Multiple Fractional Polynomial Models)

多变量分数多项式模型(MFP)已被提出作为样条的替代。使用变量的指数变换的定义集合,根据p值来排除predictors 。mfp包有一个内置的算法,不需要依赖BIC或AIC的MFP。

library(mfp)# 我们将使用简单的BIC模型,并使用spline::ns()来代替mfp中不可用的rcs()fits$eq5d[["MFP"]] <- mfp(eq5d1 ~ fp(Age) + Sex * Charnley_class + ns(eq5d0, 3)+ns(pain0, 3), data=train)fits$eq_vas[["MFP"]] <- mfp(eq_vas1 ~ fp(Age) + Sex * Charnley_class + ns(eq_vas0, 3)+ns(pain0, 3), data=train)

2.5 广义可加模型(Generalized additive models)

广义可加性模型(GAM)是广义线性模型的扩展,专门研究非线性关系。Elements of statistical learning是深入研究这些问题的极好资料来源。GAM平滑最简单的解释是为了避免过度拟合而惩罚灵活性。选择有很多,这里采用:

Thin plate regression splines:这通常是GAM模型中最常见的平滑类型。这个名字指的是弯曲薄金属片的物理类比。

Cubic regression splines:样条的基础是具有均匀分布节点的立方。

P-splines:P-splines与B-splines的相似之处在于它们共享基,主要区别在于p样条对系数进行了惩罚。

library(mgcv)
fits$eq5d[["GAM thin plate"]] <- gam(eq5d1 ~ s(Age, bs="tp") + Sex * Charnley_class + ns(eq5d0, 3) + ns(pain0, 3),data=train)fits$eq_vas[["GAM thin plate"]] <- gam(eq_vas1 ~ s(Age, bs="tp") + Sex * Charnley_class + ns(eq_vas0, 3) + ns(pain0, 3), data=train)
fits$eq5d[["GAM cubic regression"]] <-update(fits$eq5d[["GAM thin plate"]], .~.-s(Age, bs="tp")+s(Age, bs="cr"))fits$eq_vas[["GAM cubic regression"]] <- update(fits$eq_vas[["GAM thin plate"]], .~.-s(Age, bs="tp")+s(Age, bs="cr"))
fits$eq5d[["GAM P-splines"]] <- update(fits$eq5d[["GAM thin plate"]], .~.-s(Age, bs="tp")+s(Age, bs="ps"))fits$eq_vas[["GAM P-splines"]] <- update(fits$eq_vas[["GAM thin plate"]], .~.-s(Age, bs="tp")+s(Age, bs="ps"))

3. Benchmark

有了所有这些奇特的模型,我们将首先尝试评估它们在交叉验证时的执行情况,然后在我们一开始就设置的测试集上进行评估。我们将根据均方根误差( root-mean-square error, RMSE)和平均绝对百分误差( mean absolute percentage error, MAPE)进行评估。RMSE倾向于因为有异常值而受到惩罚,而MAPE平均来说更能描述性能。因此,我们的测试功能将是:

rmse <- function(fit, newdata, outcome_var){ pred <- predict(fit, newdata=newdata) sqrt(mean((newdata[, outcome_var]-pred)^2, na.rm=TRUE))}mape <- function(fit, newdata, outcome_var){ pred <- predict(fit, newdata=newdata) mean(abs(newdata[, outcome_var]-pred)/mean(newdata[, outcome_var], na.rm=TRUE)*100, na.rm=TRUE)

此外,在这篇特别的文章中,我想看看在尾巴上发生了什么,因为几乎70%的病人在60到80岁之间,而可变跨度是40到100岁。因此,我定义了一个中心部分与外围部分,其中中心部分被定义为介于第15和第85个百分点之间。

交叉验证是一种直接的实现。考虑并行化可能会花费一些时间,因此我们将在这里使用parallel-package,不过foreach也是一个很好的选择。

crossValidateInParallel <- function(fit, df, outcome_var, k=10, tail_prop=.15){ df$central <- with(df, Age >= quantile(Age, probs=tail_prop) & Age <= quantile(Age, probs=1-tail_prop)) df$cv <- rep(1:k, length.out=nrow(df)) cv_internal_fn <- function(x, cv_fit,df, outcome_var){ lc_train <- subset(df, cv != x) lc_test <- subset(df, cv == x) # In the original version I had a BIC/AIC optimizing function that worked with the cross-validation # Here, we'll keep it simple and just use the previously chosen functions cv_fit <- update(fit, data=lc_train) return(c(Main_RMSE = rmse(cv_fit, newdata=lc_test, outcome_var=outcome_var), Central_RMSE = rmse(cv_fit, newdata=subset(lc_test, central == TRUE), outcome_var=outcome_var), Peripheral_RMSE = rmse(cv_fit, newdata=subset(lc_test, central == FALSE), outcome_var=outcome_var), Main_MAPE = mape(cv_fit, newdata=lc_test, outcome_var=outcome_var), Central_MAPE = mape(cv_fit, newdata=subset(lc_test, central == TRUE), outcome_var=outcome_var), Peripheral_MAPE = mape(cv_fit, newdata=subset(lc_test, central == FALSE), outcome_var=outcome_var))) } # It is convenient to use the detectCores() function in order to use the machines full capacity. # Subtracting 1 is also preferable as it prohibits the computer from stopping other tasks, # i.e. you can keep surfing the web 🙂 cl <- makeCluster(mc <- getOption("cl.cores", ifelse(detectCores() <= 1, 1, detectCores() - 1))) # In Windows each cores starts out fresh and we therefore need to export the functions, # data etc so that they can access it as expected or you will get nasty errors clusterEvalQ(cl, library(mfp)) clusterEvalQ(cl, library(mgcv)) clusterEvalQ(cl, library(rms)) clusterEvalQ(cl, library(splines)) clusterEvalQ(cl, library(stringr)) clusterExport(cl, c("rmse", "mape")) res <- parallel::parLapply(1:k,cl=cl,fun=cv_internal_fn,outcome_var=outcome_var,cv_fit = fit,df=df) stopCluster(cl) res <- as.data.frame(Gmisc::mergeLists(lapplyOutput=res)) ret <- colMeans(res) attr(ret, "raw") <- res return(ret)}

如果我们运行所有的模型,我们得到以下结果:

尽管多项式的性能明显不如其他方法,但两者之间的差别几乎可以忽略不计。这在测试集中不太明显,因为多项式的尾部有一些问题。

cv_results <- list(eq5d = NULL,eq_vas = NULL)for (group in names(fits)){ for (method in names(fits[[group]])){ cv_results[[group]] <- rbind(cv_results[[group]], crossValidateInParallel(fits[[group]][[method]],df=train, outcome_var = sprintf("%s1", group))) } rownames(cv_results[[group]]) <- names(fits[[group]])}
out <- rbind(t(apply(cv_results$eq5d, 1, function(x) c(sprintf("%.3f", x[1:3]), sprintf("%.2f", x[4:6])))), t(apply(cv_results$eq_vas, 1, function(x) c(sprintf("%.2f", x[1:3]), sprintf("%.1f", x[4:6])))))cgroup <- unique(gsub("^.+_", "", colnames(cv_results$eq5d)))n.cgroup <- as.numeric(table(gsub("^.+_", "", colnames(cv_results$eq5d))))colnames(out) <- gsub("_.+$", "", colnames(cv_results$eq5d))rownames(out) <- capitalize(gsub("(RCS|BS|GAM|Polynomial) (with |)", "", c(names(fits$eq5d), names(fits$eq_vas))))htmlTable(out, rgroupCSSstyle = "", rowlabel = "Method", cgroup = cgroup, n.cgroup = n.cgroup, rgroup = rep(c("Restricted cubic splines", "B-splines", "Polynomials", "", "Generalized additive models"), 2), n.rgroup = rep(c(2, 2, 2, 1, 3), 2), tspanner = c("The EQ-5D index", "The EQ VAS"), n.tspanner = sapply(cv_results, nrow), ctable=TRUE)test$central <- with(test, Age >= quantile(Age, probs=.15) & Age <= quantile(Age, probs=.85))
testset_results <- list(eq5d = NULL,eq_vas = NULL)
for (group in names(fits)){ outcome_var <- sprintf("%s1", group) for (method in names(fits[[group]])){ testset_results[[group]] <- rbind(testset_results[[group]], c(Main_RMSE = rmse(fits[[group]][[method]], newdata=test, outcome_var=outcome_var), Central_RMSE = rmse(fits[[group]][[method]], newdata=subset(test, central == TRUE), outcome_var=outcome_var), Peripheral_RMSE = rmse(fits[[group]][[method]], newdata=subset(test, central == FALSE), outcome_var=outcome_var), Main_MAPE = mape(fits[[group]][[method]], newdata=test, outcome_var=outcome_var), Central_MAPE = mape(fits[[group]][[method]], newdata=subset(test, central == TRUE), outcome_var=outcome_var), Peripheral_MAPE = mape(fits[[group]][[method]], newdata=subset(test, central == FALSE), outcome_var=outcome_var))) } rownames(testset_results[[group]]) <- names(fits[[group]])}

out <- rbind(t(apply(testset_results$eq5d, 1, function(x) c(sprintf("%.3f", x[1:3]),sprintf("%.2f", x[4:6])))), t(apply(testset_results$eq_vas, 1, function(x) c(sprintf("%.2f", x[1:3]),sprintf("%.1f", x[4:6])))))cgroup <- unique(gsub("^.+_", "", colnames(testset_results$eq5d)))n.cgroup <- as.numeric(table(gsub("^.+_", "", colnames(testset_results$eq5d))))colnames(out) <- gsub("_.+$", "", colnames(testset_results$eq5d))rownames(out) <- capitalize(gsub("(RCS|BS|GAM|Polynomial) (with |)", "",c(names(fits$eq5d),names(fits$eq_vas))))htmlTable(out, rgroupCSSstyle = "", rowlabel = "Method", cgroup = cgroup, n.cgroup = n.cgroup, rgroup = rep(c("Restricted cubic splines", "B-splines", "Polynomials", "", "Generalized additive models"), 2), n.rgroup = rep(c(2, 2, 2, 1, 3), 2), tspanner = c("The EQ-5D index","The EQ VAS"), n.tspanner = sapply(testset_results, nrow),ctable=TRUE)

4. Graphs

现在,让我们看看发现的关系实际上是什么样子的。为了快速风格的所有图表,我使用一个常见的设置:

# The ggplot look and feelmy_theme <- theme_bw() + theme(legend.position = "bottom", legend.text=element_text(size=7), axis.title.x = element_text(size=9), axis.title.y = element_text(size=9), axis.text.x = element_text(size=8), axis.text.y = element_text(size=8))

getPredDf <- function(ds, eq_type, eq_values, Age){ new_data <- data.frame(Age = Age) new_data$Sex="Male" new_data[[sprintf("%s0", eq_type)]] = eq_values new_data$Charnley_class="A" new_data$pain0 = median(train$pain0, na.rm=TRUE) new_data$OpNr = 1 ret <- list("Best" = new_data) new_data$Sex="Female" new_data$Charnley_class="C" new_data$OpNr = 2 ret[["Worst"]] <- new_data return(ret)}
getAgePredDf <- function(ds, eq_type){ mode_for_preop_eq <- as.numeric(names(which.max(table(ds[[sprintf("%s0", eq_type)]])))) Age <- seq(from=40, to=90, by=.1) return(getPredDf(ds, eq_type, eq_values = rep(mode_for_preop_eq, length.out=length(Age)),Age=Age))}
getRmsPred <- function(nd, model, eq_type){ new_data <- nd[["Best"]] best_age_spl <- predict(model, newdata=new_data, conf.type="mean", conf.int=.95) new_data <- nd[["Worst"]] worst_age_spl <- predict(model, newdata=new_data, conf.type="mean", conf.int=.95) getDf <- function(pred, newdata, eq_type){ df <- data.frame(Age = newdata$Age,Lower = pred$lower,Upper = pred$upper) df[[sprintf("%s1", eq_type)]] = pred$linear.predictors df[[sprintf("%s0", eq_type)]] = newdata[[sprintf("%s0", eq_type)]] df$Pain = newdata$pain0 df$OpNr = newdata$OpNr return(df) } df <- getDf(best_age_spl, new_data, eq_type) tmp <- getDf(worst_age_spl, new_data, eq_type) df$Cat = 1 tmp$Cat = 2 df <- rbind(df, tmp) rm(tmp) df$Cat <- factor(as.numeric(df$Cat), labels=c(" Sex: Male n Charnley class: A n First THR ", " Sex: Female n Charnley class: C n Previous contralateral THR ")) return(df)}
getGlmPred <- function(nd, model, eq_type){ new_data <- nd[["Best"]] best_age_spl <- predict(model, newdata=new_data, se.fit = TRUE) new_data <- nd[["Worst"]] worst_age_spl <- predict(model, newdata=new_data, se.fit = TRUE) getDf <- function(pred, newdata, eq_type){ df <- data.frame(Age = newdata$Age, Lower = pred$fit - qnorm(0.975)*pred$se.fit, Upper = pred$fit + qnorm(0.975)*pred$se.fit) df[[sprintf("%s1", eq_type)]] = pred$fit df[[sprintf("%s0", eq_type)]] = newdata[[sprintf("%s0", eq_type)]] df$Pain = newdata$pain0 df$OpNr = newdata$OpNr return(df) } df <- getDf(best_age_spl, new_data, eq_type) tmp <- getDf(worst_age_spl, new_data, eq_type) df$Cat = 1 tmp$Cat = 2 df <- rbind(df, tmp) rm(tmp) df$Cat <- factor(as.numeric(df$Cat), labels=c(" Sex: Malen Charnley class: A n First THR ", " Sex: Femalen Charnley class: C n Previous contralateral THR ")) return(df)}
g_legend<-function(a.gplot){ tmp <- ggplot_gtable(ggplot_build(a.gplot)) leg <- which(sapply(tmp$grobs, function(x) x$name) == "guide-box") legend <- tmp$grobs[[leg]] return(legend)}
plot2Lines <- function(eq5d_df, vas_df){ clrs <- brewer.pal(3, "Pastel1") clrs <- GISTools::add.alpha(clrs, .5) clrs <- clrs[c language="(2,1,3)"][/c] gg_eq5d <- ggplot(data=eq5d_df, aes(x=Age, y=eq5d1, fill=Cat, linetype=Cat)) + geom_hline(yintercept=eq5d_df$eq5d0[1], lwd=1, lty=2, colour="#00640055") + geom_ribbon(aes(ymin=Lower, ymax=Upper)) + geom_line() + ylab("EQ-5D index one year after surgery") + xlab("Age (years)") + scale_x_continuous(expand=c(0,0))+ scale_linetype_manual(name = "", values=c(1,2)) + scale_fill_manual(name = "", values = clrs[1:2], guide="none") + my_theme gg_eqvas <- ggplot(data=vas_df, aes(x=Age, y=eq_vas1, fill=Cat, linetype=Cat)) + geom_hline(yintercept=vas_df$eq_vas0[1], lwd=1, lty=2, colour="#00640055") + geom_ribbon(aes(ymin=Lower, ymax=Upper)) + geom_line() + theme_bw() + ylab("EQ VAS one year after surgery") + xlab("Age (years)") + scale_x_continuous(expand=c(0,0))+ scale_linetype_manual(name = "", values=c(1,2)) + scale_fill_manual(name = "", values = clrs[1:2]) + theme(legend.position = "bottom") + my_theme mylegend<-g_legend(gg_eqvas) grid.arrange(arrangeGrob(gg_eq5d + theme(legend.position="none"),gg_eqvas + theme(legend.position="none"),nrow=1), mylegend, nrow=2,heights=c(10, 2)) }

重要的是要记住,不同的算法会找到不同的最优值,因此可能在眼睛看来是不同的,即使它们同样适合数据。我认为它是骨架的一种形式,定义了它如何移动,它会尽量适应,但在它的范围内。例如,我们可以弯曲肘部,但不能弯曲前臂(除非你需要我的外科技术)。

4.1 Restricted cubic splines

AIC

BIC

4.2 B-splines

4.3 Polynomials

Note that the y-scale differs for the polynomials

4.4 MFP - multiple fractional polynomial

4.5 Generalized additive models

4.5.1 Thin plate regression splines
4.5.2 Cubic regression splines
4.5.3 P-splines

5. RCS代码示例

# options("repos" = c(CRAN="http://mirrors.cloud.tencent.com/CRAN/"))# options("repos" = c(CRAN="https://mirrors.aliyun.com/CRAN/"))
# 0.R包的安装和载入----list.of.packages <- c("dplyr","survival","broom","ggplot2", "rms","splines","Hmisc")new.packages <- list.of.packages[!(list.of.packages %in% installed.packages()[,"Package"])]if(length(new.packages)) install.packages(new.packages)suppressPackageStartupMessages(lapply(list.of.packages, require, character.only = TRUE))
# Knots个数选择#' A simple function for updating the formula and extracting the information criteria#' #' @param no A number that is used together with the add_var_str#' @param fit A regression fit that is used for the update#' @param rm_var The variable that is to be substituted#' @param add_var_str A sprintf() string that accepts the no parameter for each update#' @param ic_fn The information criteria function (AIC/BIC)getInfCrit <- function(no, fit, rm_var, add_var_str, ic_fn) { new_var <- sprintf(add_var_str, no) updt_frml <- as.formula(sprintf(".~.-%s+%s", rm_var, new_var)) ret <- ic_fn(update(fit, updt_frml)) names(ret) <- new_var return(ret)}
# 测试数据data(pbc)d <- pbcrm(pbc)d$died <- ifelse(d$status == 2, 1, 0)d$status <- ifelse(d$status != 0, 1, 0)head(d)
ddist <- datadist(d)options(datadist='ddist')
# test----# We start with a model where the other splines have been AIC-selectedfit <- lrm(status ~ rcs(age, 4)+sex, data=d, x=TRUE, y=TRUE)sapply(3:9, FUN=getInfCrit, fit = fit, rm_var = "rcs(age, 4)", add_var_str = "rcs(age, %d)", ic_fn=AIC)

# 1. RCS in Logistic regression----fit <- lrm(status ~ rcs(age, 4), data=d)(an <- anova(fit))plot(Predict(fit), anova=an, pval=TRUE)plot(Predict(fit,fun=exp), anova=an, pval=TRUE, ylab="Odds ratio")names(fit)str(fit$linear.predictors)

# 2. RCS in Cox regression----# https://rdrr.io/cran/Greg/man/plotHR.htmlfit2 <- cph(Surv(time, status) ~ rcs(age, 4)+sex, x=T,y=T,surv = T,data=d)(an2 <- anova(fit2))(Pnonlinear=paste0(expression("P for non-linear relation: ="),round(an2[8],digits = 3)))plot(Predict(fit2), anova=an2, pval=TRUE)plot(Predict(fit2, fun=exp), anova=an2, pval=TRUE, ylab="Hazard ratio")
dataplot <- Predict(fit2,age, ref.zero = TRUE, fun=exp)ggplot(dataplot,aes(age, yhat)) + # geom_line(colour="Black", linetype="dashed", size=1.5)+ theme(plot.subtitle = element_text(vjust = 1), plot.caption = element_text(vjust = 1), axis.line = element_line(size = 0.5,linetype = "solid"), panel.grid.major = element_line(colour = "gray98"), axis.title = element_text(size = 15), panel.background = element_rect(fill = "gray99",colour = "white", linetype = "twodash"), plot.background = element_rect(fill = "white")) + scale_x_continuous(breaks = seq(30,80,by=5),expression(paste("Age", " ", )))+ scale_y_continuous(limits = c(-0.1,3), breaks = seq(0,3,by=0.5),"Hazard ratio (95%CI)")+ labs(caption = NULL)+ annotate("text", x=40, y=3, parse = TRUE, label="label", size=5)+ geom_hline(yintercept =1, linetype="dashed")

# 3. Linear Model Estimation Using Ordinary Least Squares----set.seed(1)x1 <- runif(200)x2 <- sample(0:3, 200, TRUE)distance <- (x1 + x2/3 + rnorm(200))^2d <- datadist(x1,x2)options(datadist="d") # No d -> no summary, plot without giving all detailshead(d)
f <- ols(sqrt(distance) ~ rcs(x1,4) + scored(x2), x=TRUE)an=anova(f)plot(Predict(f), anova=an, pval=TRUE)which.influence(f)summary(f)summary.lm(f) # will only work if penalty and penalty.matrix not used

1. 原始文献及代码来源:

(0)

相关推荐