微生信生物带你学235页的PLS-PM官方教程(偏最小二乘路径分析PLS Path Modeling)
偏最小二乘路径分析(PLS-PM)
截止目前为止,网上的教程也就是这里这篇:http://blog.sciencenet.cn/home.php?do=blog&id=940864&mod=space&uid=2379401。 上面只有R语言的例子数据,并且给予的解释是还不足对这个分析做一个明确的认识并运用实践。这个分析的R教程有235页,从模型的了解到模型构建,到结果文件解读再到模型的对比检验等都做了完整的教程。所以在这里学习起来也需要分几个部分:
第一部分 理解模型简单运用
现在我们来深入理解PLS-PM
PLS 路径模型
PLS 路径模型是Wold 及Lohmoller 继偏最小二乘回归之后提出的分析多组变量集合之间的线性统计关系的方法,是PLS 回归的扩展与延伸。PLS 路径的主要优点有: ( 1) 可对小样本进行测算; ( 2) 不需要对观测变量分布与误差分布做特定的概率分布假设,因此也就不存在模型无法识别的问题; ( 3) 可用许多潜变量与显变量估计较复杂的模型。PLS 路径模型由两部分组成: 测量模型( 描述显变量与隐变量之间的关系,又称为外部模型) 与结构模型( 描述隐变量之间的关系,又称为内部模 型) 。每组显变量Xj所对应的隐变量为ξj ,Xj与其所对应的ξj构成测量模型,不同组间的ξj构成结构模型。PLS 路径模型示意图见图2。
模型一般有两种形式:
A表示是反映型的变量,隐变量是显变量的原因,箭头指向显变量
B表示是影响型的变量,显变量是隐变量的原因,箭头指向隐变量 一般我们都是通过隐变量来估算显变量的变化,也就是反应型变量。模型评价:PLS 路径模型的评价与检验首先要满足两个条件: (1) 潜变量与其所对应的显变量之间高正相关,相关负荷大于 0.4; (2) 各潜变量组指标是单一维度的,即“唯一度”检验。对各潜变量组做主成分检验,并得出显变量与潜变量的相关矩阵。这两个指标对应在R中有相应的函数获取指标。
反映式测量模型的信度评价:
科隆巴奇系数: Cronbach’s α ≥ 0.7;使用函数$unidim调取
合成信度: CR ≥ 0.6;
指标绝对标准载荷(loadings): ≥0.7,低于0.4的指标删除 反映式测量模型的效度评价:
主成分分析( 最大特征根) : 仅有1个最大特征根,> 1
平均差异萃取量( AVE) : 聚合效度,> 0.5,使用$inner_summary得到结果
交叉载荷:每个显变量的标准外部权重要大于其与另外潜变量的交叉权重;使用函数$crossloadings提取,观察对角线值是否都大于同行的其他值。loading检测显变量对隐变量的拟合效果,crossloadings检测对其他隐变量的拟合效果,如果大于本显变量就出问题了。
结构模型的评价:
内生潜变量决定系数( 方差解释度 Coecients of determination R2) :≥0.60,较好; 0.3,适中; < 0.3,较差: 在R语言中通过$inner summary函数调取这部分结果。使用$inner_summary[, "R2", drop = FALSE]命令单独调取R2值。
For each regression in the structural model we have an R2 that is interpreted similarly as in any multiple regression analysis. R2 indicates the amount of variance in the endogenous latent variable explained by its independent latent variables. The inner model seems to be fine, although we must keep in mind that this is a very simple model. We have an R2 = 0.85 which under the PLS-PM standards can be considered as an outstanding R2. In fact, values for the R-squared can be classified in three categories (please don’t take them as absolute truth):
Low: R < 0.30 (although some authors consider R < 0.20)
Moderate: 0.30 < R < 0.60 (you can 0.20 < R < 0.50)
High: R > 0.60 (alternatively there R > 0.50)
路径效果大小f2: ≥0.35,较大; 0.15,适中; < 0.02,很小
R语言实战:
# library(BiocManager)
# install("plspm")
# install.packages("plsdepot")
library("plspm")
library(plsdepot)
导入数据并进行一般统计分析
education = read.table("./sample-data//education.txt", header = TRUE,
row.names = 1)
dim(education)
summary(education[, 1:20])
# 这批数据是一份调查问卷,最后三行分别记录了这批调查者的写别,收入和工作,问题的答案按照是否同意被分为了七类。
#下面简单统计是第一个变量的值分布
aux_distrib = table(education[, 1])/nrow(education)
barplot(aux_distrib, border = NA, main = colnames(education)[1])
library(RColorBrewer)
# 统计前四列指标
# questions of Support indicators
sq1 = "Help when not doing well"
sq2 = "I feel underappreciated"
sq3 = "I can find a place where I feel safe"
sq4 = "Concerns about school"
# put questions in one vector
sup_questions = c(sq1, sq2, sq3, sq4)
# setting graphical parameters
op = par(mfrow = c(2,2), mar = c(2.5, 3.2, 2, 0.8))
# bar-chart for each indicator of Support
for (j in 1:4) {
distribution = table(education[,j]) / nrow(education)
barplot(distribution, border = NA, col = brewer.pal(8, "Blues")[2:8],
axes = FALSE, main = sup_questions[j], cex.main = 1)
# add vertical axis, and rectangle around figure
axis(side = 2, las=2)
box("figure", col="gray70")
}
# reset default graphical parameters
par(op)
##计算前四列指标是否相关
cor(education[, 1:4])
# 计算PCA,查看前四个指标的权重
library(plsdepot)
# PCA of Support indicators with nipals
support_pca = nipals(education[,1:4])
# plot
plot(support_pca, main = "Support indicators (circle of correlations)",
cex.main = 1)
开始PLS-PM路径分析
首先我们开始假设路径,注意,既然我们是为了寻找因果关系,那么隐变量之间必须是单向的,这也就意味着路径矩阵只能是半角矩阵。对角线是隐变量自己,下半角我们来设定假设路径。每个路径模型中我们都会有两种类型的隐变量类型(BLOCKS DEFINITION )
Exogenous :外源变量,纯作为解释变量用来解释内源变量。
Endogenous :内源变量,既可以作为解释变量解释其他内源变量,又可以作为因果关系的果被其他外源或者内源变量解释。
# 开始做路径分析
# rows of path matrix
Support = c(0, 0, 0, 0, 0, 0)
Advising = c(0, 0, 0, 0, 0, 0)
Tutoring = c(0, 0, 0, 0, 0, 0)
Value = c(1, 1, 1, 0, 0, 0)
Satisfaction = c(1, 1, 1, 1, 0, 0)
Loyalty = c(0, 0, 0, 0, 1, 0)
# matrix (by row binding)
edu_path = rbind(Support, Advising, Tutoring, Value, Satisfaction, Loyalty)
colnames(edu_path) = rownames(edu_path)
# plot the inner matrix
innerplot(edu_path, box.size = 0.1)
设置隐变量对应的显变量数据和模型类型
# outer model
edu_blocks = list(1:4, 5:8, 9:12, 13:16, 17:19, 20:23)
# modes (reflective blocks)
edu_modes = rep("A", 6)
运行模型
# apply plspm
edu_pls1 = plspm(education, edu_path, edu_blocks, modes = edu_modes)
# print edu_pls1
edu_pls1
查看模型的全部结果
#summary()函数湖展示PLS_PM全部结果
summary(edu_pls1)
显变量之间应该一致的,这里明显出现了不一致
# check unidimensionality:小于0.7的就代表这些变量中存在问题
edu_pls1$unidim
#下面出图看看
plot(edu_pls1, what = "loadings")
为了统一个隐变量内显变量一致,我们修改两个变量,并从新分析
#为了保持组内一直,我们将指标反转,因此变量代表的意思就相反了
# adding Support 'appreciated'
education$sup.appre = 8 - education$sup.under
# adding 'Loyalty' pleased
education$loy.pleas = 8 - education$loy.asha
#处理完成这个问题之后就开始指定潜变量对应的显变量
edu_blocks2 = list(c(1, 27, 3, 4), 5:8, 9:12, 13:16, 17:19, c(20, 21, 28,
23))
# apply plspm
edu_pls2 = plspm(education, edu_path, edu_blocks2, modes = edu_modes)
#此时我们看到每一组内是一致的
plot(edu_pls2, what = "loadings")
#再次检测异质性
edu_pls2$unidim
#吗,模型模块外部相关,内部相关loading,这个值大于0.7就认为可以,communality大于0.49就认为可以
# ,这代表了潜变量可以解释的程度,超过50% ,就认为潜变量可以解释超过50%的显变量
edu_pls2$outer_model#使用ggplot出图现实load列,这里代表模型中显变量可以被模型解释的变量,大于0.7表明不错
library(ggplot2)
# barchart of loadings
ggplot(data = edu_pls2$outer_model,
aes(x = name, y = loading, fill = block)) +
geom_bar(stat = "identity" , position = "dodge") +
# threshold line (to peek acceptable loadings above 0.7)
geom_hline(yintercept = 0.7, color = "gray50" ) +
# add title
ggtitle("Barchart of Loadings") +
# rotate x-axis names
theme(axis.text.x = element_text(angle = 90))
#检测显变量是否合适,观察对角线值是否都大于同行的其他值
edu_pls2$crossloadings#路径效应指数
edu_pls2$path_coefs
#显变量对隐变量的解释,loading
edu_pls2$outer_model
#这里是R方
aa= summary(edu_pls2)
aa$inner_summary
#显著性
edu_pls2$inner_model
#提取模型拟合度
edu_pls2$gof
#绘制路径图
innerplot(edu_pls2)
#提取影响
satpls$effects
展示不同隐变量之间的贡献关系
#出图,添加路径尺度大小
plot(edu_pls2, arr.pos = 0.35)
Paths = edu_pls2$path_coefs
arrow_lwd = 10 * round(Paths, 2)
plot(edu_pls2, arr.pos = 0.35, arr.lwd = arrow_lwd)
#效应可视化
good_rows = c(3:5, 7:15)
#
path_effs = as.matrix(edu_pls2$effects[good_rows, 2:3])
rownames(path_effs) = edu_pls2$effects[good_rows, 1]
# setting margin size
op = par(mar = c(8, 3, 1, 0.5))
# barplots of total effects (direct + indirect)
barplot(t(path_effs), border = NA, col = c("#9E9AC8", "#DADAEB"),
las = 2, cex.names = 0.8, cex.axis = 0.8,
legend = c("Direct", "Indirect"),
args.legend = list(x = "top", ncol = 2, border = NA,
bty = "n", title = "Effects"))
# resetting default margins
par(op)
boot检验,后我们对于各种指标就会得到误差无置信区间。包括显变量对隐变量的影响指标weigt和loading命令调取:$boot$weigts和$boot$loadings,path路径$boot$paths,R2$boot$rsq,以及隐变量之间的影响$boot$total.efs,每个值都有标准误和在95%区间的最低和最高值。文章上使用的人还不多。
What we obtain in foot val$boot is a list with results for:
the outer weights (foot val$boot$weigts)
the loadings (foot val$boot$loadings)
the path coefficients (foot val$boot$paths)
the R2 (foot val$boot$rsq)
the total effects (foot val$boot$total.efs)
Each one of these elements is a matrix that contains five columns: the original value of the parameters, the bootstrap mean value, the bootstrap standard error, and the lower percentile and upper percentiles of the 95% bootstrap confidence interval.
# boot检验
foot_val = plspm(education, edu_path, edu_blocks2, modes = edu_modes,
boot.val = TRUE, br = 200)
foot_val$boot
下一步,对两个模型的差异分析:前提是两个模型处理数据不一样,其他全部一样,这时我们可以进行一个比较:
###--------模型对比
library(plspm)
# load data college
data(college)
# what does the data look like
head(college, n = 5)
# path matrix (inner model)
HighSchool = c(0, 0, 0, 0)
Intro = c(1, 0, 0, 0)
Medium = c(1, 1, 0, 0)
Graduation = c(1, 1, 1, 0)
gpa_path = rbind(HighSchool, Intro, Medium, Graduation)
# list of blocks (outer model)
gpa_blocks = list(1:3, 4:7, 8:11, 12)
# vector of reflective modes
gpa_modes = rep("A", 4)
# apply plspm
gpa_pls = plspm(college, gpa_path, gpa_blocks, modes = gpa_modes,
boot.val = TRUE)
summary(gpa_pls)
# plot path coefficients
plot(gpa_pls)
下面分性别进行模型构建
# select data of female students
female = college[college$Gender == "FEMALE", ]
# female students plspm
female_gpa_pls = plspm(female, gpa_path, gpa_blocks, modes = gpa_modes)
# select data of male students
male = college[college$Gender == "MALE", ]
# male students plspm
male_gpa_pls = plspm(male, gpa_path, gpa_blocks, modes = gpa_modes)
# plot path coefficients
plot(female_gpa_pls, box.size = 0.14)
plot(male_gpa_pls, box.size = 0.14)
我们想比对一下,男生和女生的模型是否相同,这个时候就需要使用函数plspm.groups。
# apply plspm.groups bootstrap
gpa_boot = plspm.groups(gpa_pls, college$Gender, method = "bootstrap")
# see the results
gpa_boot
# apply plspm.groups premutation
gpa_perm = plspm.groups(gpa_pls, college$Gender, method = "permutation")
# see the results
gpa_perm
# path coefficients between female and male students
barplot(t(as.matrix(gpa_boot$test[,2:3])), border = NA, beside = TRUE,
col = c("#FEB24C","#74A9CF"), las = 2, ylim = c(-0.1, 1),
cex.names = 0.8, col.axis = "gray30", cex.axis = 0.8)
# add horizontal line
abline(h = 0, col = "gray50")
# add itle
title("Path coefficients of Female and Male Students",
cex.main = 0.95, col.main = "gray30")
# add legend
legend("top", legend = c("female", "male"), pt.bg = c("#FEB24C", "#A6BDDB"),
ncol = 2, pch = 22, col = c("#FEB24C", "#74A9CF"), bty = "n",
text.col = "gray40")
完成之后我们发现相同路径在不同性别人群中出现了的不同。这一不同我们已经检测的,虽然在本研究中是不显著的。当应用到我们的研究后可以标注显著性,更加清晰明了。