Stata连享会 主页 || 视频 || 推文 || 知乎 || Bilibili 站
温馨提示: 定期 清理浏览器缓存,可以获得最佳浏览体验。
New!
lianxh
命令发布了:
随时搜索推文、Stata 资源。安装:
. ssc install lianxh
详情参见帮助文件 (有惊喜):
. help lianxh
连享会新命令:cnssc
,ihelp
,rdbalance
,gitee
,installpkg
⛳ Stata 系列推文:
作者: 余坚 (贵州财经大学)
邮箱: yujiangeren@163.com
Source: Xiang Zhou, 2021, Semiparametric estimation for causal mediation analysis with multiple causally ordered mediators. -Link-,-Data-
目录
在有序因果中介分析的半参数估计理论篇中,我们已经对
作者使用的数据来源于美国 1997 年全国青年纵向调查 (National Longitudinal Survey of Youth 1997),数据集中共有 2969 人,在 1997 年他们的年龄为 15 岁至 17 岁,并在 20 岁前完成了高中学业。相关研究表明,大学入学率对美国的政治参与 (选举中是否投票) 有实质性的积极影响,但这种因果关系背后的机制仍不清楚。大学入学率对政治参与的影响可能通过公民和政治兴趣的发展、经济状况的提高、或社会和职业网络等其他途径发挥作用。为了检验这些直接和间接的影响,我们考虑一个因果结构,令
在该数据集中,处理
预处理协变量
我们在上一篇推文中已经介绍过反映因果路径
其中,第一个部分是大学入学率的 NDE,第二个和第三个部分反映了分别通过公民/政治兴趣中介和经济状况中介的处理效应的大小。
首先,我们进行一系列前期工作,包括载入所需的包,设定画图的主题,以及导入数据。
rm(list = ls())
library(haven)
library(survey)
library(gbm)
library(ranger)
library(rlang)
library(glmnet)
library(rsample)
library(caret)
library(tidyverse)
library(mice)
library(SuperLearner)
source("zmisc.R")
set.seed(02138)
# set my ggplot theme
mytheme <- theme_minimal(base_size = 16) +
theme(legend.position = "bottom",
plot.title = element_text(hjust = 0.5),
plot.caption = element_text(color = "grey30"))
theme_set(mytheme)
# load data
nlsy97_ed3 <- readRDS("nlsy97_ed3.RDS")
并且,我们使用多重插补法 (Multiple Imputation) 补齐缺失的协变量。
##########################################################
# Multiple Imputation for All Missing Covariates
##########################################################
# number of imputed datasets
I <- 10
nlsy97_ed3_mi <- nlsy97_ed3 %>%
# remove redundant variables for imputation
select(-id, -weight) %>%
mice(m = I, maxit = 10, defaultMethod = "rf")
nlsy97_ed3_full <- map(1:I, ~ complete(nlsy97_ed3_mi, .x)) %>%
map(., ~ mutate(.x,
id = nlsy97_ed3$id,
weight = nlsy97_ed3$weight,
logearn2007 = log(pmax(total_earnings_adj_2007, 1000)),
logearn2008 = log(pmax(total_earnings_adj_2008, 1000)),
logearn2009 = log(pmax(total_earnings_adj_2009, 1000)),
logearn2010 = log(pmax(total_earnings_adj_2010, 1000)),
))
saveRDS(nlsy97_ed3_full, file = "nlsy97_ed3_full.RDS")
首先,我们载入所需的包并对一些变量进行预处理。
rm(list = ls())
library(survey)
library(gbm)
library(ranger)
library(glmnet)
library(rsample)
library(caret)
library(rlang)
library(tidyverse)
library(Hmisc)
library(SuperLearner)
source("zmisc.R")
set.seed(02138)
nlsy97_full <- readRDS("nlsy97_ed3_full.RDS") %>%
map(., ~ mutate(.x,
aveearn = rowMeans(select(.x, num_range("total_earnings_adj_", 2007:2010)), na.rm = TRUE),
logearn = log(pmax(aveearn, 1000)),
interest = 5 - rowMeans(select(.x, interest2008, interest2010), na.rm = TRUE)))
n <- nrow(nlsy97_full[[1]])
接着,我们开始设定变量与公式,此处 a
为处理组的平均处理效应 (Average Treatment Effects on Treated, ATT);y
为结果变量,即是否在 2010 年选举中投票;第一个中介 m1
为经济状况;第二个中介 m2
为一组多元变量,包括受访者在 2007 至 2010 年间对政府和公共事务的兴趣,以及参与志愿服务、捐赠、社区团体活动的情况;x
为一组控制变量。我们分别对无中介、仅有中介 m1
、中介 m1
和 m2
三种情况下的处理变量和结果变量设定计算公式。
##########################################################
# Formulas for treatment and outcome models
##########################################################
# continuous variables
nlsy97_full[[1]] %>%
map_lgl( ~ length(unique(.x)) > 2) %>%
which()
a <- "att"
y <- "vote2010"
m1 <- c("logearn")
m2 <- c("volunteer2007", "community2007", "donate2007", "interest")
x <- exprs(age97, female, black, hispan, parinc, paredu, parast, livebio, father,
afqt3, gpa, children18, substind, delinqind, rural97, south97,
frcoll_75, frcoll_90, schstolen, schthreat, schfight) %>%
map_chr(as_string)
a0_form <- as.formula(paste(a, " ~ ", paste(x, collapse= "+")))
a1_form <- as.formula(paste(a, " ~ ", paste(c(x, m1), collapse= "+")))
a2_form <- as.formula(paste(a, " ~ ", paste(c(x, m1, m2), collapse= "+")))
y0_form <- as.formula(paste(y, " ~ ", paste(c(x, a), collapse= "+")))
y1_form <- as.formula(paste(y, " ~ ", paste(c(x, a, m1), collapse= "+")))
y2_form <- as.formula(paste(y, " ~ ", paste(c(x, a, m1, m2), collapse= "+")))
最后,则是对原文中的理论估计模型进行编程并得到估计结果。作者使用了两种方法,一是结合结果模型与迭代回归输入,二是使用目标极大似然估计方法。注意以下命令全部运行完约需两个小时。
##########################################################
# Main analyses
##########################################################
estimands <- expand.grid(c(0, 1), c(0, 1), c(0, 1)) %>%
`colnames<-`(c("a1", "a2", "a3"))
I <- length(nlsy97_full)
S <- nrow(estimands)
K <- 5
out <- vector(mode = "list", I)
for(i in 1:I){
cat("imputed sample ", i, "\n")
df <- nlsy97_full[[i]]
df_p0 <- model.matrix(a0_form, data = df)[, -1] %>% as_tibble()
df_p1 <- model.matrix(a1_form, data = df)[, -1] %>% as_tibble()
df_p2 <- model.matrix(a2_form, data = df)[, -1] %>% as_tibble()
df_mu0 <- model.matrix(y0_form, data = df)[, -1] %>% as_tibble()
df_mu1 <- model.matrix(y1_form, data = df)[, -1] %>% as_tibble()
df_mu2 <- model.matrix(y2_form, data = df)[, -1] %>% as_tibble()
df_mu2n <- model.matrix(y2_form, data = mutate(df, att = 0))[, -1] %>% as_tibble()
df_mu1n <- model.matrix(y1_form, data = mutate(df, att = 0))[, -1] %>% as_tibble()
df_mu0n <- model.matrix(y0_form, data = mutate(df, att = 0))[, -1] %>% as_tibble()
df_mu2y <- model.matrix(y2_form, data = mutate(df, att = 1))[, -1] %>% as_tibble()
df_mu1y <- model.matrix(y1_form, data = mutate(df, att = 1))[, -1] %>% as_tibble()
df_mu0y <- model.matrix(y0_form, data = mutate(df, att = 1))[, -1] %>% as_tibble()
# create cross-fitting split
cf_fold <- createFolds(df$vote2010, K)
main_list <- vector(mode = "list", K)
for(k in 1:K){
cat(" cross-fitting fold ", k, "\n")
#################################################
# Design matrices for different models
#################################################
# auxiliary and main data
aux <- df[-cf_fold[[k]], ]
main <- df[cf_fold[[k]], ]
aux_p0 <- df_p0[-cf_fold[[k]], ]
aux_p1 <- df_p1[-cf_fold[[k]], ]
aux_p2 <- df_p2[-cf_fold[[k]], ]
main_p0 <- df_p0[cf_fold[[k]], ]
main_p1 <- df_p1[cf_fold[[k]], ]
main_p2 <- df_p2[cf_fold[[k]], ]
aux_mu0 <- df_mu0[-cf_fold[[k]], ]
aux_mu1 <- df_mu1[-cf_fold[[k]], ]
aux_mu2 <- df_mu2[-cf_fold[[k]], ]
main_mu0 <- df_mu0[cf_fold[[k]], ]
main_mu1 <- df_mu1[cf_fold[[k]], ]
main_mu2 <- df_mu2[cf_fold[[k]], ]
#################################################
# Treatment Models
#################################################
p0_sl <- SuperLearner(
Y = aux$att,
X = aux_p0,
newX = df_p0,
family = binomial(),
obsWeights = aux$weight,
SL.library = c("SL.mean", "SL.glmnet", "SL.ranger"),
control = list(saveFitLibrary = TRUE, trimLogit = 0.001),
cvControl = list(V = 5L, stratifyCV = TRUE, shuffle = TRUE, validRows = NULL)
)
p1_sl <- SuperLearner(
Y = aux$att,
X = aux_p1,
newX = df_p1,
family = binomial(),
obsWeights = aux$weight,
SL.library = c("SL.mean", "SL.glmnet", "SL.ranger"),
control = list(saveFitLibrary = TRUE, trimLogit = 0.001),
cvControl = list(V = 5L, stratifyCV = TRUE, shuffle = TRUE, validRows = NULL)
)
p2_sl <- SuperLearner(
Y = aux$att,
X = aux_p2,
newX = df_p2,
family = binomial(),
obsWeights = aux$weight,
SL.library = c("SL.mean", "SL.glmnet", "SL.ranger"),
control = list(saveFitLibrary = TRUE, trimLogit = 0.001),
cvControl = list(V = 5L, stratifyCV = TRUE, shuffle = TRUE, validRows = NULL)
)
df <- df %>% mutate(
p0_fit = p0_sl$SL.predict,
p1_fit = p1_sl$SL.predict,
p2_fit = p2_sl$SL.predict,
w0_n = I(att == 0)/(1 - p0_fit),
w0_y = I(att == 1)/p0_fit,
w1_nn = I(att == 0)/(1 - p0_fit) * 1,
w1_ny = I(att == 1)/(1 - p0_fit) * (1 - p1_fit)/p1_fit,
w1_yn = I(att == 0)/p0_fit * p1_fit/(1 - p1_fit),
w1_yy = I(att == 1)/p0_fit * 1,
w2_000 = I(att == 0)/(1 - p0_fit) * 1 * 1,
w2_001 = I(att == 1)/(1 - p0_fit) * 1 * (1 - p2_fit)/p2_fit,
w2_010 = I(att == 0)/(1 - p0_fit) * (1 - p1_fit)/p1_fit * p2_fit/(1 - p2_fit),
w2_011 = I(att == 1)/(1 - p0_fit) * (1 - p1_fit)/p1_fit * 1,
w2_100 = I(att == 0)/p0_fit * p1_fit/(1 - p1_fit) * 1,
w2_101 = I(att == 1)/p0_fit * p1_fit/(1 - p1_fit) * (1 - p2_fit)/p2_fit,
w2_110 = I(att == 0)/p0_fit * 1 * p2_fit/(1 - p2_fit),
w2_111 = I(att == 1)/p0_fit * 1 * 1,
)
#################################################
# Outcome models
#################################################
mu2_sl <- SuperLearner(
Y = aux$vote2010,
X = aux_mu2,
family = binomial(),
obsWeights = aux$weight,
SL.library = c("SL.mean", "SL.glmnet", "SL.ranger"),
control = list(saveFitLibrary = TRUE),
cvControl = list(V = 5L, shuffle = TRUE, validRows = NULL)
)
df$mu2_fit_a3n <- predict.SuperLearner(mu2_sl, newdata = df_mu2n)$pred
df$mu2_fit_a3y <- predict.SuperLearner(mu2_sl, newdata = df_mu2y)$pred
mu1_sl_a3n <- SuperLearner(
Y = df$mu2_fit_a3n[-cf_fold[[k]]],
X = aux_mu1,
family = gaussian(),
obsWeights = aux$weight,
SL.library = c("SL.mean", "SL.glmnet", "SL.ranger"),
control = list(saveFitLibrary = TRUE),
cvControl = list(V = 5L, shuffle = TRUE, validRows = NULL)
)
mu1_sl_a3y <- SuperLearner(
Y = df$mu2_fit_a3y[-cf_fold[[k]]],
X = aux_mu1,
family = gaussian(),
obsWeights = aux$weight,
SL.library = c("SL.mean", "SL.glmnet", "SL.ranger"),
control = list(saveFitLibrary = TRUE),
cvControl = list(V = 5L, shuffle = TRUE, validRows = NULL)
)
df$mu1_fit_a3n_a2n <- predict.SuperLearner(mu1_sl_a3n, newdata = df_mu1n)$pred
df$mu1_fit_a3n_a2y <- predict.SuperLearner(mu1_sl_a3n, newdata = df_mu1y)$pred
df$mu1_fit_a3y_a2n <- predict.SuperLearner(mu1_sl_a3y, newdata = df_mu1n)$pred
df$mu1_fit_a3y_a2y <- predict.SuperLearner(mu1_sl_a3y, newdata = df_mu1y)$pred
mu0_sl_a3n_a2n <- SuperLearner(
Y = df$mu1_fit_a3n_a2n[-cf_fold[[k]]],
X = aux_mu0,
family = gaussian(),
obsWeights = aux$weight,
SL.library = c("SL.mean", "SL.glmnet", "SL.ranger"),
control = list(saveFitLibrary = TRUE),
cvControl = list(V = 5L, shuffle = TRUE, validRows = NULL)
)
mu0_sl_a3n_a2y <- SuperLearner(
Y = df$mu1_fit_a3n_a2y[-cf_fold[[k]]],
X = aux_mu0,
family = gaussian(),
obsWeights = aux$weight,
SL.library = c("SL.mean", "SL.glmnet", "SL.ranger"),
control = list(saveFitLibrary = TRUE),
cvControl = list(V = 5L, shuffle = TRUE, validRows = NULL)
)
mu0_sl_a3y_a2n <- SuperLearner(
Y = df$mu1_fit_a3y_a2n[-cf_fold[[k]]],
X = aux_mu0,
family = gaussian(),
obsWeights = aux$weight,
SL.library = c("SL.mean", "SL.glmnet", "SL.ranger"),
control = list(saveFitLibrary = TRUE),
cvControl = list(V = 5L, shuffle = TRUE, validRows = NULL)
)
mu0_sl_a3y_a2y <- SuperLearner(
Y = df$mu1_fit_a3y_a2y[-cf_fold[[k]]],
X = aux_mu0,
family = gaussian(),
obsWeights = aux$weight,
SL.library = c("SL.mean", "SL.glmnet", "SL.ranger"),
control = list(saveFitLibrary = TRUE),
cvControl = list(V = 5L, shuffle = TRUE, validRows = NULL)
)
df$mu0_fit_a3n_a2n_a1n <- predict.SuperLearner(mu0_sl_a3n_a2n, newdata = df_mu0n)$pred
df$mu0_fit_a3n_a2n_a1y <- predict.SuperLearner(mu0_sl_a3n_a2n, newdata = df_mu0y)$pred
df$mu0_fit_a3n_a2y_a1n <- predict.SuperLearner(mu0_sl_a3n_a2y, newdata = df_mu0n)$pred
df$mu0_fit_a3n_a2y_a1y <- predict.SuperLearner(mu0_sl_a3n_a2y, newdata = df_mu0y)$pred
df$mu0_fit_a3y_a2n_a1n <- predict.SuperLearner(mu0_sl_a3y_a2n, newdata = df_mu0n)$pred
df$mu0_fit_a3y_a2n_a1y <- predict.SuperLearner(mu0_sl_a3y_a2n, newdata = df_mu0y)$pred
df$mu0_fit_a3y_a2y_a1n <- predict.SuperLearner(mu0_sl_a3y_a2y, newdata = df_mu0n)$pred
df$mu0_fit_a3y_a2y_a1y <- predict.SuperLearner(mu0_sl_a3y_a2y, newdata = df_mu0y)$pred
#################################################
# Targeted MLE
#################################################
# targeted mu2
mu2_tmle_000 <- glm(vote2010 ~ 0 + offset(I(qlogis(mu2_fit_a3n))) + w2_000, family = binomial(), weights = weight, data = df[cf_fold[[k]], ])
mu2_tmle_001 <- glm(vote2010 ~ 0 + offset(I(qlogis(mu2_fit_a3y))) + w2_001, family = binomial(), weights = weight, data = df[cf_fold[[k]], ])
mu2_tmle_010 <- glm(vote2010 ~ 0 + offset(I(qlogis(mu2_fit_a3n))) + w2_010, family = binomial(), weights = weight, data = df[cf_fold[[k]], ])
mu2_tmle_011 <- glm(vote2010 ~ 0 + offset(I(qlogis(mu2_fit_a3y))) + w2_011, family = binomial(), weights = weight, data = df[cf_fold[[k]], ])
mu2_tmle_100 <- glm(vote2010 ~ 0 + offset(I(qlogis(mu2_fit_a3n))) + w2_100, family = binomial(), weights = weight, data = df[cf_fold[[k]], ])
mu2_tmle_101 <- glm(vote2010 ~ 0 + offset(I(qlogis(mu2_fit_a3y))) + w2_101, family = binomial(), weights = weight, data = df[cf_fold[[k]], ])
mu2_tmle_110 <- glm(vote2010 ~ 0 + offset(I(qlogis(mu2_fit_a3n))) + w2_110, family = binomial(), weights = weight, data = df[cf_fold[[k]], ])
mu2_tmle_111 <- glm(vote2010 ~ 0 + offset(I(qlogis(mu2_fit_a3y))) + w2_111, family = binomial(), weights = weight, data = df[cf_fold[[k]], ])
df$mu2b_fit_000 <- predict(mu2_tmle_000, newdata = df, type = "response")
df$mu2b_fit_001 <- predict(mu2_tmle_001, newdata = df, type = "response")
df$mu2b_fit_010 <- predict(mu2_tmle_010, newdata = df, type = "response")
df$mu2b_fit_011 <- predict(mu2_tmle_011, newdata = df, type = "response")
df$mu2b_fit_100 <- predict(mu2_tmle_100, newdata = df, type = "response")
df$mu2b_fit_101 <- predict(mu2_tmle_101, newdata = df, type = "response")
df$mu2b_fit_110 <- predict(mu2_tmle_110, newdata = df, type = "response")
df$mu2b_fit_111 <- predict(mu2_tmle_111, newdata = df, type = "response")
# refit mu1
mu1b_sl_000 <- SuperLearner(
Y = df$mu2b_fit_000[-cf_fold[[k]]],
X = aux_mu1,
family = gaussian(),
obsWeights = aux$weight,
SL.library = c("SL.mean", "SL.glmnet", "SL.ranger"),
control = list(saveFitLibrary = TRUE),
cvControl = list(V = 5L, shuffle = TRUE, validRows = NULL)
)
mu1b_sl_001 <- SuperLearner(
Y = df$mu2b_fit_001[-cf_fold[[k]]],
X = aux_mu1,
family = gaussian(),
obsWeights = aux$weight,
SL.library = c("SL.mean", "SL.glmnet", "SL.ranger"),
control = list(saveFitLibrary = TRUE),
cvControl = list(V = 5L, shuffle = TRUE, validRows = NULL)
)
mu1b_sl_010 <- SuperLearner(
Y = df$mu2b_fit_010[-cf_fold[[k]]],
X = aux_mu1,
family = gaussian(),
obsWeights = aux$weight,
SL.library = c("SL.mean", "SL.glmnet", "SL.ranger"),
control = list(saveFitLibrary = TRUE),
cvControl = list(V = 5L, shuffle = TRUE, validRows = NULL)
)
mu1b_sl_011 <- SuperLearner(
Y = df$mu2b_fit_011[-cf_fold[[k]]],
X = aux_mu1,
family = gaussian(),
obsWeights = aux$weight,
SL.library = c("SL.mean", "SL.glmnet", "SL.ranger"),
control = list(saveFitLibrary = TRUE),
cvControl = list(V = 5L, shuffle = TRUE, validRows = NULL)
)
mu1b_sl_100 <- SuperLearner(
Y = df$mu2b_fit_100[-cf_fold[[k]]],
X = aux_mu1,
family = gaussian(),
obsWeights = aux$weight,
SL.library = c("SL.mean", "SL.glmnet", "SL.ranger"),
control = list(saveFitLibrary = TRUE),
cvControl = list(V = 5L, shuffle = TRUE, validRows = NULL)
)
mu1b_sl_101 <- SuperLearner(
Y = df$mu2b_fit_101[-cf_fold[[k]]],
X = aux_mu1,
family = gaussian(),
obsWeights = aux$weight,
SL.library = c("SL.mean", "SL.glmnet", "SL.ranger"),
control = list(saveFitLibrary = TRUE),
cvControl = list(V = 5L, shuffle = TRUE, validRows = NULL)
)
mu1b_sl_110 <- SuperLearner(
Y = df$mu2b_fit_110[-cf_fold[[k]]],
X = aux_mu1,
family = gaussian(),
obsWeights = aux$weight,
SL.library = c("SL.mean", "SL.glmnet", "SL.ranger"),
control = list(saveFitLibrary = TRUE),
cvControl = list(V = 5L, shuffle = TRUE, validRows = NULL)
)
mu1b_sl_111 <- SuperLearner(
Y = df$mu2b_fit_111[-cf_fold[[k]]],
X = aux_mu1,
family = gaussian(),
obsWeights = aux$weight,
SL.library = c("SL.mean", "SL.glmnet", "SL.ranger"),
control = list(saveFitLibrary = TRUE),
cvControl = list(V = 5L, shuffle = TRUE, validRows = NULL)
)
df$mu1_fit_000 <- predict.SuperLearner(mu1b_sl_000, newdata = df_mu1n)$pred
df$mu1_fit_001 <- predict.SuperLearner(mu1b_sl_001, newdata = df_mu1n)$pred
df$mu1_fit_010 <- predict.SuperLearner(mu1b_sl_010, newdata = df_mu1y)$pred
df$mu1_fit_011 <- predict.SuperLearner(mu1b_sl_011, newdata = df_mu1y)$pred
df$mu1_fit_100 <- predict.SuperLearner(mu1b_sl_100, newdata = df_mu1n)$pred
df$mu1_fit_101 <- predict.SuperLearner(mu1b_sl_101, newdata = df_mu1n)$pred
df$mu1_fit_110 <- predict.SuperLearner(mu1b_sl_110, newdata = df_mu1y)$pred
df$mu1_fit_111 <- predict.SuperLearner(mu1b_sl_111, newdata = df_mu1y)$pred
# targeted mu1
mu1_tmle_000 <- lm(mu2b_fit_000 ~ 0 + offset(mu1_fit_000) + w1_nn, weights = weight, data = df[cf_fold[[k]], ])
mu1_tmle_001 <- lm(mu2b_fit_001 ~ 0 + offset(mu1_fit_001) + w1_nn, weights = weight, data = df[cf_fold[[k]], ])
mu1_tmle_010 <- lm(mu2b_fit_010 ~ 0 + offset(mu1_fit_010) + w1_ny, weights = weight, data = df[cf_fold[[k]], ])
mu1_tmle_011 <- lm(mu2b_fit_011 ~ 0 + offset(mu1_fit_011) + w1_ny, weights = weight, data = df[cf_fold[[k]], ])
mu1_tmle_100 <- lm(mu2b_fit_100 ~ 0 + offset(mu1_fit_100) + w1_yn, weights = weight, data = df[cf_fold[[k]], ])
mu1_tmle_101 <- lm(mu2b_fit_101 ~ 0 + offset(mu1_fit_101) + w1_yn, weights = weight, data = df[cf_fold[[k]], ])
mu1_tmle_110 <- lm(mu2b_fit_110 ~ 0 + offset(mu1_fit_110) + w1_yy, weights = weight, data = df[cf_fold[[k]], ])
mu1_tmle_111 <- lm(mu2b_fit_111 ~ 0 + offset(mu1_fit_111) + w1_yy, weights = weight, data = df[cf_fold[[k]], ])
df$mu1b_fit_000 <- predict(mu1_tmle_000, newdata = df)
df$mu1b_fit_001 <- predict(mu1_tmle_001, newdata = df)
df$mu1b_fit_010 <- predict(mu1_tmle_010, newdata = df)
df$mu1b_fit_011 <- predict(mu1_tmle_011, newdata = df)
df$mu1b_fit_100 <- predict(mu1_tmle_100, newdata = df)
df$mu1b_fit_101 <- predict(mu1_tmle_101, newdata = df)
df$mu1b_fit_110 <- predict(mu1_tmle_110, newdata = df)
df$mu1b_fit_111 <- predict(mu1_tmle_111, newdata = df)
# refit mu0
mu0b_sl_000 <- SuperLearner(
Y = df$mu1b_fit_000[-cf_fold[[k]]],
X = aux_mu0,
family = gaussian(),
obsWeights = aux$weight,
SL.library = c("SL.mean", "SL.glmnet", "SL.ranger"),
control = list(saveFitLibrary = TRUE),
cvControl = list(V = 5L, shuffle = TRUE, validRows = NULL)
)
mu0b_sl_001 <- SuperLearner(
Y = df$mu1b_fit_001[-cf_fold[[k]]],
X = aux_mu0,
family = gaussian(),
obsWeights = aux$weight,
SL.library = c("SL.mean", "SL.glmnet", "SL.ranger"),
control = list(saveFitLibrary = TRUE),
cvControl = list(V = 5L, shuffle = TRUE, validRows = NULL)
)
mu0b_sl_010 <- SuperLearner(
Y = df$mu1b_fit_010[-cf_fold[[k]]],
X = aux_mu0,
family = gaussian(),
obsWeights = aux$weight,
SL.library = c("SL.mean", "SL.glmnet", "SL.ranger"),
control = list(saveFitLibrary = TRUE),
cvControl = list(V = 5L, shuffle = TRUE, validRows = NULL)
)
mu0b_sl_011 <- SuperLearner(
Y = df$mu1b_fit_011[-cf_fold[[k]]],
X = aux_mu0,
family = gaussian(),
obsWeights = aux$weight,
SL.library = c("SL.mean", "SL.glmnet", "SL.ranger"),
control = list(saveFitLibrary = TRUE),
cvControl = list(V = 5L, shuffle = TRUE, validRows = NULL)
)
mu0b_sl_100 <- SuperLearner(
Y = df$mu1b_fit_100[-cf_fold[[k]]],
X = aux_mu0,
family = gaussian(),
obsWeights = aux$weight,
SL.library = c("SL.mean", "SL.glmnet", "SL.ranger"),
control = list(saveFitLibrary = TRUE),
cvControl = list(V = 5L, shuffle = TRUE, validRows = NULL)
)
mu0b_sl_101 <- SuperLearner(
Y = df$mu1b_fit_101[-cf_fold[[k]]],
X = aux_mu0,
family = gaussian(),
obsWeights = aux$weight,
SL.library = c("SL.mean", "SL.glmnet", "SL.ranger"),
control = list(saveFitLibrary = TRUE),
cvControl = list(V = 5L, shuffle = TRUE, validRows = NULL)
)
mu0b_sl_110 <- SuperLearner(
Y = df$mu1b_fit_110[-cf_fold[[k]]],
X = aux_mu0,
family = gaussian(),
obsWeights = aux$weight,
SL.library = c("SL.mean", "SL.glmnet", "SL.ranger"),
control = list(saveFitLibrary = TRUE),
cvControl = list(V = 5L, shuffle = TRUE, validRows = NULL)
)
mu0b_sl_111 <- SuperLearner(
Y = df$mu1b_fit_111[-cf_fold[[k]]],
X = aux_mu0,
family = gaussian(),
obsWeights = aux$weight,
SL.library = c("SL.mean", "SL.glmnet", "SL.ranger"),
control = list(saveFitLibrary = TRUE),
cvControl = list(V = 5L, shuffle = TRUE, validRows = NULL)
)
df$mu0_fit_000 <- predict.SuperLearner(mu0b_sl_000, newdata = df_mu0n)$pred
df$mu0_fit_001 <- predict.SuperLearner(mu0b_sl_001, newdata = df_mu0n)$pred
df$mu0_fit_010 <- predict.SuperLearner(mu0b_sl_010, newdata = df_mu0y)$pred
df$mu0_fit_011 <- predict.SuperLearner(mu0b_sl_011, newdata = df_mu0y)$pred
df$mu0_fit_100 <- predict.SuperLearner(mu0b_sl_100, newdata = df_mu0n)$pred
df$mu0_fit_101 <- predict.SuperLearner(mu0b_sl_101, newdata = df_mu0n)$pred
df$mu0_fit_110 <- predict.SuperLearner(mu0b_sl_110, newdata = df_mu0y)$pred
df$mu0_fit_111 <- predict.SuperLearner(mu0b_sl_111, newdata = df_mu0y)$pred
# targeted mu0
mu0_tmle_000 <- lm(mu1b_fit_000 ~ 0 + offset(mu0_fit_000) + w0_n, weights = weight, data = df[cf_fold[[k]], ])
mu0_tmle_001 <- lm(mu1b_fit_001 ~ 0 + offset(mu0_fit_001) + w0_n, weights = weight, data = df[cf_fold[[k]], ])
mu0_tmle_010 <- lm(mu1b_fit_010 ~ 0 + offset(mu0_fit_010) + w0_n, weights = weight, data = df[cf_fold[[k]], ])
mu0_tmle_011 <- lm(mu1b_fit_011 ~ 0 + offset(mu0_fit_011) + w0_n, weights = weight, data = df[cf_fold[[k]], ])
mu0_tmle_100 <- lm(mu1b_fit_100 ~ 0 + offset(mu0_fit_100) + w0_y, weights = weight, data = df[cf_fold[[k]], ])
mu0_tmle_101 <- lm(mu1b_fit_101 ~ 0 + offset(mu0_fit_101) + w0_y, weights = weight, data = df[cf_fold[[k]], ])
mu0_tmle_110 <- lm(mu1b_fit_110 ~ 0 + offset(mu0_fit_110) + w0_y, weights = weight, data = df[cf_fold[[k]], ])
mu0_tmle_111 <- lm(mu1b_fit_111 ~ 0 + offset(mu0_fit_111) + w0_y, weights = weight, data = df[cf_fold[[k]], ])
df$mu0b_fit_000 <- predict(mu0_tmle_000, newdata = df)
df$mu0b_fit_001 <- predict(mu0_tmle_001, newdata = df)
df$mu0b_fit_010 <- predict(mu0_tmle_010, newdata = df)
df$mu0b_fit_011 <- predict(mu0_tmle_011, newdata = df)
df$mu0b_fit_100 <- predict(mu0_tmle_100, newdata = df)
df$mu0b_fit_101 <- predict(mu0_tmle_101, newdata = df)
df$mu0b_fit_110 <- predict(mu0_tmle_110, newdata = df)
df$mu0b_fit_111 <- predict(mu0_tmle_111, newdata = df)
main_list[[k]] <- df[cf_fold[[k]], ]
}
main_df <- reduce(main_list, bind_rows)
for (s in 1:S){
a1 <- estimands$a1[[s]]
a2 <- estimands$a2[[s]]
a3 <- estimands$a3[[s]]
main_df <- main_df %>%
mutate(
wt0_deno = a1 * p0_fit + (1 - a1) * (1 - p0_fit),
wt1_nume = a1 * p1_fit + (1 - a1) * (1 - p1_fit),
wt1_deno = a2 * p1_fit + (1 - a2) * (1 - p1_fit),
wt2_nume = a2 * p2_fit + (1 - a2) * (1 - p2_fit),
wt2_deno = a3 * p2_fit + (1 - a3) * (1 - p2_fit),
!!sym(paste0("w0_", a1, a2, a3)) := 1/trim(wt0_deno),
!!sym(paste0("w1_", a1, a2, a3)) := !!sym(paste0("w0_", a1, a2, a3)) * wt1_nume/trim(wt1_deno),
!!sym(paste0("w2_", a1, a2, a3)) := !!sym(paste0("w1_", a1, a2, a3)) * wt2_nume/trim(wt2_deno),
!!sym(paste0("mu2fit_", a1, a2, a3)) := a3 * mu2_fit_a3y + (1 - a3) * mu2_fit_a3n,
!!sym(paste0("mu1fit_", a1, a2, a3)) := a3 * a2 * mu1_fit_a3y_a2y + a3 * (1 - a2) * mu1_fit_a3y_a2n +
(1 - a3) * a2 * mu1_fit_a3n_a2y + (1 - a3) * (1 - a2) * mu1_fit_a3n_a2n,
!!sym(paste0("mu0fit_", a1, a2, a3)) := a3 * a2 * a1 * mu0_fit_a3y_a2y_a1y + a3 * a2 * (1 - a1) * mu0_fit_a3y_a2y_a1n +
a3 * (1 - a2) * a1 * mu0_fit_a3y_a2n_a1y + a3 * (1 - a2) * (1 - a1) * mu0_fit_a3y_a2n_a1n +
(1 - a3) * a2 * a1 * mu0_fit_a3n_a2y_a1y + (1 - a3) * a2 * (1 - a1) * mu0_fit_a3n_a2y_a1n +
(1 - a3) * (1 - a2) * a1 * mu0_fit_a3n_a2n_a1y + (1 - a3) * (1 - a2) * (1 - a1) * mu0_fit_a3n_a2n_a1n,
!!sym(paste0("www_", a1, a2, a3)) := as.double(att==a3) * !!sym(paste0("w2_", a1, a2, a3)) * vote2010,
!!sym(paste0("iii_", a1, a2, a3)) := !!sym(paste0("mu0fit_", a1, a2, a3)),
!!sym(paste0("eif_", a1, a2, a3)) := as.double(att==a3) * !!sym(paste0("w2_", a1, a2, a3)) *
(vote2010 - !!sym(paste0("mu2fit_", a1, a2, a3))) +
as.double(att==a2) * !!sym(paste0("w1_", a1, a2, a3)) * (!!sym(paste0("mu2fit_", a1, a2, a3)) -
!!sym(paste0("mu1fit_", a1, a2, a3))) +
as.double(att==a1) * !!sym(paste0("w0_", a1, a2, a3)) * (!!sym(paste0("mu1fit_", a1, a2, a3)) -
!!sym(paste0("mu0fit_", a1, a2, a3))) +
!!sym(paste0("mu0fit_", a1, a2, a3)),
!!sym(paste0("tmle_", a1, a2, a3)) := as.double(att==a3) * !!sym(paste0("w2_", a1, a2, a3)) *
(vote2010 - !!sym(paste0("mu2b_fit_", a1, a2, a3))) +
as.double(att==a2) * !!sym(paste0("w1_", a1, a2, a3)) * (!!sym(paste0("mu2b_fit_", a1, a2, a3)) -
!!sym(paste0("mu1b_fit_", a1, a2, a3))) +
as.double(att==a1) * !!sym(paste0("w0_", a1, a2, a3)) * (!!sym(paste0("mu1b_fit_", a1, a2, a3)) -
!!sym(paste0("mu0b_fit_", a1, a2, a3))) +
!!sym(paste0("mu0b_fit_", a1, a2, a3)),
!!sym(paste0("tmle2_", a1, a2, a3)) := !!sym(paste0("mu0b_fit_", a1, a2, a3))
)
}
out[[i]] <- main_df
}
save.image(file = "05_vote_main.RData")
我们同样载入所需包、导入数据、设定绘图主题后,根据之前的估计结果计算得到 ATE 与其分解所得的 PSEs 的估计值与标准误,并通过 ggplot
绘制图形。
rm(list = ls())
library(survey)
library(gbm)
library(ranger)
library(glmnet)
library(rsample)
library(caret)
library(rlang)
library(tidyverse)
library(Hmisc)
library(SuperLearner)
library(scales)
source("zmisc.R")
set.seed(02138)
load("05_vote_main.RData")
# set my ggplot theme
mytheme <- theme_minimal(base_size = 18) +
theme(legend.position = "bottom",
plot.title = element_text(hjust = 0.5),
plot.caption = element_text(color = "grey30"))
theme_set(mytheme)
out_df <- out %>%
imap( ~ mutate(.x, imp = .y)) %>%
reduce(bind_rows) %>%
mutate(eif_type1_ate = eif_111 - eif_000,
eif_type2_ate = eif_111 - eif_000,
eif_type3_ate = eif_111 - eif_000,
eif_type1_pse3 = eif_001 - eif_000,
eif_type1_pse2 = eif_011 - eif_001,
eif_type1_pse1 = eif_111 - eif_011,
eif_type2_pse1 = eif_100 - eif_000,
eif_type2_pse2 = eif_110 - eif_100,
eif_type2_pse3 = eif_111 - eif_110,
eif_type3_pse3 = eif_001 - eif_000,
eif_type3_pse1 = eif_101 - eif_001,
eif_type3_pse2 = eif_111 - eif_101,
tmle_type1_ate = tmle_111 - tmle_000,
tmle_type2_ate = tmle_111 - tmle_000,
tmle_type3_ate = tmle_111 - tmle_000,
tmle_type1_pse3 = tmle_001 - tmle_000,
tmle_type1_pse2 = tmle_011 - tmle_001,
tmle_type1_pse1 = tmle_111 - tmle_011,
tmle_type2_pse1 = tmle_100 - tmle_000,
tmle_type2_pse2 = tmle_110 - tmle_100,
tmle_type2_pse3 = tmle_111 - tmle_110,
tmle_type3_pse3 = tmle_001 - tmle_000,
tmle_type3_pse1 = tmle_101 - tmle_001,
tmle_type3_pse2 = tmle_111 - tmle_101) %>%
group_by(imp) %>%
summarise_at(vars(contains("type")), list(est = ~ wtd.mean(.x, weight),
se = ~ sqrt(wtd.var(.x, weight)/length(.x)))) %>%
ungroup() %>%
pivot_longer(-imp) %>%
separate(name, into = c("estimator", "type", "estimand", "measure")) %>%
pivot_wider(names_from = measure, values_from = value) %>%
group_by(estimator, type, estimand) %>%
summarise(within_var = mean(se^2),
between_var = var(est),
total_var = within_var + (1 + 1/I)*between_var,
est = mean(est),
se = sqrt(total_var)) %>%
ungroup() %>%
filter(type == "type1") %>%
mutate(estimator = factor(estimator,
levels = c("eif", "tmle"),
labels = c(expression(paste("Estimating Equation (", hat(psi)[np]^eif2, ")")),
expression(paste("TMLE (", hat(psi)[tmle]^eif2, ")")))),
estimand = factor(estimand,
levels = rev(c("ate", "pse3", "pse2", "pse1")),
labels = rev(c(expression(paste("Total Effect (", psi[`111`]-psi[`000`], ")")),
expression(paste("Direct Effect (", psi[`001`]-psi[`000`], ")")),
expression(paste("via Civic/Political Interest (", psi[`011`]-psi[`001`], ")")),
expression(paste("via Economic Status (", psi[`111`]-psi[`011`], ")"))))))
ggplot(out_df, aes(x = estimand, y = est, shape = estimator)) +
geom_pointrange(aes(ymin = est - 1.96 * se, ymax = est + 1.96 * se),
position = position_dodge(width = - 0.5), size = 1) +
geom_hline(yintercept = 0, linetype = 2) +
scale_shape("", labels = parse_format()) +
scale_color_discrete("", labels = parse_format()) +
scale_x_discrete("", labels = parse_format()) +
scale_y_continuous("Effects of College Attendance on Voting") +
coord_flip()
out_df[, c("est", "se")] %>% round(3)
R 的最终估计结果如下所示:
est | se |
---|---|
0.153 | 0.023 |
0.008 | 0.005 |
0.045 | 0.008 |
0.1 | 0.021 |
0.157 | 0.022 |
0.002 | 0.005 |
0.051 | 0.008 |
0.103 | 0.021 |
我们将其与原文表格 (如下所示,括号中为标准误) 进行对比,可见结果是一致的。
估计方程 ( |
TMLE ( |
|
---|---|---|
平均总效应 | 0.152 (0.022) | 0.156 (0.023) |
经济状况路径 ( |
0.007 (0.005) | 0.002 (0.005) |
公民/政治兴趣路径 ( |
0.042 (0.008) | 0.049 (0.008) |
直接效应 ( |
0.103 (0.021) | 0.105 (0.021) |
此外,我们还通过 R 将以上结果可视化:
在理论篇中,我们知道本文的多重稳健估计量可以通过两种方法来估计,一是调整结果模型的估计方程,使包含逆概率权重的项等于零,只留下一个位于估计对象参数位置的迭代回归输入估计量,即表中的
本推文对
Note:产生如下推文列表的 Stata 命令为:
lianxh
安装最新版lianxh
命令:
ssc install lianxh, replace
免费公开课
最新课程-直播课
专题 | 嘉宾 | 直播/回看视频 |
---|---|---|
⭐ 最新专题 | 文本分析、机器学习、效率专题、生存分析等 | |
研究设计 | 连玉君 | 我的特斯拉-实证研究设计,-幻灯片- |
面板模型 | 连玉君 | 动态面板模型,-幻灯片- |
面板模型 | 连玉君 | 直击面板数据模型 [免费公开课,2小时] |
⛳ 课程主页
⛳ 课程主页
关于我们
课程, 直播, 视频, 客服, 模型设定, 研究设计, stata, plus, 绘图, 编程, 面板, 论文重现, 可视化, RDD, DID, PSM, 合成控制法
等
连享会小程序:扫一扫,看推文,看视频……
扫码加入连享会微信群,提问交流更方便
✏ 连享会-常见问题解答:
✨ https://gitee.com/lianxh/Course/wikis
New!
lianxh
命令发布了:
随时搜索连享会推文、Stata 资源,安装命令如下:
. ssc install lianxh
使用详情参见帮助文件 (有惊喜):
. help lianxh