中介效应:有序因果中介分析的半参数估计B-实操

发布时间:2023-03-19 阅读 650

Stata连享会   主页 || 视频 || 推文 || 知乎 || Bilibili 站

温馨提示: 定期 清理浏览器缓存,可以获得最佳浏览体验。

New! lianxh 命令发布了:
随时搜索推文、Stata 资源。安装:
. ssc install lianxh
详情参见帮助文件 (有惊喜):
. help lianxh
连享会新命令:cnssc, ihelp, rdbalance, gitee, installpkg

课程详情 https://gitee.com/lianxh/Course

课程主页 https://gitee.com/lianxh/Course

⛳ Stata 系列推文:

PDF下载 - 推文合集

作者: 余坚 (贵州财经大学)
邮箱: yujiangeren@163.com

Source: Xiang Zhou, 2021, Semiparametric estimation for causal mediation analysis with multiple causally ordered mediators. -Link--Data-


目录


1. 引言

在有序因果中介分析的半参数估计理论篇中,我们已经对 K(1) 个因果有序中介设定下,路径特定效应 (path-specific effects, PSEs) 的估计进行了理论介绍。本推文将在此基础上,复现作者原文中的实证结果,在 R 语言软件中演示具体操作,并对相应结果进行解读。

2. 数据背景

作者使用的数据来源于美国 1997 年全国青年纵向调查 (National Longitudinal Survey of Youth 1997),数据集中共有 2969 人,在 1997 年他们的年龄为 15 岁至 17 岁,并在 20 岁前完成了高中学业。相关研究表明,大学入学率对美国的政治参与 (选举中是否投票) 有实质性的积极影响,但这种因果关系背后的机制仍不清楚。大学入学率对政治参与的影响可能通过公民和政治兴趣的发展、经济状况的提高、或社会和职业网络等其他途径发挥作用。为了检验这些直接和间接的影响,我们考虑一个因果结构,令 A 表示大学入学率,Y 表示政治参与 ,M1 和 M2 表示两个因果有序中介,分别反映了 (a) 经济状况, (b) 公民和政治兴趣。

在该数据集中,处理 A 是一个二元变量,用来衡量个体在 20 岁之前是否上过 2 年或 4 年的大学。结果 Y 是一个二元变量,用来衡量个体是否在 2010 年大选中投票。经济状况 M1 由受访者 2006 年至 2009 年的平均年收入度量。公民和政治兴趣 M2 由一组变量来度量,反映受访者在 2007 至 2010 年间对政府和公共事务的兴趣,以及参与志愿服务、捐赠、社区团体活动的情况。

预处理协变量 X 包括性别、种族、民族、1997 年的年龄、父母教育、父母收入、父母资产、父亲形象、是否与亲生父母都住在一起、军队职业能力测试分数、高中 GPA、物质使用指数、青少年犯罪指数,受访者在 18 岁之前是否有孩子,受访者的同龄人对大学的期望,以及一些学校水平的特征。

我们在上一篇推文中已经介绍过反映因果路径 AYAM1Y,和 AM2Y 的 PSEs。为了便于说明,现在我们主要关注 Mk(k2) 的累积路径特定效应 (cumulative path-sprecific effects, cPSEs):

其中,第一个部分是大学入学率的 NDE,第二个和第三个部分反映了分别通过公民/政治兴趣中介和经济状况中介的处理效应的大小。

3. R 语言实操

3.1 前期工作

首先,我们进行一系列前期工作,包括载入所需的包,设定画图的主题,以及导入数据。

rm(list = ls())
library(haven)
library(survey)
library(gbm)
library(ranger)
library(rlang)
library(glmnet)
library(rsample)
library(caret)
library(tidyverse)
library(mice)
library(SuperLearner)
source("zmisc.R")
set.seed(02138)

# set my ggplot theme
mytheme <- theme_minimal(base_size = 16) + 
  theme(legend.position = "bottom",
        plot.title = element_text(hjust = 0.5),
        plot.caption = element_text(color = "grey30"))
theme_set(mytheme)

# load data
nlsy97_ed3 <- readRDS("nlsy97_ed3.RDS")

并且,我们使用多重插补法 (Multiple Imputation) 补齐缺失的协变量。

##########################################################
# Multiple Imputation for All Missing Covariates
##########################################################

# number of imputed datasets
I <- 10

nlsy97_ed3_mi <- nlsy97_ed3 %>%
  # remove redundant variables for imputation
  select(-id, -weight) %>%
  mice(m = I, maxit = 10, defaultMethod = "rf")

nlsy97_ed3_full <- map(1:I, ~ complete(nlsy97_ed3_mi, .x)) %>%
  map(., ~ mutate(.x,
                  id = nlsy97_ed3$id,
                  weight = nlsy97_ed3$weight,
                  logearn2007 = log(pmax(total_earnings_adj_2007, 1000)),
                  logearn2008 = log(pmax(total_earnings_adj_2008, 1000)),
                  logearn2009 = log(pmax(total_earnings_adj_2009, 1000)),
                  logearn2010 = log(pmax(total_earnings_adj_2010, 1000)),
  ))

saveRDS(nlsy97_ed3_full, file = "nlsy97_ed3_full.RDS")

3.2 估计过程

首先,我们载入所需的包并对一些变量进行预处理。

rm(list = ls())
library(survey)
library(gbm)
library(ranger)
library(glmnet)
library(rsample)
library(caret)
library(rlang)
library(tidyverse)
library(Hmisc)
library(SuperLearner)
source("zmisc.R")
set.seed(02138)

nlsy97_full <- readRDS("nlsy97_ed3_full.RDS") %>%
  map(., ~ mutate(.x,
                  aveearn = rowMeans(select(.x, num_range("total_earnings_adj_", 2007:2010)), na.rm = TRUE),
                  logearn = log(pmax(aveearn, 1000)),
                  interest = 5 - rowMeans(select(.x, interest2008, interest2010), na.rm = TRUE)))

n <- nrow(nlsy97_full[[1]])

接着,我们开始设定变量与公式,此处 a 为处理组的平均处理效应 (Average Treatment Effects on Treated, ATT);y 为结果变量,即是否在 2010 年选举中投票;第一个中介 m1 为经济状况;第二个中介 m2 为一组多元变量,包括受访者在 2007 至 2010 年间对政府和公共事务的兴趣,以及参与志愿服务、捐赠、社区团体活动的情况;x 为一组控制变量。我们分别对无中介、仅有中介 m1、中介 m1m2 三种情况下的处理变量和结果变量设定计算公式。

##########################################################
# Formulas for treatment and outcome models
##########################################################

# continuous variables
nlsy97_full[[1]] %>%
  map_lgl( ~ length(unique(.x)) > 2) %>%
  which()

a <- "att"
y <- "vote2010"
m1 <- c("logearn")
m2 <- c("volunteer2007", "community2007", "donate2007", "interest")

x <- exprs(age97, female, black, hispan, parinc, paredu, parast, livebio, father,
           afqt3, gpa, children18, substind, delinqind, rural97, south97,
           frcoll_75, frcoll_90, schstolen, schthreat, schfight) %>%
  map_chr(as_string)

a0_form <- as.formula(paste(a, " ~ ", paste(x, collapse= "+")))
a1_form <- as.formula(paste(a, " ~ ", paste(c(x, m1), collapse= "+")))
a2_form <- as.formula(paste(a, " ~ ", paste(c(x, m1, m2), collapse= "+")))

y0_form <- as.formula(paste(y, " ~ ", paste(c(x, a), collapse= "+")))
y1_form <- as.formula(paste(y, " ~ ", paste(c(x, a, m1), collapse= "+")))
y2_form <- as.formula(paste(y, " ~ ", paste(c(x, a, m1, m2), collapse= "+")))

最后,则是对原文中的理论估计模型进行编程并得到估计结果。作者使用了两种方法,一是结合结果模型与迭代回归输入,二是使用目标极大似然估计方法。注意以下命令全部运行完约需两个小时。

##########################################################
# Main analyses
##########################################################

estimands <- expand.grid(c(0, 1), c(0, 1), c(0, 1)) %>%
  `colnames<-`(c("a1", "a2", "a3"))

I <- length(nlsy97_full)
S <- nrow(estimands)
K <- 5

out <- vector(mode = "list", I)

for(i in 1:I){
  
  cat("imputed sample ", i, "\n")
  
  df <- nlsy97_full[[i]]
  
  df_p0 <- model.matrix(a0_form, data = df)[, -1] %>% as_tibble()
  df_p1 <- model.matrix(a1_form, data = df)[, -1] %>% as_tibble()
  df_p2 <- model.matrix(a2_form, data = df)[, -1] %>% as_tibble()
  
  df_mu0 <- model.matrix(y0_form, data = df)[, -1] %>% as_tibble()
  df_mu1 <- model.matrix(y1_form, data = df)[, -1] %>% as_tibble()
  df_mu2 <- model.matrix(y2_form, data = df)[, -1] %>% as_tibble()
  
  df_mu2n <- model.matrix(y2_form, data = mutate(df, att = 0))[, -1] %>% as_tibble()
  df_mu1n <- model.matrix(y1_form, data = mutate(df, att = 0))[, -1] %>% as_tibble()
  df_mu0n <- model.matrix(y0_form, data = mutate(df, att = 0))[, -1] %>% as_tibble()
  
  df_mu2y <- model.matrix(y2_form, data = mutate(df, att = 1))[, -1] %>% as_tibble()
  df_mu1y <- model.matrix(y1_form, data = mutate(df, att = 1))[, -1] %>% as_tibble()
  df_mu0y <- model.matrix(y0_form, data = mutate(df, att = 1))[, -1] %>% as_tibble()
    
  # create cross-fitting split
  cf_fold <- createFolds(df$vote2010, K)
  
  main_list <- vector(mode = "list", K)
  
  for(k in 1:K){
    
    cat(" cross-fitting fold ", k, "\n")
    
    #################################################
    # Design matrices for different models
    #################################################
    
    # auxiliary and main data
    aux <- df[-cf_fold[[k]], ]
    main <- df[cf_fold[[k]], ]
    
    aux_p0 <- df_p0[-cf_fold[[k]], ]
    aux_p1 <- df_p1[-cf_fold[[k]], ]
    aux_p2 <- df_p2[-cf_fold[[k]], ]
    
    main_p0 <- df_p0[cf_fold[[k]], ]
    main_p1 <- df_p1[cf_fold[[k]], ]
    main_p2 <- df_p2[cf_fold[[k]], ]
    
    aux_mu0 <- df_mu0[-cf_fold[[k]], ]
    aux_mu1 <- df_mu1[-cf_fold[[k]], ]
    aux_mu2 <- df_mu2[-cf_fold[[k]], ]
    
    main_mu0 <- df_mu0[cf_fold[[k]], ]
    main_mu1 <- df_mu1[cf_fold[[k]], ]
    main_mu2 <- df_mu2[cf_fold[[k]], ]
    
    #################################################
    # Treatment Models
    #################################################
    
    p0_sl <- SuperLearner(
      Y          = aux$att,
      X          = aux_p0,
      newX       = df_p0,
      family     = binomial(),
      obsWeights = aux$weight,
      SL.library = c("SL.mean", "SL.glmnet", "SL.ranger"),
      control    = list(saveFitLibrary = TRUE, trimLogit = 0.001),
      cvControl  = list(V = 5L, stratifyCV = TRUE, shuffle = TRUE, validRows = NULL)
    )
    
    p1_sl <- SuperLearner(
      Y          = aux$att,
      X          = aux_p1,
      newX       = df_p1,
      family     = binomial(),
      obsWeights = aux$weight,
      SL.library = c("SL.mean", "SL.glmnet", "SL.ranger"),
      control    = list(saveFitLibrary = TRUE, trimLogit = 0.001),
      cvControl  = list(V = 5L, stratifyCV = TRUE, shuffle = TRUE, validRows = NULL)
    )
    
    p2_sl <- SuperLearner(
      Y          = aux$att,
      X          = aux_p2,
      newX       = df_p2,
      family     = binomial(),
      obsWeights = aux$weight,
      SL.library = c("SL.mean", "SL.glmnet", "SL.ranger"),
      control    = list(saveFitLibrary = TRUE, trimLogit = 0.001),
      cvControl  = list(V = 5L, stratifyCV = TRUE, shuffle = TRUE, validRows = NULL)
    )
    
    df <- df %>% mutate(
      p0_fit = p0_sl$SL.predict,
      p1_fit = p1_sl$SL.predict,
      p2_fit = p2_sl$SL.predict,
      w0_n = I(att == 0)/(1 - p0_fit),
      w0_y = I(att == 1)/p0_fit,
      w1_nn = I(att == 0)/(1 - p0_fit) * 1,
      w1_ny = I(att == 1)/(1 - p0_fit) * (1 - p1_fit)/p1_fit,
      w1_yn = I(att == 0)/p0_fit * p1_fit/(1 - p1_fit),
      w1_yy = I(att == 1)/p0_fit * 1,
      w2_000 = I(att == 0)/(1 - p0_fit) * 1 * 1,
      w2_001 = I(att == 1)/(1 - p0_fit) * 1 * (1 - p2_fit)/p2_fit,
      w2_010 = I(att == 0)/(1 - p0_fit) * (1 - p1_fit)/p1_fit * p2_fit/(1 - p2_fit),
      w2_011 = I(att == 1)/(1 - p0_fit) * (1 - p1_fit)/p1_fit * 1,
      w2_100 = I(att == 0)/p0_fit * p1_fit/(1 - p1_fit) * 1,
      w2_101 = I(att == 1)/p0_fit * p1_fit/(1 - p1_fit) * (1 - p2_fit)/p2_fit,
      w2_110 = I(att == 0)/p0_fit * 1 * p2_fit/(1 - p2_fit),
      w2_111 = I(att == 1)/p0_fit * 1 * 1,
    )
    
    #################################################
    # Outcome models
    #################################################
    
    mu2_sl <- SuperLearner(
      Y          = aux$vote2010,
      X          = aux_mu2,
      family     = binomial(),
      obsWeights = aux$weight,
      SL.library = c("SL.mean", "SL.glmnet", "SL.ranger"),
      control    = list(saveFitLibrary = TRUE),
      cvControl  = list(V = 5L, shuffle = TRUE, validRows = NULL)
    )
    
    df$mu2_fit_a3n <- predict.SuperLearner(mu2_sl, newdata = df_mu2n)$pred
    df$mu2_fit_a3y <- predict.SuperLearner(mu2_sl, newdata = df_mu2y)$pred
    
    mu1_sl_a3n <- SuperLearner(
      Y          = df$mu2_fit_a3n[-cf_fold[[k]]],
      X          = aux_mu1,
      family     = gaussian(),
      obsWeights = aux$weight,
      SL.library = c("SL.mean", "SL.glmnet", "SL.ranger"),
      control    = list(saveFitLibrary = TRUE),
      cvControl  = list(V = 5L, shuffle = TRUE, validRows = NULL)
    )
    
    mu1_sl_a3y <- SuperLearner(
      Y          = df$mu2_fit_a3y[-cf_fold[[k]]],
      X          = aux_mu1,
      family     = gaussian(),
      obsWeights = aux$weight,
      SL.library = c("SL.mean", "SL.glmnet", "SL.ranger"),
      control    = list(saveFitLibrary = TRUE),
      cvControl  = list(V = 5L, shuffle = TRUE, validRows = NULL)
    )
    
    df$mu1_fit_a3n_a2n <- predict.SuperLearner(mu1_sl_a3n, newdata = df_mu1n)$pred
    df$mu1_fit_a3n_a2y <- predict.SuperLearner(mu1_sl_a3n, newdata = df_mu1y)$pred
    df$mu1_fit_a3y_a2n <- predict.SuperLearner(mu1_sl_a3y, newdata = df_mu1n)$pred
    df$mu1_fit_a3y_a2y <- predict.SuperLearner(mu1_sl_a3y, newdata = df_mu1y)$pred
    
    mu0_sl_a3n_a2n <- SuperLearner(
      Y          = df$mu1_fit_a3n_a2n[-cf_fold[[k]]],
      X          = aux_mu0,
      family     = gaussian(),
      obsWeights = aux$weight,
      SL.library = c("SL.mean", "SL.glmnet", "SL.ranger"),
      control    = list(saveFitLibrary = TRUE),
      cvControl  = list(V = 5L, shuffle = TRUE, validRows = NULL)
    )
    
    mu0_sl_a3n_a2y <- SuperLearner(
      Y          = df$mu1_fit_a3n_a2y[-cf_fold[[k]]],
      X          = aux_mu0,
      family     = gaussian(),
      obsWeights = aux$weight,
      SL.library = c("SL.mean", "SL.glmnet", "SL.ranger"),
      control    = list(saveFitLibrary = TRUE),
      cvControl  = list(V = 5L, shuffle = TRUE, validRows = NULL)
    )
    
    mu0_sl_a3y_a2n <- SuperLearner(
      Y          = df$mu1_fit_a3y_a2n[-cf_fold[[k]]],
      X          = aux_mu0,
      family     = gaussian(),
      obsWeights = aux$weight,
      SL.library = c("SL.mean", "SL.glmnet", "SL.ranger"),
      control    = list(saveFitLibrary = TRUE),
      cvControl  = list(V = 5L, shuffle = TRUE, validRows = NULL)
    )
    
    mu0_sl_a3y_a2y <- SuperLearner(
      Y          = df$mu1_fit_a3y_a2y[-cf_fold[[k]]],
      X          = aux_mu0,
      family     = gaussian(),
      obsWeights = aux$weight,
      SL.library = c("SL.mean", "SL.glmnet", "SL.ranger"),
      control    = list(saveFitLibrary = TRUE),
      cvControl  = list(V = 5L, shuffle = TRUE, validRows = NULL)
    )
    
    df$mu0_fit_a3n_a2n_a1n <- predict.SuperLearner(mu0_sl_a3n_a2n, newdata = df_mu0n)$pred
    df$mu0_fit_a3n_a2n_a1y <- predict.SuperLearner(mu0_sl_a3n_a2n, newdata = df_mu0y)$pred
    
    df$mu0_fit_a3n_a2y_a1n <- predict.SuperLearner(mu0_sl_a3n_a2y, newdata = df_mu0n)$pred
    df$mu0_fit_a3n_a2y_a1y <- predict.SuperLearner(mu0_sl_a3n_a2y, newdata = df_mu0y)$pred
    
    df$mu0_fit_a3y_a2n_a1n <- predict.SuperLearner(mu0_sl_a3y_a2n, newdata = df_mu0n)$pred
    df$mu0_fit_a3y_a2n_a1y <- predict.SuperLearner(mu0_sl_a3y_a2n, newdata = df_mu0y)$pred
    
    df$mu0_fit_a3y_a2y_a1n <- predict.SuperLearner(mu0_sl_a3y_a2y, newdata = df_mu0n)$pred
    df$mu0_fit_a3y_a2y_a1y <- predict.SuperLearner(mu0_sl_a3y_a2y, newdata = df_mu0y)$pred
    
    #################################################
    # Targeted MLE
    #################################################
    
    # targeted mu2
    
    mu2_tmle_000 <- glm(vote2010 ~ 0 + offset(I(qlogis(mu2_fit_a3n))) + w2_000, family = binomial(), weights = weight, data = df[cf_fold[[k]], ])
    mu2_tmle_001 <- glm(vote2010 ~ 0 + offset(I(qlogis(mu2_fit_a3y))) + w2_001, family = binomial(), weights = weight, data = df[cf_fold[[k]], ])
    mu2_tmle_010 <- glm(vote2010 ~ 0 + offset(I(qlogis(mu2_fit_a3n))) + w2_010, family = binomial(), weights = weight, data = df[cf_fold[[k]], ])
    mu2_tmle_011 <- glm(vote2010 ~ 0 + offset(I(qlogis(mu2_fit_a3y))) + w2_011, family = binomial(), weights = weight, data = df[cf_fold[[k]], ])
    mu2_tmle_100 <- glm(vote2010 ~ 0 + offset(I(qlogis(mu2_fit_a3n))) + w2_100, family = binomial(), weights = weight, data = df[cf_fold[[k]], ])
    mu2_tmle_101 <- glm(vote2010 ~ 0 + offset(I(qlogis(mu2_fit_a3y))) + w2_101, family = binomial(), weights = weight, data = df[cf_fold[[k]], ])
    mu2_tmle_110 <- glm(vote2010 ~ 0 + offset(I(qlogis(mu2_fit_a3n))) + w2_110, family = binomial(), weights = weight, data = df[cf_fold[[k]], ])
    mu2_tmle_111 <- glm(vote2010 ~ 0 + offset(I(qlogis(mu2_fit_a3y))) + w2_111, family = binomial(), weights = weight, data = df[cf_fold[[k]], ])
    
    df$mu2b_fit_000 <- predict(mu2_tmle_000, newdata = df, type = "response")
    df$mu2b_fit_001 <- predict(mu2_tmle_001, newdata = df, type = "response")
    df$mu2b_fit_010 <- predict(mu2_tmle_010, newdata = df, type = "response")
    df$mu2b_fit_011 <- predict(mu2_tmle_011, newdata = df, type = "response")
    df$mu2b_fit_100 <- predict(mu2_tmle_100, newdata = df, type = "response")
    df$mu2b_fit_101 <- predict(mu2_tmle_101, newdata = df, type = "response")
    df$mu2b_fit_110 <- predict(mu2_tmle_110, newdata = df, type = "response")
    df$mu2b_fit_111 <- predict(mu2_tmle_111, newdata = df, type = "response")
    
    # refit mu1
    
    mu1b_sl_000 <- SuperLearner(
      Y          = df$mu2b_fit_000[-cf_fold[[k]]],
      X          = aux_mu1,
      family     = gaussian(),
      obsWeights = aux$weight,
      SL.library = c("SL.mean", "SL.glmnet", "SL.ranger"),
      control    = list(saveFitLibrary = TRUE),
      cvControl  = list(V = 5L, shuffle = TRUE, validRows = NULL)
    )
    
    mu1b_sl_001 <- SuperLearner(
      Y          = df$mu2b_fit_001[-cf_fold[[k]]],
      X          = aux_mu1,
      family     = gaussian(),
      obsWeights = aux$weight,
      SL.library = c("SL.mean", "SL.glmnet", "SL.ranger"),
      control    = list(saveFitLibrary = TRUE),
      cvControl  = list(V = 5L, shuffle = TRUE, validRows = NULL)
    )
    
    mu1b_sl_010 <- SuperLearner(
      Y          = df$mu2b_fit_010[-cf_fold[[k]]],
      X          = aux_mu1,
      family     = gaussian(),
      obsWeights = aux$weight,
      SL.library = c("SL.mean", "SL.glmnet", "SL.ranger"),
      control    = list(saveFitLibrary = TRUE),
      cvControl  = list(V = 5L, shuffle = TRUE, validRows = NULL)
    )
    
    mu1b_sl_011 <- SuperLearner(
      Y          = df$mu2b_fit_011[-cf_fold[[k]]],
      X          = aux_mu1,
      family     = gaussian(),
      obsWeights = aux$weight,
      SL.library = c("SL.mean", "SL.glmnet", "SL.ranger"),
      control    = list(saveFitLibrary = TRUE),
      cvControl  = list(V = 5L, shuffle = TRUE, validRows = NULL)
    )
    
    mu1b_sl_100 <- SuperLearner(
      Y          = df$mu2b_fit_100[-cf_fold[[k]]],
      X          = aux_mu1,
      family     = gaussian(),
      obsWeights = aux$weight,
      SL.library = c("SL.mean", "SL.glmnet", "SL.ranger"),
      control    = list(saveFitLibrary = TRUE),
      cvControl  = list(V = 5L, shuffle = TRUE, validRows = NULL)
    )
    
    mu1b_sl_101 <- SuperLearner(
      Y          = df$mu2b_fit_101[-cf_fold[[k]]],
      X          = aux_mu1,
      family     = gaussian(),
      obsWeights = aux$weight,
      SL.library = c("SL.mean", "SL.glmnet", "SL.ranger"),
      control    = list(saveFitLibrary = TRUE),
      cvControl  = list(V = 5L, shuffle = TRUE, validRows = NULL)
    )
    
    mu1b_sl_110 <- SuperLearner(
      Y          = df$mu2b_fit_110[-cf_fold[[k]]],
      X          = aux_mu1,
      family     = gaussian(),
      obsWeights = aux$weight,
      SL.library = c("SL.mean", "SL.glmnet", "SL.ranger"),
      control    = list(saveFitLibrary = TRUE),
      cvControl  = list(V = 5L, shuffle = TRUE, validRows = NULL)
    )
    
    mu1b_sl_111 <- SuperLearner(
      Y          = df$mu2b_fit_111[-cf_fold[[k]]],
      X          = aux_mu1,
      family     = gaussian(),
      obsWeights = aux$weight,
      SL.library = c("SL.mean", "SL.glmnet", "SL.ranger"),
      control    = list(saveFitLibrary = TRUE),
      cvControl  = list(V = 5L, shuffle = TRUE, validRows = NULL)
    )
    
    df$mu1_fit_000 <- predict.SuperLearner(mu1b_sl_000, newdata = df_mu1n)$pred
    df$mu1_fit_001 <- predict.SuperLearner(mu1b_sl_001, newdata = df_mu1n)$pred
    df$mu1_fit_010 <- predict.SuperLearner(mu1b_sl_010, newdata = df_mu1y)$pred
    df$mu1_fit_011 <- predict.SuperLearner(mu1b_sl_011, newdata = df_mu1y)$pred
    df$mu1_fit_100 <- predict.SuperLearner(mu1b_sl_100, newdata = df_mu1n)$pred
    df$mu1_fit_101 <- predict.SuperLearner(mu1b_sl_101, newdata = df_mu1n)$pred
    df$mu1_fit_110 <- predict.SuperLearner(mu1b_sl_110, newdata = df_mu1y)$pred
    df$mu1_fit_111 <- predict.SuperLearner(mu1b_sl_111, newdata = df_mu1y)$pred
    
    # targeted mu1
    
    mu1_tmle_000 <- lm(mu2b_fit_000 ~ 0 + offset(mu1_fit_000) + w1_nn, weights = weight, data = df[cf_fold[[k]], ])
    mu1_tmle_001 <- lm(mu2b_fit_001 ~ 0 + offset(mu1_fit_001) + w1_nn, weights = weight, data = df[cf_fold[[k]], ])
    mu1_tmle_010 <- lm(mu2b_fit_010 ~ 0 + offset(mu1_fit_010) + w1_ny, weights = weight, data = df[cf_fold[[k]], ])
    mu1_tmle_011 <- lm(mu2b_fit_011 ~ 0 + offset(mu1_fit_011) + w1_ny, weights = weight, data = df[cf_fold[[k]], ])
    mu1_tmle_100 <- lm(mu2b_fit_100 ~ 0 + offset(mu1_fit_100) + w1_yn, weights = weight, data = df[cf_fold[[k]], ])
    mu1_tmle_101 <- lm(mu2b_fit_101 ~ 0 + offset(mu1_fit_101) + w1_yn, weights = weight, data = df[cf_fold[[k]], ])
    mu1_tmle_110 <- lm(mu2b_fit_110 ~ 0 + offset(mu1_fit_110) + w1_yy, weights = weight, data = df[cf_fold[[k]], ])
    mu1_tmle_111 <- lm(mu2b_fit_111 ~ 0 + offset(mu1_fit_111) + w1_yy, weights = weight, data = df[cf_fold[[k]], ])
    
    df$mu1b_fit_000 <- predict(mu1_tmle_000, newdata = df)
    df$mu1b_fit_001 <- predict(mu1_tmle_001, newdata = df)
    df$mu1b_fit_010 <- predict(mu1_tmle_010, newdata = df)
    df$mu1b_fit_011 <- predict(mu1_tmle_011, newdata = df)
    df$mu1b_fit_100 <- predict(mu1_tmle_100, newdata = df)
    df$mu1b_fit_101 <- predict(mu1_tmle_101, newdata = df)
    df$mu1b_fit_110 <- predict(mu1_tmle_110, newdata = df)
    df$mu1b_fit_111 <- predict(mu1_tmle_111, newdata = df)
    
    # refit mu0
    
    mu0b_sl_000 <- SuperLearner(
      Y          = df$mu1b_fit_000[-cf_fold[[k]]],
      X          = aux_mu0,
      family     = gaussian(),
      obsWeights = aux$weight,
      SL.library = c("SL.mean", "SL.glmnet", "SL.ranger"),
      control    = list(saveFitLibrary = TRUE),
      cvControl  = list(V = 5L, shuffle = TRUE, validRows = NULL)
    )
    
    mu0b_sl_001 <- SuperLearner(
      Y          = df$mu1b_fit_001[-cf_fold[[k]]],
      X          = aux_mu0,
      family     = gaussian(),
      obsWeights = aux$weight,
      SL.library = c("SL.mean", "SL.glmnet", "SL.ranger"),
      control    = list(saveFitLibrary = TRUE),
      cvControl  = list(V = 5L, shuffle = TRUE, validRows = NULL)
    )
    
    mu0b_sl_010 <- SuperLearner(
      Y          = df$mu1b_fit_010[-cf_fold[[k]]],
      X          = aux_mu0,
      family     = gaussian(),
      obsWeights = aux$weight,
      SL.library = c("SL.mean", "SL.glmnet", "SL.ranger"),
      control    = list(saveFitLibrary = TRUE),
      cvControl  = list(V = 5L, shuffle = TRUE, validRows = NULL)
    )
    
    mu0b_sl_011 <- SuperLearner(
      Y          = df$mu1b_fit_011[-cf_fold[[k]]],
      X          = aux_mu0,
      family     = gaussian(),
      obsWeights = aux$weight,
      SL.library = c("SL.mean", "SL.glmnet", "SL.ranger"),
      control    = list(saveFitLibrary = TRUE),
      cvControl  = list(V = 5L, shuffle = TRUE, validRows = NULL)
    )
    
    mu0b_sl_100 <- SuperLearner(
      Y          = df$mu1b_fit_100[-cf_fold[[k]]],
      X          = aux_mu0,
      family     = gaussian(),
      obsWeights = aux$weight,
      SL.library = c("SL.mean", "SL.glmnet", "SL.ranger"),
      control    = list(saveFitLibrary = TRUE),
      cvControl  = list(V = 5L, shuffle = TRUE, validRows = NULL)
    )
    
    mu0b_sl_101 <- SuperLearner(
      Y          = df$mu1b_fit_101[-cf_fold[[k]]],
      X          = aux_mu0,
      family     = gaussian(),
      obsWeights = aux$weight,
      SL.library = c("SL.mean", "SL.glmnet", "SL.ranger"),
      control    = list(saveFitLibrary = TRUE),
      cvControl  = list(V = 5L, shuffle = TRUE, validRows = NULL)
    )
    
    mu0b_sl_110 <- SuperLearner(
      Y          = df$mu1b_fit_110[-cf_fold[[k]]],
      X          = aux_mu0,
      family     = gaussian(),
      obsWeights = aux$weight,
      SL.library = c("SL.mean", "SL.glmnet", "SL.ranger"),
      control    = list(saveFitLibrary = TRUE),
      cvControl  = list(V = 5L, shuffle = TRUE, validRows = NULL)
    )
    
    mu0b_sl_111 <- SuperLearner(
      Y          = df$mu1b_fit_111[-cf_fold[[k]]],
      X          = aux_mu0,
      family     = gaussian(),
      obsWeights = aux$weight,
      SL.library = c("SL.mean", "SL.glmnet", "SL.ranger"),
      control    = list(saveFitLibrary = TRUE),
      cvControl  = list(V = 5L, shuffle = TRUE, validRows = NULL)
    )
    
    df$mu0_fit_000 <- predict.SuperLearner(mu0b_sl_000, newdata = df_mu0n)$pred
    df$mu0_fit_001 <- predict.SuperLearner(mu0b_sl_001, newdata = df_mu0n)$pred
    df$mu0_fit_010 <- predict.SuperLearner(mu0b_sl_010, newdata = df_mu0y)$pred
    df$mu0_fit_011 <- predict.SuperLearner(mu0b_sl_011, newdata = df_mu0y)$pred
    df$mu0_fit_100 <- predict.SuperLearner(mu0b_sl_100, newdata = df_mu0n)$pred
    df$mu0_fit_101 <- predict.SuperLearner(mu0b_sl_101, newdata = df_mu0n)$pred
    df$mu0_fit_110 <- predict.SuperLearner(mu0b_sl_110, newdata = df_mu0y)$pred
    df$mu0_fit_111 <- predict.SuperLearner(mu0b_sl_111, newdata = df_mu0y)$pred
    
    # targeted mu0
    
    mu0_tmle_000 <- lm(mu1b_fit_000 ~ 0 + offset(mu0_fit_000) + w0_n, weights = weight, data = df[cf_fold[[k]], ])
    mu0_tmle_001 <- lm(mu1b_fit_001 ~ 0 + offset(mu0_fit_001) + w0_n, weights = weight, data = df[cf_fold[[k]], ])
    mu0_tmle_010 <- lm(mu1b_fit_010 ~ 0 + offset(mu0_fit_010) + w0_n, weights = weight, data = df[cf_fold[[k]], ])
    mu0_tmle_011 <- lm(mu1b_fit_011 ~ 0 + offset(mu0_fit_011) + w0_n, weights = weight, data = df[cf_fold[[k]], ])
    mu0_tmle_100 <- lm(mu1b_fit_100 ~ 0 + offset(mu0_fit_100) + w0_y, weights = weight, data = df[cf_fold[[k]], ])
    mu0_tmle_101 <- lm(mu1b_fit_101 ~ 0 + offset(mu0_fit_101) + w0_y, weights = weight, data = df[cf_fold[[k]], ])
    mu0_tmle_110 <- lm(mu1b_fit_110 ~ 0 + offset(mu0_fit_110) + w0_y, weights = weight, data = df[cf_fold[[k]], ])
    mu0_tmle_111 <- lm(mu1b_fit_111 ~ 0 + offset(mu0_fit_111) + w0_y, weights = weight, data = df[cf_fold[[k]], ])
    
    df$mu0b_fit_000 <- predict(mu0_tmle_000, newdata = df)
    df$mu0b_fit_001 <- predict(mu0_tmle_001, newdata = df)
    df$mu0b_fit_010 <- predict(mu0_tmle_010, newdata = df)
    df$mu0b_fit_011 <- predict(mu0_tmle_011, newdata = df)
    df$mu0b_fit_100 <- predict(mu0_tmle_100, newdata = df)
    df$mu0b_fit_101 <- predict(mu0_tmle_101, newdata = df)
    df$mu0b_fit_110 <- predict(mu0_tmle_110, newdata = df)
    df$mu0b_fit_111 <- predict(mu0_tmle_111, newdata = df)
    
    main_list[[k]] <- df[cf_fold[[k]], ]
  }
  
  main_df <- reduce(main_list, bind_rows)
  
  for (s in 1:S){
    
    a1 <- estimands$a1[[s]]
    a2 <- estimands$a2[[s]]
    a3 <- estimands$a3[[s]]
    
    main_df <- main_df %>%
      mutate(
        
        wt0_deno = a1 * p0_fit + (1 - a1) * (1 - p0_fit),
        wt1_nume = a1 * p1_fit + (1 - a1) * (1 - p1_fit),
        wt1_deno = a2 * p1_fit + (1 - a2) * (1 - p1_fit),
        wt2_nume = a2 * p2_fit + (1 - a2) * (1 - p2_fit),
        wt2_deno = a3 * p2_fit + (1 - a3) * (1 - p2_fit),
        
        !!sym(paste0("w0_", a1, a2, a3)) := 1/trim(wt0_deno),
        !!sym(paste0("w1_", a1, a2, a3)) := !!sym(paste0("w0_", a1, a2, a3)) * wt1_nume/trim(wt1_deno),
        !!sym(paste0("w2_", a1, a2, a3)) := !!sym(paste0("w1_", a1, a2, a3)) * wt2_nume/trim(wt2_deno),
        
        !!sym(paste0("mu2fit_", a1, a2, a3)) := a3 * mu2_fit_a3y + (1 - a3) * mu2_fit_a3n,
        
        !!sym(paste0("mu1fit_", a1, a2, a3)) := a3 * a2 * mu1_fit_a3y_a2y + a3 * (1 - a2) * mu1_fit_a3y_a2n +
          (1 - a3) * a2 * mu1_fit_a3n_a2y + (1 - a3) * (1 - a2) * mu1_fit_a3n_a2n,
        
        !!sym(paste0("mu0fit_", a1, a2, a3)) := a3 * a2 * a1 * mu0_fit_a3y_a2y_a1y + a3 * a2 * (1 - a1) * mu0_fit_a3y_a2y_a1n +
          a3 * (1 - a2) * a1 * mu0_fit_a3y_a2n_a1y + a3 * (1 - a2) * (1 - a1) * mu0_fit_a3y_a2n_a1n +
          (1 - a3) * a2 * a1 * mu0_fit_a3n_a2y_a1y + (1 - a3) * a2 * (1 - a1) * mu0_fit_a3n_a2y_a1n +
          (1 - a3) * (1 - a2) * a1 * mu0_fit_a3n_a2n_a1y + (1 - a3) * (1 - a2) * (1 - a1) * mu0_fit_a3n_a2n_a1n,
        
        !!sym(paste0("www_", a1, a2, a3)) := as.double(att==a3) * !!sym(paste0("w2_", a1, a2, a3)) * vote2010,
        !!sym(paste0("iii_", a1, a2, a3)) := !!sym(paste0("mu0fit_", a1, a2, a3)),
        !!sym(paste0("eif_", a1, a2, a3)) := as.double(att==a3) * !!sym(paste0("w2_", a1, a2, a3)) *
          (vote2010 - !!sym(paste0("mu2fit_", a1, a2, a3))) +
          as.double(att==a2) * !!sym(paste0("w1_", a1, a2, a3)) * (!!sym(paste0("mu2fit_", a1, a2, a3)) -
                                                                         !!sym(paste0("mu1fit_", a1, a2, a3))) +
          as.double(att==a1) * !!sym(paste0("w0_", a1, a2, a3)) * (!!sym(paste0("mu1fit_", a1, a2, a3)) -
                                                                         !!sym(paste0("mu0fit_", a1, a2, a3))) +
          !!sym(paste0("mu0fit_", a1, a2, a3)),
        !!sym(paste0("tmle_", a1, a2, a3)) := as.double(att==a3) * !!sym(paste0("w2_", a1, a2, a3)) *
          (vote2010 - !!sym(paste0("mu2b_fit_", a1, a2, a3))) +
          as.double(att==a2) * !!sym(paste0("w1_", a1, a2, a3)) * (!!sym(paste0("mu2b_fit_", a1, a2, a3)) -
                                                                         !!sym(paste0("mu1b_fit_", a1, a2, a3))) +
          as.double(att==a1) * !!sym(paste0("w0_", a1, a2, a3)) * (!!sym(paste0("mu1b_fit_", a1, a2, a3)) -
                                                                         !!sym(paste0("mu0b_fit_", a1, a2, a3))) +
          !!sym(paste0("mu0b_fit_", a1, a2, a3)),
        !!sym(paste0("tmle2_", a1, a2, a3)) := !!sym(paste0("mu0b_fit_", a1, a2, a3))
      )
  }
  
  out[[i]] <- main_df
  
}

save.image(file = "05_vote_main.RData")

3.3 输出与绘制估计结果

我们同样载入所需包、导入数据、设定绘图主题后,根据之前的估计结果计算得到 ATE 与其分解所得的 PSEs 的估计值与标准误,并通过 ggplot 绘制图形。

rm(list = ls())
library(survey)
library(gbm)
library(ranger)
library(glmnet)
library(rsample)
library(caret)
library(rlang)
library(tidyverse)
library(Hmisc)
library(SuperLearner)
library(scales)
source("zmisc.R")
set.seed(02138)

load("05_vote_main.RData")

# set my ggplot theme
mytheme <- theme_minimal(base_size = 18) + 
  theme(legend.position = "bottom",
        plot.title = element_text(hjust = 0.5),
        plot.caption = element_text(color = "grey30"))
theme_set(mytheme)

out_df <- out %>%
  imap( ~ mutate(.x, imp = .y)) %>%
  reduce(bind_rows) %>%
  mutate(eif_type1_ate = eif_111 - eif_000,
         eif_type2_ate = eif_111 - eif_000,
         eif_type3_ate = eif_111 - eif_000,
         eif_type1_pse3 = eif_001 - eif_000,
         eif_type1_pse2 = eif_011 - eif_001,
         eif_type1_pse1 = eif_111 - eif_011,
         eif_type2_pse1 = eif_100 - eif_000,
         eif_type2_pse2 = eif_110 - eif_100,
         eif_type2_pse3 = eif_111 - eif_110,
         eif_type3_pse3 = eif_001 - eif_000,
         eif_type3_pse1 = eif_101 - eif_001,
         eif_type3_pse2 = eif_111 - eif_101,
         tmle_type1_ate = tmle_111 - tmle_000,
         tmle_type2_ate = tmle_111 - tmle_000,
         tmle_type3_ate = tmle_111 - tmle_000,
         tmle_type1_pse3 = tmle_001 - tmle_000,
         tmle_type1_pse2 = tmle_011 - tmle_001,
         tmle_type1_pse1 = tmle_111 - tmle_011,
         tmle_type2_pse1 = tmle_100 - tmle_000,
         tmle_type2_pse2 = tmle_110 - tmle_100,
         tmle_type2_pse3 = tmle_111 - tmle_110,
         tmle_type3_pse3 = tmle_001 - tmle_000,
         tmle_type3_pse1 = tmle_101 - tmle_001,
         tmle_type3_pse2 = tmle_111 - tmle_101) %>%
  group_by(imp) %>%
  summarise_at(vars(contains("type")), list(est = ~ wtd.mean(.x, weight),
                                            se = ~ sqrt(wtd.var(.x, weight)/length(.x)))) %>%
  ungroup() %>%
  pivot_longer(-imp) %>%
  separate(name, into = c("estimator", "type", "estimand", "measure")) %>%
  pivot_wider(names_from = measure, values_from = value) %>%
  group_by(estimator, type, estimand) %>%
  summarise(within_var = mean(se^2),
            between_var = var(est),
            total_var = within_var + (1 + 1/I)*between_var,
            est = mean(est),
            se = sqrt(total_var)) %>%
  ungroup() %>%
  filter(type == "type1")   %>%
  mutate(estimator = factor(estimator,
                            levels = c("eif", "tmle"),
                            labels = c(expression(paste("Estimating Equation (", hat(psi)[np]^eif2, ")")),
                                       expression(paste("TMLE (", hat(psi)[tmle]^eif2, ")")))),
         estimand = factor(estimand,
                           levels = rev(c("ate", "pse3", "pse2", "pse1")),
                           labels = rev(c(expression(paste("Total Effect (", psi[`111`]-psi[`000`], ")")),
                                          expression(paste("Direct Effect (", psi[`001`]-psi[`000`], ")")),
                                          expression(paste("via Civic/Political Interest (", psi[`011`]-psi[`001`], ")")),
                                          expression(paste("via Economic Status (", psi[`111`]-psi[`011`], ")")))))) 

ggplot(out_df, aes(x = estimand, y = est, shape = estimator)) +
  geom_pointrange(aes(ymin = est - 1.96 * se,  ymax = est + 1.96 * se),
                  position = position_dodge(width = - 0.5), size = 1) +
  geom_hline(yintercept = 0, linetype = 2) +
  scale_shape("", labels = parse_format()) +
  scale_color_discrete("", labels = parse_format()) +
  scale_x_discrete("", labels = parse_format()) +
  scale_y_continuous("Effects of College Attendance on Voting") +
  coord_flip()

out_df[, c("est", "se")] %>% round(3)

4. 估计结果解读

R 的最终估计结果如下所示:

est se
0.153 0.023
0.008 0.005
0.045 0.008
0.1 0.021
0.157 0.022
0.002 0.005
0.051 0.008
0.103 0.021

我们将其与原文表格 (如下所示,括号中为标准误) 进行对比,可见结果是一致的。

估计方程 (θ^a¯np,eif2) TMLE (θ^a¯tmle,eif2)
平均总效应 0.152 (0.022) 0.156 (0.023)
经济状况路径 (A  M1  Y) 0.007 (0.005) 0.002 (0.005)
公民/政治兴趣路径 (A  M2  Y) 0.042 (0.008) 0.049 (0.008)
直接效应 (A  Y) 0.103 (0.021) 0.105 (0.021)

此外,我们还通过 R 将以上结果可视化:

在理论篇中,我们知道本文的多重稳健估计量可以通过两种方法来估计,一是调整结果模型的估计方程,使包含逆概率权重的项等于零,只留下一个位于估计对象参数位置的迭代回归输入估计量,即表中的 θ^a¯np,eif2。二是使用目标极大似然估计方法,通过两步拟合每个结果模型,即表中的 θ^a¯tmle,eif2。从上表可知,这两个估计量均得到了类似的总效应和 PSEs 的估计值。大学入学率对选举中投票的效应有很大一部分是 “直接的”,也就是说,既不是通过提高经济状况路径发挥作用,也不是通过公民和政治兴趣路径发挥作用。

5. 结语

本推文对 K(1) 个因果有序中介设定下 ATE 与其各类分解——自然直接效应、自然间接效应\总间接效应、自然路径特定效应、累积路径特定效应——的建模与估计实操部分进行了展示。在后续拓展中,感兴趣的读者可以对此操作进行复现,在政策评估等实证研究中进行运用。

6. 参考资料

  • Pearl, J. (2009) Causality: models, reasoning, and inference. Cambridge: Cambridge University Press.
  • Robins, J.M. (2003) Semantics of causal dag models and the identification of direct and indirect effects. Highly Structured Stochastic Systems, 70–81.
  • Tchetgen Tchetgen, E.J. & Shpitser, I. (2012) Semiparametric theory for causal mediation analysis: efficiency bounds, multiple robustness, and sensitivity analysis. Annals of Statistics, 40, 1816. -PDF-
  • Avin, C., Shpitser, I. & Pearl, J. (2005) Identifiability of path-specific effects. In Proceedings of the 19th International Joint Conference on Artificial Intelligence, Morgan Kaufmann Publishers Inc, pp. 357–363.
  • Zhou, X., & Yamamoto, T. (2023). Tracing causal paths from experimental and observational data. The Journal of Politics, 85(1), 250-265. -PDF-
  • Vansteelandt, S., Bekaert, M. & Lange, T. (2012) Imputation strategies for the estimation of natural direct and indirect effects. Epidemiologic Methods, 1, 131–158. -PDF-
  • VanderWeele, T.J., Vansteelandt, S. & Robins, J.M. (2014) Effect decomposition in the presence of an exposure-induced mediator-outcome confounder. Epidemiology, 25, 300–306. -PDF-

7. 相关推文

Note:产生如下推文列表的 Stata 命令为:
lianxh
安装最新版 lianxh 命令:
ssc install lianxh, replace

相关课程

免费公开课

最新课程-直播课

专题 嘉宾 直播/回看视频
最新专题 文本分析、机器学习、效率专题、生存分析等
研究设计 连玉君 我的特斯拉-实证研究设计-幻灯片-
面板模型 连玉君 动态面板模型-幻灯片-
面板模型 连玉君 直击面板数据模型 [免费公开课,2小时]
  • Note: 部分课程的资料,PPT 等可以前往 连享会-直播课 主页查看,下载。

课程主页

课程主页

关于我们

  • Stata连享会 由中山大学连玉君老师团队创办,定期分享实证分析经验。
  • 连享会-主页知乎专栏,700+ 推文,实证分析不再抓狂。直播间 有很多视频课程,可以随时观看。
  • 公众号关键词搜索/回复 功能已经上线。大家可以在公众号左下角点击键盘图标,输入简要关键词,以便快速呈现历史推文,获取工具软件和数据下载。常见关键词:课程, 直播, 视频, 客服, 模型设定, 研究设计, stata, plus, 绘图, 编程, 面板, 论文重现, 可视化, RDD, DID, PSM, 合成控制法

连享会小程序:扫一扫,看推文,看视频……

扫码加入连享会微信群,提问交流更方便

✏ 连享会-常见问题解答:
https://gitee.com/lianxh/Course/wikis

New! lianxh 命令发布了:
随时搜索连享会推文、Stata 资源,安装命令如下:
. ssc install lianxh
使用详情参见帮助文件 (有惊喜):
. help lianxh