Stata连享会 主页 || 视频 || 推文 || 知乎 || Bilibili 站
温馨提示: 定期 清理浏览器缓存,可以获得最佳浏览体验。
New!
lianxh
命令发布了:
随时搜索推文、Stata 资源。安装:
. ssc install lianxh
详情参见帮助文件 (有惊喜):
. help lianxh
连享会新命令:cnssc
,ihelp
,rdbalance
,gitee
,installpkg
⛳ Stata 系列推文:
作者:孟佳音 (University College London)
邮箱:<jiayin.meng.20@ucl.ac.uk
目录
在前三讲中我们讲述了有关截面数据多重插补的理论框架,在本讲中,我们将通过多重插补的实例演示来加深大家对多重插补操作方法的理解。本推文范例选用英国老龄化研究 (ELSA) 数据,为展示方便,我们对原始数据进行了整理,大家可以通过以下命令直接调用:
cnssc install lxhuse, lianxh replace
lxhuse elsa1.dta, clear
MICE 方法是多重插补运用最广的方法,它不受缺失变量联合分布假设的限制。并且,可以根据每个变量的自身特征单独建模,例如使用逻辑回归 (logistic regression) 建模的二元变量 (binary variables) 和使用线性回归 (linear regression) 建模的连续变量 (continuous variables) 等。
首先我们将数据调入,并对数据进行一个整体的把握和探索。
. lxhuse elsa1.dta, clear // 调用数据
. *查看数据概况
. describe
. summarize
. codebook, compact //查看数据特异值
. *查看缺失数据
. misstable summarize, gen (_m)
. tab1 _m* //查看缺失数据个数与百分比
. misstable pattern //查看数据缺失模式(百分比)并列出变量顺序
Missing-value patterns
(1 means complete)
| Pattern
Percent | 1 2 3 4 5 6 7 8 9 10
------------+-----------------------------------
53% | 1 1 1 1 1 1 1 1 1 1
|
3 | 1 1 1 1 1 1 1 0 1 1
3 | 1 1 1 1 1 1 1 1 1 0
3 | 1 0 1 1 1 1 1 1 1 1
3 | 1 1 1 1 0 1 1 1 1 1
3 | 1 1 1 1 1 1 1 1 0 1
3 | 1 1 0 1 1 1 1 1 1 1
3 | 1 1 1 0 1 1 1 1 1 1
3 | 1 1 1 1 1 0 1 1 1 1
3 | 0 1 1 1 1 1 1 1 1 1
2 | 1 1 1 1 1 1 0 1 1 1
......
<1 | 1 1 1 1 1 1 0 0 0 0
<1 | 1 1 1 1 1 1 0 0 1 0
------------+-----------------------------------
100% |
Variables are (1) cfme1 (2) intm1 (3) cfex1 (4) inty1 (5) marital1
(6) limitil1 (7) cigst1 (8) physact1
(9) wealth1 (10) income1
. misstable pattern, freq //数据缺失模式(频数)
Missing-value patterns
(1 means complete)
| Pattern
Frequency | 1 2 3 4 5 6 7 8 9 10
------------+-----------------------------------
2,130 | 1 1 1 1 1 1 1 1 1 1
|
114 | 1 1 1 1 1 1 1 0 1 1
114 | 1 1 1 1 1 1 1 1 1 0
112 | 1 0 1 1 1 1 1 1 1 1
110 | 1 1 1 1 0 1 1 1 1 1
109 | 1 1 1 1 1 1 1 1 0 1
106 | 1 1 0 1 1 1 1 1 1 1
106 | 1 1 1 0 1 1 1 1 1 1
105 | 1 1 1 1 1 0 1 1 1 1
104 | 0 1 1 1 1 1 1 1 1 1
99 | 1 1 1 1 1 1 0 1 1 1
......
1 | 1 1 1 1 1 1 0 0 0 0
1 | 1 1 1 1 1 1 0 0 1 0
------------+-----------------------------------
4,034 |
在 misstable pattern
与 misstable pattern, freq
命令的运行结果中,数字 1 代表不含缺失值,数字 0 代表含有缺失值。可以看出,在存在缺失数据的变量中,53% 的数据 (2130 位个体的观测值) 是完整的。表格最上方从 1 到 10 的数字代表含有缺失数据的 10 个变量,其顺序在 Stata 运行完上述命令后给出。
我们以研究老年人生活质量为例。其中,qol1 (生活质量) 为因变量;sex1 (性别)、educ1 (受教育水平)、age1 (年龄)、marital1 (婚姻状况)、cigst1 (吸烟状况)、physact1 (体育活动状况)、limitil1 (是否长期患病)、cfmel (记忆力功能得分)、cfex1 (认知执行功能得分)、wealth1 (财富排名)、income1 (收入)、depression1 (沮丧状态) 为协变量。
. *定义回归模型
. regress qol1 sex1 i.educ1 age1 marital1 cigst1 i.physact1 ///
limitil1 cfme1 cfex1 i.wealth1 income1 depression1
. *为多重插补声明数据结构
. mi set mlong
. *声明完全变量和要插补的变量
. mi register regular sex1 educ1 age1 depression1 qol1 //声明回归中不含缺失值的变量
. mi register imputed marital1 cigst1 physact1 limitil1 ///
cfme1 cfex1 wealth1 income1 //声明需要插补的变量
. *选择 MICE 方法进行多重插补
. *其中,add(20) 指定添加 20 次插补,reseed 指定随机数,
. *savetrace 保存插补值得均值和标准差
. mi impute chained (regress) cfme1 cfex1 income1 (logit, augment) ///
marital1 limitil1 (ologit, augment) cigst1 physact1 ///
wealth1= sex1 educ1 age1 depression1 qol1, add(20) ///
rseed(54321) savetrace (trace2, replace)
. *查看插补后的数据
. mi convert wide, clear
. sum *_marital1 *_cigst1 *_physact1 *_limitil1 *_cfme1 *_cfex1 *_wealth1 *_income1
. *将回归模型拟合到插补数据中
. mi estimate: regress qol1 sex1 i.educ1 age1 marital1 cigst1 i.physact1 ///
limitil1 cfme1 cfex1 i.wealth1 income1 depression1
Multiple-imputation estimates Imputations = 20
Linear regression Number of obs = 4,034
Average RVI = 0.0715
Largest FMI = 0.1849
Complete DF = 4016
DF adjustment: Small sample DF: min = 488.22
avg = 2,112.13
max = 3,909.00
Model F test: Equal FMI F( 17, 3777.9) = 75.26
Within VCE type: OLS Prob > F = 0.0000
------------------------------------------------------------------------------
qol1 | Coefficient Std. err. t P>|t| [95% conf. interval]
-------------+----------------------------------------------------------------
sex1 | 0.681 0.273 2.49 0.013 0.145 1.216
|
educ1 |
Alevel | 0.465 0.307 1.52 0.129 -0.136 1.067
College | -0.299 0.406 -0.74 0.461 -1.096 0.497
|
age1 | 0.012 0.018 0.70 0.485 -0.022 0.047
marital1 | -0.917 0.331 -2.77 0.006 -1.566 -0.268
cigst1 | -0.436 0.203 -2.15 0.032 -0.834 -0.038
|
physact1 |
moderate | -0.611 0.316 -1.93 0.054 -1.231 0.010
sedentary | -2.038 0.359 -5.67 0.000 -2.743 -1.334
|
limitil1 | -4.414 0.326 -13.52 0.000 -5.054 -3.773
cfme1 | 0.103 0.036 2.90 0.004 0.034 0.173
cfex1 | 0.227 0.042 5.36 0.000 0.144 0.310
|
wealth1 |
2 | 0.568 0.543 1.05 0.296 -0.498 1.634
3 | 1.896 0.521 3.64 0.000 0.874 2.918
4 | 2.239 0.533 4.20 0.000 1.193 3.286
5 | 3.285 0.565 5.81 0.000 2.175 4.395
|
income1 | 0.000 0.000 1.52 0.129 -0.000 0.001
depression1 | -7.683 0.422 -18.20 0.000 -8.511 -6.855
_cons | 36.258 1.615 22.44 0.000 33.091 39.425
------------------------------------------------------------------------------
. *显示插补的方差信息
. mi estimate, vartable nocitable
Multiple-imputation estimates Imputations = 20
Linear regression
Variance information
------------------------------------------------------------------------------
| Imputation variance Relative
| Within Between Total RVI FMI efficiency
-------------+----------------------------------------------------------------
sex1 | .07322 .001416 .074707 .020299 .019946 .999004
|
educ1 |
Alevel | .093147 .000818 .094006 .009222 .009151 .999543
College | .162459 .002303 .164877 .014886 .014698 .999266
|
age1 | .000307 5.1e-06 .000313 .017553 .01729 .999136
marital1 | .104103 .005172 .109533 .052161 .049846 .997514
cigst1 | .036572 .004332 .041121 .124378 .111817 .99444
|
physact1 |
moderate | .094144 .005556 .099978 .061964 .058714 .997073
sedentary | .118573 .00994 .12901 .088023 .081574 .995938
|
limitil1 | .095011 .010948 .106507 .120987 .109074 .994576
cfme1 | .001166 .000096 .001267 .086023 .079855 .996023
cfex1 | .001645 .000139 .001791 .088789 .08223 .995905
|
wealth1 |
2 | .240861 .051044 .294457 .22252 .184942 .990838
3 | .238176 .031578 .271333 .139212 .123636 .993856
4 | .246077 .036388 .284285 .155268 .136106 .993241
5 | .263382 .053286 .319332 .212431 .177947 .991181
|
income1 | 7.2e-08 7.5e-09 7.9e-08 .10945 .099623 .995044
depression1 | .176289 .001921 .178306 .011441 .011331 .999434
_cons | 2.52367 .081843 2.60961 .034052 .033057 .99835
------------------------------------------------------------------------------
在上述方差结果中,各列的含义解释如下:
插补诊断是多重插补的必要步骤,用以判断插补模型是否合理。可以比较观测值和插补值的均值、频率和箱线图,也可以分别查看每个插补数据集的残差和异常值图,还应该诊断插补模型的收敛性 (convergence),即画出不同的插补变量轨迹图 (trace plot),查看该变量的方差和均值是否平稳。我们可以用如下命令画出各插补变量的 trace plot:
*调用插补时保存的 trace 数据
use trace2, clear
*将数据整理为 wide 格式
reshape wide *mean *sd, i(iter) j(m)
*画 trace plot 图
tsset iter
tsline income1_mean*
income1 均值的 trace plot 如下图所示,可以看出其轨迹较为平稳。
在缺失值与多重补漏分析 (三) 中,我们讲解了非线性插补模型可以用 PMM 方法进行插补,下面我们通过实例为大家展示具体操作流程。
我们仍然用 ELSA 数据,研究 qol1 (生活质量) 与 cfex1 (认知执行功能得分) 的关系,并在回归中加入 cfex1 的二次项。
. *调用数据
. lxhuse elsa1, clear
. gen int cfex1sq = cfex1^2
(294 missing values generated)
. *定义回归模型
. regress qol1 cfex1 cfex1sq i.educ1 age1 marital1 cigst1 physact1 ///
limitil1 cfme1 i.wealth1
Source | SS df MS Number of obs = 2,502
-------------+---------------------------------- F(14, 2487) = 38.80
Model | 40861.7906 14 2918.69933 Prob > F = 0.0000
Residual | 187068.466 2,487 75.2185226 R-squared = 0.1793
-------------+---------------------------------- Adj R-squared = 0.1747
Total | 227930.256 2,501 91.1356482 Root MSE = 8.6729
------------------------------------------------------------------------------
qol1 | Coefficient Std. err. t P>|t| [95% conf. interval]
-------------+----------------------------------------------------------------
cfex1 | 1.100 0.354 3.11 0.002 0.406 1.793
cfex1sq | -0.023 0.010 -2.40 0.017 -0.042 -0.004
|
educ1 |
Alevel | 0.477 0.409 1.17 0.244 -0.325 1.279
College | -0.362 0.552 -0.66 0.512 -1.444 0.721
|
age1 | 0.003 0.024 0.14 0.888 -0.043 0.050
marital1 | -1.466 0.415 -3.53 0.000 -2.279 -0.653
cigst1 | -0.700 0.264 -2.65 0.008 -1.217 -0.182
physact1 | -1.218 0.228 -5.33 0.000 -1.666 -0.770
limitil1 | -5.204 0.406 -12.82 0.000 -6.000 -4.408
cfme1 | 0.075 0.046 1.63 0.104 -0.015 0.166
|
wealth1 |
2 | 0.488 0.666 0.73 0.464 -0.817 1.793
3 | 1.856 0.663 2.80 0.005 0.555 3.156
4 | 2.475 0.679 3.65 0.000 1.143 3.806
5 | 3.409 0.692 4.92 0.000 2.051 4.767
|
_cons | 30.048 3.608 8.33 0.000 22.972 37.124
------------------------------------------------------------------------------
. *查看缺失数据
. misstable sum
由于 cfes1sq 是由待插补变量 cfex1 经变换得到,我们称 cfes1sq 为 Passive Variable,需要用 mi register passive
命令对其进行定义。
. *声明数据结构
. mi set mlong
. *声明完全变量和需要插补的变量
. mi register regular qol1 educ1 age1 //声明回归中不含缺失值的变量
. mi register passive cfex1sq //声明 passive variable
. mi register imputed cfex1 marital1 cigst1 physact1 ///
limitil1 cfme1 wealth1 //声明要插补的变量
由于插补模型中含有非线性关系,我们使用 PMM 方法进行插补,命令如下:
. *构建插补模型
. mi impute chained (pmm, knn(10)) cfex1 (regress) cfme1 (logit, augment) ///
marital1 limitil1 (ologit, augment) cigst1 physact1 wealth1= educ1 ///
age1 qol1, add(5) rseed(21)
------------------------------------------------------------------
| Observations per m
|----------------------------------------------
Variable | Complete Incomplete Imputed | Total
-------------------+-----------------------------------+----------
cfex1 | 3740 294 294 | 4034
cfme1 | 3746 288 288 | 4034
marital1 | 3735 299 299 | 4034
limitil1 | 3734 300 300 | 4034
cigst1 | 3730 304 304 | 4034
physact1 | 3707 327 327 | 4034
wealth1 | 3674 360 360 | 4034
------------------------------------------------------------------
(Complete + Incomplete = Total; Imputed is the minimum across m
of the number of filled-in observations.)
. mi passive: replace cfex1sq = cfex1^2
. *将回归模型拟合到插补数据中
. mi estimate: regress qol1 cfex1 cfex1sq i.educ1 age1 marital1 cigst1 ///
physact1 limitil1 cfme1 i.wealth1
Multiple-imputation estimates Imputations = 5
Linear regression Number of obs = 4,034
Average RVI = 0.0828
Largest FMI = 0.2430
Complete DF = 4019
DF adjustment: Small sample DF: min = 77.63
avg = 1,060.69
max = 3,322.66
Model F test: Equal FMI F( 14, 2657.4) = 62.38
Within VCE type: OLS Prob > F = 0.0000
------------------------------------------------------------------------------
qol1 | Coefficient Std. err. t P>|t| [95% conf. interval]
-------------+----------------------------------------------------------------
cfex1 | 0.944 0.281 3.37 0.001 0.394 1.495
cfex1sq | -0.019 0.008 -2.53 0.012 -0.034 -0.004
|
educ1 |
Alevel | 0.660 0.320 2.06 0.039 0.032 1.287
College | -0.086 0.424 -0.20 0.838 -0.917 0.744
|
age1 | 0.033 0.019 1.78 0.075 -0.003 0.069
marital1 | -1.544 0.337 -4.58 0.000 -2.206 -0.882
cigst1 | -0.502 0.224 -2.24 0.028 -0.948 -0.057
physact1 | -1.036 0.182 -5.69 0.000 -1.394 -0.679
limitil1 | -5.257 0.340 -15.44 0.000 -5.928 -4.586
cfme1 | 0.130 0.037 3.51 0.000 0.057 0.203
|
wealth1 |
2 | 1.025 0.570 1.80 0.075 -0.106 2.155
3 | 2.574 0.527 4.88 0.000 1.539 3.609
4 | 3.016 0.550 5.48 0.000 1.933 4.099
5 | 4.274 0.547 7.81 0.000 3.200 5.348
|
_cons | 27.504 2.894 9.50 0.000 21.827 33.182
------------------------------------------------------------------------------
Note:产生如下推文列表的 Stata 命令为:
lianxh 缺失值 补漏, m
安装最新版lianxh
命令:ssc install lianxh, replace
免费公开课
最新课程-直播课
专题 | 嘉宾 | 直播/回看视频 |
---|---|---|
⭐ 最新专题 | 文本分析、机器学习、效率专题、生存分析等 | |
研究设计 | 连玉君 | 我的特斯拉-实证研究设计,-幻灯片- |
面板模型 | 连玉君 | 动态面板模型,-幻灯片- |
面板模型 | 连玉君 | 直击面板数据模型 [免费公开课,2小时] |
⛳ 课程主页
⛳ 课程主页
关于我们
课程, 直播, 视频, 客服, 模型设定, 研究设计, stata, plus, 绘图, 编程, 面板, 论文重现, 可视化, RDD, DID, PSM, 合成控制法
等
连享会小程序:扫一扫,看推文,看视频……
扫码加入连享会微信群,提问交流更方便
✏ 连享会-常见问题解答:
✨ https://gitee.com/lianxh/Course/wikis
New!
lianxh
命令发布了:
随时搜索连享会推文、Stata 资源,安装命令如下:
. ssc install lianxh
使用详情参见帮助文件 (有惊喜):
. help lianxh