Stata缺失值专题:多重补漏分析

发布时间:2020-10-28 阅读 21814

Stata连享会   主页 || 视频 || 推文 || 知乎

温馨提示: 定期 清理浏览器缓存,可以获得最佳浏览体验。

New! lianxh 命令发布了:
随时搜索连享会推文、Stata 资源,安装命令如下:
. ssc install lianxh
使用详情参见帮助文件 (有惊喜):
. help lianxh

课程详情 https://gitee.com/arlionn/Course   |   lianxh.cn

课程主页 https://gitee.com/arlionn/Course


作者: 陈滨志 (英国伯明翰大学)
邮箱: Rickchen0910@163.com


目录


我们在实际的问卷收集中,会因为诸多原因存在数据缺失的问题,诸如填写问卷的人没有完成全部的问卷调查、一些跟进问题的缺失和存储设备故障等。在统计学中,补漏 (imputation) 是用替换值替换缺失数据的过程。本文将着重介绍多重补漏 (multiple imputation) 及 Stata 的实现。    

1. 数据丢失会导致三个主要问题

  • 数据丢失会带来大量偏差 (bias)
  • 使数据的处理和分析更加艰巨
  • 导致数据分析效率降低

由于缺少数据可能会造成分析数据的潜在问题,因此补漏被视为一种避免列表式删除具有缺失值的案例所涉及的陷阱的方法。也就是说,当一个案例缺少一个或多个值时,大多数统计数据包默认会丢弃任何具有缺失值的案例,这可能会引入偏差或影响结果的代表性。补漏通过基于其他可用信息将丢失的数据替换为估计值来保留所有情况。估算完所有缺失值后,即可使用标准技术对数据集进行分析以获取完整数据。目前国内外学者已经接受了许多理论来解释缺失的数据。    

2. 处理缺失值的方法

2.1 单一补漏方法 (Single Imputation Approach)

  • 完整案例分析 (Complete Dase Analysis):只用数据完整的案例样本进行分析。大多数的统计软件应用的是完整案例分析的方法,当出现缺失值时剔出该样本,运用剩下的样本数据进行分析,一般会形成样本偏误。  
  • 平均值补漏 (Mean Imputation):用完整样本计算出来的平均值替代样本变量的缺失值。但是,均值补漏会减弱涉及补漏变量的相关性。这是因为在进行补漏的情况下,补漏确保在补漏变量和任何其他测量变量之间不存在任何相关关系。因此,均值补漏更适用于单变量的分析。  
  • 回归补漏 (Regression Imputation):基于数据集建立回归方程,对于缺失值的对象,将其已知的变量值代入回归方程求解缺失值。该方法的问题在于,回归估计中不包含误差项。因此,这些估计值会沿回归线完全拟合,没有任何残留方差。这会导致关系被过度识别。回归模型仅仅是预测丢失数据的最可能值,但不会提供有关该值的不确定性。  
  • 向前/向后补漏 (多用于面板数据/时间序列) :由于面板数据与时间序列数据具有时间上的连贯性,因此在某个时间点产生缺失值的情况下,我们可以用向前/向后一个时间点的数据来进行补漏处理

2.2 单一补漏 Stata 的实现

********************
*** 单一补漏方法 ***
********************

. //完整案例分析//

. *-missing()函数
. sysuse nlsw88.dta, clear
(NLSW, 1988 extract)

. sum

    Variable |    Obs        Mean    Std. Dev.       Min        Max
-------------+-----------------------------------------------------
      idcode |  2,246    2612.654    1480.864          1       5159
         age |  2,246    39.15316    3.060002         34         46
        race |  2,246    1.282725    .4754413          1          3
     married |  2,246    .6420303    .4795099          0          1
never_marr~d |  2,246    .1041852    .3055687          0          1
-------------+-----------------------------------------------------
       grade |  2,244    13.09893    2.521246          0         18
    collgrad |  2,246    .2368655    .4252538          0          1
       south |  2,246    .4194123    .4935728          0          1
        smsa |  2,246    .7039181    .4566292          0          1
      c_city |  2,246    .2916296    .4546139          0          1
-------------+-----------------------------------------------------
    industry |  2,232    8.189516    3.010875          1         12
  occupation |  2,237    4.642825    3.408897          1         13
       union |  1,878    .2454739    .4304825          0          1
        wage |  2,246    7.766949    5.755523   1.004952   40.74659
       hours |  2,242    37.21811    10.50914          1         80
-------------+-----------------------------------------------------
     ttl_exp |  2,246    12.53498    4.610208   .1153846   28.88461
      tenure |  2,231     5.97785    5.510331          0   25.91667

. drop if missing(grade,indus,occup,union,hours,tenure)
(398 observations deleted)

. sum

    Variable |    Obs        Mean    Std. Dev.       Min        Max
-------------+-----------------------------------------------------
      idcode |  1,848    2614.384     1486.31          1       5159
         age |  1,848    39.21429    3.041416         34         46
        race |  1,848    1.291667    .4823869          1          3
     married |  1,848    .6515152    .4766194          0          1
never_marr~d |  1,848    .1087662      .31143          0          1
-------------+-----------------------------------------------------
       grade |  1,848    13.17208    2.550548          0         18
    collgrad |  1,848    .2478355    .4318727          0          1
       south |  1,848    .4242424    .4943612          0          1
        smsa |  1,848    .7083333    .4546527          0          1
      c_city |  1,848    .2938312    .4556388          0          1
-------------+-----------------------------------------------------
    industry |  1,848    8.255952    3.042377          1         12
  occupation |  1,848     4.62013    3.479021          1         13
       union |  1,848    .2467532    .4312386          0          1
        wage |  1,848     7.60597    4.173447   1.344605   39.23074
       hours |  1,848    37.61905    9.957783          1         80
-------------+-----------------------------------------------------
     ttl_exp |  1,848    12.86178    4.576879   .4038461   28.88461
      tenure |  1,848    6.582882    5.631957          0   25.91667


. *-更为简洁的命令:-dropmiss-  (外部命令)
. sysuse nlsw88.dta, clear
(NLSW, 1988 extract)

. dropmiss, any obs
(398 observations deleted)

. sum

    Variable |    Obs        Mean    Std. Dev.       Min        Max
-------------+-----------------------------------------------------
      idcode |  1,848    2614.384     1486.31          1       5159
         age |  1,848    39.21429    3.041416         34         46
        race |  1,848    1.291667    .4823869          1          3
     married |  1,848    .6515152    .4766194          0          1
never_marr~d |  1,848    .1087662      .31143          0          1
-------------+-----------------------------------------------------
       grade |  1,848    13.17208    2.550548          0         18
    collgrad |  1,848    .2478355    .4318727          0          1
       south |  1,848    .4242424    .4943612          0          1
        smsa |  1,848    .7083333    .4546527          0          1
      c_city |  1,848    .2938312    .4556388          0          1
-------------+-----------------------------------------------------
    industry |  1,848    8.255952    3.042377          1         12
  occupation |  1,848     4.62013    3.479021          1         13
       union |  1,848    .2467532    .4312386          0          1
        wage |  1,848     7.60597    4.173447   1.344605   39.23074
       hours |  1,848    37.61905    9.957783          1         80
-------------+-----------------------------------------------------
     ttl_exp |  1,848    12.86178    4.576879   .4038461   28.88461
      tenure |  1,848    6.582882    5.631957          0   25.91667

可以从描述性分析看出来,dropmiss 可以有效的删除缺失值。

. //平均值补漏//
. sysuse nlsw88.dta, clear
(NLSW, 1988 extract)

. summarize

    Variable |    Obs        Mean    Std. Dev.       Min        Max
-------------+-----------------------------------------------------
      idcode |  2,246    2612.654    1480.864          1       5159
         age |  2,246    39.15316    3.060002         34         46
        race |  2,246    1.282725    .4754413          1          3
     married |  2,246    .6420303    .4795099          0          1
never_marr~d |  2,246    .1041852    .3055687          0          1
-------------+-----------------------------------------------------
       grade |  2,244    13.09893    2.521246          0         18
    collgrad |  2,246    .2368655    .4252538          0          1
       south |  2,246    .4194123    .4935728          0          1
        smsa |  2,246    .7039181    .4566292          0          1
      c_city |  2,246    .2916296    .4546139          0          1
-------------+-----------------------------------------------------
    industry |  2,232    8.189516    3.010875          1         12
  occupation |  2,237    4.642825    3.408897          1         13
       union |  1,878    .2454739    .4304825          0          1
        wage |  2,246    7.766949    5.755523   1.004952   40.74659
       hours |  2,242    37.21811    10.50914          1         80
-------------+-----------------------------------------------------
     ttl_exp |  2,246    12.53498    4.610208   .1153846   28.88461
      tenure |  2,231     5.97785    5.510331          0   25.91667

. replace grade = r(mean) if grade==.
variable grade was byte now float
(2 real changes made)

由于 sum 函数不会包括缺漏值,所以可以直接用内置 r list 进行替换。

. //回归补漏//
.
. sysuse nlsw88.dta, clear
(NLSW, 1988 extract)

. sum

    Variable |    Obs        Mean    Std. Dev.       Min        Max
-------------+-----------------------------------------------------
      idcode |  2,246    2612.654    1480.864          1       5159
         age |  2,246    39.15316    3.060002         34         46
        race |  2,246    1.282725    .4754413          1          3
     married |  2,246    .6420303    .4795099          0          1
never_marr~d |  2,246    .1041852    .3055687          0          1
-------------+-----------------------------------------------------
       grade |  2,244    13.09893    2.521246          0         18
    collgrad |  2,246    .2368655    .4252538          0          1
       south |  2,246    .4194123    .4935728          0          1
        smsa |  2,246    .7039181    .4566292          0          1
      c_city |  2,246    .2916296    .4546139          0          1
-------------+-----------------------------------------------------
    industry |  2,232    8.189516    3.010875          1         12
  occupation |  2,237    4.642825    3.408897          1         13
       union |  1,878    .2454739    .4304825          0          1
        wage |  2,246    7.766949    5.755523   1.004952   40.74659
       hours |  2,242    37.21811    10.50914          1         80
-------------+-----------------------------------------------------
     ttl_exp |  2,246    12.53498    4.610208   .1153846   28.88461
      tenure |  2,231     5.97785    5.510331          0   25.91667

.
. reg wage grade hours

   Source |      SS           df       MS      Number of obs =   2,240
----------+---------------------------------   F(2, 2237)    =  157.14
    Model | 9149.60544         2  4574.80272   Prob > F      =  0.0000
 Residual | 65124.5132     2,237  29.1124333   R-squared     =  0.1232
----------+---------------------------------   Adj R-squared =  0.1224
    Total | 74274.1187     2,239   33.172898   Root MSE      =  5.3956
----------------------------------------------------------------------
     wage |     Coef.  Std. Err.     t    P>|t|   [95% Conf. Interval]
----------+-----------------------------------------------------------
    grade |  .7176616  .0454271   15.80   0.000    .628578    .8067452
    hours |   .072271  .0108872    6.64   0.000   .0509209    .0936211
    _cons | -4.315149  .6995475   -6.17   0.000  -5.686979   -2.943319
----------------------------------------------------------------------


. list wage hours if grade==.

      +------------------+
      |     wage   hours |
      |------------------|
 496. | 7.045088      40 |
2210. | 4.146536      40 |
      +------------------+

. replace grade = (wage[496]+_b[_cons] - _b[hours]*hours[496])/_b[grade] ///
          if (wage == wage[496]&grade==.&hours==hours[496])

. replace grade = (wage[2210]+_b[_cons] - _b[hours]*hours[2210])/_b[grade] ///
          if (wage == wage[2210]&grade==.&hours==hours[2210])

. sum grade

 Variable |    Obs      Mean   Std. Dev.       Min   Max
----------+---------------------------------------------
    grade |  2,246  13.08527   2.562064  -4.263086    18

回归分析补漏相对复杂,需要先用 reg 回归求解各个系数值,接下来使用 list 找到缺失值所在行,并使用 Stata 内置 _b 进行计算。

*== 向前向后填补 ==
*-向前填补
. sysuse nlsw88, clear
(NLSW, 1988 extract)

. sum grade

    Variable |   Obs      Mean    Std. Dev.   Min    Max
-------------+------------------------------------------
       grade | 2,244  13.09893    2.521246      0     18

. sort grade

. replace grade = grade[_n-1] if mi(grade)
(2 real changes made)

. sum grade

    Variable |    Obs       Mean    Std. Dev.   Min   Max
-------------+-------------------------------------------
       grade |  2,246   13.10329    2.524361      0    18

*-向后填补
. sysuse nlsw88, clear
(NLSW, 1988 extract)

. sum grade

    Variable |    Obs      Mean    Std. Dev.   Min   Max
-------------+------------------------------------------
       grade |  2,244  13.09893    2.521246      0    18

. sort grade

. replace grade = grade[_n+1] if mi(grade)
(0 real changes made)

. sum grade

    Variable |    Obs      Mean   Std. Dev.   Min   Max
-------------+-----------------------------------------
       grade |  2,244  13.09893   2.521246      0    18

*-面板数据填补
. use "http://www.stata-press.com/data/r13/nlswork", clear
(National Longitudinal Survey.  Young Women 14-26 years of age in 1968)

. misstable sum
                                                        Obs<.
                                     +--------------------------
           |                         | Unique
  Variable |  Obs=.   Obs>.   Obs<.  | values    Min         Max
-----------+-------------------------+--------------------------
       age |     24          28,510  |     33     14          46
       msp |     16          28,518  |      2      0           1
   nev_mar |     16          28,518  |      2      0           1
     grade |      2          28,532  |     19      0          18
  not_smsa |      8          28,526  |      2      0           1
    c_city |      8          28,526  |      2      0           1
     south |      8          28,526  |      2      0           1
  ind_code |    341          28,193  |     12      1          12
  occ_code |    121          28,413  |     13      1          13
     union |  9,296          19,238  |      2      0           1
    wks_ue |  5,704          22,830  |     61      0          76
    tenure |    433          28,101  |    270      0    25.91667
     hours |     67          28,467  |     85      1         168
  wks_work |    703          27,831  |    105      0         104
----------------------------------------------------------------

. xtset idcode year
   panel variable:  idcode (unbalanced)
    time variable:  year, 68 to 88, but with gaps
            delta:  1 unit

. by idcode: replace grade = grade[_n+1] if mi(grade) //为什么没有填补?
(0 real changes made)

在面板数据填补缺失值那里没有有效进行填补的原因是该样本只有一年期的数据,所以没办法通过向前一期或者向后一期进行填补。

2.3 多重补漏方法 (Multiple Imputation Approach)

上述单一补漏方法方法大多仅仅可以生成单个替补值以来解决数据缺失的问题,Rubin (1987) 开发了一种多重补漏方法,该方法(MI)是一种基于模拟,并用于处理丢失的数据的灵活统计技术。MI 作为一种缺失数据的补漏技术,具有两个主要特征:

  • 运用现有统计方法进行的多种完整数据分析。
  • 将补漏过程与分析过程分离。

2.2.1 多重补漏的三个步骤:

  1. 补漏 (Imputation Step) :选择补漏模型,基于 simulation 形成 M 个补漏后的完整数据集。
  2. 估计 (Estimation Step) :通过完整的 M 个数据集,分别估计 m=1,2,M 的最优模型。
  3. 合并 (Pooling Step) :将从 M 个数据集估计出来的最优结果合并成一个单一的多重补漏结果。

2.2.2 多重补漏对于缺失值类型假设:

  1. 完全随机缺失 (Missing completely at random MCAR):变量 x 的缺失与其他变量的观测值无关,并且与 x 的未观测值无关。

    假设一个比较不同血压治疗方法的研究,如果一些受试者移至另一个区域进而未从这些受试者中收集随访血压测量值。只要受试者的移动决定与研究中的任何项目无关,这些丢失的血压测量值都可以视为 MCAR。

  2. 随机缺失 (Missing at random MAR): 变量 x 的缺失与 x 的未观测值无关,与其他变量的观测值相关。

    沿用上述假设比较不同血压治疗方法的研究,假设某些受试者由于分配高剂量药物的严重副作用而决定退出研究。在这里,丢失血压测量值不太可能是 MCAR,因为接受较高剂量药物的受试者比受到较低剂量药物的受试者更可能遭受严重的副作用,因此更可能退出研究。血压测量值的缺失取决于所接受治疗的剂量,因此为 MAR。

  3. 非随机缺失 (Missing not at random MNAR): 变量 x 的缺失与 x 的未观测值有关,与其他变量的观测值相关。

    沿用上述假设比较不同血压治疗方法的研究,如果出于伦理原因,让具有极高血压的受试者退出研究,则血压测量的失误将不会是 MAR。在这里血压非常高的受试者的测量值丢失是与未观测值相关。

3. 多重补漏 Stata 的实现

3.1 单变量多重补漏

接下来,我们用 Stata 进一步解释上述原理。首先,引入数据,并进行基本回归。这里我们使用 Stata help mi 的帮助文档中的 Fictional heart attack data 进行单变量补漏分析,各变量的具体含义如下:  

. use "http://www.stata-press.com/data/r15/mheart0", clear
*-或
. webuse "mheart0", clear

. describe

Contains data from http://www.stata-press.com/data/r15/mheart0.dta
  obs:           154      Fictional heart attack data; bmi missing
 vars:             9      19 Jun 2016 10:50
 size:         2,310
------------------------------------------------
            value
variable    label    variable label
------------------------------------------------
attack               Outcome (heart attack)
smokes               Current smoker
age                  Age, in years
bmi                  Body Mass Index, kg/m^2
female               Gender
hsgrad               High school graduate
marstatus   mar      Marital status: single, married, divorced
alcohol     alc      Alcohol consumption: none, &lt;2 drinks/day, &gt;=2 drinks/day
hightar              Smokes high tar cigarettes
------------------------------------------------

 

summarize 呈现的基本统计量可知,变量 bmi 有缺失值 (bmi缺失值满足哪种缺失值假设?)。由 logit 回归结果可知,该回归仅使用了完整案例分析的 132 个样本,剔出了缺失值;只有 smokes bmi 两个变量在 5%上显著。  

. summarize

    Variable |  Obs       Mean    Std. Dev.       Min        Max
-------------+--------------------------------------------------
      attack |  154   .4480519    .4989166          0          1
      smokes |  154   .4155844    .4944304          0          1
         age |  154   56.48829    11.73051   20.73613   87.14446
         bmi |  132   25.24136    4.027137   17.22643   38.24214
      female |  154   .2467532    .4325285          0          1
-------------+--------------------------------------------------
      hsgrad |  154   .7532468    .4325285          0          1
   marstatus |  154   1.941558    .8183916          1          3
     alcohol |  154   1.181818    .6309506          0          2
     hightar |  154   .2077922     .407051          0          1


. logit attack smokes age bmi hsgrad female

Iteration 0:   log likelihood = -91.359017
Iteration 1:   log likelihood = -79.374749
Iteration 2:   log likelihood = -79.342218
Iteration 3:   log likelihood =  -79.34221

Logistic regression                       Number of obs    =        132
                                          LR chi2(5)       =      24.03
                                          Prob > chi2      =     0.0002
Log likelihood =  -79.34221               Pseudo R2        =     0.1315

-----------------------------------------------------------------------
  attack |     Coef.   Std. Err.     z    P>|z|    [95% Conf. Interval]
---------+-------------------------------------------------------------
  smokes |  1.544053   .3998329    3.86   0.000    .7603945    2.327711
     age |   .026112    .017042    1.53   0.125   -.0072898    .0595137
     bmi |  .1129938   .0500061    2.26   0.024    .0149837     .211004
  hsgrad |  .4048251   .4446019    0.91   0.363   -.4665786    1.276229
  female |  .2255301   .4527558    0.50   0.618   -.6618549    1.112915
   _cons | -5.408398   1.810603   -2.99   0.003   -8.957115    -1.85968
-----------------------------------------------------------------------

  Stata 中的 misstable 命令可以让我们直观的了解缺失值的数量与类型。由 misstable summarize 可知,该数据仅有 bmi 这一个变量有缺失值。由 misstable patterns 可知,bmi 数据中缺失值变量占总体的 14%。  

. misstable summarize
                                                  Obs<.
                                   +------------------------------
          |                        | Unique
 Variable |  Obs=.   Obs>.  Obs<.  | values        Min         Max
----------+------------------------+------------------------------
      bmi |     22            132  |    132   17.22643    38.24214
------------------------------------------------------------------

. misstable patterns

   Missing-value patterns
     (1 means complete)

              |   Pattern
    Percent   |  1
  ------------+-------------
       86%    |  1
              |
       14     |  0
  ------------+-------------
      100%    |

  Variables are  (1) bmi

  接下来,我们通过调用 mi setmi register 指令来设置所需要的补漏变量,并通过调用 mi impute regress 进行单变量补漏分析,即使用高斯正态回归补漏方法填充连续变量的缺失值。在这里我们选择进行 20 次多重补漏。  

. mi set wide


. mi register imputed bmi age


. mi impute regress bmi attack smokes age hsgrad female, add(20) rseed(2232)

note: variable age registered as imputed and used to model variable bmi;
this may cause some observations to be omitted from the estimation and may lead to missing imputed values

Univariate imputation                 Imputations =    20
Linear regression                           added =    20
Imputed: m=1 through m=20                 updated =     0

---------------------------------------------------------
          |               Observations per m
          |----------------------------------------------
 Variable |   Complete   Incomplete   Imputed |     Total
----------+-----------------------------------+----------
      bmi |        132           22        22 |       154
---------------------------------------------------------
(complete + incomplete = total; imputed is the minimum across m
 of the number of filled-in observations.)

  由上表可知,20 次补漏均完成了填补 22 个缺失值的工作,如果我们想检验每次补漏是否都正常工作,我们可以使用 mi xeq 命令来查看,在这里我们选取查看第 0 次、第 1 次和第 20 次补漏的结果。  

. mi xeq 0 1 20: summarize bmi

m=0 data:
-> summarize bmi

    Variable |  Obs      Mean    Std. Dev.       Min        Max
-------------+-------------------------------------------------
         bmi |  132  25.24136    4.027137   17.22643   38.24214

m=1 data:
-> summarize bmi

    Variable |  Obs      Mean    Std. Dev.       Min        Max
-------------+-------------------------------------------------
         bmi |  154  25.28134    3.969649   17.22643   38.24214

m=20 data:
-> summarize bmi

    Variable |  Obs      Mean    Std. Dev.       Min        Max
-------------+-------------------------------------------------
         bmi |  154  25.30992     4.05665   16.44644   38.24214

  最后,我们使用 mi estimate 指令进行回归,查看多重补漏之后对回归结果有没有影响。  

. mi estimate, dots:logit attack smokes age bmi hsgrad female

Imputations (20):
  .........10.........20 done

Multiple-imputation estimates         Imputations       =         20
Logistic regression                   Number of obs     =        154
                                      Average RVI       =     0.0611
                                      Largest FMI       =     0.2518
DF adjustment:   Large sample         DF:     min       =     311.30
                                              avg       = 116,139.89
                                              max       = 252,553.06
Model F test:       Equal FMI         F(   5,19590.7)   =       3.52
Within VCE type:          OIM         Prob > F          =     0.0035

--------------------------------------------------------------------
 attack |     Coef.   Std. Err.    t    P>|t|   [95% Conf. Interval]
--------+-----------------------------------------------------------
 smokes |  1.222431   .3608138   3.39   0.001   .5152409     1.92962
    age |  .0358403   .0154631   2.32   0.020   .0055329    .0661476
    bmi |  .1094125   .0518803   2.11   0.036   .0073322    .2114929
 hsgrad |  .1740094   .4055789   0.43   0.668  -.6209156    .9689344
 female | -.0985455   .4191946  -0.24   0.814  -.9201594    .7230684
  _cons | -5.625926   1.782136  -3.16   0.002  -9.124984   -2.126867
--------------------------------------------------------------------

与之前原数据集的 logit 回归相比较,多重补漏检测出了变量 age 在 5%水平上的显著性。

3.1 多变量多重补漏

接下来,我们通过调用 mi setmi register 指令来设置所需要的补漏变量,并通过调用mi impute mvn进行连续变量的多变量补漏分析,即使用多元正态回归填充一个或多个连续变量的缺失值;调用mi impute chained进行离散变量的多变量补漏分析。在这里我们选择进行 10 次多重补漏。

. //多变量多重补漏//
. use https://www.stata-press.com/data/r16/mheart5s0, clear
(Fictional heart attack data)

.
. mi describe

  Style:  mlong
          last mi update 19apr2019 14:00:11, 222 days ago

  Obs.:   complete          126
          incomplete         28  (M = 0 imputations)
          ---------------------
          total             154

  Vars.:  imputed:  2; bmi(28) age(12)

          passive:  0

          regular:  4; attack smokes female hsgrad

          system:   3; _mi_m _mi_id _mi_miss

         (there are no unregistered variables)

.
. mi misstable patterns

   Missing-value patterns
     (1 means complete)

              |   Pattern
    Percent   |  1  2
  ------------+-------------
       82%    |  1  1
              |
       10     |  1  0
        8     |  0  0
  ------------+-------------
      100%    |

  Variables are  (1) age  (2) bmi

运用这次的数据集,我们发现两个变量有数据缺失的情况,其中变量 bmi 有 28 个缺失值,变量 age 有 12 个缺失值。并且全部观测值的 8%是两个变量的共同缺失值,这里两个变量遵循了「单调缺失」规律 (Monotonr),即:变量 1 的缺失值小于等于变量 2 的缺失值,且可以被变量 2 涵盖。可视化如图:

如果是单调缺失这种情况,mi impute monotonemi impute mvnmi impute chained 都可以使用:

. mi impute monotone (regress) age bmi = attack smokes hsgrad female, add(10)

Conditional models:
               age: regress age attack smokes hsgrad female
               bmi: regress bmi age attack smokes hsgrad female


Multivariate imputation               Imputations =    10
Monotone method                             added =    10
Imputed: m=1 through m=10                 updated =     0

               age: linear regression
               bmi: linear regression

---------------------------------------------------------
          |               Observations per m
          |----------------------------------------------
 Variable |   Complete   Incomplete   Imputed |     Total
----------+-----------------------------------+----------
      age |        142           12        12 |       154
      bmi |        126           28        28 |       154
---------------------------------------------------------
(complete + incomplete = total; imputed is the minimum across m
 of the number of filled-in observations.)

.
. mi impute mvn age bmi = attack smokes hsgrad female, replace nolog

Multivariate imputation                   Imputations =  10
Multivariate normal regression                  added =   0
Imputed: m=1 through m=10                     updated =  10

Prior: uniform                            Iterations = 1000
                                             burn-in =  100
                                             between =  100

-----------------------------------------------------------
            |               Observations per m
            |----------------------------------------------
   Variable |   Complete   Incomplete   Imputed |     Total
------------+-----------------------------------+----------
        age |        142           12        12 |       154
        bmi |        126           28        28 |       154
-----------------------------------------------------------
(complete + incomplete = total; imputed is the minimum across m
 of the number of filled-in observations.)

.
. mi impute chained (regress) age bmi = attack smokes hsgrad female, replace
note: missing-value pattern is monotone; no iteration performed

Conditional models (monotone):
               age: regress age attack smokes hsgrad female
               bmi: regress bmi age attack smokes hsgrad female

Performing chained iterations ...

Multivariate imputation                  Imputations = 10
Chained equations                              added =  0
Imputed: m=1 through m=10                    updated = 10

Initialization: monotone                  Iterations =  0
                                             burn-in =  0

               age: linear regression
               bmi: linear regression

---------------------------------------------------------
          |               Observations per m
          |----------------------------------------------
 Variable |   Complete   Incomplete   Imputed |     Total
----------+-----------------------------------+----------
      age |        142           12        12 |       154
      bmi |        126           28        28 |       154
---------------------------------------------------------
(complete + incomplete = total; imputed is the minimum across m
 of the number of filled-in observations.)

4. 参考文献和扩展阅读

  • Acock, A., n.d. A Gentle Introduction To Stata. -Link-
  • Andersen, E., 1997. Introduction To The Statistical Analysis Of Categorical Data. Berlin: Springer.
  • Campion, W. and Rubin, D., 1989. Multiple Imputation for Nonresponse in Surveys. Journal of Marketing Research, 26(4), p.485.
  • Lee, K. and Carlin, J., 2010. Multiple Imputation for Missing Data: Fully Conditional Specification Versus Multivariate Normal Imputation. American Journal of Epidemiology, 171(5), pp.624-632.

5. 相关推文

Note:产生如下推文列表的命令为:lianxh 机器学习, m
安装最新版 lianxh 命令:ssc install lianxh, replace

6. 附录:Stata 完整代码


webdoc init Example, replace logall plain md
********************
*****单一补漏方法*****
********************
//完整案例分析//

*-missing()函数
 sysuse nlsw88.dta, clear
 sum
 drop if missing(grade,indus,occup,union,hours,tenure)
 sum


*-更为简洁的命令:-dropmiss-  (外部命令)
 sysuse nlsw88.dta, clear
 dropmiss, any obs  // 这或许是我们所需要的
 sum


//平均值补漏//
 sysuse nlsw88.dta, clear
 summarize
 replace grade = r(mean) if grade==.

//回归补漏//

  sysuse nlsw88.dta, clear
  sum

  reg wage grade hours

  list wage hours if grade==.
  replace grade = (wage[496]+_b[_cons] - _b[hours]*hours[496])/_b[grade] ///
  if (wage == wage[496]&grade==.&hours==hours[496])
  replace grade = (wage[2210]+_b[_cons] - _b[hours]*hours[2210])/_b[grade] ///
  if (wage == wage[2210]&grade==.&hours==hours[2210])
  sum grade
//向前向后填补
*——向前填补
    sysuse nlsw88, clear
    sum grade
    sort grade
    replace grade = grade[_n-1] if mi(grade)
    sum grade
*——向后填补
    sysuse nlsw88, clear
    sum grade
    sort grade
    replace grade = grade[_n+1] if mi(grade)
    sum grade
*——面板数据填补
use http://www.stata-press.com/data/r13/nlswork,clear

misstable sum

xtset idcode year

by idcode: replace grade = grade[_n+1] if mi(grade) //为什么没有填补?

********************
*****多重补漏方法*****
********************

//单变量多重补漏//
use http://www.stata-press.com/data/r15/mheart0,clear

describe

summarize

logit attack smokes age bmi hsgrad female

misstable summarize

misstable patterns

mi set wide

mi register imputed bmi

mi impute regress bmi attack smokes age hsgrad female, add(20) rseed(2232)

mi xeq 0 1 20: summarize bmi

mi estimate, dots:logit attack smokes age bmi hsgrad female



//多变量多重补漏//
use https://www.stata-press.com/data/r16/mheart5s0, clear

mi describe

mi misstable patterns

mi impute monotone (regress) age bmi = attack smokes hsgrad female, add(10)

mi impute mvn age bmi = attack smokes hsgrad female, replace nolog

mi impute chained (regress) age bmi = attack smokes hsgrad female, replace

相关课程

连享会-直播课 上线了!
http://lianxh.duanshu.com

免费公开课:


课程一览

支持回看

专题 嘉宾 直播/回看视频
最新专题 因果推断, 空间计量,寒暑假班等
数据清洗系列 游万海 直播, 88 元,已上线
研究设计 连玉君 我的特斯拉-实证研究设计-幻灯片-
面板模型 连玉君 动态面板模型-幻灯片-
面板模型 连玉君 直击面板数据模型 [免费公开课,2小时]

Note: 部分课程的资料,PPT 等可以前往 连享会-直播课 主页查看,下载。


关于我们

  • Stata连享会 由中山大学连玉君老师团队创办,定期分享实证分析经验。直播间 有很多视频课程,可以随时观看。
  • 连享会-主页知乎专栏,300+ 推文,实证分析不再抓狂。
  • 公众号关键词搜索/回复 功能已经上线。大家可以在公众号左下角点击键盘图标,输入简要关键词,以便快速呈现历史推文,获取工具软件和数据下载。常见关键词:课程, 直播, 视频, 客服, 模型设定, 研究设计, stata, plus, 绘图, 编程, 面板, 论文重现, 可视化, RDD, DID, PSM, 合成控制法

连享会主页  lianxh.cn
连享会主页 lianxh.cn

连享会小程序:扫一扫,看推文,看视频……

扫码加入连享会微信群,提问交流更方便

✏ 连享会学习群-常见问题解答汇总:
https://gitee.com/arlionn/WD

New! lianxh 命令发布了:
随时搜索连享会推文、Stata 资源,安装命令如下:
. ssc install lianxh
使用详情参见帮助文件 (有惊喜):
. help lianxh