# Stata：聚类标准误的纠结

Stata连享会   主页 || 视频 || 推文 || 知乎 || Bilibili 站

New！ `lianxh` 命令发布了：

`. ssc install lianxh`

`. help lianxh`

⛳ Stata 系列推文：

Source： To cluster or not to cluster

## 1.背景介绍

### 1.2 如果本应该聚类，但没有聚类会怎样

a. 给定数据假设，如何通过模拟来分析。 b. 如何解释模拟结果。 c. 本应该聚类的却没有聚类会带来怎样的后果

## 2.实操准备

### 2.1 模拟设置

`````` . * Consider 100 individuals:
. set obs 100
number of observations (_N) was 0, now 100
. * For which I'll create an identifier.
. gen id=_n
. * And I will assume that each individual is observed for 10 periods
. expand 10
(900 observations created)
. * Lets assume that X and Z are independent from each other.
. * but that both are "normal"
. gen x = rnormal()
. gen z = rnormal()
. * And Z is fixed for each individual
. bysort id:replace z=z[1]
. * I use the same code for the unobserved factors.
. gen e = rnormal()
. gen v = rnormal()
. * And Z is fixed for each individual
. bysort id:replace e=e[1]
. * Finally, we construct our dependent variable Y
. gen y=1+x+z+e+v
``````

（1）简单的OLS （2）基于稳健标准误差的OLS （3）基于聚类标准误的OLS （4）随机效应面板 （5）固定效应面板

### 2.2 主要程序

`````` . . * I'm writing this as an "eclass" estimator, so it is easy to collect the estimation results.
. program panel_re, eclass
1. * a new line. "clear" to start from a clean dataset each time the program is called:
.         clear
2.         set obs 100
3.         gen id=_n
4.         expand 10
5.         gen x = rnormal()
6.         gen z = rnormal()
7.         sort id, stable
8.         by id:replace z=z[1]
9.         gen e = rnormal()
10.         gen v = rnormal()
11.         by id:replace e=e[1]
12.         gen y=1+x+z+e+v
* here I estimate the model
* I will use two "locals": 1 and 2.
* This will be used as place holders for the model
* and options for the estimation.
* I'll also set this data as panel data (it will be needed for some cases)
13.         xtset id
14.         `1' y x z, `2'
15.         ** and that is it!
. end
``````

``````. *panel_re reg  cluster(id)
. *          ^       ^
. *          |       |
. *         '1'     '2'
. *Here reg will be argument 1
. *and cluster(id) argument 2
. panel_re reg  cluster(id)
number of observations (_N) was 0, now 100
(900 observations created)
panel variable:  id (balanced)

Linear regression                               Number of obs     =      1,000
F(2, 99)          =     355.71
Prob > F          =     0.0000
R-squared         =     0.4973
Root MSE          =     1.4754

(Std. Err. adjusted for 100 clusters in id)
------------------------------------------------------------------------------
|               Robust
y |      Coef.   Std. Err.      t    P>|t|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
x |   1.033408   .0409937    25.21   0.000     .9520676    1.114748
z |   .9983844   .0895856    11.14   0.000     .8206271    1.176142
_cons |   .9965414   .1162236     8.57   0.000     .7659286    1.227154
------------------------------------------------------------------------------
``````

## 3. 主要结果

### 3.1 不同情况下的结果对比

``````. simulate _b _se, reps(100) seed(101):panel_re reg

command:  panel_re reg

Simulations (100)
----+--- 1 ---+--- 2 ---+--- 3 ---+--- 4 ---+--- 5
..................................................    50
..................................................   100

. sum, sep(3)

Variable |        Obs        Mean    Std. Dev.       Min        Max
-------------+---------------------------------------------------------
_b_x |        100    1.000355    .0381124   .8919857   1.085083
_b_z |        100    .9876579    .0994754   .7406453     1.2002
_b_cons |        100    .9984308    .1107775   .7436068   1.238203
-------------+---------------------------------------------------------
_se_x |        100    .0444944    .0023557   .0397996   .0505446
_se_z |        100    .0454697    .0039679   .0382312   .0611115
_se_cons |        100    .0446636    .0022492   .0396397   .0515558
``````

``````. gen is_bx_1 = !inrange(1,_b_x-_se_x*1.96,_b_x+_se_x*1.96)

. gen is_bz_1 = !inrange(1,_b_z-_se_z*1.96,_b_z+_se_z*1.96)

. gen is_bc_1 = !inrange(1,_b_c-_se_c*1.96,_b_c+_se_c*1.96)

. sum is*

Variable |        Obs        Mean    Std. Dev.       Min        Max
-------------+---------------------------------------------------------
is_bx_1 |        100         .02    .1407053          0          1
is_bz_1 |        100         .43    .4975699          0          1
is_bc_1 |        100         .42     .496045          0          1
``````

``````. simulate _b _se, reps(100) seed(101):panel_re reg robust

command:  panel_re reg robust

Simulations (100)
----+--- 1 ---+--- 2 ---+--- 3 ---+--- 4 ---+--- 5
..................................................    50
..................................................   100

. gen is_bx_1 = !inrange(1,_b_x-_se_x*1.96,_b_x+_se_x*1.96)
. gen is_bz_1 = !inrange(1,_b_z-_se_z*1.96,_b_z+_se_z*1.96)
. gen is_bc_1 = !inrange(1,_b_c-_se_c*1.96,_b_c+_se_c*1.96)
. sum, sep(3)

Variable |        Obs        Mean    Std. Dev.       Min        Max
-------------+---------------------------------------------------------
_b_x |        100    1.000355    .0381124   .8919857   1.085083
_b_z |        100    .9876579    .0994754   .7406453     1.2002
_b_cons |        100    .9984308    .1107775   .7436068   1.238203
-------------+---------------------------------------------------------
_se_x |        100    .0444072    .0027773   .0391417    .052421
_se_z |        100    .0453708    .0049049   .0347456    .064886
_se_cons |        100     .044696     .002205   .0396771   .0515227
-------------+---------------------------------------------------------
is_bx_1 |        100         .02    .1407053          0          1
is_bz_1 |        100          .4     .492366          0          1
is_bc_1 |        100         .42     .496045          0          1
``````

``````. . simulate _b _se, reps(100) seed(101):panel_re reg cluster(id)

command:  panel_re reg cluster(id)

Simulations (100)
----+--- 1 ---+--- 2 ---+--- 3 ---+--- 4 ---+--- 5
..................................................    50
..................................................   100

. gen is_bx_1 = !inrange(1,_b_x-_se_x*1.96,_b_x+_se_x*1.96)
. gen is_bz_1 = !inrange(1,_b_z-_se_z*1.96,_b_z+_se_z*1.96)
. gen is_bc_1 = !inrange(1,_b_c-_se_c*1.96,_b_c+_se_c*1.96)
. sum, sep(3)

Variable |        Obs        Mean    Std. Dev.       Min        Max
-------------+---------------------------------------------------------
_b_x |        100    1.000355    .0381124   .8919857   1.085083
_b_z |        100    .9876579    .0994754   .7406453     1.2002
_b_cons |        100    .9984308    .1107775   .7436068   1.238203
-------------+---------------------------------------------------------
_se_x |        100    .0441533    .0051071   .0332627   .0670466
_se_z |        100    .1049323    .0155818   .0745955   .1655501
_se_cons |        100    .1045397    .0085235   .0841128   .1316457
-------------+---------------------------------------------------------
is_bx_1 |        100         .04    .1969464          0          1
is_bz_1 |        100         .05    .2190429          0          1
is_bc_1 |        100         .06    .2386833          0          1
``````

``````. simulate _b _se, reps(100) seed(101):panel_re xtreg fe

command:  panel_re xtreg fe

Simulations (100)
----+--- 1 ---+--- 2 ---+--- 3 ---+--- 4 ---+--- 5
..................................................    50
..................................................   100

. gen is_bx_1 = !inrange(1,_b_x-_se_x*1.96,_b_x+_se_x*1.96)
. gen is_bc_1 = !inrange(1,_b_c-_se_c*1.96,_b_c+_se_c*1.96)
. sum, sep(3)

Variable |        Obs        Mean    Std. Dev.       Min        Max
-------------+---------------------------------------------------------
_b_x |        100    .9956048     .029711   .9376439   1.062583
_sim_2 |        100           0           0          0          0
_b_cons |        100    .9955147    .1434849   .6446086   1.326498
-------------+---------------------------------------------------------
_se_x |        100    .0333269    .0011717   .0301809   .0364448
_sim_5 |        100           0           0          0          0
_se_cons |        100    .0316341     .000825   .0287964   .0339852
-------------+---------------------------------------------------------
is_bx_1 |        100           0           0          0          0
is_bc_1 |        100         .68    .4688262          0          1
``````

``````. simulate _b _se, reps(100) seed(101):panel_re xtreg re

command:  panel_re xtreg re

Simulations (100)
----+--- 1 ---+--- 2 ---+--- 3 ---+--- 4 ---+--- 5
..................................................    50
..................................................   100

. gen is_bx_1 = !inrange(1,_b_x-_se_x*1.96,_b_x+_se_x*1.96)
. gen is_bz_1 = !inrange(1,_b_z-_se_z*1.96,_b_z+_se_z*1.96)
. gen is_bc_1 = !inrange(1,_b_c-_se_c*1.96,_b_c+_se_c*1.96)
. sum, sep(3)

Variable |        Obs        Mean    Std. Dev.       Min        Max
-------------+---------------------------------------------------------
_b_x |        100    .9961247    .0289081   .9367011   1.064042
_b_z |        100    .9877817    .0993992   .7413588   1.199385
_b_cons |        100    .9984448    .1109847   .7435786   1.240776
-------------+---------------------------------------------------------
_se_x |        100    .0331599     .001161    .030052   .0362103
_se_z |        100    .1067879    .0120344   .0871601   .1579433
_se_cons |        100    .1048667    .0087699   .0846491   .1327749
-------------+---------------------------------------------------------
is_bx_1 |        100           0           0          0          0
is_bz_1 |        100         .03    .1714466          0          1
is_bc_1 |        100         .06    .2386833          0          1
``````

（1）其标准误与基于FE估计的一样小。

（2）我们仍然能够得到$z$的估计，标准误的大小与考虑聚类标准误差的OLS相似。

（3）我们还可以得到常数项的显著性水平

``````. program panel_cre, eclass
1.         clear
2.         set obs 100
3.         gen id=_n
4.         expand 10
5.         gen x = rnormal()
6.         gen z = rnormal()
7.         sort id, stable
8.         by id:replace z=z[1]
9.         gen e = rnormal()
10.         gen v = rnormal()
11.         by id:replace e=e[1]
12.         gen y=1+x+z+e+v
13.         xtset id
14.         by id:egen m_x=mean(x)  // <-- Here we estimate the within person mean characteristic
15.         xtreg y x z m_x, re   // and control for it in the RE effects model.
16. end

. simulate _b _se, reps(100) seed(101):panel_cre

command:  panel_cre

Simulations (100)
----+--- 1 ---+--- 2 ---+--- 3 ---+--- 4 ---+--- 5
..................................................    50
..................................................   100

. gen is_bx_1 = !inrange(1,_b_x-_se_x*1.96,_b_x+_se_x*1.96)

. gen is_bz_1 = !inrange(1,_b_z-_se_z*1.96,_b_z+_se_z*1.96)

. gen is_bc_1 = !inrange(1,_b_c-_se_c*1.96,_b_c+_se_c*1.96)

. sum, sep(4)

Variable |        Obs        Mean    Std. Dev.       Min        Max
-------------+---------------------------------------------------------
_b_x |        100    .9956048     .029711   .9376439   1.062583
_b_z |        100    .9863926    .1006587   .7307997   1.207344
_b_m_x |        100     .051984    .3436732  -.6213726   1.080986
_b_cons |        100    .9982991    .1093144   .7438877   1.214026
-------------+---------------------------------------------------------
_se_x |        100    .0333269    .0011717   .0301809   .0364448
_se_z |        100    .1072043    .0120987    .087363   .1584293
_se_m_x |        100    .3380803    .0343612   .2572779   .4165666
_se_cons |        100    .1055241    .0090165   .0849954   .1330222
-------------+---------------------------------------------------------
is_bx_1 |        100           0           0          0          0
is_bz_1 |        100         .04    .1969464          0          1
is_bc_1 |        100         .06    .2386833          0          1
``````

## 4.参考资料

1、Alberto Abadie & Susan Athey & Guido W. Imbens & Jeffrey Wooldridge, 2017. "When Should You Adjust Standard Errors for Clustering?," NBER Working Papers 24003, National Bureau of Economic Research, Inc.-PDF-

2、Abowd, J. M., F. Kramarz, and S. Woodcock. 2008. Econometric analyses of linked employer-employee data. In The Econometrics of Panel Data: Fundamentals and Recent Developments in Theory and Practice, ed. L. M´aty´as and P. Sevestre, 3rd ed., 727–760. Berlin: Springer.-PDF-

## 5. 相关推文

Note：产生如下推文列表的 Stata 命令为：
`lianxh 聚类`

`ssc install lianxh, replace`

## 相关课程

### 最新课程-直播课

• Note: 部分课程的资料，PPT 等可以前往 连享会-直播课 主页查看，下载。

### 关于我们

• Stata连享会 由中山大学连玉君老师团队创办，定期分享实证分析经验。
• 连享会-主页知乎专栏，700+ 推文，实证分析不再抓狂。直播间 有很多视频课程，可以随时观看。
• 公众号关键词搜索/回复 功能已经上线。大家可以在公众号左下角点击键盘图标，输入简要关键词，以便快速呈现历史推文，获取工具软件和数据下载。常见关键词：`课程, 直播, 视频, 客服, 模型设定, 研究设计, stata, plus, 绘图, 编程, 面板, 论文重现, 可视化, RDD, DID, PSM, 合成控制法`

✏ 连享会-常见问题解答：
https://gitee.com/lianxh/Course/wikis

New！ `lianxh` 命令发布了：

`. ssc install lianxh`

`. help lianxh`