温馨提示: 定期 清理浏览器缓存,可以获得最佳浏览体验。
作者: 黄俊凯 (中国人民大学)
邮箱: kopanswer@126.com
目录
前期相关推文
专题: 数据处理
专题: PSM-Matching
Note: 上述检索结果由 lianxh.ado
命令自动生成,命令为 lianxh 匹配 PSM matching, m
。
lianxh.ado
命令可以使用如下命令安装:
. net install lianxh.pkg, from(https://arlionn.gitee.io/lianxh) replace
详情参见 lianxh-项目主页: https://gitee.com/arlionn/lianxh
单变量匹配的方法有精确匹配、k-近邻匹配和半径 (卡尺) 匹配,具体介绍如下:
精确匹配 (exact match):将匹配变量相等的对照组观测值作为反事实观测值;
k-近邻匹配 (k-nearest neighbor match):挑选距离最近的 k 个对照组观测值作为反事实观测值;
半径 (卡尺) 匹配 (radius caliper match):指定半径范围内的所有对照组观测值作为反事实观测值。
多变量匹配的核心是「降维」,即将多变量降维为距离或得分,然后再进行单变量匹配,具体匹配方法如下:
粗化精确匹配 (coarsened exact match):如公司金融中的同年度同行业公司、以及教育经济学中考上同一所大学的同学;
百分等级匹配 (percentile rank match):这是一种非参方法,对任意处理组观测值
马氏距离匹配 (mahalanobis distance match):在匹配时,欧氏距离存在两个缺点:不同维度的量纲之间存在差异,不同维度之间有相关性。因此,坐标轴不是正交的。马氏距离通过标准化和旋转矩阵 (合起来就是协方差矩阵的逆矩阵),将多元随机变量的分布转换为一个各变量量纲相同,且各变量之间不相关的多元随机分布。如果将样本协方差矩阵替换为单位矩阵,则马氏距离退化为欧氏距离;
倾向得分匹配 (propensity score match):倾向得分匹配比马氏距离更进一步,它体现了选择机制,并且即使距离较远的观测值也可能有相同的得分,也因此成为流行的匹配方法。
Note: 关于上述匹配方法更多理论介绍,请参考「Stata 手动:各类匹配方法大全 A——理论篇」 (微信版)。
温馨提示: 文中链接在微信中无法打开,请点击「阅读原文」。
ultimatch [varlist] [if] [in], treated(var) [exact(varlist)] [draw(#)] [caliper(#)] [support] [single] [greedy] [between] [rank] [copy] [report(varlist)]
[unit(varlist)] [unmatched] [exp(string)] [limit(string)]
其中,varlist
指定用于匹配的变量,并且可以是单变量也可以多变量,var
为指定处理组和对照组的哑变量。
ultimatch
命令默认使用有放回抽样的马氏距离匹配。同时,每一次成功运行 ultimatch
命令,都会默认产生三个辅助变量 _match、_distance 和 _weight:
_match 是指定匹配成功的观测值:
_distance 是对照组与匹配成功的处理组观测值的距离:
对于处理组观测值,其值为
对于匹配成功的对照组观测值,其值等于到匹配成功的处理组观测值的距离;
对于匹配失败的对照组观测值,其值等于缺失值。
_weight 是权重:
对于匹配成功的处理组观测值始终有 _weight 等于
若一个对照组观测值匹配到多个处理组观测值,则 _weight 是对应的多个权重之和。若还指定了选项 copy
,则对应不同的处理组观测值都会生成一个副本,每个副本的 _weight 等于该对照组观测值与相应处理组观测值的反事实观测值中所占的权重。
单变量匹配方法
卡尺匹配/半径匹配:caliper(real)
卡尺匹配/半径匹配(不放回抽样):caliper(real) greedy
greedy
保留与对照组观测值 caliper(real)
选项连用。k-近邻匹配:draw(integer)
k-近邻匹配(向前后各自独立搜寻 k 个反事实观测值):draw(integre) between
此外,ultimatch
命令默认所有具有相同得分或距离的观测值为同一个抽样,这有助于减轻近邻的 “负担” 。用户可以使用 single
选项从具有相同得分或距离的观测值中随机选择一个作为反事实观测值。
多变量匹配方法
粗化精确匹配:exact(varlist)
欧氏距离的百分等级匹配:rank
马氏距离匹配:默认的匹配方法
报告变量差异
报告匹配后变量的差异:report(varlist)
报告匹配前后各变量的差异:report(varlist) unmatched
生成辅助变量
观测值是否在共同支撑集内:support
support
会生成一个辅助变量 _support,若观测值位于共同支撑集内,则 _support 等于 为对照组观测值匹配到的每一个处理组观测值生成副本:copy
copy
。其他选项
指定面板数据中的个体:unit(varlist)
unit(varlist)
时,选项 report(varlist)
中报告的差异也会做相应的聚类标准误调整。自定义匹配条件:exp(string)
exp(string)
自定义匹配条件。如果表达式等于 t.
指定当前激活的处理组观测值,以方便对照组观测值与处理组观测值之间的运算。比如,指定上下半径不同的卡尺匹配 caliper(score - t.score <= 10 & t.score - score <= 5)
。额外的约束:limit(string)
limit(string)
不同于百分等级匹配。其参数为一组「变量—数字」对,并且其中数字必须介于 limit(age 5 height weight 10)
要求匹配时年龄 age 的单变量百分等级差异不超过
3.1 数据具体生成过程
clear
tempfile tmp
set obs 2000
set seed 2000
*-处理前数据
gen byte period = 0 //pre-treatment
label var period "是否处理后"
gen long id = _n
label var id "个体id"
gen byte gender = uniform() > 0.5
label var gender "性别"
gen age = uniform()
label var age "年龄"
gen fitness = normal(gender*0.25 - age + invnorm(uniform())*0.1)
label var fitness "健康程度"
gen weight = normal(-gender*0.25 + age*0.25 - fitness*0.25 + invnorm(uniform())*0.1)
label var weight "体重"
gen treated = normal(fitness + invnorm(uniform())*0.25) > 0.73
label var treated "是否处理组"
save `tmp'
*-处理后数据
replace period = 1 // after treatment
replace weight = weight + weight*(uniform()-0.5)*0.2 - weight*(fitness-0.5)*0.25
*-合并处理前后的数据
append using `tmp'
sort id period
replace weight = int(30.5+100*weight)
replace age = int(18.5+50*age)
gen effect = treated*period // treatment effect (interaction term for DiD)
label var effect "处理效应的交互"
order id treated period effect gender age weight fitness
*-倾向得分
probit treated age gender weight if period == 0 // omitting "unobserved" selection
predict score // probensity score
label var score "倾向得分"
des
此时,我们就得到一组有关于健康状况的数据,其基本的描述如下:
obs: 4,000
vars: 9 6 Nov 2020 18:24
--------------------------------------------------------------
storage display value
variable name type format label variable label
--------------------------------------------------------------
id long %12.0g 个体id
treated float %9.0g 是否处理组
period byte %8.0g 是否处理后
effect float %9.0g 处理效应的交互
gender byte %8.0g 性别
age float %9.0g 年龄
weight float %9.0g 体重
fitness float %9.0g 健康程度
score float %9.0g 倾向得分
--------------------------------------------------------------
Sorted by: id period
实际上,该样本为包含
基础语法
ultimatch score if period == 0, treated(treated)
des _match _weight _distance
运行上述代码后,将生成三个变量:
_match,是一个非负的自增长序列,具有相同 _match 值的对照组观测值匹配相同的处理组观测值;
_weight,对于匹配成功的处理组观测值,_weight 等于 pweigth
类型权重。
_distance,是匹配成功的对照组观测值到与最近的处理组观测值的距离。
storage display value
variable name type format label variable label
---------------------------------------------------------------
_match long %12.0g match id
_weight double %10.0g pweight
_distance double %10.0g neighbor distance
ultimatch
默认报告匹配前后处理组和观测值的样本数量。具体如下,匹配前处理组 (treated) 有
Nearest Neighbor
--------------------------------------------------
Support | Treated Control
-----------------+--------------------------------
Total | 387 1613
Without | 0 0
With | 387 1613
-----------------+--------------------------------
Matched | 387 448
Clustered | 0 0
Clusters | 387 448
--------------------------------------------------
report(varlist)
选项使用
cap drop _*
ultimatch score if period == 0, treated(treated) report(gender age)
cap drop _*
ultimatch score if period == 0, treated(treated) report(gender age) unmatched
选项 report(varlist)
报告匹配后处理组与观测值在指定变量 varlist 上的差异及对应 t 检验,其附属选项 unmatched
进一步报告匹配前 varlist 的差异及对应 t 检验。
. ultimatch score if period == 0, treated(treated) report(gender age)
Nearest Neighbor
---------------------------------------------
Support | Treated Control
------------+--------------------------------
Total | 387 1613
Without | 0 0
With | 387 1613
------------+--------------------------------
Matched | 387 448
Clustered | 0 0
Clusters | 387 448
------------+-----------------------------------------------------------
Matched | Treated Control | StdErr t p>|t|
------------+--------------------------------+--------------------------
gender | .599483204 .589147287 | .0390341 0.26 0.791
age | 34.8165375 34.6925065 | .967815 0.13 0.898
------------------------------------------------------------------------
.ultimatch score if period == 0, treated(treated) report(gender age) unmatched
Nearest Neighbor
------------------------------------------
Support | Treated Control
------------+-----------------------------
Total | 387 1613
Without | 0 0
With | 387 1613
------------+-----------------------------
Matched | 387 448
Clustered | 0 0
Clusters | 387 448
------------+--------------------------------------------------------
Unmatched | Treated Control | StdErr t p>|t|
------------+-----------------------------+--------------------------
gender | .599483204 .458152511 | .0281268 5.02 0.000
age | 34.8165375 45.2461252 | .783574 -13.31 0.000
------------+-----------------------------+--------------------------
Matched | Treated Control | StdErr t p>|t|
------------+-----------------------------+--------------------------
gender | .599483204 .589147287 | .0390341 0.26 0.791
age | 34.8165375 34.6925065 | .967815 0.13 0.898
---------------------------------------------------------------------
匹配方法使用
ultimatch
支持距离匹配,包括欧氏距离匹配 euclid
、基于欧式距离的百分等级匹配 rank
和马氏距离匹配 mahalanobis
。
cap drop _*
ultimatch gender age weight if period == 0, treated(treated) report(weight) euclid
cap drop _*
ultimatch gender age weight if period == 0, treated(treated) report(weight) rank
cap drop _*
ultimatch gender age weight if period == 0, treated(treated) report(weight) mahalanobis
结果如下:
. ultimatch gender age weight if period == 0, treated(treated) report(weight) euclid
Euclidean Distance-based Neighborhood Matching
------------------------------------------
Support | Treated Control
------------+-----------------------------
Total | 387 1613
Without | 0 0
With | 387 1613
------------+-----------------------------
Matched | 387 600
Clustered | 0 0
Clusters | 387 600
------------+------------------------------------------------------------------
Matched | Treated Control | StdErr t p>|t| SDM
------------+-----------------------------+------------------------------------
weight | 73.4444444 73.4384413 | .522879 0.01 0.991 0.00087
-------------------------------------------------------------------------------
. ltimatch gender age weight if period == 0, treated(treated) report(weight) rank
Euclidean Distance-based Percentile Rank Neighborhood Matching
------------------------------------------
Support | Treated Control
------------+-----------------------------
Total | 387 1613
Without | 0 0
With | 387 1613
------------+-----------------------------
Matched | 387 456
Clustered | 0 0
Clusters | 387 456
------------+------------------------------------------------------------------
Matched | Treated Control | StdErr t p>|t| SDM
------------+-----------------------------+------------------------------------
weight | 73.4444444 73.4470284 | .541415 -0.00 0.996 -0.00038
-------------------------------------------------------------------------------
. ultimatch gender age weight if period == 0, treated(treated) report(weight) mahalanobis
Mahalanobis Distance-based Neighborhood Matching
------------------------------------------
Support | Treated Control
------------+-----------------------------
Total | 387 1613
Without | 0 0
With | 387 1613
------------+-----------------------------
Matched | 387 503
Clustered | 0 0
Clusters | 387 503
------------+------------------------------------------------------------------
Matched | Treated Control | StdErr t p>|t| SDM
------------+-----------------------------+------------------------------------
weight | 73.4444444 73.4668389 | .55054 -0.04 0.968 -0.00325
-------------------------------------------------------------------------------
ultimatch
还支持得分匹配。由于得分匹配最终转化为单变量匹配,因此这里仅介绍单变量匹配的选项,如抽样次数 draw(integer)
,前后抽样 between
,额外的单变量百分等级约束 limit(string)
。
cap drop _*
qui ultimatch score if period == 0, treated(treated)
sort period score treated
list id treated _distance if _match == 1
+---------------------------+
| id treated _dista~e |
|---------------------------|
38. | 1617 0 0 |
39. | 328 1 0 |
+---------------------------+
cap drop _*
qui ultimatch score if period == 0, treated(treated) draw(3)
sort period score treated
list id treated _distance if _match == 1
+----------------------------+
| id treated _distance |
|----------------------------|
35. | 813 0 .00016189 |
36. | 1922 0 .00004054 |
37. | 1148 0 .00004054 |
38. | 1617 0 0 |
39. | 328 1 0 |
+----------------------------+
cap drop _*
ultimatch score if period == 0, treated(treated) between
sort period score treated
list id treated _distance if _match == 1
+---------------------------+
| id treated _dista~e |
|---------------------------|
38. | 1617 0 0 |
39. | 328 1 0 |
40. | 288 0 .0017835 |
41. | 961 0 .0017835 |
+---------------------------+
cap drop _*
ultimatch score if period == 0, treated(treated) between limit(age 1 weight 10)
list id treated _distance if _match == 1
+----------------------------+
| id treated _distance |
|----------------------------|
632. | 1617 0 0 |
652. | 226 0 .03584137 |
657. | 112 0 .03584137 |
1049. | 328 1 0 |
+----------------------------+
single
和 greedy
选项使用
ultimatch
默认将距离相同的观测值视为同一个观测值,你也可以通过选项 single
指定从最近邻中随机抽取一个作为反事实观测值。此外,你也可以通过选项 greedy
指定不放回抽样。
cap drop _*
ultimatch score if period == 0, treated(treated) report(gender age weight) single
sort period score treated
list id treated _match _distance if _match <= 3
+------------------------------------+
| id treated _match _dista~e |
|------------------------------------|
38. | 1617 0 1 0 |
39. | 328 1 1 0 |
42. | 1643 0 2 0 |
45. | 1041 1 2 0 |
121. | 905 0 3 0 |
|------------------------------------|
122. | 646 1 3 0 |
+------------------------------------+
cap drop _*
ultimatch score if period == 0, treated(treated) report(gender age weight) single greedy
sort period score treated
list id treated _match _distance if _match <= 3
+------------------------------------+
| id treated _match _dista~e |
|------------------------------------|
38. | 1617 0 1 0 |
39. | 328 1 1 0 |
42. | 432 0 2 0 |
45. | 1041 1 2 0 |
121. | 1375 0 3 0 |
|------------------------------------|
122. | 646 1 3 0 |
+------------------------------------+
copy
和 support
选项使用
ultimatch
还提供一系列额外选项,比如,选项 copy
及其附属选项 full
生成的。单独使用 copy
时,匹配后每个处理组仍然只有一个副本 (_weight 等于 copy
和 full
,每个处理组观测值也会为它匹配的每个对照组观测值生成一个副本,并生成各自对应的 _match 和 _weight。再比如,选项 support
生成变量 _support,当 _support 等于
cap drop if _copy == 1
cap drop _*
ultimatch score if period == 0, treated(treated) report(gender age weight) copy
tab _copy
tab _weight if treated == 1
copied |
observation | Freq. Percent Cum.
------------+-----------------------------------
1 | 212 100.00 100.00
------------+-----------------------------------
Total | 212 100.00
. tab _weight if treated == 1
pweight | Freq. Percent Cum.
------------+-----------------------------------
1 | 387 100.00 100.00
------------+-----------------------------------
Total | 387 100.00
cap drop if _copy == 1
cap drop _*
ultimatch score if period == 0, treated(treated) report(gender age weight) copy full
tab _copy
tab _weight if treated == 1
copied |
observation | Freq. Percent Cum.
------------+-----------------------------------
1 | 485 100.00 100.00
------------+-----------------------------------
Total | 485 100.00
. tab _weight if treated == 1
pweight | Freq. Percent Cum.
------------+-----------------------------------
.1666667 | 6 0.91 0.91
.2 | 30 4.55 5.45
.25 | 72 10.91 16.36
.3333333 | 123 18.64 35.00
.5 | 216 32.73 67.73
1 | 213 32.27 100.00
------------+-----------------------------------
Total | 660 100.00
cap drop if _copy == 1
cap drop _*
ultimatch score if period == 0, treated(treated) report(gender age weight) support
tab _support
common |
support | Freq. Percent Cum.
------------+-----------------------------------
0 | 41 2.05 2.05
1 | 1,959 97.95 100.00
------------+-----------------------------------
Total | 2,000 100.00
exact(varlist)
和 exp(string)
选项使用
ultimatch
也支持丰富的定制功能。选项 exact(varlist)
支持粗化精确匹配,并且粗化精确匹配的变量在匹配后无差异。选项 exp(string)
和算子 t.
提供定制的匹配要求,比如选项 exp(gender)
和 exp(gender==t.gender)
都要求匹配成功的观测值同性别。
cap drop _*
ultimatch score if period == 0, treated(treated) report(gender age weight) exact(gender)
----------+-------------------------------------------------------------
Matched | Treated Control | StdErr t p>|t| SDM
----------+--------------------------+----------------------------------
gender | .599483204 .599483204 | .038298 -0.00 1.000 0.00000
age | 34.8165375 34.8165375 | .956433 -0.00 1.000 0.00000
weight | 73.4444444 73.4547804 | .527746 -0.02 0.984 -0.00153
------------------------------------------------------------------------
cap drop _*
ultimatch score if period == 0, treated(treated) report(gender age weight) exp(gender == t.gender)
---------+-------------------------------------------------------------
Matched | Treated Control | StdErr t p>|t| SDM
---------+--------------------------+----------------------------------
gender | .599483204 .599483204 | .038298 -0.00 1.000 0.00000
age | 34.8165375 34.8165375 | .956433 -0.00 1.000 0.00000
weight | 73.4444444 73.4547804 | .527746 -0.02 0.984 -0.00153
-----------------------------------------------------------------------
连享会-直播课 上线了!
http://lianxh.duanshu.com
免费公开课:
直击面板数据模型 - 连玉君,时长:1小时40分钟 Stata 33 讲 - 连玉君, 每讲 15 分钟. 部分直播课 课程资料下载 (PPT,dofiles等)
支持回看,所有课程可以随时购买观看。
专题 | 嘉宾 | 直播/回看视频 |
---|---|---|
⭐ 最新专题 ⭐ | DSGE, 因果推断, 空间计量等 | |
⭕ Stata数据清洗 | 游万海 | 直播, 2 小时,已上线 |
研究设计 | 连玉君 | 我的特斯拉-实证研究设计,-幻灯片- |
面板模型 | 连玉君 | 动态面板模型,-幻灯片- |
面板模型 | 连玉君 | 直击面板数据模型 [免费公开课,2小时] |
Note: 部分课程的资料,PPT 等可以前往 连享会-直播课 主页查看,下载。
关于我们
课程, 直播, 视频, 客服, 模型设定, 研究设计, stata, plus, 绘图, 编程, 面板, 论文重现, 可视化, RDD, DID, PSM, 合成控制法
等
连享会小程序:扫一扫,看推文,看视频……
扫码加入连享会微信群,提问交流更方便
✏ 连享会学习群-常见问题解答汇总:
✨ https://gitee.com/arlionn/WD