发布时间:2020-07-01 阅读 2491

Stata 连享会   主页 || 视频 || 推文

温馨提示: 定期 清理浏览器缓存,可以获得最佳浏览体验。

课程详情 https://gitee.com/arlionn/Course   |   lianxh.cn

课程主页 https://gitee.com/arlionn/Course

4 June 2020
Enrique Pinzon, Associate Director Econometrics

I care about reproducible research. Anyone who has ever been a research assistant or tried to follow the path set by other researchers also cares. Sometimes, reproducing others’ results is a frustrating task; sometimes, it is outright impossible. Yet sometimes, it is satisfyingly simple. In my experience, reproducing results is easy when it involves a Stata do-file. I believe this is true even beyond my personal bias (I work for Stata and used the software regularly before that). A recent article published by the American Economic Association (AEA), Vilhuber, Turrito, and Welch (2020), shows that Stata is the preferred package among economists, and I believe reproducibility is a big reason why.

The AEA established reproducibility guidelines in 2008. Recently, it updated its guidelines to require authors not only to make data and analysis available but also to provide the code used to clean the data and the raw data, whenever feasible. Now, the editorial process includes an AEA data editor who verifies that the information provided by the authors is sufficient to replicate the results in the paper.

Vilhuber, Turrito, and Welch (2020) show that since the inception of the policy, Stata has been used in 73% of the supplements provided by the authors. The usage has been increasing over the span of the policy. The graph below shows the percentage of data supplements in which different software packages are used. These percentages may add up to more than 100% because content from more than one software package may be submitted in each supplement.

Figure 1: Percentage of software usage by year in AEA supplements


This is not a surprise to anyone who has used Stata. I believe one important reason researchers choose Stata is that reproducing your results is easy. Case in point is the graph above. To get the data and reproduce the graph, you just need to run the do-file, which I discuss in Appendix I. If you want to create a reproducible report, see my discussion in Appendix II.

Appendix I: Explaining the do-file

The do-file below mainly uses three commands: import delimited, which I use to get the comma-separated value dataset used for the graph; xtline, which I use to generate the graph; and egen, which helps me to generate a numeric categorical variable from a string variable using the group function. The other commands I use simplify readability, allow me to modify the code, and help me display results.

Line 1 is there for reproducibility. Stata is the only package I am aware of with integrated version control ensuring that scripts written as long ago as 2008, and indeed, even earlier, can still be used to reproduce their results in the modern version of the software. Lines 5 to 7 create locals for the location of the files. I use them in my call to import delimited. Line 16 creates a categorical variable from a string variable. Each category corresponds to a software package. In the next line, I keep a subset of the data. The data I keep correspond to the time period since the AEA implemented its data policy. In line 21, I use xtset to declare data as a panel and then to be able to use xtline. In line 22, I change the default Stata graphic scheme. Stata has multiple schemes, and you could even write one that best represents your preferences. I like the simplicity of s1color. The remaining code reproduces figure 1.

version 16

// Simplyfing readability and allowing for possibly other data downloads

local  where "https://raw.githubusercontent.com/AEADataEditor"
local  folder "aea-supplement-migration/master/data/generated"
local data "table.aea.software_by_year.csv"

// Importing dataset

import delimited using "`where'/`folder'/`data'",   ///
clear varnames(1) colrange(2:9)

// Generating a numeric variable to graph

egen software_num = group(software_collapsed), label
keep if year > 2007

// Graphing

xtset software_num year
set scheme s1color
xtline percent, overlay legend(position(10) ring(0) rows(2)         ///
   region(lstyle(none)) order(9 3 4 7 5 6 8 1 2))                   ///
   ytitle("") xtitle("") xlabel(2008(3)2017 2019)                   ///
   ylabel(0(20)100)  plotregion(margin(medium))                     ///
   plot9(lcolor(blue))                                              ///
   note("Note: A supplement can be used in more than one software." ///
   "Note: Data for 2019 ends in October.")

Appendix II: Making it a reproducible document

You can create a reproducible report in Word, Excel, PDF, and HTML using Stata. By reproducible, I mean that you can run a do-file that produces Stata results and outputs them into the format you want.

I want to create an HTML document. I save a do-file as a text file and give it the name aea.txt. Then, I type dyndoc “aea.txt”, replace. I generate an HTML document. Aside from aea.txt, I use two other files to produce my HTML document. First, I create a header.txt file. The header.txt file contains HTML code to include at the top of the target HTML file. The header refers to the second file, stmarkdown.css, which is a style sheet that defines how the HTML document is to be formatted.

Now that I have my HTML documents, with Stata’s reporting tools, I can migrate from HTML to Word using one line of code, html2docx aea.html, saving(aea.docx) replace, and from Word to PDF, again, with one line of code, docx2pdf aea.docx, saving (aea.pdf) replace.


  • Vilhuber, L., J. Turrito, and K. Welch. 2020. Report by the AEA Data Editor. AEA Papers and Proceedings 110: 764–775. https://doi.org/10.1257/pandp.110.764.


连享会-直播课 上线了!




专题 嘉宾 直播/回看视频
Stata暑期班 连玉君
线上直播 9 天
效率分析-专题 连玉君
张 宁
文本分析/爬虫 游万海
空间计量系列 范巧 空间全局模型, 空间权重矩阵
空间动态面板, 空间DID
研究设计 连玉君 我的特斯拉-实证研究设计-幻灯片-
面板模型 连玉君 动态面板模型-幻灯片-
直击面板数据模型 [免费公开课,2小时]

Note: 部分课程的资料,PPT 等可以前往 连享会-直播课 主页查看,下载。


  • Stata连享会 由中山大学连玉君老师团队创办,定期分享实证分析经验。直播间 有很多视频课程,可以随时观看。
  • 连享会-主页知乎专栏,300+ 推文,实证分析不再抓狂。
  • 公众号推文分类: 计量专题 | 分类推文 | 资源工具。推文分成 内生性 | 空间计量 | 时序面板 | 结果输出 | 交乘调节 五类,主流方法介绍一目了然:DID, RDD, IV, GMM, FE, Probit 等。
  • 公众号关键词搜索/回复 功能已经上线。大家可以在公众号左下角点击键盘图标,输入简要关键词,以便快速呈现历史推文,获取工具软件和数据下载。常见关键词:
    • 课程, 直播, 视频, 客服, 模型设定, 研究设计, 暑期班
    • stata, plus,Profile, 手册, SJ, 外部命令, profile, mata, 绘图, 编程, 数据, 可视化
    • DID,RDD, PSM,IV,DID, DDD, 合成控制法,内生性, 事件研究, 交乘, 平方项, 缺失值, 离群值, 缩尾, R2, 乱码, 结果
    • Probit, Logit, tobit, MLE, GMM, DEA, Bootstrap, bs, MC, TFP, 面板, 直击面板数据, 动态面板, VAR, 生存分析, 分位数
    • 空间, 空间计量, 连老师, 直播, 爬虫, 文本, 正则, python
    • Markdown, Markdown幻灯片, marp, 工具, 软件, Sai2, gInk, Annotator, 手写批注, 盈余管理, 特斯拉, 甲壳虫, 论文重现, 易懂教程, 码云, 教程, 知乎

连享会主页  lianxh.cn
连享会主页 lianxh.cn



✏ 连享会学习群-常见问题解答汇总: