温馨提示： 定期 清理浏览器缓存，可以获得最佳浏览体验。
原文： Software for Researchers: New Data and Applications
作者： Anton Tarasenko
Amazing！先看看这张图，找到自己所在的位置…… ♠ → ♣ → ♥ → ¥ → £
The tools mentioned here help manage reproducible research and handle new types of data. Why should you go after new data? New data provides new insights. For example, the recent Clark Medal winners used unconventional data in their major works. This data came large and unstructured, so Excel, Word, and email wouldn’t do the job.
I write for economists, but other social scientists can also find these recommendations useful. These tools have a steep learning curve and pay off over time. Some improve small-data analysis as well, but most gains come from new sources and real-time analysis.
Each section ends with a recommended reading list.
LaTeX and DropBox streamline collaboration. The recommended LaTeX editor is LyX. Zotero and its browser plugin manage the references. LyX supports Zotero via another plugin.
Stata and Matlab do numerical computations. Both are paid, have good support and documentation. Free alternatives: IPython and RStudio to Stata, Octave to Matlab.
Mathematica does symbolic computations. Sage is a free alternative.
Some websites have APIs, which send data in structured formats but limit the number of requests. Site owners may alter the limit by agreement. When the website has no API, Kimono and Import.io extract structured data from webpages. When they can’t, BeautifulSoup and similar parsers can.
Other sources include industrial software, custom data collection systems (like surveys in Amazon Turk), and physical media. Text recognition systems require little manual labor, so digitizing analog sources is easy now.
Socrata, data.gov, quandl, FRED2 maintain the most comprehensive collection of public datasets. But the universe is much bigger, and exotic data hides elsewhere.
A general purpose programming language can manage data that comes in peculiar formats or requires cleaning.
Use Python by default. Its packages also replicate core functionality of Stata, Matlab, and Mathematica. Other packages handle GIS, NLP, visual, and audio data.
Python comes as a standalone installation or in special distributions like Anaconda. For easier troubleshooting, I recommend the standalone installation. Use pip for package management.
Python is slow compared to other popular languages, but certain tweaks make it fast enough to avoid learning other languages, like Julia or Java. Generally, execution time is not an issue. Execution becomes twice cheaper each year (Moore’s Law) and coder’s time gets more expensive.
Command line interfaces make massive operations on files easier. For Macs and other *nix systems, learn bash. For Windows, see cmd.exe.
Version control tracks changes in files. It includes:
Version control by Git is a de-facto standard. GitHub.com is the largest service that maintains Git repositories. It offers free storage for open projects and paid storage for private repositories.
A GitHub repository is a one-click solution for both code and data. No problems with university servers, relocated personal pages, or sending large files via email.
When your project goes north of 1 GB, you can use GitHub’s Large File Storage or alternatives: AWS, Google Cloud, mega.nz, or torrents.
Jupyter notebooks combine text, code, and output on the same page. See examples:
Beamer for LaTeX is a standard solution for slides. TikZ for LaTeX draws diagrams and graphics.
Remote servers store large datasets in memory. They do numerical optimization and Monte Carlo simulations. GPU-based servers train artificial neural networks much faster and require less coding. These things save time.
If campus servers have peculiar limitations, third-party companies offer scalable solutions (AWS and Google Cloud). Users pay for storage and processor power, so exploratory analysis goes quickly.
A typical workflow with version control:
Some services allow writing code in a browser and running it right on their servers.
Real-time analysis requires optimization for performance. I exemplify with industrial applications:
A map for learning new data technologies by Swami Chandrasekaran:
直击面板数据模型 - 连玉君，时长：1小时40分钟 Stata 33 讲 - 连玉君, 每讲 15 分钟. 部分直播课 课程资料下载 (PPT，dofiles等)
|⭐ 最新专题 ⭐||DSGE, 因果推断, 空间计量等|
|⭕ Stata数据清洗||游万海||直播, 2 小时，已上线|
Note: 部分课程的资料，PPT 等可以前往 连享会-直播课 主页查看，下载。
课程, 直播, 视频, 客服, 模型设定, 研究设计, stata, plus, 绘图, 编程, 面板, 论文重现, 可视化, RDD, DID, PSM, 合成控制法等