Stata连享会 主页 || 视频 || 推文 || 知乎 || Bilibili 站
温馨提示: 定期 清理浏览器缓存,可以获得最佳浏览体验。
New!
lianxh
命令发布了:
随时搜索推文、Stata 资源。安装:
. ssc install lianxh
详情参见帮助文件 (有惊喜):
. help lianxh
连享会新命令:cnssc
,ihelp
,rdbalance
,gitee
,installpkg
⛳ Stata 系列推文:
作者: 许梦洁 (Frankfurt School of Finance and Management)
邮箱: m.xu@fs.de
目录
写论文的时候需要搞一个 shareholder activism 变量,但是学校没买。看了下 Brav et al. (2018, JFE, -PDF-),发现这个数据是直接从 SEC EDGAR 的 13D 文件整理的,等图书馆订要好几天,索性自己爬了。
我一般用 Selenium 获取 cookies,这种方法自动化,而且几乎对任何网站都适用。对于 EDGAR,由于默认只显示文件名称,文件日期和涉及实体的名称。打开网页后,需要下拉页面勾选 CIK,FIle number 等其他你需要的选项。其中最关键的是 CIK ,这是链接 SEC File 和其他财务数据库 (Compustat,CRSP 等) 的关键识别变量。
打开 源网址 后呈现的页面如下
下拉默认只有三列数据:
勾选了其他数据栏后会呈现多列数据,相应的 Cookies 也会发现变化
from selenium import webdriver
import time
import json
option = webdriver.FirefoxOptions()
option.add_argument('-headless')
driver = webdriver.Firefox(executable_path='/Users/mengjiexu/Dropbox/Pythoncodes/geckodriver')
driver.get("https://www.sec.gov/edgar/search/#/dateRange=custom&category=custom&startdt=2001-01-01&enddt=2021-05-27&forms=SC%252013D")
# 打开源网页
time.sleep(3)
driver.execute_script('window.scrollTo(0,500)')
# 向下拉到勾选位置
time.sleep(2)
driver.find_element_by_xpath("//input[@value='cik']").click()
# 点选 CIK
driver.find_element_by_xpath("//input[@value='file-num']").click()
# 点选 File number
time.sleep(2)
# 记录 cookies 并写入 txt 文件中
orcookies = driver.get_cookies()
print(orcookies)
cookies = {}
for item in orcookies:
cookies[item['name']] = item['value']
with open("edgarcookies.txt", "w") as f:
f.write(json.dumps(cookies))
driver.close()
从 search-index 中可以找到返回的 Json 源
翻到第二页,Fetch 得到 post 的参数信息,这里关键是是 headers 和 body
await fetch("https://efts.sec.gov/LATEST/search-index", {
"credentials": "omit",
"headers": {
"User-Agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10.15; rv:89.0) Gecko/20100101 Firefox/89.0",
"Accept": "application/json, text/javascript, */*; q=0.01",
"Accept-Language": "en-US,en;q=0.5",
"Content-Type": "application/x-www-form-urlencoded; charset=UTF-8"
},
"referrer": "https://www.sec.gov/",
"body": "{\"dateRange\":\"custom\",\"category\":\"custom\",\"startdt\":\"2001-01-01\",\"enddt\":\"2021-05-27\",\"forms\":[\"SC 13D\"],\"page\":\"2\",\"from\":100}",
"method": "POST",
"mode": "cors"
});
body 里包含了向服务器 post 的查询参数,包括
startdt / enddt
forms = SC 13D
page
from = 前一页码 $\times$ 100
双击截图中的 search-index ,可以看到格式化的 Json,命名为 info。将每条结果命名为 case。这里主要提取的变量及对应的 Json 路径为:
info['hits']['total']['value']
case['_id']
case['_source']['ciks'][0]
case['_source']['ciks'][1]
case['_source']['file_num']
case['_source']['display_names'][0]
case['_source']['display_names'][1]
case['_source']['file_date']
case['_source']['adsh']
import time
import json
import csv
import math
import requests
from tqdm import tqdm
headers = {
"User-Agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10.15; rv:88.0) Gecko/20100101 Firefox/88.0",
"Accept": "application/json, text/javascript, */*; q=0.01",
"Accept-Language": "en-US,en;q=0.5",
"Content-Type": "application/x-www-form-urlencoded; charset=UTF-8"
}
def postpage(year, page):
come = (page-1)*100
post = "{\"dateRange\":\"custom\",\"category\":\"custom\",\"startdt\":\"%s-01-01\",\"enddt\":\"%s-12-31\",\"forms\":[\"SC 13D\"],\"page\":\"%s\",\"from\":%s}"%(year,year,page,come)
return(post)
def getinfo(year, page):
with open("edgarcookies.txt", "r")as f:
cookies = f.read()
cookies = json.loads(cookies)
session = requests.session()
url = "https://efts.sec.gov/LATEST/search-index"
data = session.post(url, headers=headers, cookies=cookies, data=postpage(year, page))
time.sleep(1)
info = json.loads(data.text)
totalnum = info['hits']['total']['value']
for case in info['hits']['hits']:
with open('edgar13D_check.csv','a') as g:
h = csv.writer(g)
id = case['_id']
try:
cik0 = case['_source']['ciks'][0]
cik1 = case['_source']['ciks'][1]
except:
cik0 = ""
cik1 = ""
root_form = case['_source']['root_form']
file_num = case['_source']['file_num']
try:
display_names_0 = case['_source']['display_names'][0]
display_names_1 = case['_source']['display_names'][1]
except:
display_names_0 = ""
display_names_1 = ""
file_date = case['_source']['file_date']
adsh = case['_source']['adsh']
out = [id,cik0,cik1,root_form, file_num, display_names_0, display_names_1,file_date,adsh]
h.writerow(out)
return(totalnum)
for year in range(2001, 2022):
totalnum = getinfo(year,1)
print(totalnum)
pagenum = math.ceil(totalnum/100)+1
print("%s %s"%(year,pagenum))
for page in tqdm(range(2,pagenum)):
getinfo(year,page)
先来分析 EDGAR 文件地址的结构,一个典型的 EDGAR 文件地址
https://www.sec.gov/Archives/edgar/data/0001762728/000092189521001513/sc13da412475002_05272021.htm
主要包含三个部分
https://www.sec.gov/Archives/edgar/data/
接受实体 CIK
[上表列2]
id
(e.g., 0001214659-21-005763:p519211sc13da7.htm) [上表列1] 中的 ":" 换成 "/", "-" 换成 ""
import os
import pandas as pd
import json
import csv
from tqdm import tqdm
import requests
df = pd.read_csv("edgar13d09-16.csv",header=0)
headers = {
"User-Agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10.15; rv:88.0) Gecko/20100101 Firefox/88.0",
"Accept": "application/json, text/javascript, */*; q=0.01",
"Accept-Language": "en-US,en;q=0.5",
"Content-Type": "application/x-www-form-urlencoded; charset=UTF-8"
}
with open("edgarcookies.txt", "r")as f:
cookies = f.read()
cookies = json.loads(cookies)
session = requests.session()
for row in tqdm(df.iterrows()):
url = row[1][0].replace(":","/").replace("-","")
cik = row[1][1]
filedate = row[1][7].replace('/','-')
filename = "%s_%s"%(filedate,row[1][0])
href= "https://www.sec.gov/Archives/edgar/data/%s/%s"%(cik,url)
data = session.get(href, headers=headers, cookies=cookies)
with open("/Users/mengjiexu/Dropbox/edgar13d/%s"%filename,'wb') as g:
g.write(data.content)
Note:产生如下推文列表的 Stata 命令为:
lianxh 爬虫 爬取
安装最新版lianxh
命令:
ssc install lianxh, replace
免费公开课
最新课程-直播课
专题 | 嘉宾 | 直播/回看视频 |
---|---|---|
⭐ 最新专题 | 文本分析、机器学习、效率专题、生存分析等 | |
研究设计 | 连玉君 | 我的特斯拉-实证研究设计,-幻灯片- |
面板模型 | 连玉君 | 动态面板模型,-幻灯片- |
面板模型 | 连玉君 | 直击面板数据模型 [免费公开课,2小时] |
⛳ 课程主页
⛳ 课程主页
关于我们
课程, 直播, 视频, 客服, 模型设定, 研究设计, stata, plus, 绘图, 编程, 面板, 论文重现, 可视化, RDD, DID, PSM, 合成控制法
等
连享会小程序:扫一扫,看推文,看视频……
扫码加入连享会微信群,提问交流更方便
✏ 连享会-常见问题解答:
✨ https://gitee.com/lianxh/Course/wikis
New!
lianxh
命令发布了:
随时搜索连享会推文、Stata 资源,安装命令如下:
. ssc install lianxh
使用详情参见帮助文件 (有惊喜):
. help lianxh