用正则表达式整理文献:正文与文末一一对应

发布时间:2021-03-11 阅读 378

Stata连享会   主页 || 视频 || 推文 || 知乎

温馨提示: 定期 清理浏览器缓存,可以获得最佳浏览体验。

New! lianxh 命令发布了:
随时搜索连享会推文、Stata 资源,安装命令如下:
. ssc install lianxh
使用详情参见帮助文件 (有惊喜):
. help lianxh

课程详情 https://gitee.com/arlionn/Course   |   lianxh.cn

课程主页 https://gitee.com/arlionn/Course

⛳ Stata 系列推文:

作者: 游万海 (福州大学)
邮箱: hnstedu@163.com

1. 引言

一篇规范的学术论文通常由正文、参考文献及图形表格组成。对于参考文献,通常要求其与文章正文部分所列文献一一对应。一篇论文所包含的文献少至一二十篇,多至上百篇,可见参考文献的管理是论文写作过程中的一个重要组成部分。

当你写作完成,如何快速核对正文所引文献是否和文后所列参考文献一一对应?论文校稿时,编辑是如何快速发现参考文献存在遗漏或者多余的?如下面这段文字

如何将所引文献信息抽取出来,并最终生成一份参考文献列表,从而可以通过作者和年份信息与文后的参考文献进行对比,快速地完成文献的核对?这里包括几个步骤:

  • 第一,将正文所引文献信息抽取,获取作者姓名和年份信息;
  • 第二,将参考文献部分抽取出来,并且抽取除作者姓名和年份信息;
  • 第三,将第一步和第二步获取的信息进行匹配。本文主要关注步骤一。

为了解决这一问题,已有多种文献管理软件被开发,如EndnoteNoteExpressMendeley等。掌握这些软件,对论文写作往往能起到事半功倍的效果。当然,要达到这一效果,也并非"免费",需要付出一些学习的时间成本。

本文将提供另外一种方法,利用正则表达式进行处理。阅读本文你将可以收获:

(1) 掌握正则表达式的基本思想及应用场景;

(2) 利用正则表达式处理一些较为简单的非结构化数据;

(3) 利用R 软件读取 pdfword 文档内容,并进行内容抽取。

(4) 快速有效地完成文献核对。

Notes:
需要说明的是,如果你正在使用 EndNote 等文献管理软件,本文介绍的处理方法可以忽略,你可以重点关注「正则表达式」的语法和应用。

2. 正则表达式

正则表达式英文叫 Regular Expression, 简单理解起来,其实就是一堆约定熟成的规则

例如,从文本 x ="I Love Stata&R 520" 中取出数字部分 520,如果利用正则,我们就只需要知道在正则表达式中如何定义数字与非数字,这就是命令中需要定义的 pattern。例如,数字通常可以用 [0-9] 或者 \\d 表示,非数字可以用 [^0-9] 或者 \\D 表示。

> x = "I Love Stata&R 520"
> gsub("\\D", "", x)
[1] "520"

如何理解上述的命令? gsub = global + substitutesubstitute 表示替换,global 表示全局,这句命令可以理解为将所有的非数字都用空字符替换。本文不重点去讲解正则表达式的规则,详细的可以参考以往的推文:

回到本文的目的,即参考文献的匹配。即从一大段文字中匹配出所引用的文献,是否也是相似的道理?例如:

As the emerging economic organization, a large body of studies have focused on the Belt and Road initiative countries. For instance, as the initiator of Belt and Road Initiative, China receives the extensive concern ==(Alam et al., 2016; Ahmad et al., 2018; Zhang et al., 2019; Ji et al., 2019)==. ==Ji and Zhang (2019)== explore the influence of financial development on renewable energy growth, and demonstrate that financial development is an important determinant of renewable energy growth. In terms of green innovation and energy environment, ==Duan et al. (2018)== investigate energy investment risk of the Belt and Road initiative countries.

这里要实现的就是从这段文字中抽取出 Alam et al., 2016; Ahmad et al., 2018; Zhang et al., 2019; Ji et al., 2019Ji and Zhang (2019)Duan et al. (2018) 。那么,需要做的就是如何根据正则表达式,定义出符号这些特征的规则。

3. 引用文献的抽取

3.1 基本思路和步骤

根据文献引用规范,一篇学术论文的文献引用可能包括如下几种情形:

  • 第一: A (2020),即单个作者名字 + 年份;
  • 第二: A and B (2020),即两个作者名字 + 年份;
  • 第三: A, B, and C (2020), 即三个作者名字 + 年份;
  • 第四, A et al. (2020),即三个以上作者名字 + 年份,这时通常用 et al. 表示等等的意思。
  • 第五,(A and B, 2005; A et al., 2013; B et al., 2015),这个引用方式经常是出现在某一句之后,作为这一观点的作证而出现。

接下来,本文将利用 R 软件,将利用一个具体的案例进行讲解。以如下这篇论文为例, You, W., Li, Y., Guo, P., & Guo, Y. (2020). Income inequality and CO2 emissions in belt and road initiative countries: the role of democracy. Environmental Science and Pollution Research, 27(6), 6278-6299.

主要步骤如下:

  • 第一步,下载该论文( pdfword 格式)。 R 软件中可利用 textreadr 包的 download 函数进行下载。提示:需要有购买该数据库才可,不然就直接sci-hub
  • 第二步,将该文档读入到 R 软件中。pdf 文档可利用 pdftools 包中的 pdf_text函数读取,word 文档可利用 qdapTools 包中的 read_docx 函数读取。这个案例中使用了 word 文档。
  • 第三步,利用 R 中的 grepgsubstr_extract_all 完成查询、替代、抽取等操作。

3.2 准备工作

install.packages("stringr") ##仅首次运行需要
install.packages("qdapTools") ##仅首次运行需要
library(stringr)  ##加载包
setwd("G:\\Stata培训资料\\TA2021") ##设置工作路径
list_file_word = list.files(pattern = "*.doc") ##列出该路径下所有 doc 文件
text_cont = qdapTools::read_docx(list_file_word[2]) ##读取所列举的第 2 个文档到 R 中
head(text_cont) ##打印前 6 行内容
match_result1 = grep('([0-9]{4})', text_cont, value=T) ## 匹配出包含带 括号的四位数格式所在行,例如 (2001)、(2021)等
head(match_result1)
match_result = paste(match_result1, collapse=", ") ##将匹配得到的多行内容整合成一行,方便以后操作

3.3 完成各种情况匹配

第一种情况: A (2020)

re1 = str_extract_all(match_result, "(?<!and )([A-Z]{1}[a-z]+ \\([0-9]{4}\\))")
re1 = unlist(re1)

结果显示:

> re1
[1] "Boyce (1994)"     "Kashwan (2017)"   "Kashwan (2017)"
 [4] "Boyce (1994)"     "Boyce (1994)"     "Magnani (2000)"
 [7] "Drabo (2011)"     "Scruggs (1998)"   "Borghesi (2006)"
[10] "Romuald (2011)"   "Index (2005)"     "Bae (2018)"
[13] "Maddison (2006)"  "Maddison (2006)"  "Elhorst (2014)"
[16] "Elhorst (2010)"   "Baltagi (2008)"   "Midlarsky (1998)"
[19] "Maddison (2006)"  "Maddison (2006)"

第二种情况: A and B (2020)

re2 = str_extract_all(match_result, "(?<= )([A-Z]{1}[a-z]+ and [A-Z]{1}[a-z]+ \\([0-9]{4}\\))")
re2 = unlist(re2)

结果显示:

> re2
 [1] "Torras and Boyce (1998)"      "Torras and Boyce (1998)"
 [3] "Torras and Boyce (1998)"      "Eriksson and Persson (2003)"
 [5] "Baek and Gweisah (2013)"      "Kasuga and Takaya (2017)"
 [7] "Coondoo and Dinda (2008)"     "Ji and Zhang (2019)"
 [9] "Hadenius and Teorell (2005)"  "Selden and Song (1994)"
[11] "Georgiev and Mihaylov (2015)"

第三种情况: A, B, and C (2020)

re3 = str_extract_all(match_result, "([A-Z]{1}[a-z]+, .+, and [A-Z]{1}[a-z]+ \\([0-9]{4}\\))")
re3 = unlist(re3)

结果显示:

> re3
[1] "Boyce, Coondoo, and Dinda (2008)"

第四种情况: A et al.(2020)

re4 = str_extract_all(match_result, "([A-Z]{1}[a-z]+ et al. \\([0-9]{4}\\))")
re4 = unlist(re4)

结果显示:

> re4
 [1] "Ravallion et al. (2000)"   "Schmalensee et al. (1998)"
 [3] "Grunewald et al. (2012)"   "Hao et al. (2016)"
 [5] "Baloch et al. (2018)"      "Ravallion et al. (2000)"
 [7] "Heerink et al. (2001)"     "Grunewald et al. (2017)"
 [9] "Bernard et al. (2014)"     "Duan et al. (2018)"
[11] "Rauf et al. (2018)"        "Fan et al. (2019)"
[13] "Grunewald et al. (2017)"   "Kotschy et al. (2017)"
[15] "Ravallion et al. (2000)"   "Alam et al. (2007)"
[17] "Ravallion et al. (2000)"

第五种情况:(A and B, 2005; A et al., 2013; B et al., 2015)

re51 = unlist(str_extract_all(match_result, "\\([^0-9].+?, [0-9]{4}\\)")) ##加与不加?的区别
re52 = unlist(str_extract_all(re51, "\\(.+?\\)")) ##加与不加?的区别
re53 = grep("[0-9]{4}", re52, value=T)
re54 = grep("\\([a-zA-Z]", re53, value=T)
re55 = gsub('(e.g., |e.g. )', "", re54)

结果显示:

> re55
 [1] "(Boyce, 1994; Coondoo and Dinda, 2008; Wolde-Rufael and Idowu, 2017)"
 [2] "(Schor, 1998)"
 [3] "(Bowles and Park, 2005; Knight et al., 2013; Fitzgerald et al., 2015)"
 [4] "(Boyce, 1994; Ravallion et al., 2000; Wolde-Rufael and Idowu, 2017; Liu et al., 2019)"
 [5] "(Ma et al., 2016; Balado-Naves et al., 2018; You and Lv, 2018)"
 [6] "(Anselin and Rey, 2010)"
 [7] "(Romuald, 2011; Goel et al., 2013; You et al., 2015)"
 [8] "(Bernard et al., 2014; Kashwan, 2017)"
 [9] "(Zhang et al., 2017; Rauf et al., 2018; Ahmad et al., 2018; Saud et al., 2018; Hafeez et al., 2018; Fan et al., 2019)"
[10] "(Tobler, 1970)"
[11] "(Shafik, 1994; Coondoo and Dinda, 2002; Sheldon, 2017; Balado-Naves et al., 2018)"
[12] "(Esteve and Tamarit, 2012; Zanin and Marra, 2012)"
[13] "(Kasuga and Takaya, 2017)"
[14] "(Liu et al., 2018)"
[15] "(Rauf et al., 2018; Saud et al., 2018; Hafeez et al., 2018; Ahmad et al., 2018; Fan et al., 2019)"
[16] "(Alam et al., 2016; Ahmad et al., 2018; Zhang et al., 2019; Ji et al., 2019)"
[17] "(Alam et al., 2016; Rauf et al., 2018; Saud et al., 2018; Hafeez et al., 2018; Ahmad et al., 2018)"
[18] "(Balado-Naves et al. 2018; You and Lv, 2018)"
[19] "(Rios and Gianmoena, 2018)"
[20] "(Popp, 2001; Ahmed and Ozturk, 2018; Wurlod and Noailly, 2018)"
[21] "(i.e. Holtz-Eakin and Selden, 1995; Azomahou et al., 2006; Martínez-Zarzoso and Maruotti, 2011)"
[22] "(Solt, 2009)"
[23] "(Solt, 2009)"
[24] "(Solt, 2009)"
[25] "(Behringer and Treeck, 2018; Chui and Lee, 2019; Krieger and Meierrieks, 2019)"
[26] "(Högström, 2013)"
[27] "(Winslow, 2005; Arvin and Lew, 2011; Bailer and Weiler, 2015; Wang et al., 2018; Gill et al., 2019)"
[28] "(BenYishay and Betancourt, 2014)"
[29] "(Solt, 2009)"
[30] "(He et al., 2007)"
[31] "(Anselin, 1988)"
[32] "(LeSage and Pace, 2009)"
[33] "(LeSage and Fischer, 2008; Corrado and Fingleton, 2012)"
[34] "(Maddison, 2006)"
[35] "(Anselin and Florax, 1995)"
[36] "(Levin et al. 2002)"
[37] "(Im et al. 2003)"
[38] "(Pesaran, 2007)"
[39] "(Levin et al, 2002)"
[40] "(LeSage and Pace, 2012)"
[41] "(Hazır et al., 2018)"
[42] "(LeSage and Dominguez, 2012)"
[43] "(Kelejian and Prucha, 2010)"
[44] "(Anselin, 1988; Elhorst, 2001)"

针对这种情况,还需要进行如下操作:

  • 第一:一个括号内存在多篇文献的,利用字符分割函数 strsplit ,按照 ;号进行分割。
  • 第二:将多余的括号 (),以及 i.e. 删除;
  • 第三,将 et al.,et al,形式修改为 et al.形式。
myfun <- function(x) strsplit(x, ";")
re56 = unlist(lapply(re55, myfun))
re57 = str_trim(gsub("(\\(|\\))", "", re56))
uk = gsub("\\., ", "\\.", re57)
uk1 = gsub("(.+)([0-9]{4})", "\\1(\\2)", uk)
uk2 = gsub(", ", "", uk1)
uy = gsub("i.e. ", "", uk2)

结果如下:

> uy
 [1] "Boyce(1994)"
 [2] "Coondoo and Dinda(2008)"
 [3] "Wolde-Rufael and Idowu(2017)"
 [4] "Schor(1998)"
 [5] "Bowles and Park(2005)"
 [6] "Knight et al.(2013)"
 [7] "Fitzgerald et al.(2015)"
 [8] "Boyce(1994)"
 [9] "Ravallion et al.(2000)"
[10] "Wolde-Rufael and Idowu(2017)"
[11] "Liu et al.(2019)"
[12] "Ma et al.(2016)"
[13] "Balado-Naves et al.(2018)"
[14] "You and Lv(2018)"
[15] "Anselin and Rey(2010)"
[16] "Romuald(2011)"
[17] "Goel et al.(2013)"
[18] "You et al.(2015)"
[19] "Bernard et al.(2014)"
[20] "Kashwan(2017)"
[21] "Zhang et al.(2017)"
[22] "Rauf et al.(2018)"
[23] "Ahmad et al.(2018)"
[24] "Saud et al.(2018)"
[25] "Hafeez et al.(2018)"
[26] "Fan et al.(2019)"
[27] "Tobler(1970)"
[28] "Shafik(1994)"
[29] "Coondoo and Dinda(2002)"
[30] "Sheldon(2017)"
[31] "Balado-Naves et al.(2018)"
[32] "Esteve and Tamarit(2012)"
[33] "Zanin and Marra(2012)"
[34] "Kasuga and Takaya(2017)"
[35] "Liu et al.(2018)"
[36] "Rauf et al.(2018)"
[37] "Saud et al.(2018)"
[38] "Hafeez et al.(2018)"
[39] "Ahmad et al.(2018)"
[40] "Fan et al.(2019)"
[41] "Alam et al.(2016)"
[42] "Ahmad et al.(2018)"
[43] "Zhang et al.(2019)"
[44] "Jal.(2019)"
[45] "Alam et al.(2016)"
[46] "Rauf et al.(2018)"
[47] "Saud et al.(2018)"
[48] "Hafeez et al.(2018)"
[49] "Ahmad et al.(2018)"
[50] "Balado-Naves et al. (2018)"
[51] "You and Lv(2018)"
[52] "Rios and Gianmoena(2018)"
[53] "Popp(2001)"
[54] "Ahmed and Ozturk(2018)"
[55] "Wurlod and Noailly(2018)"
[56] "Holtz-Eakin and Selden(1995)"
[57] "Azomahou et al.(2006)"
[58] "Martínez-Zarzoso and Maruotti(2011)"
[59] "Solt(2009)"
[60] "Solt(2009)"
[61] "Solt(2009)"
[62] "Behringer and Treeck(2018)"
[63] "Chui and Lee(2019)"
[64] "Krieger and Meierrieks(2019)"
[65] "Högström(2013)"
[66] "Winslow(2005)"
[67] "Arvin and Lew(2011)"
[68] "Baand Weiler(2015)"
[69] "Wang et al.(2018)"
[70] "Gill et al.(2019)"
[71] "BenYishay and Betancourt(2014)"
[72] "Solt(2009)"
[73] "He et al.(2007)"
[74] "Anselin(1988)"
[75] "LeSage and Pace(2009)"
[76] "LeSage and Fischer(2008)"
[77] "Corrado and Fingleton(2012)"
[78] "Maddison(2006)"
[79] "Anselin and Florax(1995)"
[80] "Levin et al. (2002)"
[81] "Im et al. (2003)"
[82] "Pesaran(2007)"
[83] "Levin et al(2002)"
[84] "LeSage and Pace(2012)"
[85] "Hazır et al.(2018)"
[86] "LeSage and Dominguez(2012)"
[87] "Kelejian and Prucha(2010)"
[88] "Anselin(1988)"
[89] "Elhorst(2001)"

3.4 结果汇总和格式调整

最终,我们需要将匹配的结果合并到一起,并将重复出现的文献删除,根据文献引用格式增加或删除空格,进行排序

result = c(re1, re2, re3, re4, uy)
result1 = gsub("\\.(?=\\()", "\\. ", result, perl=T)
result2 = gsub("(.+)(?=\\()", "\\1 \\2", result1, perl=T)
ux = unique(result2)
ux = sort(ux)
uz = str_squish(ux)
uk = unique(uz)

结果显示:

> uk
 [1] "Ahmad et al. (2018)"
 [2] "Ahmed and Ozturk (2018)"
 [3] "Alam et al. (2007)"
 [4] "Alam et al. (2016)"
 [5] "Anselin (1988)"
 [6] "Anselin and Florax (1995)"
 [7] "Anselin and Rey (2010)"
 [8] "Arvin and Lew (2011)"
 [9] "Azomahou et al. (2006)"
[10] "Baand Weiler (2015)"
[11] "Bae (2018)"
[12] "Baek and Gweisah (2013)"
[13] "Balado-Naves et al. (2018)"
[14] "Baloch et al. (2018)"
[15] "Baltagi (2008)"
[16] "Behringer and Treeck (2018)"
[17] "BenYishay and Betancourt (2014)"
[18] "Bernard et al. (2014)"
[19] "Borghesi (2006)"
[20] "Bowles and Park (2005)"
[21] "Boyce (1994)"
[22] "Boyce, Coondoo, and Dinda (2008)"
[23] "Chui and Lee (2019)"
[24] "Coondoo and Dinda (2008)"
[25] "Coondoo and Dinda (2002)"
[26] "Corrado and Fingleton (2012)"
[27] "Drabo (2011)"
[28] "Duan et al. (2018)"
[29] "Elhorst (2010)"
[30] "Elhorst (2014)"
[31] "Elhorst (2001)"
[32] "Eriksson and Persson (2003)"
[33] "Esteve and Tamarit (2012)"
[34] "Fan et al. (2019)"
[35] "Fitzgerald et al. (2015)"
[36] "Georgiev and Mihaylov (2015)"
[37] "Gill et al. (2019)"
[38] "Goel et al. (2013)"
[39] "Grunewald et al. (2012)"
[40] "Grunewald et al. (2017)"
[41] "Högström (2013)"
[42] "Hadenius and Teorell (2005)"
[43] "Hafeez et al. (2018)"
[44] "Hao et al. (2016)"
[45] "Hazır et al. (2018)"
[46] "He et al. (2007)"
[47] "Heerink et al. (2001)"
[48] "Holtz-Eakin and Selden (1995)"
[49] "Im et al. (2003)"
[50] "Index (2005)"
[51] "Jal. (2019)"
[52] "Ji and Zhang (2019)"
[53] "Kashwan (2017)"
[54] "Kasuga and Takaya (2017)"
[55] "Kelejian and Prucha (2010)"
[56] "Knight et al. (2013)"
[57] "Kotschy et al. (2017)"
[58] "Krieger and Meierrieks (2019)"
[59] "LeSage and Dominguez (2012)"
[60] "LeSage and Fischer (2008)"
[61] "LeSage and Pace (2009)"
[62] "LeSage and Pace (2012)"
[63] "Levin et al (2002)"
[64] "Levin et al. (2002)"
[65] "Liu et al. (2018)"
[66] "Liu et al. (2019)"
[67] "Ma et al. (2016)"
[68] "Maddison (2006)"
[69] "Magnani (2000)"
[70] "Martínez-Zarzoso and Maruotti (2011)"
[71] "Midlarsky (1998)"
[72] "Pesaran (2007)"
[73] "Popp (2001)"
[74] "Rauf et al. (2018)"
[75] "Ravallion et al. (2000)"
[76] "Rios and Gianmoena (2018)"
[77] "Romuald (2011)"
[78] "Saud et al. (2018)"
[79] "Schmalensee et al. (1998)"
[80] "Schor (1998)"
[81] "Scruggs (1998)"
[82] "Selden and Song (1994)"
[83] "Shafik (1994)"
[84] "Sheldon (2017)"
[85] "Solt (2009)"
[86] "Tobler (1970)"
[87] "Torras and Boyce (1998)"
[88] "Wang et al. (2018)"
[89] "Winslow (2005)"
[90] "Wolde-Rufael and Idowu (2017)"
[91] "Wurlod and Noailly (2018)"
[92] "You and Lv (2018)"
[93] "You et al. (2015)"
[94] "Zanin and Marra (2012)"
[95] "Zhang et al. (2017)"
[96] "Zhang et al. (2019)"
>

上图为论文参考文献,对比可以发现,利用上述代码实现了大部分引用文献的爬取,但是还存在一些问题。例如,Elhorst 教授的文献实际被引用 4 篇,而利用代码获取的只有 3 篇。通过对比原文发现,除了上述五种文献引用格式,还存在 A (2001; 2010; 2020) 的格式,这种格式用正则表达式如何定义呢?该问题留给大家思考。

彩蛋:请大家帮忙找茬,文章中还有哪些情况没有考虑到?欢迎大家留言~~ 我将对程序进行完善,先列举一个

1. Maddison (2006) and Apergis (2016) 是两篇文献,这样定义会导致 Apergis (2016)未被识别。

4. 总结

  • 虽说正则表达式不是必须掌握的,一些复杂的问题也可用多种字符函数组合处理,但掌握正则表达式往往可以做到事半功倍,特别是在非结构化数据中。
  • 正则表达式其实没有想象中的复杂,其实就是一堆约定熟成的规则
  • 本文使用 R 语言进行演示,感兴趣的读者可以做一个 Stata 版本,甚至可以写成一个通用的 ado 程序。有关 Stata 正则表达式的介绍,参见:

5. 附录:文中所用 R 代码汇总

install.packages("stringr") ##仅首次运行需要
install.packages("qdapTools") ##仅首次运行需要
library(stringr)  ##加载包
setwd("G:\\Stata培训资料\\TA2021") ##设置工作路径
list_file_word = list.files(pattern = "*.doc") ##列出该路径下所有 doc 文件
text_cont = qdapTools::read_docx(list_file_word[2]) ##读取所列举的第 2 个文档到 R 中
head(text_cont) ##打印前 6 行内容
match_result1 = grep('([0-9]{4})', text_cont, value=T) ## 匹配出包含带 括号的四位数格式所在行,例如 (2001)、(2021)等
head(match_result1)
match_result = paste(match_result1, collapse=", ") ##将匹配得到的多行内容整合成一行,方便以后操作

re1 = unlist(str_extract_all(match_result, "(?<!and )([A-Z]{1}[a-z]+ \\([0-9]{4}\\))"))
re2 = unlist(str_extract_all(match_result, "(?<= )([A-Z]{1}[a-z]+ and [A-Z]{1}[a-z]+ \\([0-9]{4}\\))"))
re3 = unlist(str_extract_all(match_result, "([A-Z]{1}[a-z]+, .+, and [A-Z]{1}[a-z]+ \\([0-9]{4}\\))"))
re4 = unlist(str_extract_all(match_result, "([A-Z]{1}[a-z]+ et al. \\([0-9]{4}\\))"))

re51 = unlist(str_extract_all(match_result, "\\([^0-9].+?, [0-9]{4}\\)")) ##加与不加?的区别
re52 = unlist(str_extract_all(re51, "\\(.+?\\)")) ##加与不加?的区别
re53 = grep("[0-9]{4}", re52, value=T)
re54 = grep("\\([a-zA-Z]", re53, value=T)
re55 = gsub('(e.g., |e.g. )', "", re54)

myfun <- function(x) strsplit(x, ";")
re56 = unlist(lapply(re55, myfun))
re57 = str_trim(gsub("(\\(|\\))", "", re56))
uk = gsub("\\., ", "\\.", re57)
uk1 = gsub("(.+)([0-9]{4})", "\\1(\\2)", uk)
uk2 = gsub(", ", "", uk1)
uy = gsub("i.e. ", "", uk2)

result = c(re1, re2, re3, re4, uy)
result1 = gsub("\\.(?=\\()", "\\. ", result, perl=T)
result2 = gsub("(.+)(?=\\()", "\\1 \\2", result1, perl=T)
ux = unique(result2)
ux = sort(ux)
uz = str_squish(ux)
uk = unique(uz)
uk

6. 相关推文

Note:产生如下推文列表的 Stata 命令为:
lianxh 正则 文本分析
安装最新版 lianxh 命令:
ssc install lianxh, replace

相关课程

免费公开课

最新课程-直播课

专题 嘉宾 直播/回看视频
最新专题 文本分析、机器学习、效率专题、生存分析等
研究设计 连玉君 我的特斯拉-实证研究设计-幻灯片-
面板模型 连玉君 动态面板模型-幻灯片-
面板模型 连玉君 直击面板数据模型 [免费公开课,2小时]
  • Note: 部分课程的资料,PPT 等可以前往 连享会-直播课 主页查看,下载。

课程主页

课程主页

关于我们

  • Stata连享会 由中山大学连玉君老师团队创办,定期分享实证分析经验。
  • 连享会-主页知乎专栏,400+ 推文,实证分析不再抓狂。直播间 有很多视频课程,可以随时观看。
  • 公众号关键词搜索/回复 功能已经上线。大家可以在公众号左下角点击键盘图标,输入简要关键词,以便快速呈现历史推文,获取工具软件和数据下载。常见关键词:课程, 直播, 视频, 客服, 模型设定, 研究设计, stata, plus, 绘图, 编程, 面板, 论文重现, 可视化, RDD, DID, PSM, 合成控制法

连享会主页  lianxh.cn
连享会主页 lianxh.cn

连享会小程序:扫一扫,看推文,看视频……

扫码加入连享会微信群,提问交流更方便

✏ 连享会学习群-常见问题解答汇总:
https://gitee.com/arlionn/WD

New! lianxh 命令发布了:
随时搜索连享会推文、Stata 资源,安装命令如下:
. ssc install lianxh
使用详情参见帮助文件 (有惊喜):
. help lianxh