Stata:正则表达式教程

发布时间:2022-05-30 阅读 1585

Stata连享会   主页 || 视频 || 推文 || 知乎 || Bilibili 站

温馨提示: 定期 清理浏览器缓存,可以获得最佳浏览体验。

New! lianxh 命令发布了:
随时搜索推文、Stata 资源。安装:
. ssc install lianxh
详情参见帮助文件 (有惊喜):
. help lianxh
连享会新命令:cnssc, ihelp, rdbalance, gitee, installpkg

课程详情 https://gitee.com/lianxh/Course

课程主页 https://gitee.com/lianxh/Course

⛳ Stata 系列推文:

PDF下载 - 推文合集

作者:梁淑珍 (华侨大学)
邮箱13514084150@163.com

编者按:本文主要摘译自下文,特此致谢!
Source:Asjad Naqvi, 2022, Blog, Regular expressions (regex) in Stata. -Link-


目录


1. 序言

本文操作基于 Stata14 及以上的版本,输入:

. help regex

从帮助文档中可以看出,regex 只适用于普通的 ASCII 字符编码,而 Unicode 字符编码需要使用 ustrregex。本文将主要基于 ustrregex 进行详细讲解,我们强烈建议读者学习和使用 ustr 版本。勤动手+多应用才能更好地学会正则表达式!我们建议读者在练习的过程中打开 Stata 数据编辑器窗口,观察生成结果。

2. 基础知识

让我们从一个简单的例子开始:

. clear
. set obs 1
. gen x = ""
. replace x = "abcdefghiJKLMN 11123" in 1
. list

     +----------------------+
     |                    x |
     |----------------------|
  1. | abcdefghiJKLMN 11123 |
     +----------------------+

这是一个包含小写字母、大写字母和数字的字符串,接下来我们会针对变量 x 进行一系列操作。

ustrregexm 匹配命令:首先,写一个简单的匹配表达式对字符串进行搜索。

. gen t1 = ustrregexm(x, "def")
. list 

     +---------------------------+
     |                    x   t1 |
     |---------------------------|
  1. | abcdefghiJKLMN 11123    1 |
     +---------------------------+

这行代码的含义是让 Stata 匹配变量 x 中的 "def",匹配成功返回 1,否则返回 0。ustrregexmu 表示 Unicode,str 表示字符串 (string),regex 表示正则表达式 (regular expression),m 表示匹配 (match)。

ustrregexs 提取命令:Stata 将匹配结果储存下来后,通常要结合 ustrregexs 获取匹配结果。

. gen t2 = ustrregexs(0) if ustrregexm(x, "def")
. list

     +---------------------------------+
     |                    x   t1    t2 |
     |---------------------------------|
  1. | abcdefghiJKLMN 11123    1   def |
     +---------------------------------+

ustrregexss 表示的是子表达式 (sub-expressions),也可以理解为 “提取” 子表达式。ustrregexs 经常和 ustrregexm 一起使用,后者将匹配 (m) 成功的字符串储存下来,前者将这部分字符提取 (s) 出来。ustrregexs(0) 表示将所有匹配成功的字符都提取出来。当然,这个参数也可以是 1、2、3 ...,表示要提取的内容所在的位置。对于初学者来说可能有些难理解,请对比分析以下几行代码:

. gen t2_0= ustrregexs(1) if ustrregexm(x, "(def)")
. gen t2_1= ustrregexs(1) if ustrregexm(x, "(d)ef")
. gen t2_2= ustrregexs(1) if ustrregexm(x, "d(ef)")
. gen t2_3= ustrregexs(2) if ustrregexm(x, "(d)(ef)")
. list x t2_0 t2_1 t2_2 t2_3

     +--------------------------------------------------+
     |                    x   t2_0   t2_1   t2_2   t2_3 |
     |--------------------------------------------------|
  1. | abcdefghiJKLMN 11123    def      d     ef     ef |
     +--------------------------------------------------+

简单来说,就是对要提取的内容加括号,数字表示提取第几个括号的内容。例如 t2_2 提取出的结果为第 1 个括号的内容 "ef"。这里用到了子表达式的知识点,下文会做进一步说明。接下来,提取所有字母或数字:

. gen t3 = ustrregexs(0) if ustrregexm(x, "[a-z]+")
. gen t4 = ustrregexs(0) if ustrregexm(x, "[A-Z]+")
. gen t5 = ustrregexs(0) if ustrregexm(x, "[0-9]+")  
. gen t6 = ustrregexs(0) if ustrregexm(x, "1+") 
. list x t3 t4 t5 t6

     +--------------------------------------------------------+
     |                    x          t3      t4      t5    t6 |
     |--------------------------------------------------------|
  1. | abcdefghiJKLMN 11123   abcdefghi   JKLMN   11123   111 |
     +--------------------------------------------------------+

ustrregera 替换命令:

. gen t7 = ustrregexra(x, "[def]", "")
. list x t7

     +------------------------------------------+
     |                    x                  t7 |
     |------------------------------------------|
  1. | abcdefghiJKLMN 11123   abcghiJKLMN 11123 |
     +------------------------------------------+

可以发现 "def" 三个字母消失了,这是 ustrregexra 在起作用。其中 ra 表示替换所有 (replace all),[def] 表示 "def" 中的任一一个元素。整行代码的意思是,寻找变量 x 中是否有 "def" 中的任一一个元素,如果有就替换为空。

. gen t8 = ustrregexra(x, "[^def]", "")
. list x t8

     +----------------------------+
     |                    x    t8 |
     |----------------------------|
  1. | abcdefghiJKLMN 11123   def |
     +----------------------------+

[^] 表示取非,代码意思为将变量 x 中非 "def" 元素替换为空,最后只留下 "def"。ustrregexra 还能再增加一个参数 ,1,表示搜索时不区分大小写。

. gen t8_1 = ustrregexra("ABDEFGHIJ", "[def]", "X", 1)
. list x t8_1

     +----------------------------------+
     |                    x        t8_1 |
     |----------------------------------|
  1. | abcdefghiJKLMN 11123   ABXXXGHIJ |
     +----------------------------------+

3. 字符串函数

首先,生成一些字符串:

. clear
. set obs 10
. gen x = ""
. replace x = "The quick brown fox jumps over the lazy dog." in 1
. replace x = "the sun is shining. The birds are singing." in 2
. replace x = "  Pi equals 3.14159265" in 3
. replace x = "TheRe arE 9 plANetS in THE solar SYstem.  eARth's mOOn is round" in 4
. replace x = "I LOVE Stata 16 . " in 5
. replace x = "Always correct the regressions for clustered standard errors." in 6
. replace x = "I get an error code r(997.55). What do i do next?" in 7
. replace x = "myname@coolmail.com, Tel: +43 444 5555" in 8
. replace x = " othername@dmail.net, Tel: +1 800 1337. " in 9
. replace x = "Firstname  Lastname  03-06-1990" in 10

这些字符串中包括邮箱、电话号码、标点符号、全部大写的字母以及一些空格。在使用正则表达式之前,先简单了解一下 Stata 自带的字符串函数。

. help string functions

帮助文件中列举了丰富的字符串函数的使用方法,接下来介绍一些常见命令:

. gen temp1 = upper(x)
. gen temp2 = lower(x)
. gen temp3 = proper(x)
. gen temp4 = trim(x)
. gen temp5 = proper(trim(x))
. list in 1

     +----------------------------------------------+
  1. |                                            x |
     | The quick brown fox jumps over the lazy dog. |
     |----------------------------------------------|
     |                                        temp1 |
     | THE QUICK BROWN FOX JUMPS OVER THE LAZY DOG. |
     |----------------------------------------------|
     |                                        temp2 |
     | the quick brown fox jumps over the lazy dog. |
     |----------------------------------------------|
     |                                        temp3 |
     | The Quick Brown Fox Jumps Over The Lazy Dog. |
     |----------------------------------------------|
     |                                        temp4 |
     | The quick brown fox jumps over the lazy dog. |
     |----------------------------------------------|
     |                                        temp5 |
     | The Quick Brown Fox Jumps Over The Lazy Dog. |
     +----------------------------------------------+

upper 将所有字母转化为大写,lower 将所有字母转化为小写,proper 将每个单词的首字母大写且词与词之间用空格隔开。这三个命令在处理混乱的英文字符串时非常好用。trim 可用于去除字符串首尾多余的空白字符,还可以使用 ftrim 仅去除开头的空白字符,以及 ltrim 仅去除结尾的空白字符。需要注意的是,带有特殊格式的 Excel 文件导入 Stata 中时,要慎用 trim 命令。

另一组有用的命令是 lengthwordcount

. gen diff = length(x) - length(temp4)
. gen count = wordcount(x)
. list x temp4 diff count in 1

     +-------------------------------------------------------------+
  1. |                                                   x         |
     |        The quick brown fox jumps over the lazy dog.         |
     |-------------------------------------------------------------|
     |                                        temp4 | diff | count |
     | The quick brown fox jumps over the lazy dog. |    0 |     9 |
     +-------------------------------------------------------------+

length 可以用于计算字符串的长度,利用上述命令可以比较 trim 后的字符串长度和原始字符串长度,如果值大于 0 表示去除了首尾的空白字符。wordcount 可以统计由空格隔开的词数量。

Stata 中常用的字符串命令还有 split

. split x, parse(" ") gen(test)
variables created as string: 
test1   test3   test5   test7   test9   test11  test13
test2   test4   test6   test8   test10  test12

x 变量中的字符串按照空格分割,并生成前缀为 test 的新变量。这里使用空格作为分割符,也可以自定义分割符或设置多个分割符,如将 "-" 设置为分割符。

3. 通用的文本搜索

再次回到正则表达式,开始探索更高级的功能!

. cap drop t* // 删除前面生成的一些变量
. gen t1 = ustrregexs(0) if ustrregexm(x, "\w+")
. list x t1

     +-----------------------------------------------------------------------------+
     |                                                               x          t1 |
     |-----------------------------------------------------------------------------|
  1. |                    The quick brown fox jumps over the lazy dog.         The |
  2. |                      the sun is shining. The birds are singing.         the |
  3. |                                            Pi equals 3.14159265          Pi |
  4. | TheRe arE 9 plANetS in THE solar SYstem.  eARth's mOOn is round       TheRe |
  5. |                                              I LOVE Stata 16 .            I |
     |-----------------------------------------------------------------------------|
  6. |   Always correct the regressions for clustered standard errors.      Always |
  7. |               I get an error code r(997.55). What do i do next?           I |
  8. |                          myname@coolmail.com, Tel: +43 444 5555      myname |
  9. |                         othername@dmail.net, Tel: +1 800 1337.    othername |
 10. |                                 Firstname  Lastname  03-06-1990   Firstname |
     +-----------------------------------------------------------------------------+

\w 表示任一英文字母、数字、下划线和汉字,+表示一个或者多个,\w+ 表示连续的英文字母、数字、下划线和汉字。所以上述命令只能提取字符串中的第一个单词。如果想要匹配连续非英文字母、数字、下划线或汉字,则可以使用\W+ (大写)。

. gen t2 = ustrregexs(0) if ustrregexm(x, "\W+")
. list x t2

     +----------------------------------------------------------------------+
     |                                                               x   t2 |
     |----------------------------------------------------------------------|
  1. |                    The quick brown fox jumps over the lazy dog.      |
  2. |                      the sun is shining. The birds are singing.      |
  3. |                                            Pi equals 3.14159265      |
  4. | TheRe arE 9 plANetS in THE solar SYstem.  eARth's mOOn is round      |
  5. |                                              I LOVE Stata 16 .       |
     |----------------------------------------------------------------------|
  6. |   Always correct the regressions for clustered standard errors.      |
  7. |               I get an error code r(997.55). What do i do next?      |
  8. |                          myname@coolmail.com, Tel: +43 444 5555    @ |
  9. |                         othername@dmail.net, Tel: +1 800 1337.       |
 10. |                                 Firstname  Lastname  03-06-1990      |
     +----------------------------------------------------------------------+

从结果来看,t2 好像只有第 8 行提取出了 @ 符号,实际上其他行提取到的是空白字符。如果想要匹配连续数字,则可以使用 \d+

. gen t3 = ustrregexs(0) if ustrregexm(x, "\d+")
. list x t3

     +-----------------------------------------------------------------------+
     |                                                               x    t3 |
     |-----------------------------------------------------------------------|
  1. |                    The quick brown fox jumps over the lazy dog.       |
  2. |                      the sun is shining. The birds are singing.       |
  3. |                                            Pi equals 3.14159265     3 |
  4. | TheRe arE 9 plANetS in THE solar SYstem.  eARth's mOOn is round     9 |
  5. |                                              I LOVE Stata 16 .     16 |
     |-----------------------------------------------------------------------|
  6. |   Always correct the regressions for clustered standard errors.       |
  7. |               I get an error code r(997.55). What do i do next?   997 |
  8. |                          myname@coolmail.com, Tel: +43 444 5555    43 |
  9. |                         othername@dmail.net, Tel: +1 800 1337.      1 |
 10. |                                 Firstname  Lastname  03-06-1990    03 |
     +-----------------------------------------------------------------------+

同理,非数字可以使用 \D+

. gen t4 = ustrregexs(0) if ustrregexm(x, "\D+")
. list x t4 in 1/4

     +----------------------------------------------------------------------------------------------------------------+
     |                                                               x                                             t4 |
     |----------------------------------------------------------------------------------------------------------------|
  1. |                    The quick brown fox jumps over the lazy dog.   The quick brown fox jumps over the lazy dog. |
  2. |                      the sun is shining. The birds are singing.     the sun is shining. The birds are singing. |
  3. |                                            Pi equals 3.14159265                                     Pi equals  |
  4. | TheRe arE 9 plANetS in THE solar SYstem.  eARth's mOOn is round                                     TheRe arE  |
     +----------------------------------------------------------------------------------------------------------------+

以上命令语句可以提取出首个数字出现前的所有字符串。以下是正则表达式的一些核心表达式:

[jkl]    = j 或 k 或 l
[^jkl]   = j k l 之外的任意字符
[j|l]    = j 或 l 
[a-z]    = 所有小写字母
[a-zA-Z] = 所有大小写字母
[0-9]    = 任一数字
\d         任一数字                             \D       任一非数字
\w         任一字母、数字、下划线、汉字           \W       \w 的反义
\s         任一空白符                           \S       任一非空白符

接下来进行更复杂的搜索,尝试找到文本中所有的 "The",先进行初步尝试:

. cap drop t*
. gen t1  = ustrregexm(x,"The")
. gen t2 = ustrregexs(0) if ustrregexm(x,"The")
. list x t1 t2

     +----------------------------------------------------------------------------+
     |                                                               x   t1    t2 |
     |----------------------------------------------------------------------------|
  1. |                    The quick brown fox jumps over the lazy dog.    1   The |
  2. |                      the sun is shining. The birds are singing.    1   The |
  3. |                                            Pi equals 3.14159265    0       |
  4. | TheRe arE 9 plANetS in THE solar SYstem.  eARth's mOOn is round    1   The |
  5. |                                              I LOVE Stata 16 .     0       |
     |----------------------------------------------------------------------------|
  6. |   Always correct the regressions for clustered standard errors.    0       |
  7. |               I get an error code r(997.55). What do i do next?    0       |
  8. |                          myname@coolmail.com, Tel: +43 444 5555    0       |
  9. |                         othername@dmail.net, Tel: +1 800 1337.     0       |
 10. |                                 Firstname  Lastname  03-06-1990    0       |
     +----------------------------------------------------------------------------+

t1 成功标记了带有 "The" 的句子,但变量 x 的第 4 行首个单词是 "TheRe",并不是我们想要提取的单词。

. gen t3 = ustrregexs(0) if ustrregexm(x,"[t|T]he")
. list x t1 t2 t3

     +----------------------------------------------------------------------------------+
     |                                                               x   t1    t2    t3 |
     |----------------------------------------------------------------------------------|
  1. |                    The quick brown fox jumps over the lazy dog.    1   The   The |
  2. |                      the sun is shining. The birds are singing.    1   The   the |
  3. |                                            Pi equals 3.14159265    0             |
  4. | TheRe arE 9 plANetS in THE solar SYstem.  eARth's mOOn is round    1   The   The |
  5. |                                              I LOVE Stata 16 .     0             |
     |----------------------------------------------------------------------------------|
  6. |   Always correct the regressions for clustered standard errors.    0         the |
  7. |               I get an error code r(997.55). What do i do next?    0             |
  8. |                          myname@coolmail.com, Tel: +43 444 5555    0             |
  9. |                         othername@dmail.net, Tel: +1 800 1337.     0         the |
 10. |                                 Firstname  Lastname  03-06-1990    0             |
     +----------------------------------------------------------------------------------+

上面这行代码,可以搜索 "the" 或 "The",能够匹配到第二行的首个单词 "the"。通过增加 ^,可以定位出现在字符串开头的元素。下面这行代码可以提取位于字符串开头的 "the" 或 "The"。

. gen t4 = ustrregexs(0) if ustrregexm(x, "^[t|T]he")
. list x t1 t2 t3 t4

     +----------------------------------------------------------------------------------------+
     |                                                               x   t1    t2    t3    t4 |
     |----------------------------------------------------------------------------------------|
  1. |                    The quick brown fox jumps over the lazy dog.    1   The   The   The |
  2. |                      the sun is shining. The birds are singing.    1   The   the   the |
  3. |                                            Pi equals 3.14159265    0                   |
  4. | TheRe arE 9 plANetS in THE solar SYstem.  eARth's mOOn is round    1   The   The   The |
  5. |                                              I LOVE Stata 16 .     0                   |
     |----------------------------------------------------------------------------------------|
  6. |   Always correct the regressions for clustered standard errors.    0         the       |
  7. |               I get an error code r(997.55). What do i do next?    0                   |
  8. |                          myname@coolmail.com, Tel: +43 444 5555    0                   |
  9. |                         othername@dmail.net, Tel: +1 800 1337.     0         the       |
 10. |                                 Firstname  Lastname  03-06-1990    0                   |
     +----------------------------------------------------------------------------------------+

对于变量 x 的第 2 行字符串而言,t1t2 均匹配到第 2 句中的 The,而 t3t4 匹配到了第 1 句中的 "the"。继续尝试:

. gen t5 = ustrregexs(0) if ustrregexm(x, "^[t|T]he\s")
. list x t1 t2 t3 t4 t5

     +-----------------------------------------------------------------------------------------------+
     |                                                               x   t1    t2    t3    t4     t5 |
     |-----------------------------------------------------------------------------------------------|
  1. |                    The quick brown fox jumps over the lazy dog.    1   The   The   The   The  |
  2. |                      the sun is shining. The birds are singing.    1   The   the   the   the  |
  3. |                                            Pi equals 3.14159265    0                          |
  4. | TheRe arE 9 plANetS in THE solar SYstem.  eARth's mOOn is round    1   The   The   The        |
  5. |                                              I LOVE Stata 16 .     0                          |
     |-----------------------------------------------------------------------------------------------|
  6. |   Always correct the regressions for clustered standard errors.    0         the              |
  7. |               I get an error code r(997.55). What do i do next?    0                          |
  8. |                          myname@coolmail.com, Tel: +43 444 5555    0                          |
  9. |                         othername@dmail.net, Tel: +1 800 1337.     0         the              |
 10. |                                 Firstname  Lastname  03-06-1990    0                          |
     +-----------------------------------------------------------------------------------------------+

由于还需要匹配一个空白字符,因此不会匹配到 "TheRe"。

. gen t6 = ustrregexs(0) if ustrregexm(x, "(^[t|T]\w+)") 
. gen t7 = ustrregexs(0) if ustrregexm(x, "(^[t|T]\w{2}\s)")
. list x t1 t2 t3 t4 t5 t6 t7

     +--------------------------------------------------------------------------------------------------------------+
     |                                                               x   t1    t2    t3    t4     t5      t6     t7 |
     |--------------------------------------------------------------------------------------------------------------|
  1. |                    The quick brown fox jumps over the lazy dog.    1   The   The   The   The      The   The  |
  2. |                      the sun is shining. The birds are singing.    1   The   the   the   the      the   the  |
  3. |                                            Pi equals 3.14159265    0                                         |
  4. | TheRe arE 9 plANetS in THE solar SYstem.  eARth's mOOn is round    1   The   The   The          TheRe        |
  5. |                                              I LOVE Stata 16 .     0                                         |
     |--------------------------------------------------------------------------------------------------------------|
  6. |   Always correct the regressions for clustered standard errors.    0         the                             |
  7. |               I get an error code r(997.55). What do i do next?    0                                         |
  8. |                          myname@coolmail.com, Tel: +43 444 5555    0                                         |
  9. |                         othername@dmail.net, Tel: +1 800 1337.     0         the                             |
 10. |                                 Firstname  Lastname  03-06-1990    0                                         |
     +--------------------------------------------------------------------------------------------------------------+

t6 匹配到位于字符串开头且首字母为 tT 的连续英文字母。t7 匹配到位于字符串开头且首字母为 tT 的 3 个连续英文字母。注意体会这个例子中,正则表达式书写逐渐一般化和通用化的过程。通过一步步尝试,我们最终得到了想要获取的字符串。

接下来,回到原始数据,对比以下命令:

. cap drop t*
. gen t1 = ustrregexs(0) if ustrregexm(x, ".*") 
. list x t1 in 1

     +---------------------------------------------------------------------------------------------+
     |                                            x                                             t1 |
     |---------------------------------------------------------------------------------------------|
  1. | The quick brown fox jumps over the lazy dog.   The quick brown fox jumps over the lazy dog. |
     +---------------------------------------------------------------------------------------------+

t1 的结果与 x 完全相同,.* 表示可以匹配任意多个字符。其中,* 为数量元字符,表示 0 个或多个,. 表示任一字符。记住 .* 这个通用表达,经常组合使用。

. gen t2 = ustrregexs(0) if ustrregexm(x, ".*\.") 
. list x t2 in 1/3

     +---------------------------------------------------------------------------------------------+
     |                                            x                                             t2 |
     |---------------------------------------------------------------------------------------------|
  1. | The quick brown fox jumps over the lazy dog.   The quick brown fox jumps over the lazy dog. |
  2. |   the sun is shining. The birds are singing.     the sun is shining. The birds are singing. |
  3. |                         Pi equals 3.14159265                                   Pi equals 3. |
     +---------------------------------------------------------------------------------------------+

这行代码表示提取出最后一次出现的 . 及其之前的字符串。其中 \ 是转义符号,由于 . 可以表示任一字符,而此时我们想要匹配到真正的 .,所以就需要加上转义符号来表示字符最原始的意思。例如 \. 表示的就是 .\* 表示的就是 *

. gen t3 = ustrregexs(0) if ustrregexm(x, ".*\.$")
. list x t3 in 1/3

     +---------------------------------------------------------------------------------------------+
     |                                            x                                             t3 |
     |---------------------------------------------------------------------------------------------|
  1. | The quick brown fox jumps over the lazy dog.   The quick brown fox jumps over the lazy dog. |
  2. |   the sun is shining. The birds are singing.     the sun is shining. The birds are singing. |
  3. |                         Pi equals 3.14159265                                                |
     +---------------------------------------------------------------------------------------------+

这行代码表示提取出位于句末的 . 及其之前的字符串,即提取出所有完整句。$ 是位置元字符表示位于最后,^ 表示位于开头。

. gen t4 = ustrregexs(0) if ustrregexm(x, ".*\.\s")
. list x t4

     +-------------------------------------------------------------------------------------------------------------+
     |                                                               x                                          t4 |
     |-------------------------------------------------------------------------------------------------------------|
  1. |                    The quick brown fox jumps over the lazy dog.                                             |
  2. |                      the sun is shining. The birds are singing.                        the sun is shining.  |
  3. |                                            Pi equals 3.14159265                                             |
  4. | TheRe arE 9 plANetS in THE solar SYstem.  eARth's mOOn is round   TheRe arE 9 plANetS in THE solar SYstem.  |
  5. |                                              I LOVE Stata 16 .                           I LOVE Stata 16 .  |
     |-------------------------------------------------------------------------------------------------------------|
  6. |   Always correct the regressions for clustered standard errors.                                             |
  7. |               I get an error code r(997.55). What do i do next?             I get an error code r(997.55).  |
  8. |                          myname@coolmail.com, Tel: +43 444 5555                                             |
  9. |                         othername@dmail.net, Tel: +1 800 1337.      othername@dmail.net, Tel: +1 800 1337.  |
 10. |                                 Firstname  Lastname  03-06-1990                                             |
     +-------------------------------------------------------------------------------------------------------------+

这行代码进一步增加要求,. 后还需要有一个空白字符,可以用来查找一个段落中的句子。

以上都是针对字符进行说明,接下来介绍子表达式 (字符的集合)。

. cap drop t*
. gen t5_0 = ustrregexs(0) if ustrregexm(x, "(.*\.)(.*\.)")  
. gen t5_1 = ustrregexs(1) if ustrregexm(x, "(.*\.)(.*\.)")  
. gen t5_2 = ustrregexs(2) if ustrregexm(x, "(.*\.)(.*\.)")
. list x t5_0

     +--------------------------------------------------------------------------------------------------------------+
     |                                                               x                                         t5_0 |
     |--------------------------------------------------------------------------------------------------------------|
  1. |                    The quick brown fox jumps over the lazy dog.                                              |
  2. |                      the sun is shining. The birds are singing.   the sun is shining. The birds are singing. |
  3. |                                            Pi equals 3.14159265                                              |
  4. | TheRe arE 9 plANetS in THE solar SYstem.  eARth's mOOn is round                                              |
  5. |                                              I LOVE Stata 16 .                                               |
     |--------------------------------------------------------------------------------------------------------------|
  6. |   Always correct the regressions for clustered standard errors.                                              |
  7. |               I get an error code r(997.55). What do i do next?               I get an error code r(997.55). |
  8. |                          myname@coolmail.com, Tel: +43 444 5555                                              |
  9. |                         othername@dmail.net, Tel: +1 800 1337.        othername@dmail.net, Tel: +1 800 1337. |
 10. |                                 Firstname  Lastname  03-06-1990                                              |
     +--------------------------------------------------------------------------------------------------------------+

这里我们定义了两个子表达式,用两对 () 表示,第一个 () 表示匹配带 . 的句子,第二个 () 也表示相同意思。t5_0 表示匹配并提取两个句子,t5_1 表示匹配两个句子并提取第 1 个句子,t5_2 表示匹配两个句子并提取第 2 个句子。如果句与句之间有空格的话,还可以使用 (.*\.)\s?(.*\.)\s? 表示有 0 个或 1 个空白字符。

ustrregexra 替换命令具有强大的功能,具体来看:

. gen t6 = ustrregexra(x,"[^a-zA-Z]","")
. gen t7 = ustrregexra(x,"\W","")
. list t6 t7

     +---------------------------------------------------------------------------------------------------------------+
     |                                                    t6                                                      t7 |
     |---------------------------------------------------------------------------------------------------------------|
  1. |                   Thequickbrownfoxjumpsoverthelazydog                     Thequickbrownfoxjumpsoverthelazydog |
  2. |                     thesunisshiningThebirdsaresinging                       thesunisshiningThebirdsaresinging |
  3. |                                              Piequals                                       Piequals314159265 |
  4. |      TheRearEplANetSinTHEsolarSYstemeARthsmOOnisround       TheRearE9plANetSinTHEsolarSYstemeARthsmOOnisround |
  5. |                                            ILOVEStata                                            ILOVEStata16 |
     |---------------------------------------------------------------------------------------------------------------|
  6. | Alwayscorrecttheregressionsforclusteredstandarderrors   Alwayscorrecttheregressionsforclusteredstandarderrors |
  7. |                         IgetanerrorcoderWhatdoidonext                      Igetanerrorcoder99755Whatdoidonext |
  8. |                                  mynamecoolmailcomTel                           mynamecoolmailcomTel434445555 |
  9. |                                  othernamedmailnetTel                            othernamedmailnetTel18001337 |
 10. |                                     FirstnameLastname                               FirstnameLastname03061990 |
     +---------------------------------------------------------------------------------------------------------------+

可以看出,t6 仅保留了字母,t7 保留了字母和数字,注意区分。继续介绍一个有用的元字符 \b,作用是匹配位置边界,限制搜索范围。举个例子,如果我们想搜索邮箱,那么就可以把边界限制在 xxx@yyy.zzz。

. cap drop t*
. gen t1 = ustrregexs(0) if ustrregexm(x, "\b([a-zA-Z]+[_|\-|\.]?[a-zA-Z0-9]+@[a-zA-Z]+\.[com|net]+)\b")
. list x t1

     +---------------------------------------------------------------------------------------+
     |                                                               x                    t1 |
     |---------------------------------------------------------------------------------------|
  1. |                    The quick brown fox jumps over the lazy dog.                       |
  2. |                      the sun is shining. The birds are singing.                       |
  3. |                                            Pi equals 3.14159265                       |
  4. | TheRe arE 9 plANetS in THE solar SYstem.  eARth's mOOn is round                       |
  5. |                                              I LOVE Stata 16 .                        |
     |---------------------------------------------------------------------------------------|
  6. |   Always correct the regressions for clustered standard errors.                       |
  7. |               I get an error code r(997.55). What do i do next?                       |
  8. |                          myname@coolmail.com, Tel: +43 444 5555   myname@coolmail.com |
  9. |                         othername@dmail.net, Tel: +1 800 1337.    othername@dmail.net |
 10. |                                 Firstname  Lastname  03-06-1990                       |
     +---------------------------------------------------------------------------------------+

果然提取出了邮箱地址!用 \b 确定边界,[a-zA-Z]+ 表示一个或多个英文字母,[_|\-|\.]? 用来匹配可能存在的特殊符号,? 表示 0 个或 1 个,[a-zA-Z0-9]+ 表示一个或多个英文字母和数字,\.[com|net]+ 可以匹配 .com 或 .net 等。对于最后一部分,还可以将表达式简化为 \w{3},表示匹配三个连续字母。书写正则表达式是一个不断修改优化的过程,可以基于基础表达式,不断添加条件,来覆盖更多情况。

4. 通用的数字搜索

. cap drop t*
. gen t1 = ustrregexs(0) if ustrregexm(x, "[0-9]")  
. gen t2 = ustrregexs(0) if ustrregexm(x, "[0-9]+")  
. gen t3 = ustrregexs(0) if ustrregexm(x, "[0-9][0-9][0-9]") 
. list x t*

     +----------------------------------------------------------------------------------+
     |                                                               x   t1    t2    t3 |
     |----------------------------------------------------------------------------------|
  1. |                    The quick brown fox jumps over the lazy dog.                  |
  2. |                      the sun is shining. The birds are singing.                  |
  3. |                                            Pi equals 3.14159265    3     3   141 |
  4. | TheRe arE 9 plANetS in THE solar SYstem.  eARth's mOOn is round    9     9       |
  5. |                                              I LOVE Stata 16 .     1    16       |
     |----------------------------------------------------------------------------------|
  6. |   Always correct the regressions for clustered standard errors.                  |
  7. |               I get an error code r(997.55). What do i do next?    9   997   997 |
  8. |                          myname@coolmail.com, Tel: +43 444 5555    4    43   444 |
  9. |                         othername@dmail.net, Tel: +1 800 1337.     1     1   800 |
 10. |                                 Firstname  Lastname  03-06-1990    0    03   199 |
     +----------------------------------------------------------------------------------+

t1 提取出了 x 中的首个数字,t2 为第一组连续数字,t3 为第一组三位数字。更为简洁的表达式:

. gen t4 = ustrregexs(0) if ustrregexm(x, "\d{4}")
. list x t4

     +------------------------------------------------------------------------+
     |                                                               x     t4 |
     |------------------------------------------------------------------------|
  1. |                    The quick brown fox jumps over the lazy dog.        |
  2. |                      the sun is shining. The birds are singing.        |
  3. |                                            Pi equals 3.14159265   1415 |
  4. | TheRe arE 9 plANetS in THE solar SYstem.  eARth's mOOn is round        |
  5. |                                              I LOVE Stata 16 .         |
     |------------------------------------------------------------------------|
  6. |   Always correct the regressions for clustered standard errors.        |
  7. |               I get an error code r(997.55). What do i do next?        |
  8. |                          myname@coolmail.com, Tel: +43 444 5555   5555 |
  9. |                         othername@dmail.net, Tel: +1 800 1337.    1337 |
 10. |                                 Firstname  Lastname  03-06-1990   1990 |
     +------------------------------------------------------------------------+

\d{4} 表示连续的四位数字,也可以使用 \d{a,b} 指定位数范围,如 \d{2,4} 表示 2~4 位数。

. gen t5 = ustrregexs(0) if ustrregexm(x, "\d+-\d+")
. list x t5

     +-------------------------------------------------------------------------+
     |                                                               x      t5 |
     |-------------------------------------------------------------------------|
  1. |                    The quick brown fox jumps over the lazy dog.         |
  2. |                      the sun is shining. The birds are singing.         |
  3. |                                            Pi equals 3.14159265         |
  4. | TheRe arE 9 plANetS in THE solar SYstem.  eARth's mOOn is round         |
  5. |                                              I LOVE Stata 16 .          |
     |-------------------------------------------------------------------------|
  6. |   Always correct the regressions for clustered standard errors.         |
  7. |               I get an error code r(997.55). What do i do next?         |
  8. |                          myname@coolmail.com, Tel: +43 444 5555         |
  9. |                         othername@dmail.net, Tel: +1 800 1337.          |
 10. |                                 Firstname  Lastname  03-06-1990   03-06 |
     +-------------------------------------------------------------------------+

t5 匹配到了用 - 连接的数字。

. gen t6 = ustrregexs(0) if ustrregexm(x, "\+\d+")
. list x t6

     +-----------------------------------------------------------------------+
     |                                                               x    t6 |
     |-----------------------------------------------------------------------|
  1. |                    The quick brown fox jumps over the lazy dog.       |
  2. |                      the sun is shining. The birds are singing.       |
  3. |                                            Pi equals 3.14159265       |
  4. | TheRe arE 9 plANetS in THE solar SYstem.  eARth's mOOn is round       |
  5. |                                              I LOVE Stata 16 .        |
     |-----------------------------------------------------------------------|
  6. |   Always correct the regressions for clustered standard errors.       |
  7. |               I get an error code r(997.55). What do i do next?       |
  8. |                          myname@coolmail.com, Tel: +43 444 5555   +43 |
  9. |                         othername@dmail.net, Tel: +1 800 1337.     +1 |
 10. |                                 Firstname  Lastname  03-06-1990       |
     +-----------------------------------------------------------------------+

t6 匹配到了 + 开头的数字,记得要加上转义符哦!

. gen t7 = ustrregexs(0) if ustrregexm(x, "\+?\d+")
. list x t7

     +-----------------------------------------------------------------------+
     |                                                               x    t7 |
     |-----------------------------------------------------------------------|
  1. |                    The quick brown fox jumps over the lazy dog.       |
  2. |                      the sun is shining. The birds are singing.       |
  3. |                                            Pi equals 3.14159265     3 |
  4. | TheRe arE 9 plANetS in THE solar SYstem.  eARth's mOOn is round     9 |
  5. |                                              I LOVE Stata 16 .     16 |
     |-----------------------------------------------------------------------|
  6. |   Always correct the regressions for clustered standard errors.       |
  7. |               I get an error code r(997.55). What do i do next?   997 |
  8. |                          myname@coolmail.com, Tel: +43 444 5555   +43 |
  9. |                         othername@dmail.net, Tel: +1 800 1337.     +1 |
 10. |                                 Firstname  Lastname  03-06-1990    03 |
     +-----------------------------------------------------------------------+

数字前面可能会存在 +,加个 ? 即可让表达式更加灵活!

. gen t8 = ustrregexs(0) if ustrregexm(x, "(\d*\.\d+)")
. list x t8

     +------------------------------------------------------------------------------+
     |                                                               x           t8 |
     |------------------------------------------------------------------------------|
  1. |                    The quick brown fox jumps over the lazy dog.              |
  2. |                      the sun is shining. The birds are singing.              |
  3. |                                            Pi equals 3.14159265   3.14159265 |
  4. | TheRe arE 9 plANetS in THE solar SYstem.  eARth's mOOn is round              |
  5. |                                              I LOVE Stata 16 .               |
     |------------------------------------------------------------------------------|
  6. |   Always correct the regressions for clustered standard errors.              |
  7. |               I get an error code r(997.55). What do i do next?       997.55 |
  8. |                          myname@coolmail.com, Tel: +43 444 5555              |
  9. |                         othername@dmail.net, Tel: +1 800 1337.               |
 10. |                                 Firstname  Lastname  03-06-1990              |
     +------------------------------------------------------------------------------+

t8 可以找到带有 . 的数字,\d* 表示 0 个或多个数字,因为有时 0.5 可能会写成 .5,这个表达式在匹配小数时经常使用。为了找到原始数据中的所有数字,接下来将进行一系列尝试,首先:

. cap drop t*
. gen t1 = ustrregexs(0) if ustrregexm(x, "\d+")
. list x t*

     +-----------------------------------------------------------------------+
     |                                                               x    t1 |
     |-----------------------------------------------------------------------|
  1. |                    The quick brown fox jumps over the lazy dog.       |
  2. |                      the sun is shining. The birds are singing.       |
  3. |                                            Pi equals 3.14159265     3 |
  4. | TheRe arE 9 plANetS in THE solar SYstem.  eARth's mOOn is round     9 |
  5. |                                              I LOVE Stata 16 .     16 |
     |-----------------------------------------------------------------------|
  6. |   Always correct the regressions for clustered standard errors.       |
  7. |               I get an error code r(997.55). What do i do next?   997 |
  8. |                          myname@coolmail.com, Tel: +43 444 5555    43 |
  9. |                         othername@dmail.net, Tel: +1 800 1337.      1 |
 10. |                                 Firstname  Lastname  03-06-1990    03 |
     +-----------------------------------------------------------------------+

t1 只能找到首次出现的一组数字,远远达不到我们的要求。

. gen t2 = ustrregexs(0) if ustrregexm(x, "\+?\d+")
. list x t*

     +-----------------------------------------------------------------------------+
     |                                                               x    t1    t2 |
     |-----------------------------------------------------------------------------|
  1. |                    The quick brown fox jumps over the lazy dog.             |
  2. |                      the sun is shining. The birds are singing.             |
  3. |                                            Pi equals 3.14159265     3     3 |
  4. | TheRe arE 9 plANetS in THE solar SYstem.  eARth's mOOn is round     9     9 |
  5. |                                              I LOVE Stata 16 .     16    16 |
     |-----------------------------------------------------------------------------|
  6. |   Always correct the regressions for clustered standard errors.             |
  7. |               I get an error code r(997.55). What do i do next?   997   997 |
  8. |                          myname@coolmail.com, Tel: +43 444 5555    43   +43 |
  9. |                         othername@dmail.net, Tel: +1 800 1337.      1    +1 |
 10. |                                 Firstname  Lastname  03-06-1990    03    03 |
     +-----------------------------------------------------------------------------+

t2 匹配上了 +。接下来增加更多元素:

. gen t3 = ustrregexs(0) if ustrregexm(x, "\+?\d+([\s|\.|-]?)")
. list x t*

     +------------------------------------------------------------------------------------+
     |                                                               x    t1    t2     t3 |
     |------------------------------------------------------------------------------------|
  1. |                    The quick brown fox jumps over the lazy dog.                    |
  2. |                      the sun is shining. The birds are singing.                    |
  3. |                                            Pi equals 3.14159265     3     3     3. |
  4. | TheRe arE 9 plANetS in THE solar SYstem.  eARth's mOOn is round     9     9     9  |
  5. |                                              I LOVE Stata 16 .     16    16    16  |
     |------------------------------------------------------------------------------------|
  6. |   Always correct the regressions for clustered standard errors.                    |
  7. |               I get an error code r(997.55). What do i do next?   997   997   997. |
  8. |                          myname@coolmail.com, Tel: +43 444 5555    43   +43   +43  |
  9. |                         othername@dmail.net, Tel: +1 800 1337.      1    +1    +1  |
 10. |                                 Firstname  Lastname  03-06-1990    03    03    03- |
     +------------------------------------------------------------------------------------+

t3 匹配上了空白字符、.-。下一步,继续匹配剩余数字:

. gen t4 = ustrregexs(0) if ustrregexm(x, "\+?\d+([\s|\.|-]?)(\d+)?")
. list x t*

     +-------------------------------------------------------------------------------------------------+
     |                                                               x    t1    t2     t3           t4 |
     |-------------------------------------------------------------------------------------------------|
  1. |                    The quick brown fox jumps over the lazy dog.                                 |
  2. |                      the sun is shining. The birds are singing.                                 |
  3. |                                            Pi equals 3.14159265     3     3     3.   3.14159265 |
  4. | TheRe arE 9 plANetS in THE solar SYstem.  eARth's mOOn is round     9     9     9            9  |
  5. |                                              I LOVE Stata 16 .     16    16    16           16  |
     |-------------------------------------------------------------------------------------------------|
  6. |   Always correct the regressions for clustered standard errors.                                 |
  7. |               I get an error code r(997.55). What do i do next?   997   997   997.       997.55 |
  8. |                          myname@coolmail.com, Tel: +43 444 5555    43   +43   +43       +43 444 |
  9. |                         othername@dmail.net, Tel: +1 800 1337.      1    +1    +1        +1 800 |
 10. |                                 Firstname  Lastname  03-06-1990    03    03    03-        03-06 |
     +-------------------------------------------------------------------------------------------------+

t4 有了明显的改善,提取出了大部分数字,但电话号码可能存在多组分隔符,进一步改善:

. gen t5 = ustrregexs(0) if ustrregexm(x, "\+?\d+([\s|\.|-]?)(\d+)?([\s|\.|-]?)(\d+)")
. list x t*

     +----------------------------------------------------------------------------------------------------------------+
     |                                                               x    t1    t2     t3           t4             t5 |
     |----------------------------------------------------------------------------------------------------------------|
  1. |                    The quick brown fox jumps over the lazy dog.                                                |
  2. |                      the sun is shining. The birds are singing.                                                |
  3. |                                            Pi equals 3.14159265     3     3     3.   3.14159265     3.14159265 |
  4. | TheRe arE 9 plANetS in THE solar SYstem.  eARth's mOOn is round     9     9     9            9                 |
  5. |                                              I LOVE Stata 16 .     16    16    16           16              16 |
     |----------------------------------------------------------------------------------------------------------------|
  6. |   Always correct the regressions for clustered standard errors.                                                |
  7. |               I get an error code r(997.55). What do i do next?   997   997   997.       997.55         997.55 |
  8. |                          myname@coolmail.com, Tel: +43 444 5555    43   +43   +43       +43 444   +43 444 5555 |
  9. |                         othername@dmail.net, Tel: +1 800 1337.      1    +1    +1        +1 800    +1 800 1337 |
 10. |                                 Firstname  Lastname  03-06-1990    03    03    03-        03-06     03-06-1990 |
     +----------------------------------------------------------------------------------------------------------------+

提取成功!如果我们一开始就看到这个表达式,肯定会想:这是什么?但经前面的铺垫,看到这的你是否能感受到正则表达式书写过程中的逻辑性呢?当然这还不是最简洁的表达式,但限于篇幅,本文不再继续拓展。

5. 案例实操

5.1 姓名和出生日期

* 数据下载地址:https://github.com/asjadnaqvi/The-Stata-Guide
. import delimited using "https://file.lianxh.cn/data/a/asjadnaqvi-The-Stata-Guide/data/file2.csv", clear
. nrow 1
. list in 1/5

     +--------------------------------------------------+
     |          name           father               dob |
     |--------------------------------------------------|
  1. |         KIRAN       HIMAT KHAN    6/17/1998 0:00 |
  2. | HAFEEZ REHMAN     IFTIKHAR ALI   12/29/1999 0:00 |
  3. | MEHREEN AKBAR   MUHAMMAD AKBAR     12-09-97 0:00 |
  4. |   AYSHA KAMAL        KAMAL DIN   12/20/1998 0:00 |
  5. |  HUMERAN BIBI     KHUDA BAKASH   10/28/1996 0:00 |
     +--------------------------------------------------+

这份子数据集包括姓名和出生日期等信息。先进行一些简单操作,将名字拆分成姓和名,将日期拆分为年、月、日。

. gen name1 = trim(proper(ustrregexs(1))) if ustrregexm(name, "([A-Z]\w+)\s?([A-Z]\w+)?")
. gen name2 = trim(proper(ustrregexs(2))) if ustrregexm(name, "([A-Z]\w+)\s?([A-Z]\w+)?")
. list name* in 1/5

     +----------------------------------+
     |          name     name1    name2 |
     |----------------------------------|
  1. |         KIRAN     Kiran          |
  2. | HAFEEZ REHMAN    Hafeez   Rehman |
  3. | MEHREEN AKBAR   Mehreen    Akbar |
  4. |   AYSHA KAMAL     Aysha    Kamal |
  5. |  HUMERAN BIBI   Humeran     Bibi |
     +----------------------------------+

这里使用了子表达式和前面说到的 trimproper,成功将姓和名拆分开。由于部分数据可能没有姓,名后不一定存在空白字符,故使用 \s?进行匹配。name1name2 分别提取了第 1 个和第 2 个表示式,并进行首字母大写和去空白字符的操作。接下来处理日期,先去掉 "0:00":

. replace dob = trim(subinstr(dob,"0:00","",.))
. list dob in 1/5

     +------------+
     |        dob |
     |------------|
  1. |  6/17/1998 |
  2. | 12/29/1999 |
  3. |   12-09-97 |
  4. | 12/20/1998 |
  5. | 10/28/1996 |
     +------------+

上述命令语句表示,将 "0:00" 替换为空后再去掉空白字符。提取年份:

. gen year = ustrregexs(0) if ustrregexm(dob, "\d+$")
. list dob year in 1/5

     +-------------------+
     |        dob   year |
     |-------------------|
  1. |  6/17/1998   1998 |
  2. | 12/29/1999   1999 |
  3. |   12-09-97     97 |
  4. | 12/20/1998   1998 |
  5. | 10/28/1996   1996 |
     +-------------------+

表示提取位于字符串末端的连续数字,即为年份。提取月份:

. gen month= ustrregexs(0) if ustrregexm(dob, "\d+")
. list dob month in 1/5

     +--------------------+
     |        dob   month |
     |--------------------|
  1. |  6/17/1998       6 |
  2. | 12/29/1999      12 |
  3. |   12-09-97      12 |
  4. | 12/20/1998      12 |
  5. | 10/28/1996      10 |
     +--------------------+

基于分隔符提取日:

. gen day = ustrregexs(1) if ustrregexm(dob, "[/|-](\d+)")
. list dob day in 1/5

     +------------------+
     |        dob   day |
     |------------------|
  1. |  6/17/1998    17 |
  2. | 12/29/1999    29 |
  3. |   12-09-97    09 |
  4. | 12/20/1998    20 |
  5. | 10/28/1996    28 |
     +------------------+

最后可以使用日期函数,将字符串转换为日期格式:

. * 将字符串转换为数字
. foreach v in year month day{
  2.    destring `v', replace
  3. }

. * 补全年份
. replace year = year+1900 if (year<100 & year>90)
. replace year = year+2000 if year<10

. * 转化为日期格式
. gen date = mdy(month,day,year)
. format date %tdDD/NN/CCYY
. list dob date in 1/5

     +-------------------------+
     |        dob         date |
     |-------------------------|
  1. |  6/17/1998   17/06/1998 |
  2. | 12/29/1999   29/12/1999 |
  3. |   12-09-97   09/12/1997 |
  4. | 12/20/1998   20/12/1998 |
  5. | 10/28/1996   28/10/1996 |
     +-------------------------+

5.2 学校普查

. import delimited using "https://file.lianxh.cn/data/a/asjadnaqvi-The-Stata-Guide/data/file1.csv", clear bindq(strict) varn(1)
. list in 1/5

     +----------------------------------------------------+
     |                                             school |
     |----------------------------------------------------|
  1. |                       GPS DULU GHURUKEY (COMBINED) |
  2. |                     GOVT .P/S TANNY KAY (COMBINED) |
  3. |                      GOVT.P/SBONGA MALA (COMBINED) |
  4. |        GOVT. M/S KOT SARDAR KAHIN SINGH (COMBINED) |
  5. | OFFICE OF B.H.U. KOT SARDAR KAHIN SINGH (COMBINED) |
     +----------------------------------------------------+

数据集里只有学校这一个变量名,包括学校名称 (全称或简写)、学校类型、村庄名等。在处理实际数据时,可能会经常面临数据编写不规范导致不同字段的数据混杂在一起的情况,给数据清洗带来很大的挑战。先进行简单操作:

. gen schooltype1 = ustrregexs(0) if ustrregexm(school, "(G?[G|B][E|P|M|H|S]S)")
. list school schooltype1 in 1/5

     +---------------------------------------------------------------+
     |                                             school   school~1 |
     |---------------------------------------------------------------|
  1. |                       GPS DULU GHURUKEY (COMBINED)        GPS |
  2. |                     GOVT .P/S TANNY KAY (COMBINED)            |
  3. |                      GOVT.P/SBONGA MALA (COMBINED)            |
  4. |        GOVT. M/S KOT SARDAR KAHIN SINGH (COMBINED)            |
  5. | OFFICE OF B.H.U. KOT SARDAR KAHIN SINGH (COMBINED)            |
     +---------------------------------------------------------------+

. tab schooltype1, m

schooltype1 |      Freq.     Percent        Cum.
------------+-----------------------------------
            |        328       55.69       55.69
       GBES |         19        3.23       58.91
       GBHS |         30        5.09       64.01
       GBMS |          2        0.34       64.35
       GBPS |         36        6.11       70.46
       GGES |         24        4.07       74.53
       GGHS |          5        0.85       75.38
       GGPS |        124       21.05       96.43
        GHS |          1        0.17       96.60
        GPS |         20        3.40      100.00
------------+-----------------------------------
      Total |        589      100.00

根据缩写的特点来提取出学校类型。正则表达式中 G 表示政府,[G|B] 表示男孩或女孩,[E|P|M|H|S] 表示学校级别,S 表示学校。这个表达式可以匹配出 45% 的学校类型。接下来是个更复杂的表达式:

. gen schooltype2 = ustrregexs(0) if ustrregexm(school, "(GOVT)?\.?\s?((BOY|GIRL)S?)?\s?((E|M|P|H|S)/?\\?S(CHOOL)?)?")
. list school schooltype2 in 1/5

     +----------------------------------------------------------------+
     |                                             school   schoolt~2 |
     |----------------------------------------------------------------|
  1. |                       GPS DULU GHURUKEY (COMBINED)             |
  2. |                     GOVT .P/S TANNY KAY (COMBINED)       GOVT  |
  3. |                      GOVT.P/SBONGA MALA (COMBINED)    GOVT.P/S |
  4. |        GOVT. M/S KOT SARDAR KAHIN SINGH (COMBINED)   GOVT. M/S |
  5. | OFFICE OF B.H.U. KOT SARDAR KAHIN SINGH (COMBINED)             |
     +----------------------------------------------------------------+

在上述正则表达式中,学校类型可能以 GOVT 开头,可能会有 . 或空白字符,接着性别可能是单数也可能是复数,所以有个 S?,后面可能又会有空白字符,随后是学校类型,可能存在 /,最后一部分表示学校,可能缩写也可能是全拼。这个命令又可以确定 45% 的学校类型,但仍然存在一些错误,读者可以自行调整表达式进行更精确匹配。如果想要提取出 BHUs,可以这么写:

. gen schooltype3 = ustrregexs(0) if ustrregexm(school, "B(ASIC)?\.?\s?H(EALTH)?\.?\s?U(NIT)?\.?")
. list school schooltype3 in 1/5

     +---------------------------------------------------------------+
     |                                             school   school~3 |
     |---------------------------------------------------------------|
  1. |                       GPS DULU GHURUKEY (COMBINED)            |
  2. |                     GOVT .P/S TANNY KAY (COMBINED)            |
  3. |                      GOVT.P/SBONGA MALA (COMBINED)            |
  4. |        GOVT. M/S KOT SARDAR KAHIN SINGH (COMBINED)            |
  5. | OFFICE OF B.H.U. KOT SARDAR KAHIN SINGH (COMBINED)     B.H.U. |
     +---------------------------------------------------------------+

这行代码能将大部分包含 BHUs 的字符串匹配出来。简单查看现在的匹配情况:

. gen check = schooltype1!="" | schooltype2!=""  | schooltype3!=""
. tab check

      check |      Freq.     Percent        Cum.
------------+-----------------------------------
          0 |         48        8.15        8.15
          1 |        541       91.85      100.00
------------+-----------------------------------
      Total |        589      100.00

大约匹配到了91%的学校类型。接下来提取性别信息:

. gen gender1 = ustrregexs(0) if ustrregexm(school, "\(([A-Z]+)\)")
. gen gender2 = ustrregexs(1) if ustrregexm(school, "\(([A-Z]+)\)")
. list  school gender1 gender2 in 1/5

     +----------------------------------------------------------------------------+
     |                                             school      gender1    gender2 |
     |----------------------------------------------------------------------------|
  1. |                       GPS DULU GHURUKEY (COMBINED)   (COMBINED)   COMBINED |
  2. |                     GOVT .P/S TANNY KAY (COMBINED)   (COMBINED)   COMBINED |
  3. |                      GOVT.P/SBONGA MALA (COMBINED)   (COMBINED)   COMBINED |
  4. |        GOVT. M/S KOT SARDAR KAHIN SINGH (COMBINED)   (COMBINED)   COMBINED |
  5. | OFFICE OF B.H.U. KOT SARDAR KAHIN SINGH (COMBINED)   (COMBINED)   COMBINED |
     +----------------------------------------------------------------------------+

观察原始数据可以发现,性别信息放在 () 中,第一条命令提取了带 () 的性别,因此可以利用子表达式,仅提取出性别。假设现在已经提取出了我们想要的变量,可以从原始数据中 “减去” 提取的变量信息,就能得到剩余的信息。

. * subinstr 为字符串函数中的替换函数。
. replace school = subinstr(school, schooltype1, "", .)
. replace school = subinstr(school, schooltype2, "", .)
. replace school = subinstr(school, schooltype3, "", .)
. replace school = subinstr(school, gender1, "", .)
. replace school = trim(school)
. list school in 1/5

     +-----------------------------------+
     |                            school |
     |-----------------------------------|
  1. |                    DULU GHURUKEY  |
  2. |                   .P/S TANNY KAY  |
  3. |                       BONGA MALA  |
  4. |            KOT SARDAR KAHIN SINGH |
  5. | OFFICE OF  KOT SARDAR KAHIN SINGH |
     +-----------------------------------+

当然,以上的表达式还可以进一步完善,继续提取其他的变量信息并解决匹配错误的问题。

6. 总结

字符匹配:

[jkl]    = j 或 k 或 l
[^jkl]   = j k l 之外的任意字符
[j|l]    = j 或 l 
[a-z]    = 任一小写字母
[a-zA-Z] = 任一大小写字母
[0-9]    = 任一数字
\d         任一数字                            
\D         任意非数字
\w         任一字母、数字、下划线、汉字           
\W         \w 的反义
\s         任一空白符                          
\S        任一非空白符

量词:

^  匹配字符串开头的位置
$  匹配字符串结尾的位置
.  匹配任一字符
|  或
?  匹配 0 个或 1 个    
*  匹配 0 个或多个  
+  匹配 1 个或多个

这些量词还可以拓展至正则表达式的不同匹配模式:

模式             贪婪模式 (Greedy)   懒惰模式 (Reluctant)   占有模式 (Possessive)
------------------------------------------------------------------------------
0 个或 1 个                 ?                 ??                   ?+
0 个或多个                  *                 *?                   *+
1 个或多个                  +                 +?                   ++
y 个                      {y}               {y}?                 {y}+
不少于 y 个               {y,}              {y,}?                {y,}+
不多于 z 个且不少于 y 个  {y,z}             {y,z}?               {y,z}+

通过一个例子简单理解贪婪模式和懒惰模式:cool (.+) hat (贪婪模式) 可以匹配 cool 和最后一次出现的 hat 之间的所有字符;cool (.+?) hat (懒惰模式) 只能匹配 cool 和首次出现的 hat 之间的所有字符。当然,不同模式的具体匹配过程较为复杂,在此不做详细叙述。

需要转义的字符:

[ \ ^ $ . | ? * + ( ) { }

括弧:

( ) 表示生成子表达式
[ ] 匹配字符集合中的某一字符
{ } 定义匹配范围

7. 相关推文

Note:产生如下推文列表的 Stata 命令为:
lianxh 正则, m
安装最新版 lianxh 命令:
ssc install lianxh, replace

相关课程

免费公开课

最新课程-直播课

专题 嘉宾 直播/回看视频
最新专题 文本分析、机器学习、效率专题、生存分析等
研究设计 连玉君 我的特斯拉-实证研究设计-幻灯片-
面板模型 连玉君 动态面板模型-幻灯片-
面板模型 连玉君 直击面板数据模型 [免费公开课,2小时]
  • Note: 部分课程的资料,PPT 等可以前往 连享会-直播课 主页查看,下载。

课程主页

课程主页

关于我们

  • Stata连享会 由中山大学连玉君老师团队创办,定期分享实证分析经验。
  • 连享会-主页知乎专栏,700+ 推文,实证分析不再抓狂。直播间 有很多视频课程,可以随时观看。
  • 公众号关键词搜索/回复 功能已经上线。大家可以在公众号左下角点击键盘图标,输入简要关键词,以便快速呈现历史推文,获取工具软件和数据下载。常见关键词:课程, 直播, 视频, 客服, 模型设定, 研究设计, stata, plus, 绘图, 编程, 面板, 论文重现, 可视化, RDD, DID, PSM, 合成控制法

连享会小程序:扫一扫,看推文,看视频……

扫码加入连享会微信群,提问交流更方便

✏ 连享会-常见问题解答:
https://gitee.com/lianxh/Course/wikis

New! lianxh 命令发布了:
随时搜索连享会推文、Stata 资源,安装命令如下:
. ssc install lianxh
使用详情参见帮助文件 (有惊喜):
. help lianxh