Stata: 约翰霍普金斯大学 COVID-19 疫情数据处理及可视化

发布时间:2020-04-09 阅读 4460

Stata 连享会   主页 || 视频 || 推文

温馨提示: 定期 清理浏览器缓存,可以获得最佳浏览体验。

课程详情 https://gitee.com/arlionn/Course   |   lianxh.cn

课程主页 https://gitee.com/arlionn/Course

Stata: 约翰厚普金斯大学 COVID-19 疫情数据处理及可视化

编译: 刘亮(三峡大学) 邮件: jetinskyliu0@gmail.com

Sources:

  • Chuck Huber, , Import COVID-19 data from Johns Hopkins University, Stata Blog, -Link-
  • Chuck Huber, COVID-19 time-series data from Johns Hopkins University, -Link-
  • Chuck Huber, How to create choropleth maps using the COVID-19 data from Johns Hopkins University, 7 April 2020, -Link-

编者按: 这篇推文在编译上述三篇 Stata Blogs 的基础上,进一步做了针对中国的数据分析和可视化处理。主要不目的不是实时呈现疫情变动 (已经有很多优秀的项目在做这件事情了),而是为了演示 Stata 中的一些常用数据分析和绘图命令。


目录


1. 简介

本文介绍 Chuck Huber 博士编写的 Covid19 Stata 命令,并将约翰`霍普金斯大学公布的全球新型冠状病毒数据为例,依次介绍如何使用 Stata 命令进行数据获取、清洗、保存以及数据可视化。

该命令能生成一个包含日期、确诊病例、新增病例、死亡人数,以及康复人数的表格。为了能够查看具体国家病毒感染人数,作者添加了 country()graph 选项,以便画出确诊人数图, 添加保存选项 saving, 以便将数据保存为 covid19_usa.dta

其基本命令为:

covid19, country("US") graph saving(covid19_usa)

​> Confirmed cases of COVID-19 in the United States

2. 约翰霍普金斯大学 GitHub 数据

GitHub 是一个流行的软件开发和发布平台。约翰霍普金斯大学系统科学与工程中心(School of Engineering Center for Systems Science and Engineering)拥有一个包括来自世界各地的定期更新的 COVID-19GitHub 数据存储库。

其原始数据网址为:https://github.com/CSSEGISandData/COVID19/tree/master/csse_covid_19_data/csse_covid_19_daily_reports

打开网站后得到如下界面,可以看到部分数据如下图所示:

每天的相关数据存储在独立的.csv 文件中。以 2020 年 1 月 29 日的数据为例,查看相关内容。

点击"Raw"按键可以看到原始的数据,原始数据如下所示:

3. 数据下载

我们能够将这些原始数据(Raw data)使用 import delimited 命令将其导入到 Stata 中,首先以 2020 年 1 月 29 日为例。Stata 的基本命令为:import delimited https://raw.githubusercontent.com/CSSEGISandData/COVID-19/master/csse_covid_19_data/csse_covid_19_daily_reports/01-29-2020.csv

导入数据后先输入describe命令,结果显示如下:

输入list in 1/5查看前 5 个观测值。

4. 数据处理

4.1 修改变量名

我们看到变量 provincestate 在一些文件中被当成变量 ïprovincestate。由于导入的文件很多,我们避免手动逐个查看。我们使用 confirm 命令去逐个查看每个文件中的 ïprovincestate。其相关命令如下:

confirm variable ïprovincestate
display_rc

输出结果为:0

如果变量 ïprovincestate 存在,command 命令会为_rc 存一个"0"值,从而我们可以使用 rename 和 label 命令修改变量名和变量标签。相关命令如下:

if _rc == 0{
    rename ïprovincestate provincestate
    label variable provincestate "Province/State"
}

执行完了 if 命令块后,使用describe provincestate命令,可以看到现在的数据中包含的变量为 provincestate 的相关信息。

为了检查数据是否还存在 ïprovincestate 变量,再使用comfirm variable ïprovincestate其结果如下:

可以发现原始数据中的变量已经被我们清理完成了。

4.2 数据导入

4.2.1 使用宏导入不同的文件

宏可以分为局部宏和全局宏。全局宏(global macro)一旦定义将贯穿整个 Stata 程序,局部宏(local macro)仅存在被定义的程序或 do 文件中。我们做一个示例定义并引用一个局部宏 today,如下所示:

local today = "3-19-2020"

我们可以通过在宏名称使用左右引号进行引用。左引号(在"esc"键下方)和右引号(在"回车"键的左侧)来引用宏。命令如下:

local today = "3-19-2020"
display "`today'"

3-19-2020

我们可以通过整合其他宏来创建宏,例如:

local month = "3"
local day = "19"
local year = "2020"
local today = "`month'-`day'-`year'"
display "`today'"

3-19-2020

我们想导入名为 03-19-2020.csv 的文件。注意月份包含一个"前置 0"。1 至 9 月被指定为:"01","02"往后依次类推。10 至 12 月指定为"10","11","12"。我们需要一种方法值赋时给月份添加适当的包含前导零的局部宏。String()函数能够满足这一要求。相关代码如下所示:

local month = string(3, "%02.0f")
local day = string(19, "%02.0f")
local year = "2020"
local today = "`month'-`day'-`year'"
display "`today'"

03-19-2020

接下来将前面的内容进行整合,使用宏来导入文件。命令为:

local URL = "https://raw.githubusercontent.com/CSSEGISandData/  > COVID-19/master/csse_covid_19_data/csse_covid_19_daily_reports/"
local FileName = "`URL'`today'.csv"
import delimited "`FileName'"

(8 vars,284 obs)

连享会主页: lianxh.cn

4.2.2 使用循环读入多个文件

作者思路非常清晰,首先介绍简单的月循环,将每个月展示出来,然后介绍嵌套循环,将年和月共同展示。

作者调用宏将月进行循环,代码如下:

forvalues month = 1/12{
    display "month = `month'"
}

回归结果如下:

继续将月和日共同循环,相关代码如下:

forvalues month = 1/12{
    forvalues day = 1/31{
        display "month = `month', day = `day'"
    }
}

结果展示如下:

有了前文的介绍,继续运用暂元和循环的相关知识,展现 2020 年所有的日期。相关代码如下:

forvalues month = 1/12{
    forvalues day = 1/31{
        local month = string(`month', "%02.0f")
        local day   = string(`day', "%02.0f")
        local year  = "2020"
        local today = "`month'-`day'-`year'"
        display "`today'"
    }
}

结果展示:

接下来作者使用关于循环和宏的命令将网站的所有数据均导入到 Stata 中,但在执行命令的过程中需要注意 2 点:

疫情的数据从 1 月 23 号开始公布,在此之前并没有数据,如果不加以提示,会出现报错,程序停止运行。

原始数据中有的文件的变量名 ïprovincestate,有的是 provincestate,全部统一修改为 provincestate。为了避免程序出现错误,使用 capture 命令,即使程序出现错误也可以继续执行。

本部分命令如下:

local URL = "https://raw.githubusercontent.com/CSSEGISandData/COVID-19/master/csse_covid_19_data/csse_covid_19_daily_reports/"
forvalues month = 1/12 {
    forvalues day = 1/31 {
          local month = string(`month', "%02.0f")
          local day = string(`day', "%02.0f")
          local year = "2020"
          local today = "`month'-`day'-`year'"
          local FileName = "`URL'`today'.csv"
          clear
          capture import delimited "`FileName'"
          capture save "`today'"
   }
}

输入ls命令 可以查看 Stata 中下载到的数据。

打开 01-22-2020.dta 文件,并且输入describe查看数据。相关结果如下:

我们看到其中的文件包括 ïprovincestate 变量名,需要将其全部统一为provincestate,代码如下:

local URL = "https://raw.githubusercontent.com/CSSEGISandData/COVID-19/master/csse_covid_19_data/csse_covid_19_daily_reports/"
forvalues month = 1/12 {
   forvalues day = 1/31 {
      local month = string(`month', "%02.0f")
      local day = string(`day', "%02.0f")
      local year = "2020"
      local today = "`month'-`day'-`year'"
      local FileName = "`URL'`today'.csv"
      clear
      capture import delimited "`FileName'"
      capture confirm variable ïprovincestate
      if _rc == 0 {
         rename ïprovincestate provincestate
         label variable provincestate "Province/State"
      }
   capture save "`today'", replace  //需要添加replace,替换掉之间存在内存中的数据
   }
}

现在打开 01-22-2020.dta 数据,并且describe结果如下:

继续使用循环命令将所有的数据依次添加并保存,代码如下:

clear
forvalues month = 1/12 {
    forvalues day = 1/31 {
        local month = string(`month', "%02.0f")
          local day = string(`day', "%02.0f")
          local year = "2020"
          local today = "`month'-`day'-`year'"
          capture append using "`today'"
      }
}

最后使用describe描述最后生成的数据。

5. 数据更新与保存

随着疫情的不断发展,相关的数据不断更新,因此本文的相关数据和变量会产生不一致。例如provoncestate,province_state,countryregion,country_region等。当我们追加数据时必须要保证变量名称一致,接下来继续使用 capture。其完整代码如下:

//收集数据
local URL = "https://raw.githubusercontent.com/CSSEGISandData/COVID-19/master/csse_covid_19_data/csse_covid_19_daily_reports/"
forvalues month = 1/12 {
    forvalues day = 1/31 {
        local month = string(`month', "%02.0f")
          local day = string(`day', "%02.0f")
          local year = "2020"
          local today = "`month'-`day'-`year'"
          local FileName = "`URL'`today'.csv"
          clear
          capture import delimited "`FileName'"
          capture confirm variable ïprovincestate
          if _rc == 0 {
             rename ïprovincestate provincestate
             label variable provincestate "Province/State"
      }
      capture rename province_state provincestate
      capture rename country_region countryregion
      capture rename last_update lastupdate
      capture rename lat latitude
      capture rename long longitude
      capture save "`today'", replace
      }
}
clear
forvalues month = 1/12 {
   forvalues day = 1/31 {
      local month = string(`month', "%02.0f")
      local day = string(`day', "%02.0f")
      local year = "2020"
      local today = "`month'-`day'-`year'"
      capture append using "`today'"
   }
}
//合并数据
clear
forvalues month = 1/12 {
    forvalues day = 1/31 {
        local month = string(`month', "%02.0f")
          local day = string(`day', "%02.0f")
          local year = "2020"
          local today = "`month'-`day'-`year'"
          capture append using "`today'"
   }
}

describe数据可以看到数据结果为:

更新数据已经处理完毕,所有变量均统一,最后保存,为以后所用。

save covid19_raw

6. 数据可视化

本文收集到的疫情数据截止到 2020 年 4 月 2 日,随着疫情的不断发展,相关数据可以到https://github.com/CSSEGISandData/COVID19/tree/master/csse_covid_19_data/csse_covid_19_daily_reportszi自行下载。

6.1 中国疫情发展情况可视化

湖北省是国内疫情重灾区,我们心系湖北,所以先画出湖北省现存确诊人数。本部分代码借鉴 TidyFriday 公众号《实时疫情与 Stata 地图绘制 》一文,相关代码如下:

use China_covid_0409.dta,clear
gen current_confirmed = confirmed - deaths - recovered //计算现存确诊人数
replace current_confirmed = 0 if current_confirmed < 0
keep province date current_confirmed
drop if missing(current_confirmed)
replace province = subinstr(province, " ", "_", .)
spread province current_confirmed  //生成面板
tw conn Hubei date, xtitle("日期", size(*1.0)) ///
    ytitle("现存确诊人数", size(*0.8))  ///
    title("湖北省现存确诊人数变化)", size(*1.3)) ///
    subtitle("") caption("数据来源:约翰·霍普金斯大学", size(*0.8))
graph save hubei.png, replace

我们同时也关注湖北省周边省市的疫情情况,同时画出周边四省的现存确诊人数,从上往下依次为河南、安徽、江西、湖南。相关代码如下:

tw conn Henan Anhui Jiangxi Hunan date, xtitle("日期", size(*0.8)) ///
    ytitle("现存确诊人数", size(*0.8)) ///
    title("湖北周边四省现存确诊人数变化", size(*1.1)) ///
    subtitle("河南、安徽、江西和湖南四省") caption("数据来源:约翰·霍普金斯大学",         size(*0.8)) xla(21937(16)22007) ///
    xsc(range(21937 22017) extend) ///
    leg(order(1 "河南" 2 "安徽" 3 "江西" 4 "湖南") pos(1) ring(0)) ///
    lc("100 200 165" "252 141 98" "141 160 203" "231 138 195") ///
    mc("102 194 165" "252 141 98" "141 160 203" "231 138 195")
graph save Fourprovince.png, replace

接下来我们观察累计治愈人数,累计死亡病例数、每日新增治愈人数、每日确诊人数。相关代码如下:

use China_covid_0409.dta,clear
collapse (sum) confirmed (sum) deaths (sum) recovered, by(date)
tsset date
gen new_confirmed = confirmed - l.confirmed
tw ///
lpolyci new_confirmed date, bw(3) || ///
sc new_confirmed date, ms(circle) ||, ///
    leg(off) xlab(, format(%tdCY-N-D)) xla(21937(16)22013) ///
    xsc(range(21937 22013) extend) ///
    ti("每日新增确诊数", size(*1.2)) ///
    yti("人数") name(a, replace) nodraw

gen current_confirmed = confirmed - deaths - recovered
replace current_confirmed = 0 if current_confirmed < 0
tw ///
lpolyci current_confirmed date, bw(3) || ///
sc current_confirmed date, ms(circle) ||, ///
    leg(off) xlab(, format(%tdCY-N-D)) xla(21937(16)22013) ///
    xsc(range(21937 22013) extend) ///
    ti("现存确诊数", size(*1.2)) ///
    yti("人数") name(b, replace) nodraw

graph combine a b, caption("数据来源:约翰霍普金斯大学", size(*0.8)) xsize(20) ysize(12) ///
    graphr(margin(medlarge))
graph export ab.png, replace

* 死亡和治愈病例数量
use China_covid_0409.dta,clear
collapse (sum) confirmed (sum) deaths (sum) recovered, by(date)
tsset date
tw ///
lpolyci recovered date, bw(3) || ///
sc recovered date, ms(circle) ||, ///
    leg(off) xlab(, format(%tdCY-N-D)) xla(21937(16)22013) ///
    xsc(range(21937 22013) extend) ///
    ti("累计治愈人数", size(*1.2)) ///
    yti("人数") name(a, replace) nodraw

tw ///
lpolyci deaths date, bw(3) || ///
sc deaths date, ms(circle) ||, ///
    leg(off) xlab(, format(%tdCY-N-D)) xla(21937(16)22013) ///
    xsc(range(21937 22013) extend) ///
    ti("累计死亡病例数", size(*1.2)) ///
    yti("人数") name(b, replace) nodraw

* 每日新增治愈
gen new_recovered = recovered - l.recovered
tw ///
lpolyci new_recovered date, bw(3) || ///
sc new_recovered date, ms(circle) ||, ///
    leg(off) xlab(, format(%tdCY-N-D)) xla(21937(16)22013) ///
    xsc(range(21937 22013) extend) ///
    ti("每日新增治愈人数", size(*1.2)) ///
    yti("人数") name(c, replace) nodraw
graph combine a b, caption("数据来源:约翰霍普金斯大学", size(*0.8)) xsize(20) ysize(12) ///
    graphr(margin(medlarge))
* 每日新增死亡
gen new_deaths = deaths - l.deaths
tw ///
lpolyci new_deaths date, bw(3) || ///
sc new_deaths date, ms(circle) ||, ///
    leg(off) xlab(, format(%tdCY-N-D)) xla(21937(16)22013) ///
    xsc(range(21937 22013) extend) ///
    ti("每日新增", size(*1.2)) ///
    yti("人数") name(d, replace) nodraw

graph combine a b c d, r(2) caption("数据来源:约翰霍普金斯大学", size(*0.8)) xsize(20) ysize(12) ///
    graphr(margin(medlarge))
graph save fourtype.png, replace

接下来我们分析治愈人数,死亡人数,现存确诊的动态图。

use China_covid_0409.dta, clear
collapse (sum) confirmed (sum) deaths (sum) recovered, by(date)
* 计算累积量
gen base = 0
gen deaths_recovered = deaths + recovered
* 绘图
tw ///
rarea recovered base date, fc("180 209 132") ///
    c("180 209 132") fintensity(inten80) lw(vvthin) || ///
rarea deaths_recovered recovered date, fc("122 118 123") ///
    c("122 118 123") fintensity(inten80) lw(vvthin) || ///
rarea confirmed deaths_recovered date, fc("95 144 161") ///
    c("95 144 161") fintensity(inten80) lw(vvthin) ||, ///
    leg(order(3 "现存确诊" 2 "死亡" 1 "治愈") pos(11) ring(0)) ///
    title("新冠肺炎疫情在中国的发展状况", size(*1.5) ///
        justification(left) bexpand) ///
    subtitle("中国新冠肺炎的确诊人数、治愈人数和死亡人数的发展趋势。", ///
        justification(left) bexpand) ///
    caption("数据来源:约翰霍普金斯大学", ///
        size(*0.8)ring(0)) ///
    xtitle("") xla(21937(8)22013) ///
    xsc(range(21937 22013) extend) ///
    graphr(margin(medlarge)) ///
    text(20000 21984 "治愈", color(black) size(*1.2)) ///
    text(63000 21983 "死亡", color(black) size(*1.2)) ///
    text(70000 21978 "现存确诊", color(black) size(*1.2)) || ///
pci 80000 21957 0 21957, lp(dash) text(56000 21952 "我国改变了" "确诊的标准", color(black) size(*0.8) justification(right) bexpand)
graph save dynamic.png,replace

6.2 世界疫情发展可视化

新冠肺炎在全球蔓延,约翰霍普金斯大学同时收集了各个国家的疫情数据。本部分内容将绘制新冠肺炎在世界蔓延情况。

首先绘制截止到 4 月 8 号全球疫情最严重的 12 个国家。本部分代码借鉴 TidyFriday 公众号《实时疫情与 Stata 地图绘制 》一文

use global_covid_0403.dta,replace
* 全球疫情最严重的12个国家现存确诊人数
use global_covid_0403.dta,clear
gen current_confirmed = confirmed - deaths - recovered
gsort - current_confirmed
sencode country,gen(country_id)
drop country
rename country_id country
keep in 1/12
twoway bar current_confirmed country,sort horizontal barwidth(0.5) ///
    fcolor(red) ylabel(1(1)12,valuelabel angle(0) labsize(*1.2)) ///
    ytitle("") xtitle("现存确诊人数") ysize(2) xsize(4) title(截止2020年4月2日确诊人数) ///
    caption("数据来源:约翰霍普金斯大学")
graph save global_12.png,replace

接下来绘制世界疫情图,本文部分代码借鉴 TidyFriday 公众号《使用 Stata 分析全球的新冠疫情数据 TidyFriday》

use global_covid_0409.dta,clear
* 修改国家
replace country = "United States of America" if country == "US"
replace country = "The Bahamas" if country == "Bahamas"
replace country = "Republic of the Congo" if country == "Congo (Brazzaville)"
replace country = "Democratic Republic of the Congo" if country == "Congo (Kinshasa)"
replace country = "Ivory Coast" if country == "Cote d'Ivoire"
replace country = "Czech Republic" if country == "Czechia"
replace country = "Guinea Bissau" if country == "Guinea-Bissau"
replace country = "South Korea" if country == "Korea, South"
replace country = "Macedonia" if country == "North Macedonia"
replace country = "Republic of Serbia" if country == "Serbia"
replace country = "United Republic of Tanzania" if country == "Tanzania"
replace country = "East Timor" if country == "Timor-Leste"
rename country name
merge 1:1 name using worlddb
* 绘制地图
replace confirmed = 0 if missing(confirmed)
xtset, clear
grmap confirmed using worldcoord, id(ID) ///
    fcolor("252 255 164" "250 193 39" "245 125 21" "212 72 66" "159 42 99" "101 21 110" "40 11 84") ///
    ocolor("white" ...) ///
    clmethod(custom) clbreaks(0 0.9 9 99 999 9999 99999 999990) ///
    title("新型冠状病毒肺炎疫情分布", size(*1.2) color(white)) ///
    graphr(margin(medium)) ///
    subtitle("截止 2020 年 4 月 8 日", color(white)) ///
    osize(vthin ...) ///
    legend(size(*1.5) ///
        order(2 "无" 3 "1~9 人" 4 "10~99 人" 5 "100~999 人" 6 "1000~9999人" 7 "10000~100000 人" 8 ">= 100000 人") ///
        ti(确诊, size(*0.5) pos(11) color(white)) color(white)) ///
    plotr(fcolor(24 24 24) lcolor(24 24 24)) ///
    graphr(fcolor(24 24 24) lcolor(24 24 24)) ///
    xsize(20) ysize(8.510638298) ///
    caption("数据来源:约翰·霍普金斯大学", size(*0.8) color(white))
graph save global_confirmed.png,replace

新冠疫情备受全球瞩目,在 github 上同样发现了很多有趣的项目,感兴趣的小伙伴可以自行浏览

https://github.com/emanuele-guidotti/COVID19

全文代码如下:

cd "D:\stata15\ado\personal\Stata_Blog\4.9"
****************************
*******收集数据*************
****************************
//收集数据//合并数据//保存数据
* 收集数据
local URL = "https://raw.githubusercontent.com/CSSEGISandData/COVID-19/master/csse_covid_19_data/csse_covid_19_daily_reports/"
forvalues month = 1/12 {
    forvalues day = 1/31 {
        local month = string(`month', "%02.0f")
          local day = string(`day', "%02.0f")
          local year = "2020"
          local today = "`month'-`day'-`year'"
          local FileName = "`URL'`today'.csv"
          clear
          capture import delimited "`FileName'"
          capture confirm variable ïprovincestate
          if _rc == 0 {
             rename ïprovincestate provincestate
             label variable provincestate "Province/State"
      }
      capture rename province_state provincestate
      capture rename country_region countryregion
      capture rename last_update lastupdate
      capture rename lat latitude
      capture rename long longitude
      capture save "`today'", replace
      }
}
clear
forvalues month = 1/12 {
    forvalues day = 1/31 {
        local month = string(`month', "%02.0f")
          local day = string(`day', "%02.0f")
          local year = "2020"
          local today = "`month'-`day'-`year'"
          capture append using "`today'"
     }
}

* 合并数据
clear
forvalues month = 1/12 {
    forvalues day = 1/31 {
        local month = string(`month', "%02.0f")
          local day = string(`day', "%02.0f")
          local year = "2020"
          local today = "`month'-`day'-`year'"
          capture append using "`today'"
   }
}

* 保存数据
save covid_0409_chen.dta,replace

*保存主要变量
use covid_0409_chen.dta,clear
keep provincestate countryregion lastupdate confirmed deaths recovered

* 缩小字符串
format provincestate %15s
format countryregion %15s

* 将时间变量lastupdate分成10个类型依次处理
gen lastupdate_01 = lastupdate in 1/38
gen lastupdate_02 = lastupdate in 39/431
gen lastupdate_03 = lastupdate in 432/560
gen lastupdate_04 = lastupdate in 561/7617
gen lastupdate_05 = lastupdate in 7618/11034
gen lastupdate_06 = lastupdate in 11035/31566
gen lastupdate_07 = lastupdate in 31567/38439
gen lastupdate_08 = lastupdate in 38440/53995
gen lastupdate_09 = lastupdate in 53996/56804
gen lastupdate_10 = lastupdate in 56805/62544

rename lastupdate_01 type1
rename lastupdate_02 type2
rename lastupdate_03 type3
rename lastupdate_04 type4
rename lastupdate_05 type5
rename lastupdate_06 type6
rename lastupdate_07 type7
rename lastupdate_08 type8
rename lastupdate_09 type9
rename lastupdate_10 type10

//处理type1
gen y_01 = real(substr(type1,6,4))
gen m_01 = real(substr(type1,1,1))
gen d_01 = real(substr(type1,3,2))
gen new_type1 = string(y_01) + "-" + "0" + string(m_01) + "-" + string(d_01)

//处理type2
gen y_02 = real(substr(type2,6,2))
gen m_02 = real(substr(type2,1,1))
gen d_02 = real(substr(type2,3,2))
gen new_type2 = string(y_02) + "20" + "-" + "0" + string(m_02) + "-" + string(d_02)

//处理type3
replace type3 = subinstr(type3,"2/1/2020","2/01/2020",.)
gen y_03 = real(substr(type3,6,4))
gen m_03 = real(substr(type3,1,1))
gen d_03 = real(substr(type3,3,2))
gen new_type3 = string(y_03) + "-" + "0" + string(m_03) + "-" + string(d_03)
replace new_type3 = subinstr(new_type3,"2020-02-1","2020-02-01",.)

//处理type4
gen new_type4 = substr(type4,1,10)

//处理type5
gen y_05 = real(substr(type5,6,2))
gen m_05 = real(substr(type5,1,1))
gen d_05 = real(substr(type5,3,2))
gen new_type5 = string(y_05) + "20" + "-" + "0" +string(m_05) + "-" + string(d_05)

//处理type6
gen new_type6 = substr(type6,1,10)

//处理type7
gen y_07 = real(substr(type7,6,2))
gen m_07 = real(substr(type7,1,1))
gen d_08 = real(substr(type7,6,2))
gen new_type7 = string(y_07) + "20" + "-" + "0" + string(m_07) + "-" + string(d_07)

//处理type8
gen new_type8 = substr(type8,1,10)

//处理type9
gen y_09 = real(substr(type9,5,2))
gen m_09 = real(substr(type9,1,1))
gen d_09 = real(substr(type9,3,1))
gen new_type9 = string(y_09) + "20" + "-" + "0" +string(m_09) + "-" + "0" + string(d_09)

//处理type10
gen new_type10 = substr(type10,1,10)

//将new_type1-10合并到new_time统一时间
gen new_time = new_type1 in 1/38
replace new_time = new_type2 in 39/431
replace new_time = new_type3 in 432/560
replace new_time = new_type4 in 561/7617
replace new_time = new_type5 in 7618/11034
replace new_time = new_type6 in 11035/31566
replace new_time = new_type7 in 31567/38439
replace new_time = new_type8 in 38440/53995
replace new_time = new_type9 in 53996/56804
replace new_time = new_type10 in 56805/62544

//检查是否完整生成时间变量
gen date = date(new_time,"YMD")
gen v1 = "100" if date ==.         // 有缺失值
list lastupdate if v1 == "100"     //找到缺失值
replace new_time = "2020-03-08" in 10847
replace new_time = "2020-03-08" in 10848
replace new_time = "2020-03-08" in 34744
replace new_time = "2020-03-08" in 34824
replace new_time = "2020-03-08" in 38182
replace new_time = "2020-03-08" in 38262
replace new_time = "2020-03-29" in 56547
replace new_time = "2020-03-18" in 56565
replace new_time = "2020-03-18" in 56566
replace new_time = "2020-03-30" in 56576
replace new_time = "2020-03-29" in 56577
replace new_time = "2020-03-16" in 56588
replace new_time = "2020-02-23" in 56596
replace new_time = "2020-02-23" in 56615
gen date1 = date(new_time,"YMD")
format date1 %tdCY-N-D

* 保存变量名
keep provincestate countryregion confirmed deaths recovered date1

*修改变量
rename provincestate province
rename countryregion country
rename date1 date

* 数据处理过程保留
save date_complete_covid_0409.dta,replace

******************************************
*******************画图*******************
******************************************
//继续处理数据
use date_complete_covid_0409.dta,clear

* 将缺失值替换为0
replace confirmed = 0 if confirmed ==.
replace deaths = 0 if deaths ==.
replace recovered = 0 if recovered == .

//********画中国的图*******
* 保留中国的数据
keep if inlist(country,"Mainland China","China","Hong Kong","Taiwan","Taiwan*","Macau")

* 完善省份名称
replace province = "Taiwan" if country == "Taiwan*"
rename provincestate province

* 生成面板的中国数据
encode province,gen(province_id)
drop country
xtset province_id date

//处理重复值
duplicates list province_id date
duplicates tag province_id date,gen(isdup)
edit if isdup
drop isdup
duplicates drop province_id date,force

xtset province_id date //生成面板数据

save China_covid_0409.dta,replace //保存中国疫情数据

*************************************画第一个图*******************************
use China_covid_0409.dta,clear
gen current_confirmed = confirmed - deaths - recovered   //计算现存确诊人数计算图
replace current_confirmed = 0 if current_confirmed < 0
keep province date current_confirmed
drop if missing(current_confirmed)
replace province = subinstr(province, " ", "_", .)
spread province current_confirmed

*画一个湖北省的疫情确诊图
tw conn Hubei date, xti("日期", size(*0.8)) ///
    ytitle("现存确诊人数", size(*0.8)) ///
    title("湖北省现存确诊人数变化", size(*1.3)) ///
    subtitle("") caption("数据来源:约翰·霍普金斯大学", size(*0.8))
graph save hubei.png, replace

**湖北省周边四省现存确诊人数
tw conn Henan Anhui Jiangxi Hunan date, xtitle("日期", size(*0.8)) ///
    ytitle("现存确诊人数", size(*0.8)) ///
    title("湖北周边四省现存确诊人数变化", size(*1.1)) ///
    subtitle("河南、安徽、江西和湖南四省") caption("数据来源:约翰·霍普金斯大学", size(*0.8)) xla(21937(16)22013) ///
    xsc(range(21937 22010) extend) ///
    leg(order(1 "河南" 2 "安徽" 3 "江西" 4 "湖南") pos(1) ring(0)) ///
    lc("100 200 165" "252 141 98" "141 160 203" "231 138 195") ///
    mc("102 194 165" "252 141 98" "141 160 203" "231 138 195")
graph save Fourprovince.png, replace

****************************************************************************
* 地图二
* 死亡和治愈病例数量
* 当前确诊人数与每日新增确诊人数
use China_covid_0409.dta,clear
collapse (sum) confirmed (sum) deaths (sum) recovered, by(date)
tsset date
gen new_confirmed = confirmed - l.confirmed
tw ///
lpolyci new_confirmed date, bw(3) || ///
sc new_confirmed date, ms(circle) ||, ///
    leg(off) xlab(, format(%tdCY-N-D)) xla(21937(16)22013) ///
    xsc(range(21937 22013) extend) ///
    ti("每日新增确诊数", size(*1.2)) ///
    yti("人数") name(a, replace) nodraw

gen current_confirmed = confirmed - deaths - recovered
replace current_confirmed = 0 if current_confirmed < 0
tw ///
lpolyci current_confirmed date, bw(3) || ///
sc current_confirmed date, ms(circle) ||, ///
    leg(off) xlab(, format(%tdCY-N-D)) xla(21937(16)22013) ///
    xsc(range(21937 22013) extend) ///
    ti("现存确诊数", size(*1.2)) ///
    yti("人数") name(b, replace) nodraw

gr combine a b, caption("数据来源:约翰霍普金斯大学", size(*0.8)) xsize(20) ysize(12) ///
    graphr(margin(medlarge))
gr export ab.png, replace

* 死亡和治愈病例数量
use China_covid_0409.dta,clear
collapse (sum) confirmed (sum) deaths (sum) recovered, by(date)
tsset date
tw ///
lpolyci recovered date, bw(3) || ///
sc recovered date, ms(circle) ||, ///
    leg(off) xlab(, format(%tdCY-N-D)) xla(21937(16)22013) ///
    xsc(range(21937 22013) extend) ///
    ti("累计治愈人数", size(*1.2)) ///
    yti("人数") name(a, replace) nodraw

tw ///
lpolyci deaths date, bw(3) || ///
sc deaths date, ms(circle) ||, ///
    leg(off) xlab(, format(%tdCY-N-D)) xla(21937(16)22007) ///
    xsc(range(21937 22000) extend) ///
    ti("累计死亡病例数", size(*1.2)) ///
    yti("人数") name(b, replace) nodraw

* 每日新增治愈
gen new_recovered = recovered - l.recovered
tw ///
lpolyci new_recovered date, bw(3) || ///
sc new_recovered date, ms(circle) ||, ///
    leg(off) xlab(, format(%tdCY-N-D)) xla(21937(16)22007) ///
    xsc(range(21937 22007) extend) ///
    ti("每日新增治愈人数", size(*1.2)) ///
    yti("人数") name(c, replace) nodraw
gr combine a b, caption("数据来源:约翰霍普金斯大学", size(*0.8)) xsize(20) ysize(12) ///
    graphr(margin(medlarge))
* 每日新增死亡
gen new_deaths = deaths - l.deaths
tw ///
lpolyci new_deaths date, bw(3) || ///
sc new_deaths date, ms(circle) ||, ///
    leg(off) xlab(, format(%tdCY-N-D)) xla(21937(16)22007) ///
    xsc(range(21937 22007) extend) ///
    ti("每日新增", size(*1.2)) ///
    yti("人数") name(d, replace) nodraw

gr combine a b c d, r(2) caption("数据来源:约翰霍普金斯大学", size(*0.8)) xsize(20) ysize(12) ///
    graphr(margin(medlarge))
graph save fourtype.png, replace

* 地图三  每天增长总数 //时间轴比较拥挤如何处理
use China_covid_0409.dta, clear
collapse (sum) confirmed (sum) deaths (sum) recovered, by(date)
* 计算累积量
gen base = 0
gen deaths_recovered = deaths + recovered
* 绘图
tw ///
rarea recovered base date, fc("180 209 132") ///
    c("180 209 132") fintensity(inten80) lw(vvthin) || ///
rarea deaths_recovered recovered date, fc("122 118 123") ///
    c("122 118 123") fintensity(inten80) lw(vvthin) || ///
rarea confirmed deaths_recovered date, fc("95 144 161") ///
    c("95 144 161") fintensity(inten80) lw(vvthin) ||, ///
    leg(order(3 "现存确诊" 2 "死亡" 1 "治愈") pos(11) ring(0)) ///
    title("新冠肺炎疫情在中国的发展状况", size(*1.5) ///
        justification(left) bexpand) ///
    subtitle("中国新冠肺炎的确诊人数、治愈人数和死亡人数的发展趋势。", ///
        justification(left) bexpand) ///
    caption("数据来源:约翰霍普金斯大学", ///
        size(*0.8)ring(0)) ///
    xti("") xla(21937(8)22000) ///
    xsc(range(21937 22000) extend) ///
    graphr(margin(medlarge)) ///
    text(20000 21984 "治愈", color(black) size(*1.2)) ///
    text(63000 21983 "死亡", color(black) size(*1.2)) ///
    text(70000 21978 "现存确诊", color(black) size(*1.2)) || ///
pci 80000 21957 0 21957, lp(dash) text(56000 21952 "我国改变了" "确诊的标准", color(black) size(*0.8) justification(right) bexpand)
graph save dynamic.png, replace

* 地图三  疫情治愈,死亡,新增图
use China_covid_0409.dta, clear
collapse (sum) confirmed (sum) deaths (sum) recovered, by(date)
* 计算累积量
gen base = 0
gen deaths_recovered = deaths + recovered
* 绘图
tw ///
rarea recovered base date, fc("180 209 132") ///
    c("180 209 132") fintensity(inten80) lw(vvthin) || ///
rarea deaths_recovered recovered date, fc("122 118 123") ///
    c("122 118 123") fintensity(inten80) lw(vvthin) || ///
rarea confirmed deaths_recovered date, fc("95 144 161") ///
    c("95 144 161") fintensity(inten80) lw(vvthin) ||, ///
    leg(order(3 "现存确诊" 2 "死亡" 1 "治愈") pos(11) ring(0)) ///
    title("新冠肺炎疫情在中国的发展状况", size(*1.5) ///
        justification(left) bexpand) ///
    subtitle("中国新冠肺炎的确诊人数、治愈人数和死亡人数的发展趋势。", ///
        justification(left) bexpand) ///
    caption("数据来源:约翰霍普金斯大学", ///
        size(*0.8)ring(0)) ///
    xti("") xla(21937(8)22000) ///
    xsc(range(21937 22000) extend) ///
    graphr(margin(medlarge)) ///
    text(20000 21984 "治愈", color(black) size(*1.2)) ///
    text(63000 21983 "死亡", color(black) size(*1.2)) ///
    text(70000 21978 "现存确诊", color(black) size(*1.2)) || ///
pci 80000 21957 0 21957, lp(dash) text(56000 21952 "我国改变了" "确诊的标准", color(black) size(*0.8) justification(right) bexpand)
graph save  dynamic.png, replace

* 绘制世界地图
use date_complete_covid_0409.dta,clear
replace confirmed = 0 if confirmed ==.
replace deaths = 0 if deaths ==.
replace recovered = 0 if recovered ==.
* 保存变量*************************************************************
keep province country date confirmed deaths recovered

* 准备截面数据
keep if date == date("2020-04-08","YMD")
replace country = "China" if countryregion == "Taiwan*"

* 将各国数据汇总
bysort country :egen confirmed_sum = sum(confirmed)
bysort country :egen deaths_sum = sum(deaths)
bysort country :egen recovered_sum = sum(recovered)

gsort -confirmed_sum //由大到小进行排序
drop in 2/2621 //保留美国
drop in 5/14  //保留法国
drop in 7/21 //中国
drop in 9/18 //英国
drop in 13/16 //荷兰
drop in 14/27 //加拿大
drop in 24/30 //澳大利亚
drop in 25/26 //保存丹麦

keep country date confirmed_sum deaths_sum recovered_sum
rename confirmed_sum confirmed
rename deaths_sum deaths
rename recovered_sum recovered
rename countryregion country
save global_covid_0409.dta,replace  //全球数据保存完成

*绘制12个国家的疫情图
* 全球疫情最严重12个国家的现存确诊人数
use global_covid_0409.dta,clear
gen current_confirmed = confirmed - deaths - recovered
gsort - current_confirmed
sencode country,gen(country_id)
drop country
rename country_id country
keep in 1/12
save global_12_covid_0409.dta,replace

twoway bar current_confirmed country, sort horizontal barwidth(0.5) ///
    fcolor(red) ylabel(1(1)12,valuelabel angle(0) labsize(*1.2)) ///
    ytitle("") xtitle("现存确诊人数") ysize(2) xsize(4) title(截止2020年4月8日现存确诊人数) ///
    caption("数据来源:约翰霍普金斯大学")
graph save global_12.png,replace

**绘制世界地形图
*绘制世界疫情图需要重新处理数据
use global_covid_0409.dta,clear
* 修改国家
replace country = "United States of America" if country == "US"
replace country = "The Bahamas" if country == "Bahamas"
replace country = "Republic of the Congo" if country == "Congo (Brazzaville)"
replace country = "Democratic Republic of the Congo" if country == "Congo (Kinshasa)"
replace country = "Ivory Coast" if country == "Cote d'Ivoire"
replace country = "Czech Republic" if country == "Czechia"
replace country = "Guinea Bissau" if country == "Guinea-Bissau"
replace country = "South Korea" if country == "Korea, South"
replace country = "Macedonia" if country == "North Macedonia"
replace country = "Republic of Serbia" if country == "Serbia"
replace country = "United Republic of Tanzania" if country == "Tanzania"
replace country = "East Timor" if country == "Timor-Leste"
rename country name
merge 1:1 name using worlddb

* 绘制地图
replace confirmed = 0 if missing(confirmed)
xtset, clear
grmap confirmed using worldcoord, id(ID) ///
    fcolor("252 255 164" "250 193 39" "245 125 21" "212 72 66" "159 42 99" "101 21 110" "40 11 84") ///
    ocolor("white" ...) ///
    clmethod(custom) clbreaks(0 0.9 9 99 999 9999 99999 999990) ///
    title("新型冠状病毒肺炎疫情分布", size(*1.2) color(white)) ///
    graphr(margin(medium)) ///
    subtitle("截止 2020 年 4 月 8 日", color(white)) ///
    osize(vthin ...) ///
    legend(size(*1.5) ///
        order(2 "无" 3 "1~9 人" 4 "10~99 人" 5 "100~999 人" 6 "1000~9999人" 7 "10000~100000 人" 8 ">= 100000 人") ///
        ti(确诊, size(*0.5) pos(11) color(white)) color(white)) ///
    plotr(fcolor(24 24 24) lcolor(24 24 24)) ///
    graphr(fcolor(24 24 24) lcolor(24 24 24)) ///
    xsize(20) ysize(8.510638298) ///
    caption("数据来源:约翰·霍普金斯大学", size(*0.8) color(white))
graph save global_confirmed.png,replace

参考资料

  • 实时疫情与 Stata 地图绘制 TidyFriday
  • 使用 Stata 分析全球的新冠疫情数据 TidyFriday
  • 用 Stata 绘制地图 Stata 之禅

相关课程

连享会-直播课 上线了!
http://lianxh.duanshu.com

免费公开课:


课程一览

支持回看,所有课程可以随时购买观看。

专题 嘉宾 直播/回看视频
最新专题 DSGE, 因果推断, 空间计量等
Stata数据清洗 游万海 直播, 2 小时,已上线
研究设计 连玉君 我的特斯拉-实证研究设计-幻灯片-
面板模型 连玉君 动态面板模型-幻灯片-
面板模型 连玉君 直击面板数据模型 [免费公开课,2小时]

Note: 部分课程的资料,PPT 等可以前往 连享会-直播课 主页查看,下载。


关于我们

  • Stata连享会 由中山大学连玉君老师团队创办,定期分享实证分析经验。直播间 有很多视频课程,可以随时观看。
  • 连享会-主页知乎专栏,300+ 推文,实证分析不再抓狂。
  • 公众号推文分类: 计量专题 | 分类推文 | 资源工具。推文分成 内生性 | 空间计量 | 时序面板 | 结果输出 | 交乘调节 五类,主流方法介绍一目了然:DID, RDD, IV, GMM, FE, Probit 等。
  • 公众号关键词搜索/回复 功能已经上线。大家可以在公众号左下角点击键盘图标,输入简要关键词,以便快速呈现历史推文,获取工具软件和数据下载。常见关键词:课程, 直播, 视频, 客服, 模型设定, 研究设计, stata, plus, 绘图, 编程, 面板, 论文重现, 可视化, RDD, DID, PSM, 合成控制法

连享会主页  lianxh.cn
连享会主页 lianxh.cn

连享会小程序:扫一扫,看推文,看视频……

扫码加入连享会微信群,提问交流更方便

✏ 连享会学习群-常见问题解答汇总:
https://gitee.com/arlionn/WD