COVID19 data scrapping from Nigerian Centre for Disease Control (NCDC)

Last updated on Sep 11, 2022 R

Introduction

On the advent of COVID-19 globally and since February 29, 2020 in Nigeria, motivations to provide visuals of the trends and to provide guide to Nigerians on how to conduct themselves responsibly was boosted by the publication of the first coronavirus package and the awesome animation of the province level by Krispin and Byrnes 2020. That immediately sent me working on how to animate the cases for Nigeria. However, this desire was met with short comings because the data publish at the Johns Hopkins University Center for Systems Science and Engineering (JHU-CCSE) by the Nigerian Centre for Disease Control is agrregated nationally, whereas the animations that motivated me were regional for Australia. In addition, the daily updates only provide the regional details of the confirmed cases. The casualties, recoveries and active cases are provided in a single Table which aggregates all the previous cases as well as the current data, rather than provide the daily as they occur daily. So daily casualties, recoveries, deaths abd active cases were not reported separately.

Faced wiith this challenges, immediately sustainable solutions had to be found, which is scrapping data from the Table published in PDF. At the end of the day, four methods of scrapping the data were explored until I had to settle for one.

Scrapping data from PDF Table

The first method was to scrap the data from the PDF Table. The codes for that is provided below.

library(pdftools)
library(tidyverse)
text <- 
  pdf_text("data/Nigeria_300820_36.pdf") %>%
  readr::read_lines()
# text    displaying the text, save the relevant lines where the data appears
Data <- text[64:101] # the lines containing the data
write_delim(as.data.frame(Data), "data/300820.csv", delim = " ") # saved to CSV

KK <- read_csv("data/300820.csv", 
                            skip = 1, skip_empty_rows = TRUE,
                            trim_ws = TRUE, col_names = FALSE) # loaded to memory for further use
head(KK)

## # A tibble: 6 × 1
##   X1                                                                            
##   <chr>                                                                         
## 1 Lagos                 18,119              15        15,228              0    …
## 2 FCT                    5,156               7         1,531              7    …
## 3 Oyo                    3,118              11         1,952              0    …
## 4 Edo                    2,578               1         2,300             0     …
## 5 Plateau                2,498              55         1,374            46     …
## 6 Rivers                 2,141               7         1,969             9     …

However, it was not efficient as the scrapped Table could not be separated into columns for further analysis.

Scrapping data from the website

The second is to scrap the data from the website as shown in the codes.

library(rvest)

url <- "https://covid19.ncdc.gov.ng/report/"

COVID19 <- url %>%
  read_html() %>%
  html_nodes(xpath='//*[@id="custom1"]') %>%
  html_table()

COVID19 <- COVID19[[1]] # the Table is item 1 in the scrapped data

xlsx::write.xlsx2(COVID19, file = "data/daily160920.xlsx", 
                  col.names = TRUE, row.names = FALSE) # the scrapped data is saved to Excel file

## # A tibble: 6 × 5
##   `States Affected` `No. of Cases (Lab Confirmed)` No. of Case…¹ No. D…² No. o…³
##   <chr>                                      <dbl>         <dbl>   <dbl>   <dbl>
## 1 Lagos                                      25436          1507   23697     232
## 2 FCT                                         8908          2356    6464      88
## 3 Plateau                                     4099           148    3917      34
## 4 Kaduna                                      4098           603    3448      47
## 5 Oyo                                         3773           355    3372      46
## 6 Rivers                                      3217           228    2929      60
## # … with abbreviated variable names ¹`No. of Cases (on admission)`,
## #   ²`No. Discharged`, ³`No. of Deaths`

The outcome of this method is better than that of the PDF but the thousand values were comma-separated, so, to use the data further, the commas had to be removed by formatting manually.

Scrapping data from Excel

The third method is to use Excel to scrap the data. For this, open MS-Excel, Select Data from the Menu as shown in the graphics below

This proved more helpful that the first two and is easier to use as it does not require any coding as shown in the pictures.

Scrapping data by copy-paste to from website to Excel

The fourth method is to scrap the data directly by selecting and copying the data from the website.

This is the easiest and most efficient as no codes are needed and the copied data is saved directly in Excel for further formatting.

After scrapping the data with any of the methods mentioned above, then, they are formatted for preparing visuals. To study the trend of spread and how recoveries and/or deaths occur from time to time, the daily records were obtained from the aggregated data.

COVID-19 NCDC RStats

Job Nmadu

Professor of Agricultural Economics and Dean, School of Agriculture and Agricultural Technology

Research interests are economic efficiencies of small scale farming and welfare effects of agricultural interventions.