Follow along – http://cpsievert.github.io/slides/web-scraping

Data is messy (especially on the web)!

  • As statisticians, we typically think of data in a tidy (i.e., tabular) format. That is
    • rows == observations (often sharing an observational unit)
    • columns == various observational attributes
  • This is a convenient statistical modeling framework, but data hardly ever begins in this tidy format (especially on the web).
  • This talk will provide a quick overview of popular methods for acquiring info/data from the web using R.
  • If you get nothing else from this talk, just remember to…

Motivating Example

rvest is a nice R package for web-scraping by (you guessed it) Hadley Wickham.

library(rvest)
# First, grab the page source
html("http://en.wikipedia.org/wiki/Table_(information)") %>%
  # then extract the first node with class of wikitable
  html_node(".wikitable") %>% 
  # then convert the HTML table into a data frame
  html_table()
##   First name   Last name Age
## 1     Bielat    Adamczak  24
## 2  Blaszczyk Kostrzewski  25
## 3 Olatunkboh    Chijiaku  22
## 4   Adrienne    Anthoula  22
## 5     Axelia  Athanasios  22
## 6  Jon-Kabat        Zinn  22
  • Note: html_table only works on 'nicely' formatted HTML tables.

This is a nice format? Really? Yes, really. It's the format used to render tables on webpages.

<table class="wikitable">
  <tr>
    <th>First name</th>
    <th>Last name</th>
    <th>Age</th>
  </tr>
  <tr>
    <td>Bielat</td>
    <td>Adamczak</td>
    <td>24</td>
  </tr>
  <tr>
    <td>Blaszczyk</td>
    <td>Kostrzewski</td>
    <td>25</td>
  </tr>
  <tr>
    <td>Olatunkboh</td>
    <td>Chijiaku</td>
    <td>22</td>
  </tr>
</table> 

What if we want info that isn't packaged neatly in <table>?

(selectorgadget + rvest) to the rescue!

Extracting links to reports

Your turn

Your turn solution

html("http://www.stat.iastate.edu/people/faculty/") %>%
  html_nodes("#content a") %>% html_attr(name="href") -> hrefs
head(hrefs)
## [1] "http://www.stat.iastate.edu/personal/?id=dcadams" 
## [2] "http://www.stat.iastate.edu/personal/?id=emilyb"  
## [3] "http://www.stat.iastate.edu/personal/?id=pcaragea"
## [4] "http://www.stat.iastate.edu/personal/?id=alicia"  
## [5] "http://www.stat.iastate.edu/personal/?id=songchen"
## [6] "http://www.stat.iastate.edu/personal/?id=dicook"

Access the DOM with RSelenium

Useful if you need access to the DOM, not just the page source.

Just to demo, let's browse through faculty pages.

library(RSelenium)
startServer()
remDr <- remoteDriver(browserName="firefox")
remDr$open()
for (i in hrefs) {
  Sys.sleep(2)
  remDr$navigate(i) # at this point, you could remDr$getPageSource()
}
  • We use RSelenium to test whether animint interactive plots behave correctly.
  • I've also used it to scrape websites that have to be rendered in a browser in order to access certain info.

Common data exchange formats

  • So far we've briefly covered how to extract information from HTML pages.
  • HTML is great for sharing content between people, but it isn't great for exchanging data between machines.
  • There are a ton of different ways to exchange data over the web, but by far the most popular ones are XML and JSON.

What is XML?

XML is a markup language that looks very similar to HTML.

<mariokart>
  <driver name="Bowser" occupation="Koopa">
    <vehicle speed="55" weight="25"> Wario Bike </vehicle>
    <vehicle speed="40" weight="67"> Piranha Prowler </vehicle>
  </driver>
  <driver name="Peach" occupation="Princess">
    <vehicle speed="54" weight="29"> Royal Racer </vehicle>
    <vehicle speed="50" weight="34"> Wild Wing </vehicle>
  </driver>
</mariokart>
  • This example shows that XML can (and is) used to store inherently tabular data (thanks Jeroen Ooms for the fun example)
  • What is are the observational units here? How many observations in total?
  • Two units and 6 total observations (4 vehicles and 2 drivers).

XML2R is a framework to simplify acquistion of tabular/relational XML.

# devtools::install_github("cpsievert/XML2R")
library(XML2R)
obs <- XML2Obs("http://bit.ly/mario-xml")
table(names(obs))
## 
##          mariokart//driver mariokart//driver//vehicle 
##                          2                          4

obs # named list of matrices. Each matrix is *guaranteed* to have 1 row.
## $`mariokart//driver//vehicle`
##      speed weight XML_value     
## [1,] "55"  "25"   " Wario Bike "
## 
## $`mariokart//driver//vehicle`
##      speed weight XML_value          
## [1,] "40"  "67"   " Piranha Prowler "
## 
## $`mariokart//driver`
##      name     occupation
## [1,] "Bowser" "Koopa"   
## 
## $`mariokart//driver//vehicle`
##      speed weight XML_value      
## [1,] "54"  "29"   " Royal Racer "
## 
## $`mariokart//driver//vehicle`
##      speed weight XML_value    
## [1,] "50"  "34"   " Wild Wing "
## 
## $`mariokart//driver`
##      name    occupation
## [1,] "Peach" "Princess"

collapse_obs(obs) # groups observations by their name/unit
## $`mariokart//driver`
##      name     occupation
## [1,] "Bowser" "Koopa"   
## [2,] "Peach"  "Princess"
## 
## $`mariokart//driver//vehicle`
##      speed weight XML_value          
## [1,] "55"  "25"   " Wario Bike "     
## [2,] "40"  "67"   " Piranha Prowler "
## [3,] "54"  "29"   " Royal Racer "    
## [4,] "50"  "34"   " Wild Wing "
  • What information have I lost by combining observations of the same unit into the same table?
  • I can't map vehicles to the drivers!

obs <- add_key(obs, parent = "mariokart//driver", recycle = "name")
collapse_obs(obs)
## $`mariokart//driver`
##      name     occupation
## [1,] "Bowser" "Koopa"   
## [2,] "Peach"  "Princess"
## 
## $`mariokart//driver//vehicle`
##      speed weight XML_value           name    
## [1,] "55"  "25"   " Wario Bike "      "Bowser"
## [2,] "40"  "67"   " Piranha Prowler " "Bowser"
## [3,] "54"  "29"   " Royal Racer "     "Peach" 
## [4,] "50"  "34"   " Wild Wing "       "Peach"

Now (if I want) I can merge the tables into a single table…

tabs <- collapse_obs(obs)
merge(tabs[[1]], tabs[[2]], by = "name")
##     name occupation speed weight         XML_value
## 1 Bowser      Koopa    55     25       Wario Bike 
## 2 Bowser      Koopa    40     67  Piranha Prowler 
## 3  Peach   Princess    54     29      Royal Racer 
## 4  Peach   Princess    50     34        Wild Wing

Your turn

Your turn 'solution'

Using add_key is optional here, but it removes the need to hang on to the "game" table.

library(magrittr)
XML2Obs("http://gd2.mlb.com/components/game/mlb/year_2011/month_04/day_04/gid_2011_04_04_minmlb_nyamlb_1/players.xml") %>% 
  add_key(parent = "game", recycle = "venue") %>% 
  add_key(parent = "game", recycle = "date") %>%
  collapse_obs -> tabs
tabs[["game//team//player"]][1:5, c("first", "last", "venue", "date")]
##      first     last      venue            date           
## [1,] "Jim"     "Thome"   "Yankee Stadium" "April 4, 2011"
## [2,] "Carl"    "Pavano"  "Yankee Stadium" "April 4, 2011"
## [3,] "Michael" "Cuddyer" "Yankee Stadium" "April 4, 2011"
## [4,] "Joe"     "Nathan"  "Yankee Stadium" "April 4, 2011"
## [5,] "Joe"     "Mauer"   "Yankee Stadium" "April 4, 2011"

What about JSON?

  • JavaScript Object Notation (JSON) is comprised of two components:
    1. arrays => [value1, value2]
    2. objects => {"key1": value1, "key2": [value2, value3]}
  • NOTE: you can also have arrays of objects!
  • The preferred R package for R <=> JSON conversion has long been RJSONIO
  • However, jsonlite is gaining a lot of momentum/attention.
  • In fact, shiny will soon be moving from RJSONIO to jsonlite

Package downloads from RStudio's CRAN mirror

Back to Mariokart

[
    {
        "driver": "Bowser",
        "occupation": "Koopa",
        "vehicles": [
            {
                "model": "Wario Bike",
                "speed": 55,
                "weight": 25
            },
            {
                "model": "Piranha Prowler",
                "speed": 40,
                "weight": 67
            }
        ]
    },
    {
        "driver": "Peach",
        "occupation": "Princess",
        "vehicles": [
            {
                "model": "Royal Racer",
                "speed": 54,
                "weight": 29
            },
            {
                "model": "Wild Wing",
                "speed": 50,
                "weight": 34
            }
        ]
    }
]

library(jsonlite)
mario <- fromJSON("http://bit.ly/mario-json")
str(mario) # nested data.frames?!? 
## 'data.frame':    2 obs. of  3 variables:
##  $ driver    : chr  "Bowser" "Peach"
##  $ occupation: chr  "Koopa" "Princess"
##  $ vehicles  :List of 2
##   ..$ :'data.frame': 2 obs. of  3 variables:
##   .. ..$ model : chr  "Wario Bike" "Piranha Prowler"
##   .. ..$ speed : int  55 40
##   .. ..$ weight: int  25 67
##   ..$ :'data.frame': 2 obs. of  3 variables:
##   .. ..$ model : chr  "Royal Racer" "Wild Wing"
##   .. ..$ speed : int  54 50
##   .. ..$ weight: int  29 34

mario$driver
## [1] "Bowser" "Peach"
mario$vehicles
## [[1]]
##             model speed weight
## 1      Wario Bike    55     25
## 2 Piranha Prowler    40     67
## 
## [[2]]
##         model speed weight
## 1 Royal Racer    54     29
## 2   Wild Wing    50     34

How do we get two tables (with a common id) like the XML example?

# this mapply statement is essentially equivalent to add_key
vehicles <- mapply(function(x, y) cbind(x, driver = y), 
                   mario$vehicles, mario$driver, SIMPLIFY = FALSE)
rbind.pages(vehicles)
##             model speed weight driver
## 1      Wario Bike    55     25 Bowser
## 2 Piranha Prowler    40     67 Bowser
## 3     Royal Racer    54     29  Peach
## 4       Wild Wing    50     34  Peach
mario[!grepl("vehicle", names(mario))]
##   driver occupation
## 1 Bowser      Koopa
## 2  Peach   Princess

Thanks! Any questions?