Introduction to XML2R package

The XML2R package is meant to simplify scraping XML data from XML files hosted on the web. This example page demonstrates the ideas behind XML2R by walking through pseudo-source code of the scrape function in the pitchRx package. Similar ideas have been implemented in the bbscrapeR package which provides a high-level API to statistics from nba.com and wnba.com.

Extracting observations

The main function in XML2R, XML2Obs, coerces XML content into a list of observations. An observation is technically a matrix with one row and possibly many columns. One observation includes all XML attributes and the XML value for a particular XML lineage. The name of each list element (or each observation) tracks the XML hierarchy so observations can be grouped together in a sensible fashion at a later point. In the code below, we obtain 946 observations from two different XML files that contain data on Major League Baseball games played on June 14th, 2013.

library(XML2R)
pre <- "http://gd2.mlb.com/components/game/mlb/year_2013/month_06/day_14/"
post <- c("gid_2013_06_14_phimlb_colmlb_1/inning/inning_all.xml",
          "gid_2013_06_14_seamlb_oakmlb_1/inning/inning_all.xml")
urls <- paste0(pre, post)
obs <- XML2Obs(urls, as.equiv=TRUE, quiet=TRUE)
table(names(obs))
## 
##                                game                        game//inning        game//inning//bottom//action 
##                                   2                                  18                                  18 
##         game//inning//bottom//atbat  game//inning//bottom//atbat//pitch     game//inning//bottom//atbat//po 
##                                  76                                 255                                   6 
## game//inning//bottom//atbat//runner           game//inning//top//action            game//inning//top//atbat 
##                                  66                                  15                                  82 
##     game//inning//top//atbat//pitch        game//inning//top//atbat//po    game//inning//top//atbat//runner 
##                                 321                                  10                                  77

This output tells us that 255 pitches were thrown in the bottom inning and 321 were thrown in the top inning during these two games. Eventually, we will want to combine observations named 'game//inning//bottom//atbat//pitch' and 'game//inning//top//atbat//pitch' into the same table since they share XML attributes (in other words, the observations share variables). XML2R has another function collapse_obs that can be used to aggregate observations into the same table based on their names.

Renaming observations

Before aggregating observations into a collection of tables, we sometimes need to re_name observations. As mentioned previously, we have some observations with different names, but they should be put into the same table. By passing these names (that should be considered to be on the same level) to the equiv argument, re_name will automatically determine the difference in the naming scheme and suppress that difference.

tmp <- re_name(obs, equiv=c("game//inning//top//atbat//pitch",
                           "game//inning//bottom//atbat//pitch"), diff.name="side")
tmp <- re_name(tmp, equiv=c("game//inning//top//atbat//runner",
                           "game//inning//bottom//atbat//runner"), diff.name="side")
tmp <- re_name(tmp, equiv=c("game//inning//top//atbat//po",
                           "game//inning//bottom//atbat//po"), diff.name="side")
tmp <- re_name(tmp, equiv=c("game//inning//top//atbat",
                           "game//inning//bottom//atbat"), diff.name="side")
obs2 <- re_name(tmp, equiv=c("game//inning//top//action",
                           "game//inning//bottom//action"), diff.name="side")
table(names(obs2))
## 
##                        game                game//inning        game//inning//action         game//inning//atbat 
##                           2                          18                          33                         158 
##  game//inning//atbat//pitch     game//inning//atbat//po game//inning//atbat//runner 
##                         576                          16                         143

Now we see that 576 pitches were thrown in total. The information we removed is not lost; however, as a new column is appended to the end of each relevant observation. For example, notice how the side column contains the part of the name we just removed:

obs2[grep("game//inning//atbat//po", names(obs2))][1:2]
## $`game//inning//atbat//po`
##      des                 
## [1,] "Pickoff Attempt 1B"
##      url                                                                                                                    
## [1,] "http://gd2.mlb.com/components/game/mlb/year_2013/month_06/day_14/gid_2013_06_14_phimlb_colmlb_1/inning/inning_all.xml"
##      side    
## [1,] "bottom"
## 
## $`game//inning//atbat//po`
##      des                 
## [1,] "Pickoff Attempt 1B"
##      url                                                                                                                    
## [1,] "http://gd2.mlb.com/components/game/mlb/year_2013/month_06/day_14/gid_2013_06_14_phimlb_colmlb_1/inning/inning_all.xml"
##      side 
## [1,] "top"

Linking observations

After all that renaming, we now have 7 different levels of observations. Let's examine observations on the game\\inning level:

obs2[grep("^game//inning$", names(obs2))][1:3]
## $`game//inning`
##      num away_team home_team next
## [1,] "1" "phi"     "col"     "Y" 
##      url                                                                                                                    
## [1,] "http://gd2.mlb.com/components/game/mlb/year_2013/month_06/day_14/gid_2013_06_14_phimlb_colmlb_1/inning/inning_all.xml"
## 
## $`game//inning`
##      num away_team home_team next
## [1,] "2" "phi"     "col"     "Y" 
##      url                                                                                                                    
## [1,] "http://gd2.mlb.com/components/game/mlb/year_2013/month_06/day_14/gid_2013_06_14_phimlb_colmlb_1/inning/inning_all.xml"
## 
## $`game//inning`
##      num away_team home_team next
## [1,] "3" "phi"     "col"     "Y" 
##      url                                                                                                                    
## [1,] "http://gd2.mlb.com/components/game/mlb/year_2013/month_06/day_14/gid_2013_06_14_phimlb_colmlb_1/inning/inning_all.xml"

Before we aggregate observations into tables, we should preserve the parent-to-child relationships in the XML lineage. For example, we want to be able to identify in what inning a particular pitch was thrown.

obswkey <- add_key(obs2, parent = "game//inning", recycle = "num", key.name = "inning")

As it turns out, the away_team and home_team columns are redundant as this information is embeded in the url column. Thus, there is only one other informative attribute on this level which is next. By recycling this value to all of its descendants as well, we remove any need to retain a game//inning table.

obswkey <- add_key(obswkey, parent = "game//inning", recycle = "next")

It is also imperative that we can identify which atbat a particular pitch, runner, and po belongs to. This can be done as follows:

obswkey <- add_key(obswkey, parent = "game//inning//atbat", recycle = "num")

Collapsing observations

Finally, we are in a position to pool together observations that have a common name. The collapse_obs function achieves this by row binding observations together and returning a list of matrices. Note that collapse_obs does not require that observations from the same level to have the same set of variables in order to be binded into a common table. In the case where variables are missing, NAs will be used as the value.

tables <- collapse_obs(obswkey)
# As mentioned before, we don't need the 'inning' table
tables <- tables[-grep("^game//inning$", names(tables))]
table.names <- c("game", "action", "atbat", "pitch", "po", "runner")
tables <- setNames(tables, table.names)
head(tables[["runner"]])
##      id       start end  event          score rbi earned
## [1,] "471865" ""    ""   "Home Run"     "T"   "T" "T"   
## [2,] "429667" ""    "1B" "Walk"         NA    NA  NA    
## [3,] "429667" "1B"  "2B" "Hit By Pitch" NA    NA  NA    
## [4,] "435623" ""    "1B" "Hit By Pitch" NA    NA  NA    
## [5,] "435623" "1B"  ""   "Flyout"       NA    NA  NA    
## [6,] "429667" "2B"  ""   "Flyout"       NA    NA  NA    
##      url                                                                                                                    
## [1,] "http://gd2.mlb.com/components/game/mlb/year_2013/month_06/day_14/gid_2013_06_14_phimlb_colmlb_1/inning/inning_all.xml"
## [2,] "http://gd2.mlb.com/components/game/mlb/year_2013/month_06/day_14/gid_2013_06_14_phimlb_colmlb_1/inning/inning_all.xml"
## [3,] "http://gd2.mlb.com/components/game/mlb/year_2013/month_06/day_14/gid_2013_06_14_phimlb_colmlb_1/inning/inning_all.xml"
## [4,] "http://gd2.mlb.com/components/game/mlb/year_2013/month_06/day_14/gid_2013_06_14_phimlb_colmlb_1/inning/inning_all.xml"
## [5,] "http://gd2.mlb.com/components/game/mlb/year_2013/month_06/day_14/gid_2013_06_14_phimlb_colmlb_1/inning/inning_all.xml"
## [6,] "http://gd2.mlb.com/components/game/mlb/year_2013/month_06/day_14/gid_2013_06_14_phimlb_colmlb_1/inning/inning_all.xml"
##      side     inning next num 
## [1,] "bottom" "1"    "Y"  "6" 
## [2,] "top"    "2"    "Y"  "8" 
## [3,] "top"    "2"    "Y"  "9" 
## [4,] "top"    "2"    "Y"  "9" 
## [5,] "top"    "2"    "Y"  "12"
## [6,] "top"    "2"    "Y"  "12"

At this point, you probably want to save these tables. I strongly recommend writing to a database if you plan on repeatedly querying your data. There are multiple packages that allow one to write tables to a database in R including: RSQLite, RMySQL, rmongodb, and DBI. Here is a simple way to create a SQLite database using the dplyr package and copy our tables to that database:

library(dplyr)
db <- src_sqlite("my_db.sqlite3")
for (i in names(tables)) copy_to(db, df = tables[[i]], name = i)