The R package pitchRx provides tools for collecting Major League Baseball (MLB) Gameday data and visualizing PITCHf/x. This page provides a rough overview of it’s scope, but the RJournal article is more comprehensive. The source file used to generate this page is helpful to see how to embed pitchRx animations in to documents using knitr. If coding isn’t your thing, you might want to just play with my PITCHf/x visualization app!

Data Collection

Collecting ‘smallish’ data

pitchRx makes it simple to acquire PITCHf/x directly from its source. Here, pitchRx’s scrape() function is used to collect all PITCHf/x data recorded on June 1st, 2013.

library(pitchRx)
dat <- scrape(start = "2013-06-01", end = "2013-06-01")
names(dat)
## [1] "atbat"  "action" "pitch"  "po"     "runner"
dim(dat[["pitch"]])
## [1] 4682   49

By default, scrape() returns a list of 5 data frames. The 'pitch' data frame contains the actual PITCHf/x data which is recorded on a pitch-by-pitch basis. The dimensions of this data frame indicate that 4682 pitches were thrown on June 1st, 2013. If your analysis requires PITCHf/x data over many months, you surely don’t want to pull all that data into a single R session! For this (and other) reasons, scrape() can write directly to a database (see the “Managing PITCHf/x data” section).

Collecting data by Gameday IDs

In the previous example, scrape() actually determines the relevant game IDs based on the start and end date. If the user wants a more complicated query based to specific games, relevant game IDs can be passed to the game.ids argument using the built in gids data object.

data(gids, package = "pitchRx")
head(gids)
## [1] "gid_2008_02_26_fanbbc_phimlb_1" "gid_2008_02_26_flsbbc_detmlb_1"
## [3] "gid_2008_02_26_umibbc_flomlb_1" "gid_2008_02_26_umwbbc_nynmlb_1"
## [5] "gid_2008_02_27_cinmlb_phimlb_1" "gid_2008_02_27_colmlb_chamlb_1"

As you can see, the gids object contains game IDs and those IDs contain relevant dates as well as abbreviations for the home and away team name. Since the away team is always listed first, we could do the following to collect PITCHf/x data from every away game played by the Minnesota Twins in July of 2013.

MNaway13 <- gids[grep("2013_06_[0-9]{2}_minmlb*", gids)]
dat2 <- scrape(game.ids = MNaway13)

Managing PITCHf/x data in bulk

Creating and maintaining a PITCHf/x database is a breeze with pitchRx and dplyr. With a few lines of code (and some patience), all available PITCHf/x data can be obtained directly from its source and stored in a local SQLite database:

library(dplyr)
db <- src_sqlite("pitchfx.sqlite3", create = T)
scrape(start = "2008-01-01", end = Sys.Date(), connect = db$con)

The website which hosts PITCHf/x data hosts a wealth of other data that might come in handy for PITCHf/x analysis. The file type which contains PITCHf/x always ends with inning/inning_all.xml. scrape also has support to collect data from three other types of files: miniscoreboard.xml, players.xml, and inning/inning_hit.xml. Data from these files can easily be added to our existing PITCHf/x database:

files <- c("miniscoreboard.xml", "players.xml", "inning/inning_hit.xml")
scrape(start = "2008-01-01", end = Sys.Date(), suffix = files, connect = db$con)

Building your own custom scraper

pitchRx is built on top of the R package XML2R. In this post, I demonstrate how to use XML2R and pitchRx to collect attendance data from the GameDay site (similar methods can be used to collect other GameDay data). For a more detailed look at XML2R, see the introductory webpage and/or the RJournal paper.

PITCHf/x Visualization

2D animation

The pitchRx comes pre-packaged with a pitches data frame with four-seam and cut fastballs thrown by Mariano Rivera and Phil Hughes during the 2011 season. These pitches are used to demonstrate PITCHf/x animations using animateFX(). The viewer should notice that as the animation progresses, pitches coming closer to them (that is, imagine you are the umpire/catcher - watching the pitcher throw directly at you). In the animation below, the horizontal and vertical location of pitches is plotted every tenth of a second until they reach home plate (in real time). Since looking at animations in real time can be painful, this animation delays the time between each frame to a half a second.

# adding ggplot2 functions to customize animateFX() output won't work, but
# you can pass a list to the layer argument like this:
x <- list(
  facet_grid(pitcher_name ~ stand, labeller = label_both), 
  theme_bw(), 
  coord_equal()
)
animateFX(pitches, layer = x)

To avoid a cluttered animation, the avg.by option averages the trajectory for each unique value of the variable supplied to avg.by.

animateFX(pitches, avg.by = "pitch_types", layer = x)

Note that when using animateFX(), the user may want to wrap the expression with animation::saveHTML() to view the result in a web browser. If you want to include the animation in a document, knitr’s fig.show = "animate" chunk option is very useful.

Interactive animations

See here for a post on creating interactive animations of PITCHf/x data using the animint package.

Interactive 3D plots

pitchRx also makes use of rgl graphics. If I want a more revealing look as Mariano Rivera’s pitches, I can subset the pitches data frame accordingly. Note that the plot below is interactive, so make sure you have JavaScript & WebGL enabled (if you do, go ahead - click and drag)!

Rivera <- subset(pitches, pitcher_name == "Mariano Rivera")
interactiveFX(Rivera)

Visualizing pitch locations

2D densities

The strikeFX() function can be used to quickly visualize pitch location densities (from the perspective of the umpire). Here is the density of called strikes thrown by Rivera and Hughes in 2011 (for both right and left-handed batters).

strikes <- subset(pitches, des == "Called Strike")
strikeFX(strikes, geom = "tile") + 
  facet_grid(pitcher_name ~ stand) +
  coord_equal() +
  theme_bw() +
  viridis::scale_fill_viridis()

Probabilistic strike-zone densities

Models that estimate the event probabilities conditioned on pitch location provide a better inferential tool than density estimation. Here we use the mgcv package to fit a Generalized Additive Model (GAMs) which estimates the probability of a called strike as a function of pitch location and batter stance.

noswing <- subset(pitches, des %in% c("Ball", "Called Strike"))
noswing$strike <- as.numeric(noswing$des %in% "Called Strike")
library(mgcv)
m <- bam(strike ~ s(px, pz, by = factor(stand)) +
          factor(stand), data = noswing, 
          family = binomial(link = 'logit'))
x <- list(
  facet_grid(. ~ stand),
  theme_bw(),
  coord_equal(),
  viridis::scale_fill_viridis(name = "Probability of Called Strike")
)
strikeFX(noswing, model = m, layer = x)

Here are some other places where GAMs were used to understand factors that influence umpire decision making.

Session Info

devtools::session_info()
##  setting  value                       
##  version  R version 3.2.1 (2015-06-18)
##  system   x86_64, darwin13.4.0        
##  ui       X11                         
##  language (EN)                        
##  collate  en_US.UTF-8                 
##  tz       Australia/Melbourne         
##  date     2015-10-08                  
## 
##  package    * version  date       source        
##  Cairo        1.5-9    2015-09-26 CRAN (R 3.2.0)
##  colorspace   1.2-6    2015-03-11 CRAN (R 3.2.0)
##  devtools     1.9.1    2015-09-11 CRAN (R 3.2.0)
##  digest       0.6.8    2014-12-31 CRAN (R 3.2.0)
##  evaluate     0.8      2015-09-18 CRAN (R 3.2.0)
##  formatR      1.2.1    2015-09-18 CRAN (R 3.2.0)
##  ggplot2    * 1.0.1    2015-03-17 CRAN (R 3.2.0)
##  gtable       0.1.2    2012-12-05 CRAN (R 3.2.0)
##  hexbin       1.27.1   2015-08-19 CRAN (R 3.2.0)
##  htmltools    0.2.6    2014-09-08 CRAN (R 3.2.0)
##  httr         1.0.0    2015-06-25 CRAN (R 3.2.0)
##  knitr        1.11     2015-08-14 CRAN (R 3.2.1)
##  labeling     0.3      2014-08-23 CRAN (R 3.2.0)
##  lattice      0.20-33  2015-07-14 CRAN (R 3.2.0)
##  magrittr     1.5      2014-11-22 CRAN (R 3.2.0)
##  MASS         7.3-44   2015-08-30 CRAN (R 3.2.0)
##  Matrix       1.2-2    2015-07-08 CRAN (R 3.2.0)
##  memoise      0.2.1    2014-04-22 CRAN (R 3.2.0)
##  mgcv       * 1.8-7    2015-07-23 CRAN (R 3.2.0)
##  munsell      0.4.2    2013-07-11 CRAN (R 3.2.0)
##  nlme       * 3.1-122  2015-08-19 CRAN (R 3.2.0)
##  pitchRx    * 1.8      2015-10-06 local         
##  plyr         1.8.3    2015-06-12 CRAN (R 3.2.0)
##  proto        0.3-10   2012-12-22 CRAN (R 3.2.0)
##  R6           2.1.1    2015-08-19 CRAN (R 3.2.0)
##  Rcpp         0.12.1   2015-09-10 CRAN (R 3.2.0)
##  reshape2     1.4.1    2014-12-06 CRAN (R 3.2.0)
##  rmarkdown    0.8      2015-08-30 CRAN (R 3.2.1)
##  scales       0.3.0    2015-08-25 CRAN (R 3.2.0)
##  stringi      0.5-5    2015-06-29 CRAN (R 3.2.0)
##  stringr      1.0.0    2015-04-30 CRAN (R 3.2.0)
##  viridis      0.2.5    2015-09-14 CRAN (R 3.2.1)
##  XML          3.98-1.3 2015-06-30 CRAN (R 3.2.0)
##  XML2R        0.0.7    2015-05-14 local         
##  yaml         2.1.13   2014-06-12 CRAN (R 3.2.0)