2016-12-12

Slides available at http://bit.ly/phd-defense

This work is released under Creative Commons

Bio

  • Teaching Assistant, Iowa State University (2011-2015)
  • M.S. in Statistics, Iowa State University (2013)
  • Internships with AT&T (2013) and Google (2014)
  • Software Developer, plotly (2015 - Present)
  • Research Assistant, Monash University (Sept 2015 - June 2016)

Future plans

Freelance software developer & data scientist

  • My clients already include: plotly, NOAA
  • In addition to developing plotly, I plan on delivering "data products" (e.g. interactive web apps, dynamic reports/documents, etc.)
  • Businesses already building products around plotly: Omni Analytics, TCBD Analytics

Dissertation chapters

  • Taming PITCHf/x Data with XML2R and pitchRx
  • LDAvis: A method for visualizing and interpreting topics
  • Extending ggplot2’s grammar of graphics implementation for linked and dynamic graphics on the web
  • Interactive data visualization on the web using R
  • plotly for R
    • Multiple linked views
      • Linking views with shiny
      • Linking views without shiny

Why link views without shiny?

Linking as a database query

1-to-1 linking

1-to-n (i.e., group) linking

1-to-n (i.e., group) linking

library(plotly)
library(crosstalk)

d <- SharedData$new(txhousing, ~city)
p <- ggplot(d, aes(date, median, group = city)) + geom_line()
ggplotly(p, tooltip = "city")

1-to-n (i.e., group) linking

Data-pipeline

library(dplyr)

txhousing %>%
  group_by(city) %>%
  summarise(has = sum(is.na(median))) %>%
  filter(has > 0) %>%
  arrange(has)
#> # A tibble: 22 × 2
#>                 city   has
#>                <chr> <int>
#> 1  Killeen-Fort Hood     1
#> 2            Lubbock     1
#> 3        Brownsville     2
#> 4            McAllen     2
#> 5        Port Arthur     2
#> 6         San Angelo     2
#> 7           Victoria     2
#> 8     Corpus Christi     3
#> 9        Nacogdoches    11
#> 10     Temple-Belton    11
#> # ... with 12 more rows

Data-plot-pipeline

library(plotly)

plot_ly(txhousing, color = I("black")) %>%
  group_by(city) %>%
  summarise(has = sum(is.na(median))) %>%
  filter(has > 0) %>%
  arrange(has) %>%
  add_markers(x = ~has, y = ~factor(city, levels = city))

SharedData-plot-pipeline

library(crosstalk)
sd <- SharedData$new(txhousing, ~city)

base <- plot_ly(sd, color = I("black")) %>%
  group_by(city)

p1 <- base %>%
  summarise(has = sum(is.na(median))) %>%
  filter(has > 0) %>%
  arrange(has) %>%
  add_markers(x = ~has, y = ~factor(city, levels = city))

p2 <- base %>%
  add_lines(x = ~date, y = ~median, alpha = 0.3)

subplot(p1, p2, widths = c(0.3, 0.7)) %>% 
  highlight(persistent = TRUE, dynamic = TRUE)

m-to-n linking

Displaying aggregated selections

d <- SharedData$new(mpg)
dots <- plot_ly(d, color = ~class, x = ~displ, y = ~cyl)
boxs <- plot_ly(d, color = ~class, x = ~class, y = ~cty) %>% add_boxplot()
bars <- plot_ly(d, x = ~class, color = ~class)

subplot(dots, boxs) %>%
  subplot(bars, nrows = 2) %>%
  layout(barmode = "overlay", dragmode = "select")

  • plotly.js "natively" supports a few statistical graphics (e.g., bar charts, histograms, and boxplots)

  • Dynamically updating other statistical graphics (e.g., densities, fitted lines, violins, etc) currently requires linking views with shiny

Tree linking via subset matching

  • Tree-like structures associate multiple values with a single graphical element

  • In this case, we need to match sets rather than elements.

Attaching sets via list-columns

d <- data.frame(x = 1:4, y = 1:4)
d$key <- lapply(1:4, function(x) letters[seq_len(x)])
d
#>   x y        key
#> 1 1 1          a
#> 2 2 2       a, b
#> 3 3 3    a, b, c
#> 4 4 4 a, b, c, d
plot_ly(d, x = ~x, y = ~y, key = ~key) %>% highlight(color = "red")

Devil in the details

  • Do we "inform the world" as \(\{\{a, b\}, \{a, b, c\}\}\)? Or \(\{a, b, c\}\)?
  • For now, we always emit the union, but emitting a set of sets could be useful for linking networks (for example).

Keeping it simple encourages system integration

Keeping it simple encourages system integration

library(plotly)
library(parcoords)
library(crosstalk)

hc <- hclust(dist(USArrests), "ave")
dend1 <- as.dendrogram(hc)

htmltools::tagList(
  plot_dendro(dend1),
  USArrests %>% SharedData$new() %>% parcoords()
)

Basic matching algorithm

  • Worst-case scenario \(\mathcal{O}(mn)\)
  • If key has, say \(u < n\) unique values, can improve to \(\mathcal{O}(mu)\).

Subset matching algorithm

  • Worst-case scenario \(\mathcal{O}(m \Sigma_{i=1}^n a_i)\)
  • Again, if key has, say \(u < n\) unique values, can improve to \(\mathcal{O}(m \Sigma_{i=1}^{\textbf{u}} a_i)\).

Improving performance via "simple keys"

m <- SharedData$new(mpg)
p <- ggplot(m, aes(displ, hwy, colour = class)) +
  geom_point() +
  geom_smooth(se = FALSE, method = "lm")
ggplotly(p) %>% highlight("plotly_hover")

The geom_smooth() data

d <- ggplotly(p, originalData = F, layerData = 2) %>% 
  plotly_data()
d
#> # A tibble: 560 × 12
#>     colour        x        y PANEL group colour_plotlyDomain   fill  size
#> *    <chr>    <dbl>    <dbl> <int> <int>               <chr>  <chr> <dbl>
#> 1  #F8766D 5.700000 24.93816     1     1             2seater grey60     1
#> 2  #F8766D 5.716456 24.93322     1     1             2seater grey60     1
#> 3  #F8766D 5.732911 24.92828     1     1             2seater grey60     1
#> 4  #F8766D 5.749367 24.92333     1     1             2seater grey60     1
#> 5  #F8766D 5.765823 24.91839     1     1             2seater grey60     1
#> 6  #F8766D 5.782278 24.91345     1     1             2seater grey60     1
#> 7  #F8766D 5.798734 24.90851     1     1             2seater grey60     1
#> 8  #F8766D 5.815190 24.90356     1     1             2seater grey60     1
#> 9  #F8766D 5.831646 24.89862     1     1             2seater grey60     1
#> 10 #F8766D 5.848101 24.89368     1     1             2seater grey60     1
#> # ... with 550 more rows, and 4 more variables: linetype <dbl>,
#> #   weight <dbl>, alpha <dbl>, key <list>

The geom_smooth() key

  • Many of the key values are redundant
length(d$key)
#> [1] 560
length(unique(d$key))
#> [1] 7
  • And one unique key per color
length(unique(setNames(d$key, d$colour)))
#> [1] 7

Data sent to plotly.js

p %>% ggplotly() %>% plotly_json()

What do "simple keys" buy us?

Future work

  • Keep adding documentation and examples in the plotly for R book.
  • Further advance plotly's support for linking views without shiny
    • Add more support for displaying statistical summaries of selections.
    • Integrate plotly's linking features with more projects (see leaflet (Cheng et al. 2016) and parcoords (Russell et al. 2016))
    • Lower hanging fruit is listed here
  • Support for more popular ggplot2 extension packages such as ggrepel and ggraph.
    • Integrating plotly's support for linking tree/network structures with ggraph/geomnet would be particularly interesting.

Thank you, questions?