Linking views on the web without shiny

2016-12-12

Slides available at http://bit.ly/phd-defense

This work is released under Creative Commons

Bio

Teaching Assistant, Iowa State University (2011-2015)
M.S. in Statistics, Iowa State University (2013)
Internships with AT&T (2013) and Google (2014)
Software Developer, plotly (2015 - Present)
Research Assistant, Monash University (Sept 2015 - June 2016)

Future plans

Freelance software developer & data scientist

My clients already include: plotly, NOAA
In addition to developing plotly, I plan on delivering "data products" (e.g. interactive web apps, dynamic reports/documents, etc.)
Businesses already building products around plotly: Omni Analytics, TCBD Analytics

Dissertation chapters

Taming PITCHf/x Data with XML2R and pitchRx
LDAvis: A method for visualizing and interpreting topics
Extending ggplot2’s grammar of graphics implementation for linked and dynamic graphics on the web
Interactive data visualization on the web using R
plotly for R
- Multiple linked views
  - Linking views with shiny
  - Linking views without shiny

Why link views without shiny?

"Multiple linked views are the optimal framework for posing queries about data" (Cook, Buja, & Swayne 1996)

Web applications are generally more complicated, less responsive, and difficult to share.

Linking as a database query

First described in detail by Buja, McDonald, Michalak, and Stuetzle 1991.
Compared to other linking frameworks by Cook, Buja, & Swayne 1996.
Snap-together visualizations (on Windows) North and Shneiderman 1999.
crosstalk (Cheng et al. 2016) proposes similar standards for linking web graphics from R.

1-to-1 linking

1-to-n (i.e., group) linking

library(plotly)
library(crosstalk)

d <- SharedData$new(txhousing, ~city)
p <- ggplot(d, aes(date, median, group = city)) + geom_line()
ggplotly(p, tooltip = "city")

1-to-n (i.e., group) linking

Data-pipeline

library(dplyr)

txhousing %>%
  group_by(city) %>%
  summarise(has = sum(is.na(median))) %>%
  filter(has > 0) %>%
  arrange(has)
#> # A tibble: 22 × 2
#>                 city   has
#>                <chr> <int>
#> 1  Killeen-Fort Hood     1
#> 2            Lubbock     1
#> 3        Brownsville     2
#> 4            McAllen     2
#> 5        Port Arthur     2
#> 6         San Angelo     2
#> 7           Victoria     2
#> 8     Corpus Christi     3
#> 9        Nacogdoches    11
#> 10     Temple-Belton    11
#> # ... with 12 more rows

Data-plot-pipeline

library(plotly)

plot_ly(txhousing, color = I("black")) %>%
  group_by(city) %>%
  summarise(has = sum(is.na(median))) %>%
  filter(has > 0) %>%
  arrange(has) %>%
  add_markers(x = ~has, y = ~factor(city, levels = city))

SharedData-plot-pipeline

library(crosstalk)
sd <- SharedData$new(txhousing, ~city)

base <- plot_ly(sd, color = I("black")) %>%
  group_by(city)

p1 <- base %>%
  summarise(has = sum(is.na(median))) %>%
  filter(has > 0) %>%
  arrange(has) %>%
  add_markers(x = ~has, y = ~factor(city, levels = city))

p2 <- base %>%
  add_lines(x = ~date, y = ~median, alpha = 0.3)

subplot(p1, p2, widths = c(0.3, 0.7)) %>% 
  highlight(persistent = TRUE, dynamic = TRUE)

m-to-n linking

Displaying aggregated selections

d <- SharedData$new(mpg)
dots <- plot_ly(d, color = ~class, x = ~displ, y = ~cyl)
boxs <- plot_ly(d, color = ~class, x = ~class, y = ~cty) %>% add_boxplot()
bars <- plot_ly(d, x = ~class, color = ~class)

subplot(dots, boxs) %>%
  subplot(bars, nrows = 2) %>%
  layout(barmode = "overlay", dragmode = "select")

plotly.js "natively" supports a few statistical graphics (e.g., bar charts, histograms, and boxplots)
Dynamically updating other statistical graphics (e.g., densities, fitted lines, violins, etc) currently requires linking views with shiny

Tree linking via subset matching

Tree-like structures associate multiple values with a single graphical element

In this case, we need to match sets rather than elements.

Attaching sets via list-columns

d <- data.frame(x = 1:4, y = 1:4)
d$key <- lapply(1:4, function(x) letters[seq_len(x)])
d
#>   x y        key
#> 1 1 1          a
#> 2 2 2       a, b
#> 3 3 3    a, b, c
#> 4 4 4 a, b, c, d

plot_ly(d, x = ~x, y = ~y, key = ~key) %>% highlight(color = "red")

Devil in the details

Do we "inform the world" as \(\{\{a, b\}, \{a, b, c\}\}\)? Or \(\{a, b, c\}\)?
For now, we always emit the union, but emitting a set of sets could be useful for linking networks (for example).

Keeping it simple encourages system integration

library(plotly)
library(parcoords)
library(crosstalk)

hc <- hclust(dist(USArrests), "ave")
dend1 <- as.dendrogram(hc)

htmltools::tagList(
  plot_dendro(dend1),
  USArrests %>% SharedData$new() %>% parcoords()
)

Basic matching algorithm

Worst-case scenario \(\mathcal{O}(mn)\)
If key has, say \(u < n\) unique values, can improve to \(\mathcal{O}(mu)\).

Subset matching algorithm

Worst-case scenario \(\mathcal{O}(m \Sigma_{i=1}^n a_i)\)
Again, if key has, say \(u < n\) unique values, can improve to \(\mathcal{O}(m \Sigma_{i=1}^{\textbf{u}} a_i)\).

Improving performance via "simple keys"

m <- SharedData$new(mpg)
p <- ggplot(m, aes(displ, hwy, colour = class)) +
  geom_point() +
  geom_smooth(se = FALSE, method = "lm")
ggplotly(p) %>% highlight("plotly_hover")

The geom_smooth() data

d <- ggplotly(p, originalData = F, layerData = 2) %>% 
  plotly_data()
d
#> # A tibble: 560 × 12
#>     colour        x        y PANEL group colour_plotlyDomain   fill  size
#> *    <chr>    <dbl>    <dbl> <int> <int>               <chr>  <chr> <dbl>
#> 1  #F8766D 5.700000 24.93816     1     1             2seater grey60     1
#> 2  #F8766D 5.716456 24.93322     1     1             2seater grey60     1
#> 3  #F8766D 5.732911 24.92828     1     1             2seater grey60     1
#> 4  #F8766D 5.749367 24.92333     1     1             2seater grey60     1
#> 5  #F8766D 5.765823 24.91839     1     1             2seater grey60     1
#> 6  #F8766D 5.782278 24.91345     1     1             2seater grey60     1
#> 7  #F8766D 5.798734 24.90851     1     1             2seater grey60     1
#> 8  #F8766D 5.815190 24.90356     1     1             2seater grey60     1
#> 9  #F8766D 5.831646 24.89862     1     1             2seater grey60     1
#> 10 #F8766D 5.848101 24.89368     1     1             2seater grey60     1
#> # ... with 550 more rows, and 4 more variables: linetype <dbl>,
#> #   weight <dbl>, alpha <dbl>, key <list>

The geom_smooth() key

Many of the key values are redundant

length(d$key)
#> [1] 560
length(unique(d$key))
#> [1] 7

And one unique key per color

length(unique(setNames(d$key, d$colour)))
#> [1] 7

Data sent to plotly.js

p %>% ggplotly() %>% plotly_json()

What do "simple keys" buy us?

Reduces querying time of this example from minutes to seconds.

Future work

Keep adding documentation and examples in the plotly for R book.
Further advance plotly's support for linking views without shiny
- Add more support for displaying statistical summaries of selections.
- Integrate plotly's linking features with more projects (see leaflet (Cheng et al. 2016) and parcoords (Russell et al. 2016))
- Lower hanging fruit is listed here
Support for more popular ggplot2 extension packages such as ggrepel and ggraph.
- Integrating plotly's support for linking tree/network structures with ggraph/geomnet would be particularly interesting.

Bio

Future plans

Freelance software developer & data scientist

Dissertation chapters

Why link views without shiny?

Linking as a database query

1-to-1 linking

1-to-n (i.e., group) linking

1-to-n (i.e., group) linking

1-to-n (i.e., group) linking

Data-pipeline

Data-plot-pipeline

SharedData-plot-pipeline

m-to-n linking

Displaying aggregated selections

Tree linking via subset matching

Attaching sets via list-columns

Devil in the details

Keeping it simple encourages system integration

Keeping it simple encourages system integration

Basic matching algorithm

Subset matching algorithm

Improving performance via "simple keys"

The geom_smooth() data

The geom_smooth() key

Data sent to plotly.js

What do "simple keys" buy us?

Future work

Thank you, questions?