6 Boxplots

Boxplots encode the five number summary of a numeric variable, and provide a decent way to compare many numeric distributions. The visual task of comparing multiple boxplots is relatively easy (i.e., compare position along a common scale) compared to some common alternatives (e.g., a trellis display of histograms, like 5.1), but the boxplot is sometimes inadequate for capturing complex (e.g., multi-modal) distributions (in this case, a frequency polygon, like Figure 2.9 provides a nice alternative). The add_boxplot() function requires one numeric variable, and guarantees boxplots are oriented correctly, regardless of whether the numeric variable is placed on the x or y scale. As Figure 6.1 shows, on the axis orthogonal to the numeric axis, you can provide a discrete variable (for conditioning) or supply a single value (to name the axis category).

p <- plot_ly(diamonds, y = ~price, color = I("black"), 
             alpha = 0.1, boxpoints = "suspectedoutliers")
p1 <- p %>% add_boxplot(x = "Overall")
p2 <- p %>% add_boxplot(x = ~cut)
subplot(
  p1, p2, shareY = TRUE,
  widths = c(0.2, 0.8), margin = 0
) %>% hide_legend()
Overall diamond price and price by cut.

FIGURE 6.1: Overall diamond price and price by cut.

If you want to partition by more than one discrete variable, you could use the interaction of those variables to the discrete axis, and coloring by the nested variable, as Figure 6.2 does with diamond clarity and cut. Another approach would be to use a trellis display, similar to Figure 13.9.

plot_ly(diamonds, x = ~price, y = ~interaction(clarity, cut)) %>%
  add_boxplot(color = ~clarity) %>%
  layout(yaxis = list(title = ""))
Diamond prices by cut and clarity.

FIGURE 6.2: Diamond prices by cut and clarity.

It is also helpful to sort the boxplots according to something meaningful, such as the median price. Figure 6.3 presents the same information as Figure 6.2, but sorts the boxplots by their median, and makes it immediately clear that diamonds with a cut of “SI2” have the highest diamond price, on average.

d <- diamonds %>%
  mutate(cc = interaction(clarity, cut))

# interaction levels sorted by median price
lvls <- d %>%
  group_by(cc) %>%
  summarise(m = median(price)) %>%
  arrange(m) %>%
  pull(cc)

plot_ly(d, x = ~price, y = ~factor(cc, lvls)) %>%
  add_boxplot(color = ~clarity) %>%
  layout(yaxis = list(title = ""))
Diamond prices by cut and clarity, sorted by price median.

FIGURE 6.3: Diamond prices by cut and clarity, sorted by price median.

Similar to add_histogram(), add_boxplot() sends the raw data to the browser, and lets plotly.js compute summary statistics. Unfortunately, plotly.js does not yet allow precomputed statistics for boxplots.19