LDAVis: A Method for Visualizing and Interpreting Topics

What is a topic model?

Each topic owns a different probability mass function over the same set of words (i.e. vocabulary).
Problem: Topics are not easily interpretable and vocabulary size is often very large. Where should we put our focus?
Typically, one produces a ranked list of words deemed important for understanding a given topic; but how should we measure importance?
Measure 1: \(p(w_i|z_j)\) – probability of word \(w_i\) given each topic \(z_j\).
Drawback: common words tend to appear near the top of such lists for multiple topics, making it hard to differentiate topics.
Measure 2: \(\text{lift} = \frac{p(w_i|z_j)}{p(w_i)}\) where \(p(w_i)\) is overall probability of word \(w_i\).
Drawback: Rare words tend to receive too high of a ranking.
We believe that a compromise between these two measures can aid topic interpretation: \[ \text{relevance} = \lambda * p(w_i|z_j) + (1 - \lambda) * \text{lift} \]

We anticipate this 'optimal' value of \(\lambda\) will vary for different datasets.
For this reason, it is nice to have an interactive tool that quickly iterates through word rankings (based on different values of \(\lambda\)).
The R package LDAvis makes it easy to create an interactive visualizations to aid topic interpretation.