LDA Topic Modeling

Overview

Latent Dirichlet Allocation models each document as a mixture of topics and each topic as a distribution over words. Discovers K latent topics from a corpus without supervision. Uses Gibbs sampling or variational inference. Complexity: O(N × K × iterations) where N = total word tokens.

When to Use

Trigger conditions:

Discovering latent themes in a large document collection
Organizing/categorizing documents by automatically discovered topics
Exploratory text analysis when categories are unknown

When NOT to use:

When categories are known (use supervised classification)
For short texts (tweets, titles) — too few words per document for reliable topic assignment
When you need semantic understanding (use embeddings)

Algorithm

IRON LAW: The Number of Topics K Must Be Chosen, Not Discovered
LDA does NOT tell you how many topics exist. K is a hyperparameter.
Too few topics: overly broad, mixed themes. Too many: fragmented,
redundant topics. Use coherence score (C_v) to compare K values,
but the final choice requires human judgment on topic interpretability.

Phase 1: Input Validation

Preprocess: tokenize, remove stop words, apply lemmatization. Build document-term matrix. Filter: remove terms appearing in <5 or >50% of documents. Gate: Clean DTM, vocabulary size reasonable (1K-50K terms).

Phase 2: Core Algorithm

Choose K (start with √(N/2), try range K=5,10,15,20,...)
Set hyperparameters: α = 50/K (document-topic density), β = 0.01 (topic-word density)
Run LDA (Gibbs sampling: 1000+ iterations, or variational inference)
Extract: topic-word distributions (top 10-20 words per topic) and document-topic distributions

Phase 3: Verification

Evaluate: topic coherence (C_v score, higher is better), manual inspection of top words per topic, check for "junk" topics (mixed incoherent words). Gate: Coherence score acceptable, topics are humanly interpretable.

Phase 4: Output

Return topics with top words and document assignments.

Output Format

{
  "topics": [{"id": 0, "label": "finance", "top_words": ["revenue", "profit", "quarter", "growth"], "coherence": 0.55}],
  "doc_topics": [{"doc_id": "d1", "dominant_topic": 0, "topic_distribution": [0.7, 0.1, 0.2]}],
  "metadata": {"K": 10, "coherence_avg": 0.48, "documents": 5000, "vocabulary": 8000}
}

Examples

Sample I/O

Input: 1000 news articles, K=5 Expected: Topics like: {politics, sports, technology, business, entertainment} with coherent top words per topic.

Edge Cases

Input	Expected	Why
Very short documents	Poor topic assignment	Too few words for reliable mixture estimation
Homogeneous corpus	1-2 topics dominate	All documents are similar, limited topic diversity
K=1	Single topic = corpus vocabulary	Degenerate case, no discrimination

Gotchas

Stop words MUST be removed: LDA will create "junk" topics dominated by common words ("the", "is", "and") if stop words remain.
Topic labeling is manual: LDA gives word distributions, NOT topic names. You must interpret and label topics based on top words.
Reproducibility: Gibbs sampling is stochastic. Different random seeds give different topics. Run multiple times and check stability.
Dynamic topics: Standard LDA assumes topics are static. For evolving corpora (news over years), use Dynamic Topic Models.
Hyperparameter sensitivity: Low α produces documents with fewer, more distinct topics. Low β produces topics with fewer, more specific words. Tune or use automatic methods.

References

For coherence metrics and K selection, see references/topic-evaluation.md
For dynamic and correlated topic models, see references/advanced-lda.md

algo-nlp-lda