Text Summarization

Overview

Text summarization condenses documents while preserving key information. Extractive: selects and concatenates important sentences from the original. Abstractive: generates new text that paraphrases the content. Extractive is simpler and more faithful; abstractive is more fluent but may hallucinate.

When to Use

Trigger conditions:

Condensing long documents, reports, or article collections
Building automated summary pipelines for content curation
Comparing extractive vs abstractive approaches for a use case

When NOT to use:

When full document understanding is needed (summarization loses detail)
For structured data extraction (use NER or information extraction)

Algorithm

IRON LAW: Abstractive Summarization Can HALLUCINATE
Abstractive models may generate fluent text containing facts NOT in
the source. Always verify key claims in abstractive summaries against
the original document. For high-stakes use cases (legal, medical),
prefer extractive or use abstractive with factual consistency checking.

Phase 1: Input Validation

Determine: input length, target summary length (ratio or word count), single-doc vs multi-doc, domain. Gate: Input text available, target length defined.

Phase 2: Core Algorithm

Extractive (TextRank/LexRank):

Split document into sentences
Build similarity graph (sentence nodes, cosine similarity edges)
Run PageRank on sentence graph
Select top-k sentences by rank, reorder by original position

Abstractive (transformer-based):

Use pre-trained model (BART, T5, Pegasus)
Encode input document (handle length limits with chunking if needed)
Generate summary with beam search
Post-process: check for repetition, factual consistency

Phase 3: Verification

Evaluate: ROUGE scores (ROUGE-1, ROUGE-2, ROUGE-L) against reference summaries. Manual check for factual accuracy and coherence. Gate: ROUGE scores reasonable for domain, no hallucinations in spot-check.

Phase 4: Output

Return summary with metadata.

Output Format

{
  "summary": "The company reported Q4 revenue of...",
  "method": "extractive_textrank",
  "metadata": {"input_words": 2000, "summary_words": 200, "compression_ratio": 0.10, "sentences_selected": 5}
}

Examples

Sample I/O

Input: 2000-word news article about quarterly earnings Expected: 200-word summary covering: revenue, profit, guidance, key highlights. Extractive: 5-6 selected sentences. Abstractive: coherent paragraph.

Edge Cases

Input	Expected	Why
Very short input (< 100 words)	Return as-is or minimal trimming	Already concise
Multiple contradicting sections	Summary may miss nuance	Summarization favors dominant theme
Technical jargon	Extractive preserves, abstractive may simplify	Domain expertise affects quality

Gotchas

ROUGE ≠ quality: ROUGE measures n-gram overlap with references. A high-ROUGE summary can be incoherent, and a low-ROUGE summary can be excellent with different word choices.
Input length limits: Transformer models have max token limits (512-4096). Long documents need chunking strategies (chunk-then-summarize or hierarchical summarization).
Repetition: Abstractive models sometimes repeat phrases. Use repetition penalty during generation (no_repeat_ngram_size).
Position bias: In news text, important information is front-loaded (inverted pyramid). Simple "take first N sentences" is a strong extractive baseline.
Multi-document summarization: Summarizing multiple related documents requires handling redundancy and contradiction across sources.

References

For TextRank/LexRank implementation details, see references/graph-based-extraction.md
For factual consistency checking, see references/factual-consistency.md

algo-nlp-summarization