PageRank Algorithm

Overview

PageRank computes the importance of web pages by modeling a random surfer who follows links with probability d (damping factor) and jumps to a random page with probability 1-d. Converges in O(k * E) where k is iterations and E is number of edges.

When to Use

Trigger conditions:

Computing page importance from link graph structure
Building link-based authority scoring systems
Analyzing citation networks or any directed graph importance

When NOT to use:

When you only need keyword relevance (use TF-IDF instead)
When the graph is undirected or unweighted (consider centrality measures)

Algorithm

IRON LAW: PageRank Convergence
- Damping factor d MUST be < 1 (typically 0.85)
- Without damping, rank sinks and spider traps break convergence
- Correctness invariant: sum of all PageRank values = 1.0

Phase 1: Input Validation

Build adjacency list from link data. Verify: no self-loops counted, all nodes accounted for (including dangling nodes with no outlinks). Gate: Graph is well-formed, dangling nodes identified.

Phase 2: Core Algorithm

Initialize all N pages with PR = 1/N
For each iteration:
- For each page p: PR(p) = (1-d)/N + d * Σ(PR(q)/L(q)) for all q linking to p
- Distribute dangling node rank equally to all pages
Repeat until convergence (L1 norm change < ε, typically 1e-6)

Phase 3: Verification

Check: all PR values sum to ~1.0. Compare top-k rankings against known authority pages. Gate: |Σ PR - 1.0| < 0.001 and convergence achieved within max iterations.

Phase 4: Output

Return sorted page scores with rank position.

Output Format

{
  "rankings": [{"page": "url", "score": 0.042, "rank": 1}],
  "metadata": {"nodes": 1000, "edges": 5000, "iterations": 45, "damping": 0.85, "converged": true}
}

Examples

Sample I/O

Input: Pages A→B, A→C, B→C, C→A (3 nodes, 4 edges, d=0.85) Expected Output: C: 0.390, A: 0.327, B: 0.283 (approximate)

Edge Cases

Input	Expected	Why
Single node, no links	PR = 1.0	Only node gets all rank
All nodes link to one	Target gets highest PR	Star topology concentrates rank
Dangling node (no outlinks)	Distribute its rank equally	Prevents rank leakage

Gotchas

Dangling nodes: Pages with no outgoing links leak rank. Redistribute their rank equally across all pages each iteration.
Spider traps: A group of pages that only link to each other accumulate rank. Damping factor prevents this but doesn't eliminate it entirely.
Convergence speed: Dense graphs converge faster. Sparse graphs with long chains may need 100+ iterations.
Floating point accumulation: For large graphs, use double precision. Single precision drifts noticeably after 50+ iterations.
Personalized PageRank: Standard PageRank uses uniform random jump. For personalized recommendations, bias the jump vector toward seed pages.

References

For mathematical derivation of convergence proof, see references/convergence-proof.md
For efficient sparse matrix implementation, see references/sparse-implementation.md

algo-seo-pagerank