Influence Maximization

Overview

Influence maximization selects k seed nodes in a network to maximize expected spread under a diffusion model (Independent Cascade or Linear Threshold). NP-hard, but the greedy algorithm achieves (1-1/e) ≈ 63% approximation guarantee due to submodularity. Practical for networks up to millions of nodes with CELF optimization.

When to Use

Trigger conditions:

Selecting k influencers/users to seed a viral marketing campaign
Maximizing information spread under a fixed budget (k seeds)
Comparing seeding strategies (degree-based vs greedy vs random)

When NOT to use:

When measuring existing influence (use centrality metrics)
For community structure analysis (use community detection)

Algorithm

IRON LAW: Greedy With Lazy Evaluation (CELF) Is the Practical Standard
The naive greedy algorithm requires O(k × n × R) simulations where
R = Monte Carlo runs (10,000+). CELF exploits submodularity to skip
unnecessary evaluations, achieving 700x speedup. Always use CELF
over naive greedy. Simple heuristics (top-k by degree) are fast
but can perform 50%+ worse than greedy.

Phase 1: Input Validation

Build network graph. Choose diffusion model: Independent Cascade (probability per edge) or Linear Threshold (threshold per node). Set k (number of seeds) and propagation probabilities. Gate: Graph loaded, diffusion model selected, k defined.

Phase 2: Core Algorithm

Greedy with CELF:

Initialize: seed set S = ∅
For each candidate node, estimate marginal gain: σ(S∪{v}) - σ(S) via Monte Carlo simulation (R=10,000 runs)
Select node with highest marginal gain, add to S
CELF optimization: reuse previous marginal gains, only re-evaluate when a node's upper bound exceeds current best
Repeat until |S| = k

Phase 3: Verification

Compare greedy result against baselines: random seeds, top-k degree, top-k PageRank. Greedy should significantly outperform. Gate: Greedy spread > degree heuristic spread, difference is meaningful.

Phase 4: Output

Return seed set with expected spread and comparison.

Output Format

{
  "seeds": [{"node": "user_42", "marginal_gain": 150, "selection_order": 1}],
  "expected_spread": 2500,
  "baselines": {"random": 800, "top_degree": 1900, "greedy": 2500},
  "metadata": {"k": 10, "model": "independent_cascade", "mc_simulations": 10000, "nodes": 50000}
}

Examples

Sample I/O

Input: Social network 10K nodes, k=5 seeds, IC model with p=0.1 per edge Expected: Greedy selects diverse, well-positioned seeds (not all high-degree), expected spread ~500-1000.

Edge Cases

Input	Expected	Why
k=1	Node with highest individual spread	Single seed, no overlap consideration
k > number of communities	One seed per community optimal	Diversity beats concentration
Very sparse graph (low p)	Small spread regardless of seeds	Network can't propagate with low probability

Gotchas

Monte Carlo variance: With R=1000, spread estimates have ~5% variance. Use R=10,000+ for stable results, especially when comparing close candidates.
Diffusion model choice matters: IC and LT produce different optimal seed sets. IC favors high-degree nodes; LT favors nodes that can trigger cascades.
Propagation probability estimation: Real-world edge probabilities are unknown. Common approaches: uniform (p=0.01-0.1), weighted inverse degree (1/in-degree), or learned from cascade data.
Overlap penalty: Greedy naturally handles overlap (submodularity). Heuristics that independently select top nodes waste seeds on overlapping influence spheres.
Scalability: Even with CELF, millions of nodes require further approximation (sketch-based methods like IMM or TIM+).

References

For CELF and CELF++ implementation, see references/celf-implementation.md
For scalable influence maximization (IMM), see references/scalable-im.md

algo-net-influence