fork-intelligence
Fork Intelligence
Systematic methodology for discovering valuable work in GitHub fork ecosystems. Stars-only filtering misses 60-100% of substantive forks — this skill uses branch-level divergence analysis, upstream PR cross-referencing, and domain-specific heuristics to find what matters.
Validated empirically across 10 repositories spanning Python, Rust, TypeScript, C++/Python, and Node.js (tensortrade, backtesting.py, kokoro, pymoo, firecrawl, barter-rs, pueue, dukascopy-node, ArcticDB, flowsurface).
FIRST — TodoWrite Task Templates
MANDATORY: Select and load the appropriate template before any fork analysis.
Template A — Full Analysis (new repository)
1. Get upstream baseline (stars, forks, default branch, last push)
2. List all forks with pagination, note timestamp clusters
3. Filter to unique-timestamp forks (skip bulk mirrors)
4. Check default branch divergence (ahead_by/behind_by)
5. Check non-default branches for all forks with recent push or >1 branch
6. Evaluate commit content, author emails, tags/releases
7. Cross-reference upstream PR history from fork owners
8. Tier ranking and cross-fork convergence analysis
9. Produce report with actionable recommendations
Template B — Quick Scan (triage only)
1. Get upstream baseline
2. List forks, filter by timestamp clustering
3. Check default branch divergence only
4. Report forks with ahead_by > 0
Template C — Targeted Fork Evaluation (specific fork)
1. Compare fork vs upstream on all branches
2. Examine commit messages and changed files
3. Check for tags/releases, open issues, PRs
4. Assess cherry-pick viability
Signal Priority Order
Ranked by empirical reliability across 10 repositories. See signal-priority.md for details.
| Rank | Signal | Reliability | What It Catches |
|---|---|---|---|
| 1 | Branch-level divergence | Highest | Work on feature branches (50%+ of substantive forks) |
| 2 | Upstream PR cross-reference | High | Rebased/force-pushed work invisible to compare API |
| 3 | Tags/releases on fork | High | Independent maintenance intent |
| 4 | Commit email domains | High | Institutional contributors (@company.com) |
| 5 | Timestamp clustering | Medium | Eliminates 85%+ mirror noise |
| 6 | Cross-fork convergence | Medium | Reveals unmet upstream demand |
| 7 | Stars | Lowest | Often anti-correlated with actual value |
Pipeline — 7 Steps
Step 1: Upstream Baseline
UPSTREAM="OWNER/REPO"
gh api "repos/$UPSTREAM" --jq '{forks_count, pushed_at, default_branch, stargazers_count}'
Step 2: List All Forks + Timestamp Clustering
# List all forks with activity signals
gh api "repos/$UPSTREAM/forks" --paginate \
--jq '.[] | {full_name, pushed_at, stargazers_count, default_branch}'
Timestamp clustering: Forks sharing exact pushed_at with upstream are bulk mirrors created by GitHub's fork mechanism and never touched. Group by pushed_at — forks with unique timestamps warrant investigation. This alone eliminates 85%+ of noise.
# Filter to unique-timestamp forks (skip bulk mirrors)
gh api "repos/$UPSTREAM/forks" --paginate \
--jq '.[] | {full_name, pushed_at, stargazers_count}' | \
jq -s 'group_by(.pushed_at) | map(select(length == 1)) | flatten'
Step 3: Default Branch Divergence
BRANCH=$(gh api "repos/$UPSTREAM" --jq '.default_branch')
# For each candidate fork
gh api "repos/$UPSTREAM/compare/$BRANCH...FORK_OWNER:$BRANCH" \
--jq '{ahead_by, behind_by, status}'
The status field meanings:
identical— pure mirror, skipbehind— stale mirror, skipdiverged— has original commits AND is behind (interesting)ahead— has original commits, up-to-date with upstream (rare, most valuable)
Important: Always compare from the upstream repo's perspective (repos/UPSTREAM/compare/...). The reverse direction (repos/FORK/compare/...) returns 404 for some repositories.
Step 4: Non-Default Branch Analysis (CRITICAL)
This is the single biggest methodology improvement. Across all 10 repos tested, 50%+ of the most valuable fork work lived exclusively on feature branches.
Examples:
- flowsurface/aviu16: 7,000-line GPU shader heatmap only on
shader-heatmap - ArcticDB/DerThorsten: 147 commits across
conda_build,clang,apple_changes - pueue/FrancescElies: Duration display only on
cesc/duration - barter-rs: 6 of 12 top forks had work only on feature branches
# List branches on a fork
gh api "repos/FORK_OWNER/REPO/branches" --jq '.[].name' | head -20
# Check divergence on a specific branch
gh api "repos/$UPSTREAM/compare/$BRANCH...FORK_OWNER:FEATURE_BRANCH" \
--jq '{ahead_by, behind_by, status}'
Heuristics for which forks need branch checks:
- Any fork with
pushed_atmore recent than upstream butahead_by == 0on default branch - Any fork with more than 1 branch
- Branch count > 10 is suspicious — likely non-trivial work (ArcticDB: Rohan-flutterint had 197 branches)
Step 5: Commit Content Evaluation
gh api "repos/$UPSTREAM/compare/$BRANCH...FORK_OWNER:BRANCH" \
--jq '.commits[] | {sha: .sha[:8], message: .commit.message | split("\n")[0], date: .commit.committer.date[:10], author: .commit.author.email}'
What to look for:
- Commit email domains reveal institutional contributors (
@man.com,@quantstack.net) - Subtract merge commits from ahead_by count (e.g., akeda2/pueue showed 35 ahead but 28 were upstream merges)
- Build system changes (
CMakeLists.txt,Cargo.toml,pyproject.toml) indicate platform enablement - Protobuf schema changes indicate architectural-level features
- Test files alongside source changes signal production-intent work
Step 6: Fork-Specific Signals
# Tags/releases (strongest independent maintenance signal)
gh api "repos/FORK_OWNER/REPO/tags" --jq '.[].name' | head -10
gh api "repos/FORK_OWNER/REPO/releases" --jq '.[] | {tag_name, name, published_at}' | head -5
# Open issues on the fork (signals independent project maintenance)
gh api "repos/FORK_OWNER/REPO/issues?state=open" --jq 'length'
# Check if repo was renamed (strong divergence intent signal)
gh api "repos/FORK_OWNER/REPO" --jq '.name'
| Signal | Strength | Example |
|---|---|---|
| Tags/releases on fork | Highest | pueue/freesrz93 had 6 releases |
| Open PRs against upstream | High | Formal proposals with review context |
| Open issues on the fork | High | Independent project maintenance |
| Repo renamed | Medium | flowsurface/sinaha81 became volume_flow |
| Build config changes | High (compiled languages) | Cargo.toml, CMakeLists.txt diff |
| Description changed | Weak | Many vanity renames with no code |
Step 7: Cross-Fork Convergence + Upstream PR History
# Check upstream PRs from fork owners
gh api "repos/$UPSTREAM/pulls?state=all" --paginate \
--jq '.[] | select(.head.repo.fork) | {number, title, state, user: .user.login}'
Cross-fork convergence: When multiple forks independently solve the same problem, it signals unmet upstream demand:
- firecrawl: 3 forks adopted Patchright for anti-detection
- flowsurface: 3 forks added technical indicators independently
- kokoro: 2 independent batched inference implementations
- barter-rs: 4 forks added Bybit support
Upstream PR cross-reference catches:
- Rebased/force-pushed work invisible to compare API
- Work that was merged upstream (fork shows 0 ahead but was historically significant)
- Declined PRs with valuable code that the fork still maintains
Tier Classification
After running the pipeline, classify forks into tiers:
| Tier | Criteria | Action |
|---|---|---|
| Tier 1: Major Extensions | New features, architectural changes, >10 original commits | Deep evaluation, cherry-pick candidates |
| Tier 2: Targeted Features | Focused additions, bug fixes, 2-10 commits | Cherry-pick individual commits |
| Tier 3: Infrastructure | CI/CD, packaging, deployment, docs | Evaluate if relevant to your setup |
| Tier 4: Historical | Merged upstream or stale but once significant | Note for context, no action needed |
Domain-Specific Patterns
Different codebases exhibit different fork behaviors. See domain-patterns.md for full details.
| Domain | Key Pattern | Example |
|---|---|---|
| Scientific/ML | Researchers fork-implement-publish-vanish, zero social engagement | pymoo: 300-file fork with 0 stars |
| Trading/Finance | Exchange connectors dominate; best forks are private | barter-rs: 4 independent Bybit impls |
| Infrastructure/DevTools | Self-hosting/SaaS-removal is the dominant theme | firecrawl: devflowinc/firecrawl-simple (630 stars) |
| C++/Python Mixed | Feature work lives on branches; email domains reveal institutions | ArcticDB: @man.com, @quantstack.net |
| Node.js Libraries | Check npm publication as separate packages | dukascopy-node: kyo06 published dukascopy-node-plus |
| Rust CLI | Cargo.toml diff is reliable quick filter; "superset" forks add subcommands | pueue: freesrz93 added 7 subcommands |
Quick-Scan Pipeline (5-minute triage)
For rapid triage of any new repo:
UPSTREAM="OWNER/REPO"
BRANCH=$(gh api "repos/$UPSTREAM" --jq '.default_branch')
# 1. Baseline
gh api "repos/$UPSTREAM" --jq '{forks_count, pushed_at, stargazers_count}'
# 2. Forks with unique timestamps (skip mirrors)
gh api "repos/$UPSTREAM/forks" --paginate \
--jq '.[] | {full_name, pushed_at, stargazers_count}' | \
jq -s 'group_by(.pushed_at) | map(select(length == 1)) | flatten | sort_by(.pushed_at) | reverse'
# 3. Check ahead_by for each candidate
# (loop over candidates from step 2)
# 4. Check upstream PRs from fork authors
gh api "repos/$UPSTREAM/pulls?state=all" --paginate \
--jq '.[] | select(.head.repo.fork) | {number, title, state, user: .user.login}'
Known Limitations
| Limitation | Impact | Workaround |
|---|---|---|
| GitHub compare API 250-commit limit | Highly divergent forks may truncate | Use gh api repos/FORK/commits?per_page=1 to get total count |
| Private forks invisible | Trading firms keep best work private | Accepted limitation |
| Force-pushed branches break compare API | Shows 0 ahead despite significant work | Cross-reference upstream PR history |
| Renamed forks may break API calls | Old URLs may 404 | Use gh api repos/FORK_OWNER/REPO --jq '.name' to detect renames |
| Rate limiting on large fork ecosystems | >1000 forks = many API calls | Use timestamp clustering to reduce calls by 85%+ |
| Maintainer dev forks look like independent work | Branch names 1:1 with upstream PRs | Cross-reference branch names against upstream PR branch names |
Report Template
Use this structure for the final analysis report:
# Fork Analysis Report: OWNER/REPO
**Repository**: OWNER/REPO (N stars, M forks)
**Analysis date**: YYYY-MM-DD
## Fork Landscape Summary
| Metric | Value |
| ------------------------------------- | ------ |
| Total forks | N |
| Pure mirrors | N (X%) |
| Divergent forks (ahead on any branch) | N |
| Substantive forks (meaningful work) | N |
| Stars-only miss rate | X% |
## Tiered Ranking
### Tier 1: Major Extensions
(fork details with ahead_by, key features, files changed)
### Tier 2: Targeted Features
...
### Tier 3: Infrastructure/Packaging
...
## Cross-Fork Convergence Patterns
(themes that multiple forks independently implemented)
## Actionable Recommendations
- Cherry-pick candidates
- Feature inspiration
- Security fixes
Post-Change Checklist
After modifying THIS skill:
- YAML frontmatter valid (no colons in description)
- Trigger keywords current in description
- All
./references/links resolve - Pipeline steps numbered consistently
- Shell commands tested against a real repository
- Append changes to evolution-log.md