skills/transilienceai/communitytools/web-archive-analysis

web-archive-analysis

SKILL.md

Web Archive Analysis Skill

Purpose

Query the Wayback Machine to discover historical technology usage and detect technology migrations over time.

Operations

1. query_cdx_api

Get historical snapshots from the Wayback Machine CDX API.

Endpoint:

GET http://web.archive.org/cdx/search/cdx

Parameters:

url: {domain}
output: json
filter: statuscode:200
collapse: timestamp:6  # Group by month (YYYYMM)
limit: 100
from: {start_year}
to: {end_year}

Example Request:

curl "http://web.archive.org/cdx/search/cdx?url=example.com&output=json&filter=statuscode:200&collapse=timestamp:6&limit=100"

Response Format:

[
  ["urlkey", "timestamp", "original", "mimetype", "statuscode", "digest", "length"],
  ["com,example)/", "20240115120000", "https://example.com/", "text/html", "200", "ABC123...", "45678"]
]

2. select_snapshots

Choose representative snapshots for analysis.

Selection Strategy:

def select_snapshots(all_snapshots):
    # Get snapshots at regular intervals
    intervals = [
        "6 months ago",
        "1 year ago",
        "2 years ago",
        "3 years ago",
        "5 years ago"
    ]

    selected = []
    for interval in intervals:
        target_date = calculate_date(interval)
        closest = find_closest_snapshot(all_snapshots, target_date)
        if closest:
            selected.append(closest)

    return selected

Snapshot Priority:

  1. Recent (baseline for comparison)
  2. 1 year ago (detect recent changes)
  3. 2-3 years ago (medium-term evolution)
  4. 5+ years ago (historical context)

3. fetch_archived_content

Retrieve archived pages for analysis.

Wayback URL Format:

https://web.archive.org/web/{timestamp}/{original_url}

Example:

https://web.archive.org/web/20230115120000/https://example.com/

Headers to Request:

Accept: text/html
User-Agent: TechStackAgent/1.0 (OSINT research)

4. compare_snapshots

Detect technology changes between snapshots.

Comparison Points:

{
  "headers_to_compare": [
    "Server",
    "X-Powered-By",
    "Set-Cookie"
  ],
  "html_elements": [
    "meta[name=generator]",
    "script[src]",
    "link[href]"
  ],
  "patterns_to_track": [
    "/wp-content/",
    "/_next/",
    "/_nuxt/",
    "/static/js/"
  ]
}

Change Detection:

def detect_changes(old_snapshot, new_snapshot):
    changes = []

    # Compare technologies
    old_tech = extract_technologies(old_snapshot)
    new_tech = extract_technologies(new_snapshot)

    added = new_tech - old_tech
    removed = old_tech - new_tech

    for tech in added:
        changes.append({
            "type": "technology_added",
            "technology": tech,
            "first_seen": new_snapshot.timestamp
        })

    for tech in removed:
        changes.append({
            "type": "technology_removed",
            "technology": tech,
            "last_seen": old_snapshot.timestamp
        })

    return changes

5. detect_migrations

Identify framework/platform migrations.

Common Migration Patterns:

{
  "WordPress → Custom/React": {
    "indicators": [
      "/wp-content/ disappears",
      "React globals appear",
      "/_next/ or /static/js/ paths"
    ],
    "typical_timeline": "6-18 months"
  },
  "AngularJS → Angular": {
    "indicators": [
      "ng-app disappears",
      "ng-version appears",
      "Angular 2+ patterns"
    ],
    "typical_timeline": "12-24 months"
  },
  "jQuery → React/Vue": {
    "indicators": [
      "jQuery CDN removed",
      "Modern framework globals",
      "SPA patterns"
    ],
    "typical_timeline": "6-12 months"
  },
  "On-prem → Cloud": {
    "indicators": [
      "CloudFront/Cloudflare headers appear",
      "AWS/GCP/Azure signatures",
      "CDN usage"
    ],
    "typical_timeline": "3-12 months"
  }
}

6. extract_historical_tech

Parse archived HTML for technology signals.

Process:

  1. Fetch archived page
  2. Apply same analysis as html_content_analysis skill
  3. Record technologies with timestamp
  4. Build timeline of technology usage

Output

{
  "skill": "web_archive_analysis",
  "domain": "string",
  "results": {
    "archive_coverage": {
      "oldest_snapshot": "2015-03-15",
      "newest_snapshot": "2024-01-10",
      "total_snapshots": 450,
      "snapshots_analyzed": 5
    },
    "snapshots_analyzed": [
      {
        "timestamp": "2024-01-10",
        "url": "https://web.archive.org/web/20240110/...",
        "technologies_detected": ["Next.js", "React", "Vercel"]
      },
      {
        "timestamp": "2022-06-15",
        "url": "https://web.archive.org/web/20220615/...",
        "technologies_detected": ["React", "Create React App", "Heroku"]
      },
      {
        "timestamp": "2020-01-20",
        "url": "https://web.archive.org/web/20200120/...",
        "technologies_detected": ["WordPress", "PHP"]
      }
    ],
    "technology_timeline": [
      {
        "technology": "WordPress",
        "first_seen": "2015-03-15",
        "last_seen": "2020-06-01",
        "status": "removed"
      },
      {
        "technology": "React",
        "first_seen": "2020-03-01",
        "last_seen": "present",
        "status": "current"
      },
      {
        "technology": "Next.js",
        "first_seen": "2023-01-15",
        "last_seen": "present",
        "status": "current"
      }
    ],
    "migrations_detected": [
      {
        "type": "CMS → Modern Framework",
        "from": "WordPress",
        "to": "React/Next.js",
        "approximate_date": "2020-Q1 to 2020-Q2",
        "confidence": 85
      },
      {
        "type": "Hosting Migration",
        "from": "Heroku",
        "to": "Vercel",
        "approximate_date": "2023-Q1",
        "confidence": 80
      }
    ],
    "current_vs_historical": {
      "current_stack": ["Next.js", "React", "Vercel"],
      "historical_stack": ["WordPress", "PHP", "Heroku"],
      "major_changes": 2
    }
  },
  "evidence": [
    {
      "type": "archived_snapshot",
      "timestamp": "string",
      "archive_url": "string",
      "technologies": ["array"],
      "analysis_timestamp": "ISO-8601"
    }
  ]
}

Rate Limiting

  • Wayback CDX API: 15 requests/minute
  • Archived page fetches: 10/minute
  • Cache CDX results to avoid repeated queries

Error Handling

  • 404: Domain not archived
  • 503: Wayback Machine overloaded - retry with backoff
  • Timeout: Increase timeout for archived pages (can be slow)
  • Continue with available snapshots on partial failures

Security Considerations

  • Only access public archives
  • Respect Wayback Machine rate limits
  • Do not store archived content beyond analysis
  • Note that archived content may contain outdated security vulnerabilities
  • Log all queries for audit

Confidence Notes

Historical data provides contextual signals:

  • Confirms technology transitions
  • Validates current technology choices
  • Lower weight than current direct evidence
  • Base confidence: 60-75%
Weekly Installs
3
GitHub Stars
67
First Seen
5 days ago
Installed on
amp3
cline3
opencode3
cursor3
kimi-cli3
codex3