web-archive-analysis
SKILL.md
Web Archive Analysis Skill
Purpose
Query the Wayback Machine to discover historical technology usage and detect technology migrations over time.
Operations
1. query_cdx_api
Get historical snapshots from the Wayback Machine CDX API.
Endpoint:
GET http://web.archive.org/cdx/search/cdx
Parameters:
url: {domain}
output: json
filter: statuscode:200
collapse: timestamp:6 # Group by month (YYYYMM)
limit: 100
from: {start_year}
to: {end_year}
Example Request:
curl "http://web.archive.org/cdx/search/cdx?url=example.com&output=json&filter=statuscode:200&collapse=timestamp:6&limit=100"
Response Format:
[
["urlkey", "timestamp", "original", "mimetype", "statuscode", "digest", "length"],
["com,example)/", "20240115120000", "https://example.com/", "text/html", "200", "ABC123...", "45678"]
]
2. select_snapshots
Choose representative snapshots for analysis.
Selection Strategy:
def select_snapshots(all_snapshots):
# Get snapshots at regular intervals
intervals = [
"6 months ago",
"1 year ago",
"2 years ago",
"3 years ago",
"5 years ago"
]
selected = []
for interval in intervals:
target_date = calculate_date(interval)
closest = find_closest_snapshot(all_snapshots, target_date)
if closest:
selected.append(closest)
return selected
Snapshot Priority:
- Recent (baseline for comparison)
- 1 year ago (detect recent changes)
- 2-3 years ago (medium-term evolution)
- 5+ years ago (historical context)
3. fetch_archived_content
Retrieve archived pages for analysis.
Wayback URL Format:
https://web.archive.org/web/{timestamp}/{original_url}
Example:
https://web.archive.org/web/20230115120000/https://example.com/
Headers to Request:
Accept: text/html
User-Agent: TechStackAgent/1.0 (OSINT research)
4. compare_snapshots
Detect technology changes between snapshots.
Comparison Points:
{
"headers_to_compare": [
"Server",
"X-Powered-By",
"Set-Cookie"
],
"html_elements": [
"meta[name=generator]",
"script[src]",
"link[href]"
],
"patterns_to_track": [
"/wp-content/",
"/_next/",
"/_nuxt/",
"/static/js/"
]
}
Change Detection:
def detect_changes(old_snapshot, new_snapshot):
changes = []
# Compare technologies
old_tech = extract_technologies(old_snapshot)
new_tech = extract_technologies(new_snapshot)
added = new_tech - old_tech
removed = old_tech - new_tech
for tech in added:
changes.append({
"type": "technology_added",
"technology": tech,
"first_seen": new_snapshot.timestamp
})
for tech in removed:
changes.append({
"type": "technology_removed",
"technology": tech,
"last_seen": old_snapshot.timestamp
})
return changes
5. detect_migrations
Identify framework/platform migrations.
Common Migration Patterns:
{
"WordPress → Custom/React": {
"indicators": [
"/wp-content/ disappears",
"React globals appear",
"/_next/ or /static/js/ paths"
],
"typical_timeline": "6-18 months"
},
"AngularJS → Angular": {
"indicators": [
"ng-app disappears",
"ng-version appears",
"Angular 2+ patterns"
],
"typical_timeline": "12-24 months"
},
"jQuery → React/Vue": {
"indicators": [
"jQuery CDN removed",
"Modern framework globals",
"SPA patterns"
],
"typical_timeline": "6-12 months"
},
"On-prem → Cloud": {
"indicators": [
"CloudFront/Cloudflare headers appear",
"AWS/GCP/Azure signatures",
"CDN usage"
],
"typical_timeline": "3-12 months"
}
}
6. extract_historical_tech
Parse archived HTML for technology signals.
Process:
- Fetch archived page
- Apply same analysis as html_content_analysis skill
- Record technologies with timestamp
- Build timeline of technology usage
Output
{
"skill": "web_archive_analysis",
"domain": "string",
"results": {
"archive_coverage": {
"oldest_snapshot": "2015-03-15",
"newest_snapshot": "2024-01-10",
"total_snapshots": 450,
"snapshots_analyzed": 5
},
"snapshots_analyzed": [
{
"timestamp": "2024-01-10",
"url": "https://web.archive.org/web/20240110/...",
"technologies_detected": ["Next.js", "React", "Vercel"]
},
{
"timestamp": "2022-06-15",
"url": "https://web.archive.org/web/20220615/...",
"technologies_detected": ["React", "Create React App", "Heroku"]
},
{
"timestamp": "2020-01-20",
"url": "https://web.archive.org/web/20200120/...",
"technologies_detected": ["WordPress", "PHP"]
}
],
"technology_timeline": [
{
"technology": "WordPress",
"first_seen": "2015-03-15",
"last_seen": "2020-06-01",
"status": "removed"
},
{
"technology": "React",
"first_seen": "2020-03-01",
"last_seen": "present",
"status": "current"
},
{
"technology": "Next.js",
"first_seen": "2023-01-15",
"last_seen": "present",
"status": "current"
}
],
"migrations_detected": [
{
"type": "CMS → Modern Framework",
"from": "WordPress",
"to": "React/Next.js",
"approximate_date": "2020-Q1 to 2020-Q2",
"confidence": 85
},
{
"type": "Hosting Migration",
"from": "Heroku",
"to": "Vercel",
"approximate_date": "2023-Q1",
"confidence": 80
}
],
"current_vs_historical": {
"current_stack": ["Next.js", "React", "Vercel"],
"historical_stack": ["WordPress", "PHP", "Heroku"],
"major_changes": 2
}
},
"evidence": [
{
"type": "archived_snapshot",
"timestamp": "string",
"archive_url": "string",
"technologies": ["array"],
"analysis_timestamp": "ISO-8601"
}
]
}
Rate Limiting
- Wayback CDX API: 15 requests/minute
- Archived page fetches: 10/minute
- Cache CDX results to avoid repeated queries
Error Handling
- 404: Domain not archived
- 503: Wayback Machine overloaded - retry with backoff
- Timeout: Increase timeout for archived pages (can be slow)
- Continue with available snapshots on partial failures
Security Considerations
- Only access public archives
- Respect Wayback Machine rate limits
- Do not store archived content beyond analysis
- Note that archived content may contain outdated security vulnerabilities
- Log all queries for audit
Confidence Notes
Historical data provides contextual signals:
- Confirms technology transitions
- Validates current technology choices
- Lower weight than current direct evidence
- Base confidence: 60-75%
Weekly Installs
3
Repository
transilienceai/…itytoolsGitHub Stars
67
First Seen
5 days ago
Security Audits
Installed on
amp3
cline3
opencode3
cursor3
kimi-cli3
codex3