X/Twitter CDP Tweet Scraper v2

Extracts authentication from a Chromium browser via CDP, then calls the Twitter GraphQL API directly with httpx for high-speed bulk tweet collection. Outputs JSON + Markdown.

Quick Start

# By date range (recommended)
python3 <skill-path>/scripts/cdp_tweet_fetcher.py <username> --since 2026-02 [--until 2026-02-28] [--output-dir DIR]

# By year (shorthand)
python3 <skill-path>/scripts/cdp_tweet_fetcher.py <username> --year 2026 [--output-dir DIR]

# No date specified -> defaults to current year to date
python3 <skill-path>/scripts/cdp_tweet_fetcher.py <username> [--output-dir DIR]

Arguments:

username (required): Twitter username (without @)
--since: Start date (inclusive). Accepts YYYY-MM-DD, YYYY-MM, YYYY
--until: End date (inclusive), defaults to today
--year: Target year (shorthand for --since YYYY-01-01)
--output-dir: Output directory, defaults to current working directory
--page-delay: Seconds between API pages, default 1.0
--max-pages: Maximum pages to fetch, default 200
--cdp-port: CDP debugging port, default 9222

Output files:

{username}_tweets_{YYYYMMDD}_{YYYYMMDD}.json — Full structured data
{username}_tweets_{YYYYMMDD}_{YYYYMMDD}.md — Human-readable Markdown report

Prerequisites

Chromium-based browser (Chrome / Edge / Brave / Arc / Chromium) installed and logged in to Twitter/X
Python dependencies: pip install playwright httpx && playwright install chromium
Browser must be launched with CDP enabled — the script auto-detects your OS and shows the correct launch command

Launch browser with CDP (pick your browser):

macOS:

# Chrome
/Applications/Google\ Chrome.app/Contents/MacOS/Google\ Chrome --remote-debugging-port=9222
# Edge
/Applications/Microsoft\ Edge.app/Contents/MacOS/Microsoft\ Edge --remote-debugging-port=9222
# Brave
/Applications/Brave\ Browser.app/Contents/MacOS/Brave\ Browser --remote-debugging-port=9222
# Arc
/Applications/Arc.app/Contents/MacOS/Arc --remote-debugging-port=9222

Linux:

google-chrome --remote-debugging-port=9222
# or: chromium-browser / microsoft-edge / brave-browser

Windows (PowerShell):

& "C:\Program Files\Google\Chrome\Application\chrome.exe" --remote-debugging-port=9222
# or: msedge.exe / brave.exe in their respective paths

Architecture — Why It's Fast

v2 minimizes browser interaction: authentication is extracted once (~5 s), then all API calls go through httpx directly.

Change	Effect
Browser only used for cookie + query ID extraction (one-time)	Init drops from ~18 s to ~5 s
httpx direct HTTP requests (no CDP round-trips)	Each API call 3-5x faster
Single endpoint: UserTweetsAndReplies (superset)	Half the pagination
count=40 (was 20)	Half the pagination again
page-delay 1.0 s (was 2.5 s)	60% less wait per page

Execution Flow

Connect to browser: CDP connection, extract cookies and CSRF token
Discover API: Parse JS bundles for GraphQL query IDs (no page navigation)
HTTP client: Create httpx client with extracted auth
Resolve user ID: Via UserByScreenName API
Bulk fetch: UserTweetsAndReplies endpoint, 40 tweets/page, direct HTTP
Output: JSON + Markdown files

Output Schema

Each tweet contains:

{
  "tweet_id": "123456789",
  "text": "Full tweet text...",
  "datetime": "2026-01-15T10:30:00+00:00",
  "url": "https://x.com/user/status/123456789",
  "author": "username",
  "is_reply": false,
  "reply_to": null,
  "reply_to_tweet_id": null,
  "is_retweet": false,
  "retweet_of": null,
  "is_quote": false,
  "quoted_tweet_url": null,
  "likes": 100,
  "retweets": 20,
  "replies": 5,
  "bookmarks": 30,
  "views": 10000,
  "media": ["https://pbs.twimg.com/..."],
  "links": ["https://example.com/..."]
}

Fetching Twitter Articles (Long-Form Posts)

Links in tweets matching x.com/i/article/{article_id} are Twitter Articles (long-form posts). Article content is NOT in the tweet API response and requires additional steps to extract.

Usage

python3 <skill-path>/scripts/fetch_articles.py <tweets_json> [--output-dir DIR] [--cdp-port 9222]

tweets_json: Path to a JSON file previously output by cdp_tweet_fetcher.py
--output-dir: Directory for article Markdown files (default: ./articles)
--cdp-port: CDP debug port (default: 9222)

Technical Details

API endpoint: TweetResultByRestId (GET), queried with the article's associated tweet_id
Critical parameter: fieldToggles: {"withArticleRichContentState": true, "withArticlePlainText": false}
Data location: data.tweetResult.result.article.article_results.result.content_state
Content format: Draft.js — blocks (paragraphs / headings / lists / blockquotes / code blocks / atomic) + entityMap (links / media references)
Type pitfall: entityMap is sometimes a dict (keyed by string index) and sometimes a list (indexed by position) — must handle both

Browser Request Parameters (Verified 2026-02)

{
  "variables": {
    "tweetId": "<tweet_id>",
    "includePromotedContent": true,
    "withBirdwatchNotes": true,
    "withVoice": true,
    "withCommunity": true
  },
  "fieldToggles": {
    "withArticleRichContentState": true,
    "withArticlePlainText": false
  }
}

Lessons Learned

Date	Lesson	Action
2026-02-27	Twitter timeline is reverse-chronological; if all tweets on a page are before the target date, subsequent pages are even older	Added early-stop condition to script
2026-02-27	DOM validation wastes time when API phase yields 0 tweets in target range	Skip DOM validation directly
2026-02-27	DOM scrolling has no date filter, collects many irrelevant IDs	Snowflake ID date filtering
2026-02-27	v1 architecture bottleneck: page.evaluate relay is slow, dual endpoints redundant, count=20 conservative, DOM validation heavy	v2 rewrite: httpx direct, single endpoint, count=40, DOM disabled by default
2026-02-28	Twitter Article (long-form) content is NOT in the tweet text; requires separate extraction	Added `fetch_articles.py` script
2026-02-28	`TweetResultByRestId` does not return article body by default; requires `fieldToggles: {"withArticleRichContentState": true}`	Critical parameter documented
2026-02-28	Article content is in Draft.js format (`content_state.blocks` + `entityMap`); `entityMap` can be either a list or a dict	Script handles both types
2026-02-28	Playwright CDP `dialog` event dismiss can throw `ProtocolError: No dialog is showing` and kill the Node process	Must wrap in `try/except`
2026-02-28	When facing unknown API behavior, trace the browser's actual request parameters first, then write scraping code	Methodology: observe before guessing

x-cdp-scraper