小红书 Research 📕

Research tool for Chinese user-generated content — travel, food, lifestyle, local discoveries.

When to Use

Travel planning and itineraries
Restaurant/cafe/bar recommendations
Activity and weekend planning
Product reviews and comparisons
Local discovery and hidden gems
Any question where Chinese perspectives help

Recommended Model

When spawning as a sub-agent: Sonnet 4.5 (model: "claude-sonnet-4-5-20250929")

Fast enough for the slow XHS API calls
Good at Chinese content understanding
More cost-effective than Opus for research grunt work
Opus overkill for search → synthesize workflow

Context Management (Always Use)

ALWAYS use dynamic context monitoring — even 5 posts with images can hit 75-300k tokens.

The Problem

Each post with images = 15-60k tokens
200k context fills fast
Context is append-only (can't "forget" within session)

The Solution: Monitor + Checkpoint + Continue

1. After EACH post, do two things:

a) Write findings to disk immediately:
   /research/{task-id}/findings/post-{n}.md

b) Check context usage:
   session_status → look for "Context: XXXk/200k (YY%)"

2. When context hits 70%, STOP and checkpoint:

Write state file:
/research/{task-id}/state.json
{
  "processed": 15,
  "pendingUrls": ["url16", "url17", ...],
  "summaries": ["Post 1: 火塘...", ...]
}

Return to caller:
{
  "complete": false,
  "processed": 15,
  "remaining": 25,
  "statePath": "/research/{task-id}/state.json",
  "findingsDir": "/research/{task-id}/findings/"
}

3. Caller spawns fresh sub-agent to continue:

spawn_subagent(
  task="Continue XHS research from /research/{task-id}/state.json",
  model="claude-sonnet-4-5-20250929"
)

New sub-agent has fresh 200k context, reads state.json, continues from post 16.

State File Schema

{
  "taskId": "kunming-food-2026-02-01",
  "query": "昆明美食",
  "searchesCompleted": ["昆明美食", "昆明美食推荐"],  // Keywords already searched
  "processedUrls": ["url1", "url2", ...],             // Explicit URL tracking (prevents duplicates)
  "pendingUrls": ["url3", "url4", ...],               // Remaining URLs to process
  "nextPostNumber": 16,                                // Next post-XXX.md number
  "summaries": [                                       // 1-liner per post for final synthesis
    "Post 1: 火塘餐厅 | 🟢 | ¥80 | 本地人推荐",
    "Post 2: 野生菌火锅 | 🟢 | ¥120 | 菌子新鲜"
  ],
  "batchNumber": 1,
  "contextCheckpoint": "70%"
}

Critical fields for handoff:

processedUrls: Prevents re-processing same post across sub-agents
pendingUrls: Exact work remaining
nextPostNumber: Ensures sequential file naming
searchesCompleted: Prevents duplicate searches

Workflow for Large Research

Caller should use longer timeout:

sessions_spawn(
  task="...",
  model="claude-sonnet-4-5-20250929",
  runTimeoutSeconds=1800  // 30 minutes for research tasks
)

Default is 600s (10 min) — too short for XHS research with slow API calls.

Interleave search and processing (don't collect all URLs first):

[XHS Sub-agent 1]
    ├── Check for state.json (none = fresh start)
    ├── Search keyword 1 → get 20 URLs
    ├── Process 5-10 posts immediately (writing each to disk)
    ├── Search keyword 2 → get more URLs (dedupe)
    ├── Process more posts
    ├── Context hits 70% → write state.json
    └── Return {complete: false, remaining: N}

This prevents timeout from losing all work — each post is saved as processed.

Full continuation pattern:

[Caller]
    ↓ spawn (runTimeoutSeconds=1800)
[XHS Sub-agent 1]
    ├── Search + process interleaved
    ├── Context hits 70% → write state.json
    └── Return {complete: false, remaining: 25}
    
[Caller sees incomplete]
    ↓ spawn continuation (runTimeoutSeconds=1800)
[XHS Sub-agent 2]  ← fresh 200k context!
    ├── Read state.json (has processedUrls, pendingUrls)
    ├── Continue processing + more searches if needed
    ├── Context hits 70% → write state.json
    └── Return {complete: false, remaining: 10}
    
[Caller sees incomplete]
    ↓ spawn continuation
[XHS Sub-agent 3]
    ├── Read state.json
    ├── Process remaining posts
    ├── All done → write synthesis.md
    └── Return {complete: true, synthesisPath: "..."}

Output Directory Structure

/research/{task-id}/
├── state.json              # Checkpoint for continuation
├── findings/
│   ├── post-001.md         # Full analysis + image paths
│   ├── post-002.md
│   └── ...
├── images/
│   ├── post-001/
│   │   ├── 1.jpg
│   │   └── 2.jpg
│   └── ...
├── summaries.md            # All 1-liners (for quick scan)
└── synthesis.md            # Final output (when complete)

Key Rules (ALWAYS FOLLOW)

Write after EVERY post — crash-safe, no work lost
Check context after EVERY post — use session_status tool
Stop at 70% — leave room for synthesis + buffer
Return structured result — caller decides next step
Read all images — they're pre-compressed (600px, q85)
Skip videos — already marked in fetch-post

⚠️ This is not optional. Even small research can overflow context with image-heavy posts.

Scripts (Mechanical Tasks)

These scripts handle the repetitive CLI work:

Script	Purpose
`bin/preflight`	Verify tool is working before research
`bin/search "keywords" [limit] [timeout] [sort]`	Search for posts (sort: general/newest/hot)
`bin/get-content "url"`	Get full note content (text only)
`bin/get-comments "url"`	Get comments on a note
`bin/get-images "url" [dir]`	Download images only
`bin/fetch-post "url" [cache] [retries]`	Fetch content + comments + images (with retries)

All scripts are at /root/clawd/skills/xhs/bin/

Preflight (always run first)

/root/clawd/skills/xhs/bin/preflight

Checks: rednote-mcp installed, cookies valid, stealth patches, test search. Don't proceed until preflight passes.

Search

/root/clawd/skills/xhs/bin/search "昆明美食推荐" [limit] [timeout] [sort]

Returns JSON with post results.

Parameters:

Param	Default	Description
keywords	(required)	Search terms in Chinese
limit	10	Max results (scroll pagination when >20)
timeout	180	Seconds before giving up
sort	general	Sort order (see below)

Sort options:

Value	XHS Label	When to use
`general`	综合	Default — XHS algorithm balances relevance + engagement. Best for most research.
`newest`	最新	舆情监控, breaking news, recent experiences, time-sensitive topics
`hot`	最热	Finding viral/popular posts, trending content

Examples:

# Default sort (recommended for most research)
bin/search "昆明美食推荐" 20

# Recent posts first (舆情, current events)
bin/search "某品牌 评价" 20 180 newest

# Most popular posts
bin/search "网红打卡地" 15 180 hot

Scroll pagination enabled (patched): When limit > 20, the tool scrolls to load more results via XHS infinite scroll. Actual results depend on available content.

For maximum coverage, combine:

Higher limits (e.g., limit=50) to scroll for more
Multiple keyword variations for different result sets:
- 香蕉攀岩, 香蕉攀岩馆, 香蕉攀岩体验, 香蕉攀岩评价
- 昆明美食, 昆明美食推荐, 昆明必吃, 昆明本地人推荐

Results vary by query — popular topics may return 30-50+, niche topics fewer.

Choosing sort order:

Most research → general (default). Let XHS's algorithm surface the best content.
舆情监控 / sentiment tracking → newest. You want recent opinions, not old viral posts.
Trend discovery → hot. See what's currently popular.

Get Content

/root/clawd/skills/xhs/bin/get-content "FULL_URL_WITH_XSEC_TOKEN"

⚠️ Must use full URL with xsec_token from search results.

Get Comments

/root/clawd/skills/xhs/bin/get-comments "FULL_URL_WITH_XSEC_TOKEN"

Get Images

Download all images from a post to local files:

/root/clawd/skills/xhs/bin/get-images "FULL_URL" /tmp/my-images

Fetch Post (Deep Dive with Images)

Fetch content, comments, and images in one call — with built-in retries:

/root/clawd/skills/xhs/bin/fetch-post "FULL_URL" /path/to/cache [max_retries]

Features:

Retries on timeout (60s → 90s → 120s)
Clear error reporting in JSON output
Images cached locally, bypassing CDN protection

Returns JSON:

{
  "success": true,
  "postId": "abc123",
  "content": { 
    "title": "...", 
    "author": "...", 
    "desc": "...", 
    "likes": "983", 
    "tags": [...],
    "postDate": "2025-09-04"  // ← Added via patch!
  },
  "comments": [{ "author": "...", "content": "...", "likes": "3" }, ...],
  "imagePaths": ["/cache/images/abc123/1.jpg", ...],
  "errors": []
}

Date filtering: Use postDate to filter out old posts. Skip posts older than your threshold (e.g., 6-12 months for restaurants).

Workflow:

1. fetch-post → JSON + cached images
2. Read each imagePath directly (Claude sees images natively)
3. Combine text + comments + what you see into findings

Viewing images:

Read("/path/to/1.jpg")  # Claude sees it directly - no special tool needed

Look for: visible text (addresses, prices, hours), atmosphere, food presentation, crowd levels.

Research Methodology (Judgment Tasks)

This is where you think. Scripts do the fetching; you do the analyzing.

Depth Levels

Depth	Posts	When to Use
Minimum	5+	Quick checks, simple queries
Standard	8-10	Default for most research
Deep	15+	Complex topics, trip planning

Minimum is 5 — unless fewer exist. Note limited coverage if <5 results.

Research Workflow

Step 0: Preflight

Run bin/preflight. Don't proceed until it passes.

Step 1: Plan Your Searches

Think: "What would a Chinese user search on 小红书?"

Include location when relevant
Add qualifiers: 推荐, 攻略, 测评, 探店, 打卡, 避坑
Consider synonyms and variations
Plan 2-3 different search angles

Date filtering: Posts include postDate field (e.g., "2025-09-04"). The calling agent specifies the date filter based on research type:

Research Type	Suggested Filter	Why
舆情监控 (sentiment)	1-4 weeks	Only current discourse matters
Breaking news/events	1-7 days	Time-critical
Travel planning	6-12 months	Recent but reasonable window
Product reviews	1-2 years	Longer product cycles
Trend analysis	Custom range	Compare specific periods
Historical/general	No limit	Want the full archive

Caller should specify in task description, e.g.:

"Only posts from last 30 days" (舆情)
"Posts from 2025 or later" (travel)
"No date filter" (general research)

If no filter specified: Default to 12 months (safe middle ground).

Fallback when postDate is null: Use keyword hints: 2025, 最近, 最新

Language strategy:

Location	Language	Example
China	Chinese	`昆明攀岩`
English-named venues	Both	`Rock Tenet 昆明`
International	Chinese	`巴黎旅游`

Step 2: Search & Scan

Run your searches. Results are already ranked by XHS's algorithm (relevance + engagement).

Use judgment based on preview — like a human deciding what to click:

Think: "Given my research goal, would this post likely contain useful information?"

Research Type	What to prioritize
舆情监控 (sentiment)	Any opinion/experience, even low engagement — complaints matter!
Travel planning	High engagement + detailed experiences
Product reviews	Mix of positive AND negative reviews
Trend analysis	Variety of perspectives

Preview Signal	Action
Relevant content in preview	✅ Fetch
Matches research goal	✅ Fetch
Low engagement but relevant opinion	✅ Fetch (esp. for 舆情)
High engagement but off-topic	❌ Skip
Official announcements only	⚠️ Context-dependent
广告/合作 markers	⚠️ Note as sponsored if fetching
Clearly off-topic	❌ Skip
Duplicate content	❌ Skip

Key insight: For 舆情监控, a 3-like complaint post may be more valuable than a 500-like promotional post. Engagement ≠ relevance for all research types.

Step 3: Deep Dive Each Post

For each selected post, use fetch-post to get everything:

bin/fetch-post "url_from_search" {{RESEARCH_DIR}}/xhs

Returns JSON with content, comments, and cached images. Has built-in retries. Then:

A. Review content

Extract key facts from title/description
Note author's perspective/bias
Check tags for categorization

B. View images (critical!) For each imagePath in the result, just read it:

Read("/path/to/1.jpg")  # You see it directly

Look for text overlays: addresses, prices, hours
Note visual details: ambiance, crowd levels, food presentation

⚠️ Don't describe images in isolation. Synthesize what you see with the post content and comments to form a holistic view. An image of a crowded restaurant + author saying "周末排队1小时" + comments confirming "人超多" = that's your finding about crowds.

C. Review comments (gold for updates)

"已经关门了" = already closed
Real experiences vs sponsored hype
Tips not in main post

D. Return picked images Include paths to the best/most informative images in your findings. The calling agent decides whether and how to use them (embed in reports, reference, etc.). You're curating — pick images that show something useful (venue exterior, menu with prices, actual food, atmosphere) not just decorative shots.

Step 4: Synthesize

What do multiple sources agree on?
Any contradictions?
What's the overall consensus?
What would you actually recommend?

Step 5: Output

Facts + Flavor — structured findings that preserve the XHS voice.

## XHS Research: [Topic]

### Search Summary
| Search | Results | Notes |
|--------|---------|-------|
| 昆明攀岩 | 10 | Good coverage |

### Findings

#### [Venue Name] (中文名)
- **Type:** Restaurant / Activity / Attraction
- **Address:** [from post or image]
- **Price:** ¥XX/person
- **Hours:** [if found]
- **The vibe:** [atmosphere, energy — preserved voice]
- **Why people like it:** [opinions, impressions]
- **Watch out for:** [warnings from comments]
- **Source:** [full URL]
- **Engagement:** X likes
- **Images:** [paths for calling agent to use]
  - `/path/to/1.jpg` — exterior/entrance
  - `/path/to/3.jpg` — menu with prices

> "引用原文..." — @username

### Overall Impressions
- Consensus across posts
- Patterns in preferences
- Things only locals know
- Disagreements worth noting

The XHS value is the human perspective. A recommendation that says "环境一般但是味道绝了" tells you more than "Rating: 4.2/5".

Think: "What would a friend who just spent an hour on XHS tell me?"

Quality Signals

Trustworthy:

100+ likes with real comments
Detailed personal experience
Multiple photos from actual visit
Specific details (prices, hours)
Recent posts (look for date mentions in content: "上周", "昨天", "2025年X月")
Year in title (e.g., "2025上海咖啡必喝榜")

Checking recency:

Look for dates in post text/title
Check if prices seem current
Comments mentioning "还在吗" or "现在还有吗" = might be outdated
Comments with recent dates confirm post is still relevant

Suspicious:

广告/合作/赞助 markers
Overly positive, no specifics
Stock photos only
No comments or generic ones
Very old posts

Timing & Efficiency

XHS is SLOW — Plan Accordingly

The rednote-mcp CLI is slow (30-90s per search). Don't rapid-fire poll.

When running searches via exec:

# GOOD: Give it time to complete
exec(command, yieldMs: 60000)  # Wait 60s before checking
process(poll)  # Then poll every 30s if still running

DON'T:

Poll every 2-3 seconds (wastes tokens, no benefit)
Start multiple searches simultaneously (overloads)
Wait indefinitely without writing partial results

Write Incrementally

Don't wait until you've analyzed everything to start writing. After each batch of 3-5 posts:

Append findings to your output file
This protects against timeout/termination losing all work

## Findings (in progress)

### Batch 1: 美食搜索 (3 posts analyzed)
[findings...]

### Batch 2: 攻略搜索 (analyzing...)

Time Budget Awareness

If you've been running 15+ minutes:

Prioritize writing what you have
Note incomplete searches in output
Better to deliver 80% findings than lose 100% to termination

Retry Pattern

rednote-mcp is slow. If a command times out:

Attempt 1: default timeout
Attempt 2: +60s
Attempt 3: +120s

If all fail, report the failure. Do NOT fall back to web_search — defeats the purpose.

Error Handling

Error	Cause	Fix
Timeout	Network/XHS slow	Retry with longer timeout
Login/cookie error	Session expired	`xvfb-run -a rednote-mcp init`
404 / xsec_token	Missing token	Use full URL from search
Empty results	No posts	Try different keywords

Setup & Maintenance

First-Time Setup

npm install -g rednote-mcp
npx playwright install
/root/clawd/skills/xhs/patches/apply-all.sh
xvfb-run -a rednote-mcp init

Re-login (when cookies expire)

xvfb-run -a rednote-mcp init

After rednote-mcp updates

/root/clawd/skills/xhs/patches/apply-all.sh

Role Clarification

This skill = Research tool that outputs structured findings Calling agent = Synthesizes XHS + other sources into final reports, decides which images to embed

You return:

Synthesized findings (text + images + comments → holistic view)
Curated image paths (calling agent decides how to use them)
Preserved human voice (opinions, vibes, tips)

You don't:

Describe images in isolation ("I see a restaurant...")
Generate final reports (that's the caller's job)
Decide image layout/placement

XHS is like having a Chinese-speaking friend spend an hour researching for you. They'd give you facts, but also opinions, vibes, and insider tips. That's what you're capturing.

Remember: Research like a curious human. Explore, cross-reference, look at pictures, read comments. The "这家真的绝了" matters as much as the address.