refurbish-demos

Installation

SKILL.md

Refurbish Demos

Upgrade demo agent knowledge from old AI-written summaries to properly scraped page content with real sourceUrls. Each knowledge item becomes a URL-type entry with the actual page URL, so the chatbot can cite specific pages in its answers.

Parameter: campaignId (CRM campaign ID) or all-sent (all demos where outreach was sent)

Announce:

━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
Refurbish Demos — upgrading knowledge
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

Environment

Read from .env in the working directory:

NUXT_MCP_DEMO_TOKEN — Bearer token for InboxMate MCP (contains # character — use curl, not Python urllib)
PSQUARED_CRM_TOKEN — Bearer token for Twenty CRM
EMAIL_DRAFT_ONLY_BEARER — Bearer token for notification service (used in Option B: all-sent)

MCP endpoint: https://app.psquared.dev/api/mcp CRM endpoint: https://crm.psquared.dev/graphql Demo API: https://app.psquared.dev/api/demo/{demoId} Supabase project: fevtfywriufbqnvbgyrm (for direct DB queries when needed) Demo account ID: 8942a6e5-91cb-4c5d-8ef5-98cfe7945620

MCP Call Template

All MCP calls use this pattern (use curl, NOT Python urllib — the token contains #):

curl -s --max-time 120 -X POST https://app.psquared.dev/api/mcp \
  -H "Content-Type: application/json" \
  -H "Authorization: Bearer $NUXT_MCP_DEMO_TOKEN" \
  -d '{"jsonrpc":"2.0","id":1,"method":"tools/call","params":{"name":"TOOL_NAME","arguments":{...}}}'

Available MCP Tools

Tool	Purpose
`cleanup_agent`	Wipe all knowledge from agent's buckets + delete orphaned buckets in demo account
`clear_bucket`	Remove all items from a specific bucket
`scrape_and_build_knowledge`	Scrape URLs via Tavily (advanced mode), create URL-type knowledge items. Max 10 URLs per call. Returns `{ created: [], failed: [] }`
`add_to_bucket`	Manually add knowledge item. If `sourceUrl` is provided, creates URL-type entry
`get_agent`	Get agent config (check knowledgeBucketIds, buttonIcon)
`list_bucket_items`	List items in a bucket (verify results)
`update_widget_style`	Fix buttonIcon or other widget config
`publish_agent`	Republish agent (required after knowledge changes)

PHASE 1 — Build Work List

Announce: [1/4] Building work list...

Option A: By campaign ID

Query CRM for opportunities in the campaign:

curl -s -X POST https://crm.psquared.dev/graphql \
  -H "Content-Type: application/json" \
  -H "Authorization: Bearer $PSQUARED_CRM_TOKEN" \
  -d "{\"query\":\"{ opportunities(filter: { campaignId: { eq: \\\"CAMPAIGN_ID\\\" } }, first: 150) { edges { node { id name demoStatus demoUrl { primaryLinkUrl } company { id name domainName { primaryLinkUrl } } } } } }\"}"

Option B: All sent demos

Get sent email drafts from notification service, extract demo IDs, then look up agents:

curl -s "https://notifications.psquared.dev/drafts?status=SENT&draftType=outreach&pageSize=200" \
  -H "Authorization: Bearer $EMAIL_DRAFT_ONLY_BEARER"

Extract demoUrl from each draft's variables, parse the ?id= param to get demoId.

For each demo ID, fetch agent info:

curl -s https://app.psquared.dev/api/demo/DEMO_ID
# Returns: { agentId, companyName, companyDomain, ... }

Then get bucket ID:

{"jsonrpc":"2.0","id":1,"method":"tools/call","params":{"name":"get_agent","arguments":{"agentId":"AGENT_ID"}}}

Extract knowledgeBucketIds[0].

Filter out: demoStatus = SKIP_*, DISQUALIFIED. Skip demos without companyDomain.

Build: [ { companyName, companyDomain, agentId, bucketId, demoId } ]

Announce: Found N demos to refurbish.

PHASE 2 — Discover Real Pages Per Company

Announce: [2/4] Discovering pages to scrape...

For each company, discover which pages actually exist before scraping. Never guess URLs blindly — generic patterns like /kontakt/, /ueber-uns/, /leistungen/ fail on 60%+ of sites (e-commerce, single-page, non-standard CMS). This produces "Seite nicht gefunden" knowledge items that pollute RAG.

Step 2a: WebFetch the homepage

WebFetch https://www.{domain}/ (or https://{domain}/ if www fails). Extract:

Navigation links (main nav, header menu)
Footer links (about, contact, FAQ, legal)
Any prominent section links

Step 2b: Select 6-8 best URLs

From the discovered links, pick URLs that give the chatbot useful knowledge:

Priority order:

Homepage (always)
About / Über uns / Unternehmen (who they are)
Services / Leistungen / Produkte (what they offer)
Contact / Kontakt / Standorte / Filialen (how to reach them)
FAQ / Service / Hilfe (common questions)
Team / Karriere (if relevant)
Impressum (legal, low priority but useful for address/contact)
Key category/product pages (for e-commerce sites)

Rules:

Only include URLs on the same domain
Skip: blog posts, privacy policy, AGB, login pages, PDF links, anchor-only links
Skip junk URLs: /wp-json/, /xmlrpc.php, /feed/, /favicon.ico, apple-touch-icon, .webp, .ico, .css, .js
For e-commerce sites: pick category overview pages, brand pages, store locator — NOT individual product pages
For JS-rendered pages (store locators, maps): if Tavily returns empty, WebFetch manually
Max 8 URLs total (keeps scrape time reasonable)

Step 2c: Split into batches of 2-3 URLs

Tavily advanced mode takes ~5-30s per URL. A single call with 5+ URLs hits server timeouts (120s). Always split into batches of max 3 URLs per MCP call.

Fallback: If WebFetch fails

If the homepage can't be fetched (timeout, blocking), fall back to these common patterns but expect failures and don't count on them:

https://www.{domain}/
https://www.{domain}/kontakt/
https://www.{domain}/ueber-uns/
https://www.{domain}/impressum/

PHASE 3 — Clear & Rebuild Knowledge

Announce: [3/4] Rebuilding knowledge for N demos...

Process one company at a time:

Step 3a: Wipe existing knowledge

{"jsonrpc":"2.0","id":1,"method":"tools/call","params":{"name":"cleanup_agent","arguments":{"agentId":"AGENT_ID"}}}

This clears the agent's buckets AND deletes all orphaned buckets in the demo account (from old demo creation flow). First call handles the global orphan cleanup.

Step 3b: Scrape via MCP in batches of 3

Send discovered URLs to scrape_and_build_knowledge in batches of max 3:

{"jsonrpc":"2.0","id":2,"method":"tools/call","params":{"name":"scrape_and_build_knowledge","arguments":{"bucketId":"BUCKET_ID","urls":["URL1","URL2","URL3"]}}}

The MCP tool automatically:

Rejects 404/"Seite nicht gefunden" pages (returns them in failed array with reason 404/not-found page detected)
Sets sourceUrl on created items (enables chat citations)
Supports content up to 50K characters

Check the response created and failed arrays after each batch.

Step 3c: Handle failures — WebFetch fallback

For each failed URL, check the failure reason:

Reason	Action
`404/not-found page detected`	Skip — page doesn't exist. Don't retry.
`Content too short`	Skip — page has no useful content.
`Timeout` or `Failed to scrape`	Retry via WebFetch (Tavily couldn't reach the site).
`Content must not exceed`	Shouldn't happen with 50K limit. If it does, WebFetch and trim.

WebFetch fallback for timeouts/scrape failures:

WebFetch the URL with a prompt to extract all content
Clean the content: remove nav, footer, cookie banners, base64 images
Keep: headings, lists, structure, ALL details (names, prices, hours, addresses)
Call add_to_bucket with cleaned content + sourceUrl:

{"jsonrpc":"2.0","id":4,"method":"tools/call","params":{"name":"add_to_bucket","arguments":{"bucketId":"BUCKET_ID","title":"Page Title","content":"[full cleaned page text]","sourceUrl":"https://domain.de/page/"}}}

Step 3d: If ALL URLs fail (site blocks Tavily entirely)

Some sites block Tavily's scraper. If every URL times out or fails:

WebFetch ALL discovered URLs yourself — WebFetch uses a different scraper that often works when Tavily doesn't
For each page, extract detailed content (not summaries)
Add each via add_to_bucket with sourceUrl
You MUST NOT leave the bucket empty — an empty bucket means the chatbot can't answer anything

Content rules for manual WebFetch entries:

Full page text, not an AI summary — keep the original wording
Preserve headings, lists, and structure
Remove nav, footer, cookie banners, boilerplate
Keep ALL specifics: names, services, prices, hours, addresses, team members
Max 50,000 characters per entry
sourceUrl MUST be the actual page URL

Step 3e: Verify minimum quality

After all scraping + fallbacks, check:

At least 3 items in the bucket (< 3 = chatbot is too thin)
No items with "nicht gefunden" / "not found" / "404" in the title (shouldn't happen with MCP detection, but verify)
No junk items (wp-json, xmlrpc, feed, favicon — delete if present)

Step 3f: Fix button icon if broken

Valid icons: messageCircle, messageSquare, sparkles, support, help, inboxmate, heart, zap, globe, wave, brain, lightbulb, compass, star, shield, robot, mascot

Any other value (e.g. shoppingBag, truck, home, car, music) renders as a broken circle in the widget. Fix to messageCircle:

{"jsonrpc":"2.0","id":5,"method":"tools/call","params":{"name":"update_widget_style","arguments":{"agentId":"AGENT_ID","buttonIcon":"messageCircle"}}}

Step 3g: Republish agent

{"jsonrpc":"2.0","id":6,"method":"tools/call","params":{"name":"publish_agent","arguments":{"agentId":"AGENT_ID"}}}

Log per company: ✓ {companyName}: {N} items created, {M} failed → republished

PHASE 4 — Report

Announce: [4/4] Refurbish complete.

Print summary table:

| Company | Domain | Items | Failed | Icon Fixed | Status |
|---------|--------|-------|--------|------------|--------|
| Schuh Marke GmbH | schuh-marke.de | 5 | 0 | yes | ✓ |
| Mainfilm | mainfilm.tv | 5 | 0 | no | ✓ |

Totals:

Companies processed: N
Knowledge items created: N
Failed URLs: N (fell back to WebFetch: N, completely failed: N)
Icons fixed: N
Tavily credits used: ~N (2 credits per 5 URLs at advanced mode)

Batch Script

For large batches (100+ demos), use the Python batch script instead of processing one-by-one from Claude:

python3 -u scripts/refurbish-all.py < /tmp/worklist.json > /tmp/refurbish.log 2>&1

The script at claude-overlord-folder/scripts/refurbish-all.py:

Discovers real URLs from homepage nav (not hardcoded patterns)
Filters junk URLs (wp-json, xmlrpc, feed, favicon, etc.)
Handles www. prefix correctly (no www.www. duplication)
Splits URLs into batches of 3 to avoid server timeouts
Uses curl (not urllib) for MCP calls — handles # in auth token
Processes sequentially with 1s pause every 5 agents
Input: JSON array of [{company_name, company_domain, agent_id, bucket_id}]

Limitation: The script does NOT do WebFetch fallback — it only uses MCP scrape_and_build_knowledge (Tavily). If Tavily fails for a site, the agent will be left with fewer items. After a batch run, check the log for agents with 0-2 items created and fix those manually with WebFetch.

Known Issues & Workarounds

Tavily advanced mode timeouts

Tavily extract_depth: 'advanced' takes up to 30s per URL. Batch 5 URLs in one MCP call = 150s, which exceeds server timeout. Always split into batches of 2-3 URLs.

Fallback API key

If primary Tavily key runs out, set NUXT_TAVILY_FALLBACK_API_KEY in agenthub .env. Auto-switches on 429/402 errors.

"Seite nicht gefunden" pages

If 404 pages end up in knowledge (from fallback URL guessing or site changes), they DO hurt RAG — the chatbot may cite non-existent pages or give confused answers. Phase 2 now discovers real URLs first to prevent this. If you see 404 items after a refurbish, the homepage WebFetch likely failed and fell back to guessed URLs — investigate and re-run with manual URL discovery.

Bucket name mismatches

Old demo creation flow sometimes linked wrong-named buckets to agents (e.g. "Autohaus Freier Wissensbasis" on a Hinzmann Elektrotechnik agent). The content is correct after refurbish — the bucket name is cosmetic. Can be fixed via SQL if needed.

Orphaned buckets

Old demo creation left ~160 orphaned "Wissensdatenbank" buckets not linked to any agent. cleanup_agent deletes these. For large batches, clean up via SQL first (faster):

DELETE FROM knowledge_bucket_chunks WHERE bucket_item_id IN (
  SELECT kbi.id FROM knowledge_bucket_items kbi
  JOIN knowledge_buckets kb ON kb.id = kbi.bucket_id
  WHERE kb.account_id = '8942a6e5-91cb-4c5d-8ef5-98cfe7945620'
  AND kb.id NOT IN (SELECT unnest(knowledge_bucket_ids) FROM agents WHERE account_id = '8942a6e5-91cb-4c5d-8ef5-98cfe7945620')
);
-- Then delete items, then buckets (same pattern)

Content length limit

Knowledge entries allow up to 50,000 characters (was 20K, bumped 2026-03-28). Tavily advanced mode returns full pages including nav/footer HTML — the extra headroom prevents truncation of actual content. Pages hitting the limit likely have nav pollution but the important content still fits.

JS-rendered content (store locators, maps, dynamic lists)

Some pages load data via JavaScript (store locators, interactive maps, product configurators). Tavily can't extract this content. If a page returns mostly nav/boilerplate with no real data, WebFetch it manually, extract the data from the page description or other sources, and add via add_to_bucket with sourceUrl.

Cross-contamination

Audit showed only 1 true case out of 152 sent demos (Dachdeckermeister Hoffmann had DBM Metallbau content). Refurbish fixes this by clearing and re-scraping based on the correct domain from the demo page.

Related skills

More from psquared-development/psquared-skills

Installs

Repository

psquared-develo…d-skills

First Seen

Mar 26, 2026

Security Audits

Gen Agent Trust HubPass

SocketWarn

SnykWarn