refurbish-demos
Refurbish Demos
Upgrade demo agent knowledge from old AI-written summaries to properly scraped page content with real sourceUrls. Each knowledge item becomes a URL-type entry with the actual page URL, so the chatbot can cite specific pages in its answers.
Parameter: campaignId (CRM campaign ID) or all-sent (all demos where outreach was sent)
Announce:
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ Refurbish Demos — upgrading knowledge ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
Environment
Read from .env in the working directory:
NUXT_MCP_DEMO_TOKEN— Bearer token for InboxMate MCP (contains#character — use curl, not Python urllib)PSQUARED_CRM_TOKEN— Bearer token for Twenty CRMEMAIL_DRAFT_ONLY_BEARER— Bearer token for notification service (used in Option B: all-sent)
MCP endpoint: https://app.psquared.dev/api/mcp
CRM endpoint: https://crm.psquared.dev/graphql
Demo API: https://app.psquared.dev/api/demo/{demoId}
Supabase project: fevtfywriufbqnvbgyrm (for direct DB queries when needed)
Demo account ID: 8942a6e5-91cb-4c5d-8ef5-98cfe7945620
MCP Call Template
All MCP calls use this pattern (use curl, NOT Python urllib — the token contains #):
curl -s --max-time 120 -X POST https://app.psquared.dev/api/mcp \
-H "Content-Type: application/json" \
-H "Authorization: Bearer $NUXT_MCP_DEMO_TOKEN" \
-d '{"jsonrpc":"2.0","id":1,"method":"tools/call","params":{"name":"TOOL_NAME","arguments":{...}}}'
Available MCP Tools
| Tool | Purpose |
|---|---|
cleanup_agent |
Wipe all knowledge from agent's buckets + delete orphaned buckets in demo account |
clear_bucket |
Remove all items from a specific bucket |
scrape_and_build_knowledge |
Scrape URLs via Tavily (advanced mode), create URL-type knowledge items. Max 10 URLs per call. Returns { created: [], failed: [] } |
add_to_bucket |
Manually add knowledge item. If sourceUrl is provided, creates URL-type entry |
get_agent |
Get agent config (check knowledgeBucketIds, buttonIcon) |
list_bucket_items |
List items in a bucket (verify results) |
update_widget_style |
Fix buttonIcon or other widget config |
publish_agent |
Republish agent (required after knowledge changes) |
PHASE 1 — Build Work List
Announce:
[1/4] Building work list...
Option A: By campaign ID
Query CRM for opportunities in the campaign:
curl -s -X POST https://crm.psquared.dev/graphql \
-H "Content-Type: application/json" \
-H "Authorization: Bearer $PSQUARED_CRM_TOKEN" \
-d "{\"query\":\"{ opportunities(filter: { campaignId: { eq: \\\"CAMPAIGN_ID\\\" } }, first: 150) { edges { node { id name demoStatus demoUrl { primaryLinkUrl } company { id name domainName { primaryLinkUrl } } } } } }\"}"
Option B: All sent demos
Get sent email drafts from notification service, extract demo IDs, then look up agents:
curl -s "https://notifications.psquared.dev/drafts?status=SENT&draftType=outreach&pageSize=200" \
-H "Authorization: Bearer $EMAIL_DRAFT_ONLY_BEARER"
Extract demoUrl from each draft's variables, parse the ?id= param to get demoId.
For each demo ID, fetch agent info:
curl -s https://app.psquared.dev/api/demo/DEMO_ID
# Returns: { agentId, companyName, companyDomain, ... }
Then get bucket ID:
{"jsonrpc":"2.0","id":1,"method":"tools/call","params":{"name":"get_agent","arguments":{"agentId":"AGENT_ID"}}}
Extract knowledgeBucketIds[0].
Filter out: demoStatus = SKIP_*, DISQUALIFIED. Skip demos without companyDomain.
Build: [ { companyName, companyDomain, agentId, bucketId, demoId } ]
Announce: Found N demos to refurbish.
PHASE 2 — Discover Real Pages Per Company
Announce:
[2/4] Discovering pages to scrape...
For each company, discover which pages actually exist before scraping. Never guess URLs blindly — generic patterns like /kontakt/, /ueber-uns/, /leistungen/ fail on 60%+ of sites (e-commerce, single-page, non-standard CMS). This produces "Seite nicht gefunden" knowledge items that pollute RAG.
Step 2a: WebFetch the homepage
WebFetch https://www.{domain}/ (or https://{domain}/ if www fails). Extract:
- Navigation links (main nav, header menu)
- Footer links (about, contact, FAQ, legal)
- Any prominent section links
Step 2b: Select 6-8 best URLs
From the discovered links, pick URLs that give the chatbot useful knowledge:
Priority order:
- Homepage (always)
- About / Über uns / Unternehmen (who they are)
- Services / Leistungen / Produkte (what they offer)
- Contact / Kontakt / Standorte / Filialen (how to reach them)
- FAQ / Service / Hilfe (common questions)
- Team / Karriere (if relevant)
- Impressum (legal, low priority but useful for address/contact)
- Key category/product pages (for e-commerce sites)
Rules:
- Only include URLs on the same domain
- Skip: blog posts, privacy policy, AGB, login pages, PDF links, anchor-only links
- Skip junk URLs:
/wp-json/,/xmlrpc.php,/feed/,/favicon.ico,apple-touch-icon,.webp,.ico,.css,.js - For e-commerce sites: pick category overview pages, brand pages, store locator — NOT individual product pages
- For JS-rendered pages (store locators, maps): if Tavily returns empty, WebFetch manually
- Max 8 URLs total (keeps scrape time reasonable)
Step 2c: Split into batches of 2-3 URLs
Tavily advanced mode takes ~5-30s per URL. A single call with 5+ URLs hits server timeouts (120s). Always split into batches of max 3 URLs per MCP call.
Fallback: If WebFetch fails
If the homepage can't be fetched (timeout, blocking), fall back to these common patterns but expect failures and don't count on them:
https://www.{domain}/
https://www.{domain}/kontakt/
https://www.{domain}/ueber-uns/
https://www.{domain}/impressum/
PHASE 3 — Clear & Rebuild Knowledge
Announce:
[3/4] Rebuilding knowledge for N demos...
Process one company at a time:
Step 3a: Wipe existing knowledge
{"jsonrpc":"2.0","id":1,"method":"tools/call","params":{"name":"cleanup_agent","arguments":{"agentId":"AGENT_ID"}}}
This clears the agent's buckets AND deletes all orphaned buckets in the demo account (from old demo creation flow). First call handles the global orphan cleanup.
Step 3b: Scrape via MCP in batches of 3
Send discovered URLs to scrape_and_build_knowledge in batches of max 3:
{"jsonrpc":"2.0","id":2,"method":"tools/call","params":{"name":"scrape_and_build_knowledge","arguments":{"bucketId":"BUCKET_ID","urls":["URL1","URL2","URL3"]}}}
The MCP tool automatically:
- Rejects 404/"Seite nicht gefunden" pages (returns them in
failedarray with reason404/not-found page detected) - Sets
sourceUrlon created items (enables chat citations) - Supports content up to 50K characters
Check the response created and failed arrays after each batch.
Step 3c: Handle failures — WebFetch fallback
For each failed URL, check the failure reason:
| Reason | Action |
|---|---|
404/not-found page detected |
Skip — page doesn't exist. Don't retry. |
Content too short |
Skip — page has no useful content. |
Timeout or Failed to scrape |
Retry via WebFetch (Tavily couldn't reach the site). |
Content must not exceed |
Shouldn't happen with 50K limit. If it does, WebFetch and trim. |
WebFetch fallback for timeouts/scrape failures:
- WebFetch the URL with a prompt to extract all content
- Clean the content: remove nav, footer, cookie banners, base64 images
- Keep: headings, lists, structure, ALL details (names, prices, hours, addresses)
- Call
add_to_bucketwith cleaned content + sourceUrl:
{"jsonrpc":"2.0","id":4,"method":"tools/call","params":{"name":"add_to_bucket","arguments":{"bucketId":"BUCKET_ID","title":"Page Title","content":"[full cleaned page text]","sourceUrl":"https://domain.de/page/"}}}
Step 3d: If ALL URLs fail (site blocks Tavily entirely)
Some sites block Tavily's scraper. If every URL times out or fails:
- WebFetch ALL discovered URLs yourself — WebFetch uses a different scraper that often works when Tavily doesn't
- For each page, extract detailed content (not summaries)
- Add each via
add_to_bucketwithsourceUrl - You MUST NOT leave the bucket empty — an empty bucket means the chatbot can't answer anything
Content rules for manual WebFetch entries:
- Full page text, not an AI summary — keep the original wording
- Preserve headings, lists, and structure
- Remove nav, footer, cookie banners, boilerplate
- Keep ALL specifics: names, services, prices, hours, addresses, team members
- Max 50,000 characters per entry
sourceUrlMUST be the actual page URL
Step 3e: Verify minimum quality
After all scraping + fallbacks, check:
- At least 3 items in the bucket (< 3 = chatbot is too thin)
- No items with "nicht gefunden" / "not found" / "404" in the title (shouldn't happen with MCP detection, but verify)
- No junk items (wp-json, xmlrpc, feed, favicon — delete if present)
Step 3f: Fix button icon if broken
Valid icons: messageCircle, messageSquare, sparkles, support, help, inboxmate, heart, zap, globe, wave, brain, lightbulb, compass, star, shield, robot, mascot
Any other value (e.g. shoppingBag, truck, home, car, music) renders as a broken circle in the widget. Fix to messageCircle:
{"jsonrpc":"2.0","id":5,"method":"tools/call","params":{"name":"update_widget_style","arguments":{"agentId":"AGENT_ID","buttonIcon":"messageCircle"}}}
Step 3g: Republish agent
{"jsonrpc":"2.0","id":6,"method":"tools/call","params":{"name":"publish_agent","arguments":{"agentId":"AGENT_ID"}}}
Log per company:
✓ {companyName}: {N} items created, {M} failed → republished
PHASE 4 — Report
Announce:
[4/4] Refurbish complete.
Print summary table:
| Company | Domain | Items | Failed | Icon Fixed | Status |
|---------|--------|-------|--------|------------|--------|
| Schuh Marke GmbH | schuh-marke.de | 5 | 0 | yes | ✓ |
| Mainfilm | mainfilm.tv | 5 | 0 | no | ✓ |
Totals:
- Companies processed: N
- Knowledge items created: N
- Failed URLs: N (fell back to WebFetch: N, completely failed: N)
- Icons fixed: N
- Tavily credits used: ~N (2 credits per 5 URLs at advanced mode)
Batch Script
For large batches (100+ demos), use the Python batch script instead of processing one-by-one from Claude:
python3 -u scripts/refurbish-all.py < /tmp/worklist.json > /tmp/refurbish.log 2>&1
The script at claude-overlord-folder/scripts/refurbish-all.py:
- Discovers real URLs from homepage nav (not hardcoded patterns)
- Filters junk URLs (wp-json, xmlrpc, feed, favicon, etc.)
- Handles
www.prefix correctly (nowww.www.duplication) - Splits URLs into batches of 3 to avoid server timeouts
- Uses curl (not urllib) for MCP calls — handles
#in auth token - Processes sequentially with 1s pause every 5 agents
- Input: JSON array of
[{company_name, company_domain, agent_id, bucket_id}]
Limitation: The script does NOT do WebFetch fallback — it only uses MCP scrape_and_build_knowledge (Tavily). If Tavily fails for a site, the agent will be left with fewer items. After a batch run, check the log for agents with 0-2 items created and fix those manually with WebFetch.
Known Issues & Workarounds
Tavily advanced mode timeouts
Tavily extract_depth: 'advanced' takes up to 30s per URL. Batch 5 URLs in one MCP call = 150s, which exceeds server timeout. Always split into batches of 2-3 URLs.
Fallback API key
If primary Tavily key runs out, set NUXT_TAVILY_FALLBACK_API_KEY in agenthub .env. Auto-switches on 429/402 errors.
"Seite nicht gefunden" pages
If 404 pages end up in knowledge (from fallback URL guessing or site changes), they DO hurt RAG — the chatbot may cite non-existent pages or give confused answers. Phase 2 now discovers real URLs first to prevent this. If you see 404 items after a refurbish, the homepage WebFetch likely failed and fell back to guessed URLs — investigate and re-run with manual URL discovery.
Bucket name mismatches
Old demo creation flow sometimes linked wrong-named buckets to agents (e.g. "Autohaus Freier Wissensbasis" on a Hinzmann Elektrotechnik agent). The content is correct after refurbish — the bucket name is cosmetic. Can be fixed via SQL if needed.
Orphaned buckets
Old demo creation left ~160 orphaned "Wissensdatenbank" buckets not linked to any agent. cleanup_agent deletes these. For large batches, clean up via SQL first (faster):
DELETE FROM knowledge_bucket_chunks WHERE bucket_item_id IN (
SELECT kbi.id FROM knowledge_bucket_items kbi
JOIN knowledge_buckets kb ON kb.id = kbi.bucket_id
WHERE kb.account_id = '8942a6e5-91cb-4c5d-8ef5-98cfe7945620'
AND kb.id NOT IN (SELECT unnest(knowledge_bucket_ids) FROM agents WHERE account_id = '8942a6e5-91cb-4c5d-8ef5-98cfe7945620')
);
-- Then delete items, then buckets (same pattern)
Content length limit
Knowledge entries allow up to 50,000 characters (was 20K, bumped 2026-03-28). Tavily advanced mode returns full pages including nav/footer HTML — the extra headroom prevents truncation of actual content. Pages hitting the limit likely have nav pollution but the important content still fits.
JS-rendered content (store locators, maps, dynamic lists)
Some pages load data via JavaScript (store locators, interactive maps, product configurators). Tavily can't extract this content. If a page returns mostly nav/boilerplate with no real data, WebFetch it manually, extract the data from the page description or other sources, and add via add_to_bucket with sourceUrl.
Cross-contamination
Audit showed only 1 true case out of 152 sent demos (Dachdeckermeister Hoffmann had DBM Metallbau content). Refurbish fixes this by clearing and re-scraping based on the correct domain from the demo page.
More from psquared-development/psquared-skills
inboxmate-demo
Set up a personalized InboxMate demo chatbot for a sales prospect. Use when asked to create a demo, set up an InboxMate playground, or prepare a chatbot demo. Guides the full pipeline: research company, scrape content, call MCP, deliver playground URL.
33find-leads
Find new B2B leads in Germany for InboxMate outreach. Validates each lead against legal requirements (UWG), checks email is publicly visible, documents justification, and adds to CRM. Germany only — Austrian law (TKG) is stricter. Pass the number of leads to find as a parameter.
32review-demos
Review InboxMate demos waiting for QA. Finds CRM opportunities at SCREENING with demoStatus=PENDING_REVIEW, opens each demo link, checks quality, and flags as OK_TO_SEND or NEEDS_FIX with a note explaining why.
24setup-email-drafts
Create email drafts for approved InboxMate demos. Verifies all demos are ready, pulls contacts from CRM, creates CRM tasks, and creates draft emails via the notification service. Run after /review-demos has processed all pending demos.
24inboxmate-batch-demo
Batch-create InboxMate demos for CRM prospects. Queries Twenty CRM for companies without opportunities, validates their websites, creates demos for valid ones, and marks unreachable/outdated ones as DISQUALIFIED.
24check-outreach-status
Check status of sent demo outreach emails and monitor follow-up draft status. Follow-up drafts are created by /setup-email-drafts — this skill only monitors. Run periodically after /setup-email-drafts has been used.
23