Tapestry: Unified Content Extraction + Action Planning

Master skill that orchestrates the entire Tapestry workflow:

Detect content type from URL
Extract content using appropriate method
Create a Ship-Learn-Next action plan automatically

Prerequisites

This skill requires UV for dependency management. Run from the tapestry-skills project root.

Workflow Overview

URL → Validate → Detect Type → Extract Content → Create Plan → Save Files

Output: Two files saved:

Content file: [Title].txt
Plan file: Ship-Learn-Next Plan - [Quest].md

Security Requirements

CRITICAL: Before processing ANY URL, validate it first using the tapestry security utilities.

All security utilities are available via UV from the project root.

URL Validation (Required)

URL="$1"

# Run security validation (checks protocol, blocks SSRF, etc.)
uv run tapestry-validate-url "$URL" || exit 1

Filename Sanitization (Required)

# Use tapestry sanitization utility for all titles
SAFE_TITLE=$(uv run tapestry-sanitize-filename "$TITLE")

Step 1: Detect Content Type

detect_content_type() {
    local URL="$1"

    # YouTube patterns
    if [[ "$URL" =~ youtube\.com/watch || "$URL" =~ youtu\.be/ || "$URL" =~ youtube\.com/shorts ]]; then
        echo "youtube"
        return
    fi

    # PDF by extension
    if [[ "$URL" =~ \.pdf($|\?) ]]; then
        echo "pdf"
        return
    fi

    # PDF by Content-Type header
    if curl -sI --max-time 10 "$URL" | grep -iq "Content-Type:.*application/pdf"; then
        echo "pdf"
        return
    fi

    # Default to article
    echo "article"
}

CONTENT_TYPE=$(detect_content_type "$URL")
echo "Detected: $CONTENT_TYPE"

Step 2: Extract Content

YouTube Extraction

Use the youtube-transcript skill workflow:

# yt-dlp is available through UV
VIDEO_TITLE=$(uv run yt-dlp --print "%(title)s" "$URL" 2>/dev/null)
SAFE_TITLE=$(uv run tapestry-sanitize-filename "$VIDEO_TITLE")

# Create temp file
TEMP_DIR=$(mktemp -d)
trap "rm -rf '$TEMP_DIR'" EXIT

# Download transcript (try manual first, then auto-generated)
if ! uv run yt-dlp --write-sub --skip-download --sub-langs en -o "$TEMP_DIR/transcript" "$URL" 2>/dev/null; then
    uv run yt-dlp --write-auto-sub --skip-download --sub-langs en -o "$TEMP_DIR/transcript" "$URL"
fi

# Find and convert VTT to clean text
VTT_FILE=$(find "$TEMP_DIR" -name "*.vtt" | head -n 1)
uv run tapestry-vtt-to-text "$VTT_FILE" --output "${SAFE_TITLE}.txt"

CONTENT_FILE="${SAFE_TITLE}.txt"

Article Extraction

Use the article-extractor skill workflow:

# Check for extraction tools
if command -v reader &> /dev/null; then
    TOOL="reader"
else
    TOOL="trafilatura"
fi

TEMP_FILE=$(mktemp)
trap "rm -f '$TEMP_FILE'" EXIT

case $TOOL in
    reader)
        reader "$URL" > "$TEMP_FILE"
        TITLE=$(head -n 1 "$TEMP_FILE" | sed 's/^# //')
        ;;
    trafilatura)
        uv run trafilatura --URL "$URL" --output-format txt --no-comments > "$TEMP_FILE"
        TITLE=$(uv run trafilatura --URL "$URL" --json 2>/dev/null | \
            python3 -c "import json,sys; print(json.load(sys.stdin).get('title','Article'))" 2>/dev/null || echo "Article")
        ;;
esac

# Fallback if extraction failed
if [ ! -s "$TEMP_FILE" ]; then
    uv run tapestry-extract-html "$URL" --output "$TEMP_FILE"
    TITLE=$(head -n 1 "$TEMP_FILE" | sed 's/^# //')
fi

SAFE_TITLE=$(uv run tapestry-sanitize-filename "$TITLE")
CONTENT_FILE="${SAFE_TITLE}.txt"
mv "$TEMP_FILE" "$CONTENT_FILE"
trap - EXIT

PDF Extraction

# Sanitize filename from URL
URL_BASENAME=$(basename "$URL" | cut -d'?' -f1)
SAFE_PDF=$(uv run tapestry-sanitize-filename "$URL_BASENAME")

# Ensure .pdf extension
[[ "$SAFE_PDF" != *.pdf ]] && SAFE_PDF="${SAFE_PDF}.pdf"

# Download with security checks
uv run tapestry-safe-download "$URL" "$SAFE_PDF" --max-size 104857600

# Verify it's actually a PDF
if ! head -c 4 "$SAFE_PDF" | grep -q '%PDF'; then
    echo "Error: Downloaded file is not a valid PDF"
    rm -f "$SAFE_PDF"
    exit 1
fi

# Extract text if pdftotext available
if command -v pdftotext &> /dev/null; then
    CONTENT_FILE="${SAFE_PDF%.pdf}.txt"
    pdftotext "$SAFE_PDF" "$CONTENT_FILE"
    echo "Extracted text to: $CONTENT_FILE"
else
    echo "Note: pdftotext not found. Install with: brew install poppler"
    CONTENT_FILE="$SAFE_PDF"
fi

Step 3: Create Action Plan

After extracting content, invoke the ship-learn-next skill logic:

Read the extracted content file
Extract 3-5 core actionable lessons
Define a specific 4-8 week quest
Design Rep 1 (shippable this week)
Outline Reps 2-5 (progressive iterations)
Save as: Ship-Learn-Next Plan - [Quest Title].md

Key points:

Focus on actionable lessons, not summaries
Rep 1 must be completable in 1-7 days
Each rep produces real artifacts
Emphasize doing over studying

Step 4: Present Results

Tapestry Workflow Complete!

Content Extracted:
  Type: [youtube/article/pdf]
  Title: [Title]
  Saved to: [filename.txt]
  Words: [X]

Action Plan Created:
  Quest: [Quest title]
  Saved to: Ship-Learn-Next Plan - [Title].md

Rep 1 (This Week): [Rep 1 goal]

When will you ship Rep 1?

Error Handling

Issue	Action
UV not installed	Install with `curl -LsSf https://astral.sh/uv/install.sh \| sh`
Invalid URL	Reject with clear message
No subtitles (YouTube)	Offer Whisper transcription (with consent)
Paywall/login required	Inform user, cannot extract
Download failed	Check URL, retry, inform user
Empty extraction	Verify before planning, don't create empty plan

Dependencies

All dependencies are managed via UV and pyproject.toml:

yt-dlp: YouTube downloads (pinned version)
trafilatura: Article extraction (pinned version)
openai-whisper (optional): For videos without subtitles

System tools (install separately if needed):

pdftotext: PDF text extraction (brew install poppler)
reader: Mozilla Readability (npm install -g reader-cli)

Security Reference

For detailed security guidelines, see: ../shared/references/security-guidelines.md

Key requirements:

Validate all URLs before processing
Sanitize all filenames
Use temp files with cleanup traps
Set download size limits
Quote all variables in shell commands