NotebookLM Knowledge Base Organizer

Prepares files for optimal use in NotebookLM by intelligently selecting and consolidating sources, converting formats, organizing structure, and ensuring compatibility. The primary constraint is NotebookLM's 50-source limit per notebook. When collections exceed this limit, systematic scoring, prioritization, and strategic merging reduce source count without losing valuable information.

When to Use This Skill

You have 50+ files and need to optimize for NotebookLM's limit
Preparing documents for a new NotebookLM notebook
Converting a messy folder into NotebookLM-ready sources
Files are in unsupported formats (PPTX, XLSX, complex PDFs)
Documents exceed 500k words or 200MB per file
Building a knowledge base for research, projects, or learning
Large document collections (100-300 files) need intelligent prioritization

What This Skill Does

Scores and Prioritizes Sources (when >50 detected) using Relevance, Recency, Uniqueness, and Information Density (0-40 scale)
Strategic Merging via time-series (daily to monthly), topic-based (related papers to comprehensive guides), and format consolidation (slides + transcript to unified PDF)
Converts to Supported Formats (PPTX to PDF, XLSX to CSV, scanned to OCR)
Applies Flat Structure with descriptive snake_case naming
Removes Duplicates across formats
Splits Large Files exceeding 500k words into parts
Optimizes for RAG with smaller, focused documents for better retrieval

NotebookLM Supported Formats

Supported:

PDF (text-selectable, not scanned images)
Google Docs, Sheets (<100k tokens), Slides (<100 slides)
Microsoft Word (.docx)
Text files (.txt, .md)
Images (PNG, JPEG, TIFF, WEBP)
Audio (MP3, WAV, AAC, OGG with clear speech)
URLs (websites, YouTube, Google Drive links)
Copy-pasted text

Convert These:

PPTX to PDF
XLSX to CSV or Google Sheets
Scanned PDFs to OCR text-selectable PDF
Large Sheets to CSV (<100k tokens)

File Limits

Per Source:

500,000 words max
200MB file size max
No page limit (word limit matters)

Per Notebook (Free):

50 sources maximum -- HARD LIMIT
100 notebooks total

Prefer many smaller, focused documents over few large ones for better RAG retrieval. The 50-source limit is the primary optimization constraint.

IMPORTANT: Preserve original file timestamps during all operations. Timestamps are essential for understanding latest additions, recent meeting minutes, and key decisions. Use touch -r original converted after conversions. Include dates in ISO format (YYYY-MM-DD) in all filenames.

How to Use

Prepare these files for NotebookLM - convert formats and organize with descriptive names

Convert all PPTX and XLSX files to NotebookLM-compatible formats

Check if any files exceed NotebookLM's 500k word or 200MB limits

Organize this research folder for a NotebookLM knowledge base

Find duplicate content across different file formats

Split this large PDF into NotebookLM-compatible chunks

Instructions

When a user requests NotebookLM organization, follow these steps.

Step 1: Assess and Prioritize Sources

Count and evaluate before proceeding with any organization.

total_sources=$(find . -type f \( -name "*.pdf" -o -name "*.docx" -o -name "*.txt" -o -name "*.md" -o -name "*.csv" \) | wc -l)
echo "Total sources found: $total_sources"

If total exceeds 50:

Score all sources using the 4-dimension rubric (Relevance, Recency, Uniqueness, Density, each 0-10). See references/scoring-system.md for the full rubric, assessment commands, and batch scoring script.
Rank and select top candidates using the decision matrix. Target 35-40 auto-keep sources initially. See references/prioritization-strategy.md for the selection process and space-based adjustments.

Identify merge candidates -- find time-series patterns, topic clusters, and multi-format duplicates:

# Time-series opportunities
find . -name "*_20[0-9][0-9]_[0-9][0-9]_*" | \
  sed 's/_20[0-9][0-9]_[0-9][0-9]_[0-9][0-9]//' | sort | uniq -c | sort -rn

# Topic clusters
find . -type f -name "*.pdf" | xargs -I {} basename {} .pdf | \
  sed 's/_part_[0-9]*//;s/_[0-9][0-9]*$//' | sort | uniq -c | sort -rn | awk '$1 > 2'

Execute strategic merges using appropriate patterns. See references/merging-strategies.md for time-series, topic-based, and format consolidation scripts. Preserve timestamps on all merged outputs.
Recount and validate the final total is at or below 50 (ideally 48 to reserve slots for future additions).

Step 2: Understand the Scope

Ask clarifying questions:

What is the topic/purpose of this knowledge base?
Which directory contains the source materials?
Target: single notebook or multiple related notebooks?
Any files that must stay in original format?
Is this for research, learning, project documentation, or reference?

Step 3: Analyze Current State

Review files for NotebookLM compatibility:

find . -type f -exec file {} \;
find . -type f -exec du -h {} \; | sort -rh
find . -type f | sed 's/.*\.//' | sort | uniq -c | sort -rn
for f in *.pdf; do pdftotext "$f" - | wc -w; done

Categorize findings:

Compatible as-is: PDF, DOCX, TXT, MD, images
Needs conversion: PPTX, XLSX, XLS, PPT, scanned PDFs
Too large: Files >500k words or >200MB
Duplicates: Same content in different formats
Merge candidates: Sources identified for consolidation in Step 1

Step 4: Convert Unsupported Formats

PowerPoint to PDF:

soffice --headless --convert-to pdf *.pptx
touch -r original.pptx converted.pdf  # Preserve timestamp

Excel to CSV:

soffice --headless --convert-to csv:"Text - txt - csv (StarCalc)":44,34,UTF8 *.xlsx
touch -r original.xlsx converted.csv  # Preserve timestamp

Scanned PDF to Searchable:

ocrmypdf input.pdf output_searchable.pdf
touch -r input.pdf output_searchable.pdf  # Preserve timestamp
pdftotext output_searchable.pdf - | wc -w  # Verify text extraction

WARNING: Always run touch -r original converted after every conversion to preserve the original file timestamp.

Step 5: Apply Naming

Use this pattern: category_topic_descriptor_YYYY_MM_DD.ext

Examples:

research_quantum_computing_basics_2025.pdf
meeting_notes_project_kickoff_2026_01_15.txt
client_proposal_acme_corp_final.docx
reference_api_documentation_v2.md
data_sales_figures_q4_2025.csv

See references/organization-scripts.md for the automated naming script. Preserve timestamps when renaming: use mv (preserves by default) and verify with stat.

Step 6: Split Large Documents

For files >500k words or >200MB:

pdftotext document.pdf - | wc -w  # Check word count
pdftk large.pdf cat 1-500 output large_part_1.pdf
pdftk large.pdf cat 501-1000 output large_part_2.pdf
touch -r large.pdf large_part_1.pdf large_part_2.pdf  # Preserve timestamps

Name parts by content, not arbitrary numbers:

annual_report_2025_part_1_executive_summary.pdf
annual_report_2025_part_2_financials.pdf
annual_report_2025_part_3_appendices.pdf

Step 7: Consolidation Pass

Perform strategic merging to optimize source count. This step is critical when merge candidates were identified in Step 1 or the collection is near the 50-source limit.

Merging is a primary optimization strategy, not a last resort. Three patterns apply:

Time-series: Combine chronological documents into period summaries (daily to monthly, weekly to quarterly)
Topic-based: Combine related papers/docs into comprehensive guides with chapter markers
Format consolidation: Combine slides + transcript + notes for the same event into a single PDF

See references/merging-strategies.md for full merge patterns, scripts (time-series merger, topic-based PDF merger), decision trees, and quality checks.

IMPORTANT: Preserve chronological timestamps in merged content. Add clear date headers within merged files so temporal context is not lost.

Log all merge decisions for inclusion in the organization plan.

Step 8: Implement Flat Structure

NotebookLM works best with flat source lists, no nested folders.

Before:

docs/
  project/
    planning/
      requirements.pdf
    research/
      background.pdf
  reference/
    api_docs.pdf

After:

notebooklm_sources/
  project_requirements_2026.pdf
  project_background_research.pdf
  reference_api_documentation.pdf

See references/organization-scripts.md for the implementation script. Preserve timestamps when copying: use cp -p to maintain original dates.

Step 9: Find and Remove Duplicates

find . -type f -exec md5 {} \; | sort | uniq -d
find . -type f -printf '%f\n' | sed 's/\.[^.]*$//' | sort | uniq -d
for pdf in *.pdf; do echo "=== $pdf ==="; pdftotext "$pdf" - | md5; done | sort

Decision matrix:

Same content, different formats: keep PDF (best for NotebookLM)
Same content, different names: keep most descriptive name
Slight variations: merge into single document if <500k words
Truly duplicate: delete older version (check timestamps first)

Step 10: Optimize for RAG

NotebookLM uses RAG, which works best with focused documents:

Split 100-page documents into 3-5 topic-focused files
Separate chapters/sections into individual sources
Keep each source focused on one topic/subtopic
Prefer 20-50 pages per PDF over 200+ page megadocs

Instead of:
  company_handbook_500_pages.pdf

Create:
  handbook_code_of_conduct.pdf
  handbook_benefits_overview.pdf
  handbook_time_off_policy.pdf
  handbook_remote_work_guidelines.pdf
  handbook_career_development.pdf

Step 11: Propose Organization Plan

Present a plan to the user before making changes. The plan should cover current state, source selection strategy (if >50 sources), proposed structure, changes to make, and a compatibility check.

See references/organization-plan-template.md for the full template with sections for prioritization results, merge decisions, and final source count verification.

Step 12: Execute Organization

After user approval, execute all conversions, merges, renames, and structural changes. Log all operations.

See references/organization-scripts.md for the complete execution script with logging and limit verification. Run touch -r after every file operation to preserve original timestamps.

Step 13: Provide Upload Instructions

Provide the user with a summary of organized sources and upload instructions for NotebookLM (direct upload and Google Drive options).

See references/upload-guide.md for the full upload instructions template including maintenance guidance.

Examples

Example 1: Research Paper Collection

User: "Prepare my PhD research papers folder for NotebookLM"

Process:

Finds 35 PDFs, 12 DOCX, 8 PPTX across nested folders
Converts 8 PPTX to PDF (preserves timestamps)
Identifies 2 papers >500k words, splits into parts
Renames: smith_2024.pdf to research_quantum_entanglement_smith_2024.pdf
Creates flat structure in phd_research_sources/
Result: 48 sources ready for upload

Example 2: Company Knowledge Base

User: "Convert our company wiki exports to NotebookLM format"

Split single 145-page PDF by section into 7 focused sources:

company_overview_history_mission.pdf (8 pages)
company_policies_hr_guidelines.pdf (28 pages)
company_product_documentation.pdf (45 pages)
(4 more topic-focused files)

Result: 7 focused sources instead of 1 large doc. Better RAG retrieval.

Example 3: Excel Data

User: "I have 10 Excel files with research data"

Convert each sheet to separate CSV. Name descriptively: data_survey_responses_2025.csv. Create overview doc: data_overview_methodology.txt. Preserve timestamps on all conversions.

Result: 10 XLSX to 23 CSV files + 1 overview doc.

Example 4: Conference Materials

User: "Organize my conference materials for a knowledge base"

Input: 12 MP3 recordings, 8 PPTX decks, 15 JPG notes, 5 PDFs. Keep MP3 as-is (NotebookLM transcribes on upload). Convert PPTX to PDF. Keep JPGs (NotebookLM reads handwriting via OCR). Apply naming: conf_session_title_speaker_date.ext. Preserve all timestamps.

Result: 40 sources in flat folder.

Example 5: Large Collection (200+ Sources)

For a complete workflow handling 200+ sources (e.g., reducing 237 sources to 48 with strategic merging), see references/large-collection-workflow.md.

Common Patterns

Academic Research

research_[topic]_[author]_[year].pdf
notes_[course]_[topic]_[date].md
textbook_[subject]_chapter_[n]_[title].pdf

Business Projects

project_[name]_requirements.pdf
project_[name]_timeline.csv
meeting_[project]_[date]_notes.txt
client_[name]_proposal_final.docx

Learning/Courses

course_[name]_lecture_[n]_[topic].pdf
course_[name]_readings_week_[n].pdf
course_[name]_assignment_[n].docx

Personal Knowledge Base

article_[topic]_[author]_[date].pdf
book_notes_[title]_[author].md
tutorial_[skill]_[topic].pdf
reference_[tool]_documentation.pdf

Pro Tips

Optimize for Search: Use descriptive names with search keywords. Good: tutorial_python_async_programming_advanced.pdf. Bad: tutorial_5.pdf.
Topic-Based Splitting: Split large docs by topic, not arbitrary page count. Good: handbook_benefits.pdf, handbook_policies.pdf. Bad: handbook_part_1.pdf, handbook_part_2.pdf.
Date Formatting: Use ISO format (YYYY-MM-DD) for sortability. Good: meeting_notes_2026_02_04.txt. Bad: meeting_notes_feb_4_2026.txt.
Preserve Source Timestamps: Always maintain original file creation/modification dates. These enable accurate recency scoring and help NotebookLM's RAG weight recent meeting notes, decisions, and additions appropriately. Use touch -r original converted after every conversion.
Extract Text from Scans: Scanned PDFs do not work in NotebookLM. Test with pdftotext test.pdf - | head. If blank, run ocrmypdf input.pdf output.pdf.
Use Prefixes for Ordering: Add numeric prefixes for logical ordering: 01_project_overview.pdf, 02_project_requirements.pdf.
Test Before Bulk Upload: Upload 2-3 files first to verify processing, summaries, and search accuracy. Then upload the rest.

Best Practices Summary

Source Selection and Optimization:

Always assess total source count first before organizing
Use scoring rubric for objective prioritization (>50 sources)
Merge strategically as primary optimization, not last resort
Prefer quality over quantity: 48 great sources over 50 mediocre ones
Reserve 2-3 slots for future additions
Do not merge high-value unique sources (score 35+)
Do not combine unrelated topics just to hit limits

File Naming:

Descriptive snake_case with searchable terms and ISO dates
Keep under 100 characters, no spaces or special characters
Use dates instead of version numbers

Format Selection:

PDF for presentations and mixed content
CSV for spreadsheet data
DOCX/TXT/MD for text documents
Always convert PPTX and XLSX before upload

Timestamp Preservation:

Run touch -r original converted after every conversion
Use cp -p when copying files to preserve modification dates
Include ISO dates in filenames for explicit temporal context
Timestamps drive recency scoring and RAG relevance weighting

Organization Structure:

Flat structure (one folder, all files)
Descriptive names include folder context
Stay under 50 sources per notebook

Implementation Checklist

Phase 1: Assessment and Prioritization

Identify target notebook topic/purpose
Locate all source files and count total
If >50: run scoring rubric for all sources
If >50: identify and execute strategic merges
If >50: select top sources using decision matrix (target 48)
Check file formats, note conversions needed
Estimate word counts for large files

Phase 2: Conversion and Organization

Convert unsupported formats (preserve timestamps)
Apply descriptive snake_case naming
Split large documents by topic
Remove duplicates
Create flat output directory
Verify all files <200MB and <500k words
Verify final source count is at or below 50
Verify timestamps preserved on all converted/moved files

Phase 3: Upload and Verification

Document selection strategy in organization plan
Test upload 2-3 files
Upload remaining sources
Verify NotebookLM processing and summaries
Test search functionality
Confirm all key topics covered despite any source reduction

notebooklm-knowledge-base-organizer