file-summarization

Installation

SKILL.md

File Summarization

Apply this methodology when summarizing files of any type. This skill provides the routing logic and type-specific strategies for faithful file summarization.

Pre-Summarization Assessment

Before summarizing any file, the model MUST:

Read the file - Use the Read tool to access the actual content. Never guess from the filename.
Assess size - Run $CLAUDE_PLUGIN_ROOT/scripts/file_metrics.py to determine word count and file type. If the script is unavailable, use the Read tool and manually estimate word count from line count.
Select strategy - Based on size thresholds from the table below.
Verify file type - Use file extension and content inspection to determine which type-specific strategy to apply.

Size-Based Strategy Selection

File Size	Strategy	Approach
Small (< 2,000 words)	Full read with extractive summarization	Read entire file, extract key passages, summarize from extracts
Medium (2,000-10,000 words)	Section-based extraction	Read full file, identify sections/modules, extract from each section, synthesize
Large (> 10,000 words)	Chunk and map-reduce	Split into chunks, summarize each chunk, synthesize chunk summaries

SOURCE: Size thresholds adapted from Anthropic knowledge-synthesis skill (knowledge-work-plugins repository, accessed 2026-02-06). Strategy patterns informed by Map-Reduce Summarization methodology.

File Type Strategies

Code Files

File extensions: .py, .js, .ts, .jsx, .tsx, .rs, .go, .java, .c, .cpp, .h, .rb, .php, .swift, .kt, .scala, .sh, .bash, .zsh

The model MUST extract:

Imports/dependencies - List external modules and standard library imports
Structure - Classes, functions, methods with signatures
Purpose - Inferred from docstrings, comments, function names
Key logic - Core algorithms, state machines, data transformations
Entry points - main(), CLI argument parsing, exported functions
Configuration - Environment variables, config file references

Extraction method: Read sequentially. Capture top-level definitions with their line numbers. Extract docstrings verbatim. Quote complex logic rather than paraphrasing.

Example summary structure:

## Summary

Python module for HTTP client authentication. Implements JWT token refresh flow with retry logic. Exports `AuthClient` class and `refresh_token()` function.

## What Was Found

- Class `AuthClient` (lines 15-87): JWT-based HTTP client with automatic token refresh
- Function `refresh_token()` (lines 92-105): Retries up to 3 times on 401 errors
- Dependencies: `httpx`, `jwt`, `tenacity` (lines 1-3)
- Environment variables: `AUTH_BASE_URL`, `AUTH_CLIENT_ID` (lines 10-11)

## What Was NOT Found

- No test coverage information in this file
- No error handling for network failures
- Configuration schema not documented

Configuration Files

File extensions: .json, .yaml, .yml, .toml, .ini, .env, .conf, .cfg, .properties

The model MUST extract:

Top-level keys - All root keys with their value types
Nested structure - Hierarchy depth and organization
Settings categories - Group keys by purpose if clear
Notable values - Endpoints, file paths, feature flags, credentials (note presence, do not expose values)
Validation constraints - Type requirements, enums, ranges if documented

Extraction method: Parse structure. For small files, include all keys. For large files, sample representative sections and note structure patterns.

Example summary structure:

## Summary

Application configuration in YAML format. Defines database connection, API endpoints, feature flags, and logging settings. 47 configuration keys across 5 top-level sections.

## What Was Found

- `database.host`, `database.port`, `database.name` (lines 2-4): PostgreSQL connection settings
- `api.base_url`, `api.timeout` (lines 7-8): External API configuration
- `features.experimental_mode: false` (line 12): Feature flag for beta features
- `logging.level: INFO`, `logging.format` (lines 15-16): Logging configuration

## What Was NOT Found

- No schema validation rules present
- No environment-specific overrides documented
- API authentication credentials not in this file

Data Files

File extensions: .csv, .tsv, .parquet, .json (when data-structured), .jsonl, .ndjson

The model MUST extract:

Row count - Exact number of records
Column names - All column headers
Data types - Inferred from first N rows
Sample values - Representative examples from each column
Missing data - Columns with null/empty values
Unique identifiers - Primary key columns if evident

Extraction method: For CSV/TSV, read header row and first 10 data rows. For Parquet, note that binary inspection is limited. For JSON, inspect array structure.

Example summary structure:

## Summary

CSV file containing user activity logs. 1,247 rows with 8 columns. Timestamps range from 2025-01-01 to 2026-02-06. No missing values detected.

## What Was Found

- Column `user_id` (integer): User identifiers, range 1001-5432
- Column `timestamp` (ISO 8601): Activity timestamps
- Column `action` (string): Values include "login", "logout", "view_page", "click_button"
- Column `duration_ms` (integer): Range 0-45000
- 1,247 total records (line count: 1,248 including header)

## What Was NOT Found

- No schema documentation in file
- Column `referrer` is present but not documented
- No indication of data collection methodology

Documentation Files

File extensions: .md, .rst, .txt, .adoc, .org

The model MUST extract:

Topic hierarchy - Top-level headings and structure
Key sections - Main topics covered
Commands/examples - Code blocks, shell commands, API calls
Links - External references and internal cross-references
Definitions - Technical terms defined in the text

Extraction method: Read sequentially. Extract headings to build table of contents. Quote key passages that define core concepts. Note code examples.

Example summary structure:

## Summary

User guide for deploying containerized applications. Covers Docker setup, image building, registry configuration, and troubleshooting. 5 main sections with 23 subsections. Includes 12 shell command examples.

## What Was Found

- Section "Getting Started" (lines 10-45): Docker installation on Linux and macOS
- Section "Building Images" (lines 47-89): Dockerfile syntax and multi-stage builds
- Section "Troubleshooting" (lines 200-245): Common errors with solutions
- 12 shell command examples throughout document

## What Was NOT Found

- No Windows deployment instructions
- Security best practices not covered
- Performance tuning section mentioned but not written (line 15: "TODO")

Binary and Unknown Files

File extensions: .pdf, .zip, .tar, .gz, .bin, .exe, .so, .dylib, .dll, or unrecognized extensions

The model MUST:

Attempt to read - Use the Read tool. If the tool returns binary content or an error, note this.
State limitation - Do NOT guess contents. State: "Binary file, cannot extract text content."
Provide file metadata - File size, extension, location.
For PDFs: Use the Read tool with pages parameter to extract text from specific page ranges. Summarize text content if extraction succeeds.

Example for unreadable binary:

## Summary

Binary file, cannot extract text content.

## What Was Found

- File path: ./build/output.bin
- File size: 2.3 MB
- Extension: .bin

## What Was NOT Found

Unable to determine contents without binary inspection tools.

## Uncertain

File may be compiled binary, compressed archive, or proprietary format.

Quote-Grounding Technique

For all text-based files, the model MUST apply the quote-grounding technique:

First pass - Read file, identify key passages
Extract - Copy exact quotes with line numbers
Organize extracts - Group by theme or importance
Summarize from extracts - Write summary grounded in the extracted quotes
Verify - Ensure every claim in summary traces to an extract

SOURCE: Technique adapted from Fidelity Rules Rule 2 (lines 27-41).

Output Format

All file summaries MUST use the structured output format defined in Structured Summary.

Required sections:

YAML frontmatter - Include source_type: file, source_path, method, confidence, word counts
Summary - Condensed content (BLUF style)
What Was Found - Items discovered with line number references
What Was NOT Found - Expected items that were absent
Uncertain - Ambiguous items requiring interpretation
Sources - Full file path, access date

Fidelity Rules

The model MUST follow all fidelity rules defined in Fidelity Rules.

Critical rules for file summarization:

Rule 1: Read the file before summarizing. Never guess from filename.
Rule 2: Extract before abstracting. Identify key passages first.
Rule 3: Preserve counts and specifics. "7 functions" not "several functions."
Rule 4: Distinguish absence from nonexistence. "Not in file" not "doesn't exist."
Rule 6: State confidence explicitly. Full read of small file = high confidence. Truncated large file = medium/low confidence.

Multi-File Summarization

When the user requests summarization of multiple files:

Summarize each file individually using this methodology
Write each summary to a separate output file or section
Do NOT merge file summaries into a single combined summary without explicit user request
If synthesis across files is requested, load the multi-source-synthesis skill after completing individual summaries

SOURCE: Multi-source synthesis approach from Summarizer lines 33-37.

Error Handling

If a file cannot be read:

Attempt to read with the Read tool
If read fails, report the error: "Unable to read [file path]: [error message]"
Do NOT speculate about file contents
Do NOT proceed with summarization
Ask user if they want to try alternative access methods

Output Rendering

Read template - Load the template file at ../summarizer/templates/{format_id}.md (default: structured). The template defines the schema, required sections, and fidelity constraints for the selected format.
Render - Produce output following the template's Schema section. Use the template's Example as a reference for structure and style.
Verify fidelity - Confirm the output satisfies the template's Fidelity Constraints and all applicable Fidelity Rules.

Anti-Patterns

The model MUST NOT:

Summarize a file based on its name without reading it
Guess file contents from directory structure or naming conventions
Assume file type from extension without verifying contents
Summarize from partial reads (head/tail/grep) without disclosing the limitation
Upgrade "not found in file" to "file doesn't contain" in a way that implies certainty about what the file should contain
Present interpretation as observation
Skip the "What Was NOT Found" section
Omit line number references for key findings

Related skills

More from jamie-bitflight/claude_skills

Installs

Repository

jamie-bitflight…e_skills

GitHub Stars

First Seen

Mar 29, 2026

Security Audits

Gen Agent Trust HubPass

SocketPass

SnykPass

file-summarization

File Summarization

Pre-Summarization Assessment

Size-Based Strategy Selection

File Type Strategies

Code Files

Configuration Files

Data Files

Documentation Files

Binary and Unknown Files

Quote-Grounding Technique

Output Format

Fidelity Rules

Multi-File Summarization

Error Handling

Output Rendering

Anti-Patterns

More from jamie-bitflight/claude_skills

perl-lint

brainstorming-skill

design-anti-patterns

python3-review

hooks-guide

agent-creator