Smart Backup System with Skill Integration

Supporting files in this directory:

MANIFEST_BACKUPS.md -- MANIFEST-aware intelligent backups

FULL_PROJECT_BACKUPS.md -- Full project backups, selective inclusion/exclusion, path verification

ADVANCED_USAGE.md -- Custom scripts, multiple file backups, real-world examples

When to Use This Skill

Use this skill when:

Working on any project with files that change over time
Jupyter notebooks, data files (CSV/TSV), HackMD presentations, or mixed projects
Need intelligent cleanup before backup (clear outputs, remove debug code)
Want to track what changed when (data provenance)
Need professional backup workflow for collaboration or publication
Want context-aware backups that use other skills intelligently

The Problem

Long-running data enrichment projects risk:

Losing days of work from accidental overwrites
Unable to revert to previous data states
No documentation of what changed when
Running out of disk space from manual backups
Confusion about which version is current

Solution: Smart Two-Tier Backup System with Skill Integration

Core Features

Intelligent Detection - Automatically detects project type and files to backup
Skill Integration - Uses jupyter-notebook, hackmd, and other skills for pre-backup cleanup
Daily backups - Rolling 7-day window (auto-cleanup)
Milestone backups - Permanent, compressed (gzip ~80% reduction)
CHANGELOG - Automatic documentation of all changes
Session Integration - Prompts for backup when exiting Claude Code session

Smart Detection & Integration

The backup system automatically detects your project type and applies appropriate cleanup:

Jupyter Notebooks (uses jupyter-notebook skill):

Detects: *.ipynb files
Pre-backup cleanup: Clear all cell outputs, remove cells tagged 'debug' or 'remove', validate notebooks

HackMD/Presentations (uses hackmd skill):

Detects: *.md files with slideOptions: frontmatter
Pre-backup cleanup: Validate SVG elements, check slide separators, verify YAML frontmatter

Data Files (native handling):

Detects: *.csv, *.tsv, *.xlsx files
Pre-backup cleanup: Validate file integrity, check for corruption

Python Projects (uses managing-environments skill):

Detects: requirements.txt, environment.yml, venv/, .venv/
Pre-backup cleanup: Remove .pyc, __pycache__, .pytest_cache, clean build artifacts

Mixed Projects: Detects all of the above and applies appropriate cleanup for each file type.

Directory Structure

For data-only projects:

project/
├── your_data_file.csv          # Main working file
├── backup_project.sh           # Smart backup script
└── backups/
    ├── daily/                  # Rolling 7-day backups
    ├── milestones/             # Permanent compressed backups
    ├── CHANGELOG.md            # Auto-generated change log
    └── README.md               # User documentation

For mixed projects (notebooks + data):

project/
├── analysis.ipynb              # Jupyter notebooks
├── data.csv                    # Data files
├── backup_project.sh           # Smart backup script
└── backups/
    ├── daily/                  # Rolling 7-day backups
    │   └── backup_2026-01-17/
    │       ├── notebooks/      # Cleaned (no outputs)
    │       └── data/
    ├── milestones/             # Permanent compressed backups
    ├── CHANGELOG.md
    └── README.md

Storage Efficiency

Daily backups: ~5.4 MB (7 days x 770KB)
Milestone backups: ~200KB each compressed (80% size reduction with gzip)
Total: <10 MB for complete project history
Auto-cleanup: Old daily backups delete after 7 days

Implementation

Quick Start with `/backup` Command

First time - Setup the backup system:

/backup

This will:

Detect your project type (notebooks, data files, presentations, etc.)
Set up appropriate backup scripts with smart cleanup
Create backup directory structure
Optionally configure automated backups

Daily usage - Create backups:

/backup                    # Daily backup with smart cleanup
/backup milestone "desc"   # Milestone backup
/backup list              # View all backups
/backup restore DATE      # Restore from backup

What Happens During Backup

Smart cleanup before backup:

Detects file types in your project
Applies skill-specific cleanup:
- Notebooks: Clear outputs, remove debug cells
- HackMD: Validate SVG, check formatting
- Python: Remove .pyc, __pycache__
- Data: Validate integrity
Creates organized backup with cleaned files
Updates CHANGELOG with what was backed up

Manual Script Usage (Alternative)

./backup_project.sh                           # Daily backup
./backup_project.sh milestone "description"   # Milestone
./backup_project.sh list                      # List backups
./backup_project.sh restore 2026-01-23        # Restore

When to Create Milestones

After adding new data sources (GenomeScope, karyotypes, external APIs)
Before major data transformations or filtering
When completing analysis sections
Before submitting/publishing
Before sharing with collaborators
After recovering missing data

Key Features

Safety Features

Never overwrites without asking - Prompts before overwriting existing backups
Safety backup before restore - Creates backup of current state before any restore
Automatic cleanup - Old daily backups auto-delete (configurable)
Complete audit trail - CHANGELOG tracks everything
Milestone protection - Important versions preserved forever (compressed)

CHANGELOG Tracking

The CHANGELOG.md automatically documents:

Date of each backup
Type (daily vs milestone)
Description of changes (for milestones)
Major modifications made to data

Example CHANGELOG:

## 2026-01-23
- **MILESTONE**: Recovered VGP accessions (backup created)
  - Added columns: `accession_recovered`, `accession_recovered_all`
  - Recovered 5 VGP accessions from NCBI
- Daily backup created at 2026-01-23 15:00:00

## 2026-01-22
- Enriched GenomeScope data for 21 species from AWS repository
- Added column: `genomescope_path` with direct links to summary files

Using `/backup` Command

Setup mode (first run): /backup -- Detects project type, sets up scripts, creates directory structure.

Daily backup mode: /backup -- Quick daily backup.

Milestone mode: /backup milestone "description of changes" -- e.g., /backup milestone "added heterozygosity data"

List and restore:

/backup list              # Show all available backups
/backup restore 2026-01-23 # Restore from specific date

Configuration: Edit backup_project.sh to change retention days (default: 7), backup directory location, or custom cleanup rules.

Benefits for Data Analysis

Data Provenance: CHANGELOG documents every modification; clear audit trail for methods sections in papers
Confidence to Experiment: Easy rollback encourages trying different approaches safely
Professional Workflow: Matches publication standards; reviewers can verify data processing steps
Collaboration-Ready: Team members can understand data history and enrichment process

Session Integration with `/safe-exit`

When you end a Claude Code session with /safe-exit, the system automatically:

Detects if backup system exists in the current project
Prompts for backup if system is configured (daily, milestone, skip, or cancel)
Performs cleanup and backup if requested
Prompts for Obsidian session summary (if obsidian skill is available)
Exits session cleanly

This ensures you never forget to backup AND document your work at the end of your session!

Example Workflow

Monday Morning

/backup                          # Daily backup with smart cleanup
# Work on notebooks and data enrichment all day
/backup milestone "added karyotype data for 50 new species"

End of session

/safe-exit
# Prompted: daily backup -> backup complete -> session summary -> exit

Friday (oops, made a mistake!)

/backup list                     # Check available backups
/backup restore 2026-01-23       # Restore from Wednesday

MANIFEST-Aware Backups

For projects with MANIFEST files, use intelligent backups that include only essential files. See MANIFEST_BACKUPS.md for the full pattern, script templates, inclusion/exclusion rules, and integration with the /backup command.

Full Project Backups

For projects where both code and data change, selective full-project backups capture the complete state without bloat. See FULL_PROJECT_BACKUPS.md for implementation patterns, backup strategy comparison, size benchmarks, and path verification guidance.

Advanced Usage

For custom backup script templates, handling multiple files, viewing compressed milestones, and real-world examples, see ADVANCED_USAGE.md.

Best Practices

Create daily backups at session start - Make it a habit
Milestone after every major change - Don't rely on memory
Use descriptive milestone names - "added genomescope" not "updates"
Check CHANGELOG before sharing - Verify data provenance is clear
List backups periodically - Ensure auto-cleanup is working
Test restore once - Verify you know how to recover

Troubleshooting

Backup script not found

ls -l backup_project.sh   # Check if backup system is set up
/backup                    # Set up if needed

Disk space running low

du -sh backups/            # Check backup sizes
# Reduce retention: edit DAYS_TO_KEEP=3 in backup_table.sh
# Manually clean old milestones if needed

CHANGELOG getting too large

tail -100 backups/CHANGELOG.md > backups/CHANGELOG_recent.md
mv backups/CHANGELOG.md backups/CHANGELOG_archive.md
mv backups/CHANGELOG_recent.md backups/CHANGELOG.md

Summary

Two-tier system: Daily rolling + permanent milestones
Storage efficient: Gzip compression (~80% reduction)
Auto-cleanup: 7-day rolling window for dailies
Complete audit trail: CHANGELOG tracks all changes
Safety first: Never overwrites without confirmation
Global installer: Use across all projects
Professional workflow: Publication-ready data provenance

data-backup