csv-excel-merger

SKILL.md

CSV/Excel Merger

Intelligently merge multiple CSV or Excel files with automatic column matching and data deduplication.

Instructions

When a user needs to merge CSV or Excel files:

  1. Analyze Input Files:

    • How many files need to be merged?
    • What format (CSV, Excel, TSV)?
    • Are the files provided or need to be read from disk?
    • Do columns have the same names across files?
    • What is the primary key (unique identifier)?
  2. Inspect File Structures:

    • Read headers from each file
    • Identify column names and data types
    • Detect encoding (UTF-8, Latin-1, etc.)
    • Check for missing columns
    • Look for duplicate column names
  3. Create Merge Strategy:

    Column Matching:

    • Exact name match: "email" = "email"
    • Case-insensitive: "Email" = "email"
    • Fuzzy match: "E-mail" ≈ "email"
    • Common patterns:
      • "first_name", "firstname", "First Name" → "first_name"
      • "phone", "phone_number", "tel" → "phone"
      • "email", "e-mail", "email_address" → "email"

    Conflict Resolution (when same record appears in multiple files):

    • Keep first: Use value from first file
    • Keep last: Use value from last file (most recent)
    • Keep longest: Use most complete value
    • Manual review: Flag conflicts for user review
    • Merge: Combine non-conflicting fields

    Deduplication:

    • Identify duplicate rows based on primary key
    • Options: keep first, keep last, keep all, merge values
    • Track source file for each row
  4. Perform Merge:

    # Example merge logic
    import pandas as pd
    
    # Read files
    df1 = pd.read_csv('file1.csv')
    df2 = pd.read_csv('file2.csv')
    
    # Normalize column names
    df1.columns = df1.columns.str.lower().str.strip()
    df2.columns = df2.columns.str.lower().str.strip()
    
    # Map similar columns
    column_mapping = {
        'firstname': 'first_name',
        'e_mail': 'email',
        # ...
    }
    df2 = df2.rename(columns=column_mapping)
    
    # Merge
    merged = pd.concat([df1, df2], ignore_index=True)
    
    # Deduplicate
    merged = merged.drop_duplicates(subset=['email'], keep='last')
    
    # Save
    merged.to_csv('merged_output.csv', index=False)
    
  5. Format Output:

    📊 CSV/EXCEL MERGER REPORT
    
    ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
    📁 INPUT FILES
    ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
    
    File 1: contacts_jan.csv
      Rows: 1,245
      Columns: 8 (name, email, phone, company, ...)
    
    File 2: contacts_feb.csv
      Rows: 987
      Columns: 9 (firstname, lastname, email, mobile, ...)
    
    File 3: leads_export.xlsx
      Rows: 2,103
      Columns: 12 (full_name, email_address, phone, ...)
    
    ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
    🔄 COLUMN MAPPING
    ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
    
    Unified Schema:
    • first_name ← [firstname, first name, fname]
    • last_name ← [lastname, last name, lname]
    • email ← [email, e-mail, email_address]
    • phone ← [phone, mobile, phone_number, tel]
    • company ← [company, organization, org]
    • title ← [title, job_title, position]
    • source ← [file origin tracking]
    
    ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
    🔍 MERGE ANALYSIS
    ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
    
    Total rows before merge: 4,335
    Duplicate records found: 892
    Conflicts detected: 47
    
    Deduplication Strategy: Keep most recent (by source file date)
    Primary Key: email
    
    ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
    ⚠️ CONFLICTS
    ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
    
    Record: john.doe@example.com
      File 1 phone: (555) 123-4567
      File 2 phone: (555) 987-6543
      Resolution: Kept most recent (File 2)
    
    [List top 10 conflicts]
    
    ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
    ✅ MERGE RESULTS
    ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
    
    Output File: merged_contacts.csv
    Total Rows: 3,443
    Columns: 7
    Duplicates Removed: 892
    
    Breakdown by Source:
    • contacts_jan.csv: 1,245 rows (398 unique)
    • contacts_feb.csv: 987 rows (521 unique)
    • leads_export.xlsx: 2,103 rows (2,524 unique)
    
    Data Quality:
    • Email completeness: 98.2%
    • Phone completeness: 87.5%
    • Company completeness: 91.3%
    
    ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
    💡 RECOMMENDATIONS
    ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
    
    • Review 47 conflict records manually
    • Standardize phone number format
    • Fill missing company names (8.7% incomplete)
    • Export conflicts to: conflicts_review.csv
    
  6. Handle Special Cases:

    Multiple Primary Keys:

    • Use compound keys: (email + company)
    • Offer options when ambiguous

    Different Data Types:

    • Convert dates to standard format
    • Normalize phone numbers
    • Standardize country codes
    • Clean whitespace and casing

    Missing Columns:

    • Fill with empty values
    • Flag missing data
    • Offer to create new columns

    Large Files:

    • Use chunking for files > 100MB
    • Show progress indicator
    • Estimate memory usage
  7. Generate Code: Provide Python/pandas script that:

    • Reads all files
    • Performs intelligent column matching
    • Deduplicates based on strategy
    • Resolves conflicts
    • Saves merged output
    • Generates detailed report
  8. Export Options:

    • CSV (UTF-8)
    • Excel (.xlsx)
    • JSON
    • SQL INSERT statements
    • Parquet (for large datasets)

Example Triggers

  • "Merge these three CSV files"
  • "Combine multiple Excel sheets into one file"
  • "Deduplicate and merge customer data"
  • "Join spreadsheets with different column names"
  • "Consolidate contact lists from different sources"

Best Practices

Column Matching:

  • Use fuzzy matching for similar names
  • Maintain original column name mapping report
  • Allow manual override of auto-matching

Data Quality:

  • Trim whitespace
  • Standardize formats (phone, email, dates)
  • Detect and flag invalid data
  • Preserve data types

Performance:

  • Use chunking for large files
  • Process in batches
  • Show progress for long operations
  • Optimize memory usage

Transparency:

  • Log all merge decisions
  • Track source file for each row
  • Report conflicts and resolutions
  • Generate detailed merge report

Output Quality

Ensure merges:

  • Intelligently match columns
  • Handle different schemas
  • Deduplicate properly
  • Preserve data integrity
  • Flag conflicts for review
  • Generate comprehensive report
  • Maintain data quality
  • Track data lineage (source)
  • Handle edge cases gracefully
  • Provide validation statistics

Generate clean, deduplicated merged files with full transparency and data quality checks.

Weekly Installs
5
Installed on
windsurf3
codex3
claude-code3
antigravity3
gemini-cli3
opencode2