NYC
skills/letta-ai/skills/multi-source-data-merger

multi-source-data-merger

SKILL.md

Multi Source Data Merger

Overview

This skill guides the process of merging data from multiple sources with different formats into a unified dataset. It covers reading heterogeneous file formats, applying field name mappings, resolving conflicts using priority ordering, and generating comprehensive output files including conflict reports.

Workflow

Step 1: Analyze Requirements and Source Files

Before writing any code, thoroughly understand the task:

  1. Identify all source files and their formats (JSON, CSV, Parquet, XML, etc.)
  2. Determine the merge key (e.g., user_id, record_id) that links records across sources
  3. Review field mapping requirements - source fields may have different names that map to common output fields
  4. Understand conflict resolution rules - typically based on source priority ordering
  5. Identify expected output formats and structure

Important: Do not attempt to read binary formats (Parquet, Excel, etc.) as text files - use appropriate libraries.

Step 2: Set Up Environment

  1. Create a Python virtual environment using uv or venv
  2. Install required dependencies based on source formats:
    • pandas - Core data manipulation
    • pyarrow - Parquet file support
    • openpyxl - Excel file support
    • lxml - XML parsing (if needed)
  3. Verify installations before proceeding

Example environment setup:

uv venv .venv
source .venv/bin/activate
uv pip install pandas pyarrow

Step 3: Write the Merge Script

Structure the script with clear separation of concerns:

  1. Data reading functions - One per format type
  2. Field mapping function - Apply column renames
  3. Data normalization - Handle date formats, type conversions
  4. Merge logic - Combine records using the merge key
  5. Conflict resolution - Apply priority rules
  6. Output generation - Write merged data and conflict reports

Script quality practices:

  • Validate syntax before execution: python -m py_compile script.py
  • Use try-except blocks with informative error messages
  • Document assumptions about data formats

Step 4: Execute and Verify

Run a comprehensive verification process:

  1. Check output file existence at expected locations
  2. Validate merged data contains expected values
  3. Verify conflict report structure and content
  4. Run any provided test suites

Common Pitfalls

Binary File Handling

  • Mistake: Attempting to read Parquet/Excel files as text
  • Solution: Always use pandas with appropriate engine (pyarrow for Parquet, openpyxl for Excel)

Syntax Errors in Scripts

  • Mistake: Writing long scripts without validation, leading to indentation or syntax errors
  • Solution: Run python -m py_compile script.py before execution

Date Format Normalization

  • Mistake: Assuming consistent date formats across sources
  • Solution: Implement flexible date parsing that handles multiple formats:
    • ISO format: 2024-01-15
    • US format: 01/15/2024
    • European format: 15-01-2024
    • Datetime: 2024-01-15T10:30:00

Incomplete Script Output

  • Mistake: Writing very long scripts that may get truncated
  • Solution: Break into modular functions, verify complete code visibility

Environment Path Issues

  • Mistake: Repeating PATH exports in every command
  • Solution: Set PATH once in a setup step or use absolute paths to executables

Verification Strategies

Output Validation Checklist

  1. File existence check:

    import os
    assert os.path.exists("output/merged_data.json")
    assert os.path.exists("output/conflict_report.json")
    
  2. Data completeness check:

    import json
    with open("output/merged_data.json") as f:
        data = json.load(f)
    # Verify expected record count
    assert len(data) == expected_count
    
  3. Conflict report validation:

    with open("output/conflict_report.json") as f:
        conflicts = json.load(f)
    # Verify conflict structure has required fields
    for conflict in conflicts:
        assert "field" in conflict
        assert "selected" in conflict
        assert "sources" in conflict
    
  4. Sample value verification:

    # Spot-check specific merged records
    record = next(r for r in data if r["user_id"] == "expected_id")
    assert record["field_name"] == "expected_value"
    

Consolidate Verification

Instead of running multiple separate verification commands, create a single comprehensive test script that validates all aspects of the output.

Edge Cases to Consider

  • Empty source files - Handle gracefully with appropriate warnings
  • Missing merge keys - Decide whether to skip or error
  • Type mismatches - Convert consistently (e.g., user_id as string vs integer)
  • Null/None values - Determine handling in conflict resolution
  • Unicode/encoding - Specify encoding when reading text-based formats
  • Records in some sources but not others - Include partial records or require complete matches

Field Mapping Example

When sources have different field names for the same concept:

FIELD_MAPPINGS = {
    "source_a": {
        "firstName": "first_name",
        "lastName": "last_name",
        "emailAddress": "email"
    },
    "source_b": {
        "fname": "first_name",
        "lname": "last_name",
        "mail": "email"
    }
}

def apply_mapping(df, source_name):
    mapping = FIELD_MAPPINGS.get(source_name, {})
    return df.rename(columns=mapping)

Conflict Resolution Pattern

When the same field has different values across sources:

def resolve_conflict(values_by_source, priority_order):
    """
    Select value based on source priority.

    Args:
        values_by_source: dict mapping source name to value
        priority_order: list of source names from highest to lowest priority

    Returns:
        tuple: (selected_value, conflict_info)
    """
    conflict_info = None
    unique_values = set(v for v in values_by_source.values() if v is not None)

    if len(unique_values) > 1:
        conflict_info = {
            "sources": values_by_source,
            "resolved_by": "priority"
        }

    for source in priority_order:
        if source in values_by_source and values_by_source[source] is not None:
            return values_by_source[source], conflict_info

    return None, conflict_info
Weekly Installs
18
Repository
letta-ai/skills
First Seen
Jan 24, 2026
Installed on
claude-code13
codex12
gemini-cli12
opencode12
antigravity11
cursor9