Multi Source Data Merger

Overview

This skill guides the process of merging data from multiple sources with different formats into a unified dataset. It covers reading heterogeneous file formats, applying field name mappings, resolving conflicts using priority ordering, and generating comprehensive output files including conflict reports.

Workflow

Step 1: Analyze Requirements and Source Files

Before writing any code, thoroughly understand the task:

Identify all source files and their formats (JSON, CSV, Parquet, XML, etc.)
Determine the merge key (e.g., user_id, record_id) that links records across sources
Review field mapping requirements - source fields may have different names that map to common output fields
Understand conflict resolution rules - typically based on source priority ordering
Identify expected output formats and structure

Important: Do not attempt to read binary formats (Parquet, Excel, etc.) as text files - use appropriate libraries.

Step 2: Set Up Environment

Create a Python virtual environment using uv or venv
Install required dependencies based on source formats:
- pandas - Core data manipulation
- pyarrow - Parquet file support
- openpyxl - Excel file support
- lxml - XML parsing (if needed)
Verify installations before proceeding

Example environment setup:

uv venv .venv
source .venv/bin/activate
uv pip install pandas pyarrow

Step 3: Write the Merge Script

Structure the script with clear separation of concerns:

Data reading functions - One per format type
Field mapping function - Apply column renames
Data normalization - Handle date formats, type conversions
Merge logic - Combine records using the merge key
Conflict resolution - Apply priority rules
Output generation - Write merged data and conflict reports

Script quality practices:

Validate syntax before execution: python -m py_compile script.py
Use try-except blocks with informative error messages
Document assumptions about data formats

Step 4: Execute and Verify

Run a comprehensive verification process:

Check output file existence at expected locations
Validate merged data contains expected values
Verify conflict report structure and content
Run any provided test suites

Common Pitfalls

Binary File Handling

Mistake: Attempting to read Parquet/Excel files as text
Solution: Always use pandas with appropriate engine (pyarrow for Parquet, openpyxl for Excel)

Syntax Errors in Scripts

Mistake: Writing long scripts without validation, leading to indentation or syntax errors
Solution: Run python -m py_compile script.py before execution

Date Format Normalization

Mistake: Assuming consistent date formats across sources
Solution: Implement flexible date parsing that handles multiple formats:
- ISO format: 2024-01-15
- US format: 01/15/2024
- European format: 15-01-2024
- Datetime: 2024-01-15T10:30:00

Incomplete Script Output

Mistake: Writing very long scripts that may get truncated
Solution: Break into modular functions, verify complete code visibility

Environment Path Issues

Mistake: Repeating PATH exports in every command
Solution: Set PATH once in a setup step or use absolute paths to executables

Verification Strategies

Output Validation Checklist

File existence check:

import os
assert os.path.exists("output/merged_data.json")
assert os.path.exists("output/conflict_report.json")

Data completeness check:

import json
with open("output/merged_data.json") as f:
    data = json.load(f)
# Verify expected record count
assert len(data) == expected_count

Conflict report validation:

with open("output/conflict_report.json") as f:
    conflicts = json.load(f)
# Verify conflict structure has required fields
for conflict in conflicts:
    assert "field" in conflict
    assert "selected" in conflict
    assert "sources" in conflict

Sample value verification:

# Spot-check specific merged records
record = next(r for r in data if r["user_id"] == "expected_id")
assert record["field_name"] == "expected_value"

Consolidate Verification

Instead of running multiple separate verification commands, create a single comprehensive test script that validates all aspects of the output.

Edge Cases to Consider

Empty source files - Handle gracefully with appropriate warnings
Missing merge keys - Decide whether to skip or error
Type mismatches - Convert consistently (e.g., user_id as string vs integer)
Null/None values - Determine handling in conflict resolution
Unicode/encoding - Specify encoding when reading text-based formats
Records in some sources but not others - Include partial records or require complete matches

Field Mapping Example

When sources have different field names for the same concept:

FIELD_MAPPINGS = {
    "source_a": {
        "firstName": "first_name",
        "lastName": "last_name",
        "emailAddress": "email"
    },
    "source_b": {
        "fname": "first_name",
        "lname": "last_name",
        "mail": "email"
    }
}

def apply_mapping(df, source_name):
    mapping = FIELD_MAPPINGS.get(source_name, {})
    return df.rename(columns=mapping)

Conflict Resolution Pattern

When the same field has different values across sources:

def resolve_conflict(values_by_source, priority_order):
    """
    Select value based on source priority.

    Args:
        values_by_source: dict mapping source name to value
        priority_order: list of source names from highest to lowest priority

    Returns:
        tuple: (selected_value, conflict_info)
    """
    conflict_info = None
    unique_values = set(v for v in values_by_source.values() if v is not None)

    if len(unique_values) > 1:
        conflict_info = {
            "sources": values_by_source,
            "resolved_by": "priority"
        }

    for source in priority_order:
        if source in values_by_source and values_by_source[source] is not None:
            return values_by_source[source], conflict_info

    return None, conflict_info

multi-source-data-merger