multi-source-data-merger
Multi Source Data Merger
Overview
This skill guides the process of merging data from multiple sources with different formats into a unified dataset. It covers reading heterogeneous file formats, applying field name mappings, resolving conflicts using priority ordering, and generating comprehensive output files including conflict reports.
Workflow
Step 1: Analyze Requirements and Source Files
Before writing any code, thoroughly understand the task:
- Identify all source files and their formats (JSON, CSV, Parquet, XML, etc.)
- Determine the merge key (e.g.,
user_id,record_id) that links records across sources - Review field mapping requirements - source fields may have different names that map to common output fields
- Understand conflict resolution rules - typically based on source priority ordering
- Identify expected output formats and structure
Important: Do not attempt to read binary formats (Parquet, Excel, etc.) as text files - use appropriate libraries.
Step 2: Set Up Environment
- Create a Python virtual environment using
uvorvenv - Install required dependencies based on source formats:
pandas- Core data manipulationpyarrow- Parquet file supportopenpyxl- Excel file supportlxml- XML parsing (if needed)
- Verify installations before proceeding
Example environment setup:
uv venv .venv
source .venv/bin/activate
uv pip install pandas pyarrow
Step 3: Write the Merge Script
Structure the script with clear separation of concerns:
- Data reading functions - One per format type
- Field mapping function - Apply column renames
- Data normalization - Handle date formats, type conversions
- Merge logic - Combine records using the merge key
- Conflict resolution - Apply priority rules
- Output generation - Write merged data and conflict reports
Script quality practices:
- Validate syntax before execution:
python -m py_compile script.py - Use try-except blocks with informative error messages
- Document assumptions about data formats
Step 4: Execute and Verify
Run a comprehensive verification process:
- Check output file existence at expected locations
- Validate merged data contains expected values
- Verify conflict report structure and content
- Run any provided test suites
Common Pitfalls
Binary File Handling
- Mistake: Attempting to read Parquet/Excel files as text
- Solution: Always use pandas with appropriate engine (
pyarrowfor Parquet,openpyxlfor Excel)
Syntax Errors in Scripts
- Mistake: Writing long scripts without validation, leading to indentation or syntax errors
- Solution: Run
python -m py_compile script.pybefore execution
Date Format Normalization
- Mistake: Assuming consistent date formats across sources
- Solution: Implement flexible date parsing that handles multiple formats:
- ISO format:
2024-01-15 - US format:
01/15/2024 - European format:
15-01-2024 - Datetime:
2024-01-15T10:30:00
- ISO format:
Incomplete Script Output
- Mistake: Writing very long scripts that may get truncated
- Solution: Break into modular functions, verify complete code visibility
Environment Path Issues
- Mistake: Repeating PATH exports in every command
- Solution: Set PATH once in a setup step or use absolute paths to executables
Verification Strategies
Output Validation Checklist
-
File existence check:
import os assert os.path.exists("output/merged_data.json") assert os.path.exists("output/conflict_report.json") -
Data completeness check:
import json with open("output/merged_data.json") as f: data = json.load(f) # Verify expected record count assert len(data) == expected_count -
Conflict report validation:
with open("output/conflict_report.json") as f: conflicts = json.load(f) # Verify conflict structure has required fields for conflict in conflicts: assert "field" in conflict assert "selected" in conflict assert "sources" in conflict -
Sample value verification:
# Spot-check specific merged records record = next(r for r in data if r["user_id"] == "expected_id") assert record["field_name"] == "expected_value"
Consolidate Verification
Instead of running multiple separate verification commands, create a single comprehensive test script that validates all aspects of the output.
Edge Cases to Consider
- Empty source files - Handle gracefully with appropriate warnings
- Missing merge keys - Decide whether to skip or error
- Type mismatches - Convert consistently (e.g., user_id as string vs integer)
- Null/None values - Determine handling in conflict resolution
- Unicode/encoding - Specify encoding when reading text-based formats
- Records in some sources but not others - Include partial records or require complete matches
Field Mapping Example
When sources have different field names for the same concept:
FIELD_MAPPINGS = {
"source_a": {
"firstName": "first_name",
"lastName": "last_name",
"emailAddress": "email"
},
"source_b": {
"fname": "first_name",
"lname": "last_name",
"mail": "email"
}
}
def apply_mapping(df, source_name):
mapping = FIELD_MAPPINGS.get(source_name, {})
return df.rename(columns=mapping)
Conflict Resolution Pattern
When the same field has different values across sources:
def resolve_conflict(values_by_source, priority_order):
"""
Select value based on source priority.
Args:
values_by_source: dict mapping source name to value
priority_order: list of source names from highest to lowest priority
Returns:
tuple: (selected_value, conflict_info)
"""
conflict_info = None
unique_values = set(v for v in values_by_source.values() if v is not None)
if len(unique_values) > 1:
conflict_info = {
"sources": values_by_source,
"resolved_by": "priority"
}
for source in priority_order:
if source in values_by_source and values_by_source[source] is not None:
return values_by_source[source], conflict_info
return None, conflict_info