ai-cleaning-data

Installation
SKILL.md

ai-cleaning-data

Use DSPy to normalize and fix messy data fields at scale. The core pattern - messy field value + field type/context → cleaned value + confidence - lets you handle inconsistent addresses, company names, dates, phone numbers, and free-text fields without writing a rule for every edge case.

The most effective approach: sample anomalies first, infer normalization rules, then apply deterministically where possible and use the LM only for ambiguous cases.

Step 1 - Understand the Cleaning Task

Before writing code, clarify:

  • What fields need cleaning? (addresses, phone numbers, dates, company names, free-text?)
  • What inconsistencies exist? (typos, format variations, abbreviations, mixed languages?)
  • What is the target format? Always define this explicitly — otherwise the LM improvises
  • How many rows? This determines whether to use LM for each row or rule inference + deterministic apply
  • Is there a gold standard? Even 50 manually-cleaned examples make optimization possible

Step 2 - Build a Single-Field Cleaner

Start with one field type. The signature takes the messy value plus explicit format instructions.

Related skills

More from lebsral/dspy-programming-not-prompting-lms-skills

Installs
2
GitHub Stars
5
First Seen
6 days ago