data-cleaning
Data Cleaning
This skill enables an AI agent to systematically clean and preprocess raw datasets into analysis-ready form. The agent handles missing values, duplicate records, data type mismatches, inconsistent formats, outlier treatment, and normalization. It can also enforce validation schemas to ensure ongoing data quality. The primary toolchain is pandas with support from pyjanitor and great_expectations for advanced validation.
Workflow
-
Ingest and profile the raw data. Load the dataset and immediately generate a quality report: count nulls per column, identify duplicate rows, check data types against expected schema, and flag columns with mixed types. This profile drives every subsequent cleaning decision.
-
Handle missing values. Apply strategy per column based on data type and missingness pattern. For numeric columns with less than 5% missing, use median imputation. For categorical columns, use mode or a dedicated "Unknown" category. For columns missing more than 40%, flag them for potential removal and consult the user before dropping.
-
Remove duplicates and resolve conflicts. Identify exact duplicates and near-duplicates (e.g., rows differing only in whitespace or casing). For exact duplicates, keep the first occurrence. For near-duplicates, apply fuzzy matching with a configurable similarity threshold and merge conflicting values by recency or completeness.
-
Correct data types and standardize formats. Coerce columns to their intended types — parse date strings into datetime objects, convert numeric strings to floats, and normalize categorical values to a canonical form. Standardize formats such as phone numbers, postal codes, and currency representations.
-
Detect and treat outliers. Use the IQR method (1.5x) for symmetric distributions and z-scores for normally distributed data. Offer three treatment options: cap at boundary values (winsorization), replace with null for later imputation, or flag-only mode that annotates but preserves original values.
-
Validate the cleaned output. Run the cleaned dataset through validation rules — non-null constraints, range checks, uniqueness constraints, and referential integrity. Report any remaining violations and save the clean dataset alongside a cleaning log that documents every transformation applied.
Supported Technologies
More from seb1n/awesome-ai-agent-skills
summarization
Summarize text using extractive, abstractive, hierarchical, and multi-document techniques, producing concise outputs at configurable detail levels.
24proofreading
Proofread and correct text for grammar, spelling, punctuation, style, clarity, and consistency, with support for multiple style guides and readability analysis.
21note-taking
Capture, organize, and retrieve notes efficiently using structured formats, tagging, and file management for meetings, ideas, research, and daily logs.
20knowledge-graph-creation
Build structured knowledge graphs from unstructured text by extracting entities, mapping relationships, generating graph triples, and visualizing the result.
18data-visualization
Create clear, effective charts and dashboards from structured data using matplotlib, seaborn, and plotly.
16data-analysis
Analyze datasets to extract insights through statistical methods, trend identification, hypothesis testing, and correlation analysis.
15