Knack Data Cleaner
SKILL.md
Knack Data Cleaner
Purpose
Ensures accuracy for HTI compliance and performance dashboards through data validation, deduplication, normalization, and integrity checks. Critical for NCDIT reporting accuracy.
Core Functions
deduplicate_records
Purpose: Identify and remove/merge duplicate device or donor records
Detection Strategies:
- Exact match: Same serial number, donor email, etc.
- Fuzzy match: Similar names, close dates
- Bulk import duplicates: Same acquisition batch
Example:
const cleaned = await deduplicate_records({
object_key: "object_1",
match_fields: ["serial_number", "acquisition_date"],
strategy: "keep_latest" // or "merge", "keep_first"
});
// Output:
// {
// original_count: 3520,
// duplicates_found: 37,
// final_count: 3483,
// removed: [
// { id: "rec_123", reason: "Duplicate serial: ABC123" },
// // ...
// ]
// }
Deduplication Strategies:
const strategies = {
keep_latest: "Retain most recently modified record",
keep_first: "Keep oldest record (by created_date)",
merge: "Combine data from all duplicates",
manual_review: "Flag for human verification"
};
normalize_fields
Purpose: Standardize data formats for consistency
Normalization Rules:
County Names
const county_normalizer = {
"wake": "Wake",
"WAKE": "Wake",
"Wake County": "Wake",
"wake co": "Wake",
"wake co.": "Wake"
};
Phone Numbers
// (919) 555-1234 → 9195551234
// 919-555-1234 → 9195551234
// 919.555.1234 → 9195551234
const normalizePhone = (phone) => phone.replace(/\D/g, '');
Emails
// Convert to lowercase, trim whitespace
const normalizeEmail = (email) => email.trim().toLowerCase();
Dates
// Standardize to ISO 8601: YYYY-MM-DD
const normalizeDate = (date) => {
const d = new Date(date);
return d.toISOString().split('T')[0];
};
Device Status
const status_normalizer = {
"acquired": "Acquired",
"ACQUIRED": "Acquired",
"received": "Acquired",
"in progress": "In Progress",
"ready": "Ready for Donation",
"delivered": "Presented"
};
Example:
const normalized = await normalize_fields({
dataset: await fetch_all_pages("object_1"),
rules: {
county: county_normalizer,
phone: normalizePhone,
email: normalizeEmail,
date: normalizeDate,
status: status_normalizer
}
});
validate_integrity
Purpose: Check for logical inconsistencies and data quality issues
Validation Checks:
Date Logic
const date_validations = [
"acquisition_date < conversion_date",
"conversion_date < presentation_date",
"all dates within grant period (2024-2026)",
"no future dates"
];
Required Fields
const required_by_status = {
"Acquired": ["donor", "acquisition_date", "quantity"],
"Converted": ["converted_date", "os_installed", "tested"],
"Presented": ["recipient", "county", "presentation_date"],
"Discarded": ["discard_reason", "discard_date"]
};
Value Ranges
const range_validations = {
quantity: { min: 1, max: 500 },
training_duration: { min: 0.5, max: 8 },
participants: { min: 1, max: 100 },
donation_value: { min: 0, max: 50000 }
};
Geographic Validation
const valid_counties = [
"Wake", "Durham", "Vance", "Franklin", "Granville",
"Halifax", "Wilson", "Edgecombe", "Martin", "Hertford",
"Greene", "Warren", "Northampton", "Person", "Nash"
];
const county_validation = (county) => {
return valid_counties.includes(county);
};
Example:
const validation_report = await validate_integrity({
dataset: await fetch_all_pages("object_1"),
rules: {
dates: date_validations,
required: required_by_status,
ranges: range_validations,
counties: valid_counties
}
});
// Output:
// {
// total_records: 3483,
// valid: 3401,
// errors: [
// { record_id: "rec_123", issue: "Conversion date before acquisition" },
// { record_id: "rec_456", issue: "Invalid county: 'Raleigh'" },
// { record_id: "rec_789", issue: "Missing required field: donor" }
// ],
// warnings: [
// { record_id: "rec_234", issue: "Unusually high quantity: 450" }
// ]
// }
fix_common_issues
Purpose: Auto-correct known data entry mistakes
Auto-Fix Rules:
Trim Whitespace
const fix_whitespace = (record) => {
Object.keys(record).forEach(key => {
if (typeof record[key] === 'string') {
record[key] = record[key].trim();
}
});
return record;
};
Fix Capitalization
const fix_capitalization = (record) => {
if (record.donor_name) {
record.donor_name = toTitleCase(record.donor_name);
}
if (record.county) {
record.county = county_normalizer[record.county.toLowerCase()] || record.county;
}
return record;
};
Infer Missing Data
const infer_missing = (record) => {
// If county missing but zip code present, infer county
if (!record.county && record.zip_code) {
record.county = inferCountyFromZip(record.zip_code);
}
// If discard reason missing but status is Discarded, set default
if (record.status === "Discarded" && !record.discard_reason) {
record.discard_reason = "Not specified";
}
return record;
};
Example:
const fixed = await fix_common_issues({
dataset: await fetch_all_pages("object_1"),
auto_fixes: [
"trim_whitespace",
"fix_capitalization",
"infer_missing_counties",
"standardize_phone_numbers"
]
});
// Output:
// {
// processed: 3483,
// fixed: 287,
// fixes_by_type: {
// whitespace: 145,
// capitalization: 67,
// inferred_county: 42,
// phone_format: 33
// }
// }
remove_test_records
Purpose: Exclude dummy/test data from production reports
Detection:
const is_test_record = (record) => {
const test_indicators = [
record.donor_name?.toLowerCase().includes("test"),
record.donor_email?.includes("test@"),
record.donor_email?.includes("example.com"),
record.serial_number?.startsWith("TEST"),
record.notes?.toLowerCase().includes("do not count")
];
return test_indicators.some(indicator => indicator === true);
};
Example:
const production_only = await remove_test_records({
dataset: await fetch_all_pages("object_1")
});
// Output:
// {
// original: 3483,
// test_records_removed: 8,
// production_records: 3475
// }
enrich_records
Purpose: Add calculated or derived fields
Enrichment Operations:
Add Calculated Fields
const enrich = (record) => {
// Add age of device
record.days_since_acquisition = daysBetween(record.acquisition_date, new Date());
// Add conversion time
if (record.converted_date && record.acquisition_date) {
record.days_to_convert = daysBetween(record.acquisition_date, record.converted_date);
}
// Add delivery time
if (record.presentation_date && record.converted_date) {
record.days_to_deliver = daysBetween(record.converted_date, record.presentation_date);
}
// Flag slow conversions
record.conversion_delayed = record.days_to_convert > 30;
return record;
};
Add Geographic Context
const add_region = (record) => {
const regions = {
"Triangle": ["Wake", "Durham", "Franklin"],
"Northeast": ["Vance", "Granville", "Warren", "Halifax"],
"Eastern": ["Wilson", "Edgecombe", "Martin", "Greene"]
};
record.region = Object.keys(regions).find(region =>
regions[region].includes(record.county)
) || "Other";
return record;
};
Add Donor Tier
const add_donor_tier = (record) => {
if (record.donation_value > 1000) {
record.donor_tier = "Major";
} else if (record.donation_value > 500) {
record.donor_tier = "Significant";
} else {
record.donor_tier = "Standard";
}
return record;
};
HTI-Specific Cleaning Workflows
Pre-Quarterly Report Cleaning
async function clean_for_quarterly_report(quarter, year) {
// 1. Fetch quarter data
const raw_data = await fetch_all_pages("object_1", {
filters: quarterly_filter(quarter, year)
});
// 2. Remove test records
const production = await remove_test_records({ dataset: raw_data });
// 3. Deduplicate
const unique = await deduplicate_records({
dataset: production.production_records,
match_fields: ["serial_number"]
});
// 4. Normalize
const normalized = await normalize_fields({
dataset: unique.records,
rules: all_normalizers
});
// 5. Validate
const validated = await validate_integrity({
dataset: normalized,
rules: compliance_rules
});
// 6. Fix auto-fixable issues
const fixed = await fix_common_issues({
dataset: validated.valid_records
});
return {
clean_data: fixed,
cleaning_summary: {
original: raw_data.length,
test_removed: production.test_records_removed,
duplicates: unique.duplicates_found,
validation_errors: validated.errors.length,
auto_fixed: fixed.fixed
}
};
}
Pre-Dashboard Data Cleaning
async function clean_for_dashboard() {
const all_data = await fetch_all_pages("object_1");
// Quick cleaning for real-time dashboard
const cleaned = all_data
.filter(r => !is_test_record(r))
.map(fix_whitespace)
.map(fix_capitalization)
.map(enrich);
return cleaned;
}
Data Quality Monitoring
Generate Data Quality Report
async function data_quality_report() {
const all_records = await fetch_all_pages("object_1");
const report = {
total_records: all_records.length,
completeness: {
with_donor: all_records.filter(r => r.donor).length,
with_county: all_records.filter(r => r.county).length,
with_serial: all_records.filter(r => r.serial_number).length
},
accuracy: {
valid_counties: all_records.filter(r => valid_counties.includes(r.county)).length,
logical_dates: all_records.filter(r => validate_dates(r)).length
},
duplicates: await find_duplicates(all_records),
test_records: all_records.filter(is_test_record).length
};
return report;
}
Automated Quality Checks
// Run daily at 2 AM
cron.schedule('0 2 * * *', async () => {
const report = await data_quality_report();
if (report.duplicates.length > 10) {
await alert_admin("High duplicate count detected");
}
if (report.accuracy.valid_counties / report.total_records < 0.95) {
await alert_admin("County data quality below threshold");
}
});
Integration Points
- knack_reader: Fetch data for cleaning
- knack_pagination: Process large datasets
- knack_reporting_sync: Clean before report generation
- knack_dashboard_ai: Ensure metrics accuracy
- knack_goal_tracker: Validate progress data
- knack_filter_sort: Filter out invalid records
Best Practices
- Clean before every report: Don't trust raw data
- Log all cleaning operations: Audit trail for compliance
- Manual review for ambiguous cases: Don't auto-fix everything
- Run quality checks daily: Catch issues early
- Version cleaned datasets: Track data lineage
Grant Compliance
NCDIT Data Standards
- No test/dummy records in official reports
- All dates must be within grant period
- County assignments must be accurate (15-county requirement)
- Device lifecycle dates must be logical
- Training hours must have supporting documentation
Audit Trail
const cleaning_log = {
timestamp: new Date(),
operation: "quarterly_report_cleaning",
quarter: "Q1",
year: "2025",
records_processed: 875,
issues_found: 23,
auto_fixed: 18,
manual_review_needed: 5,
performed_by: "automated_system"
};