data-quality-checker
Data Quality Checker
Implement comprehensive data quality checks and validation.
Quick Start
Use Great Expectations for validation, implement schema checks, monitor data quality metrics, set up alerts.
Instructions
Great Expectations Setup
import great_expectations as gx
context = gx.get_context()
# Create expectation suite
suite = context.add_expectation_suite("data_quality_suite")
# Add expectations
validator = context.get_validator(
batch_request=batch_request,
expectation_suite_name="data_quality_suite"
)
# Schema validation
validator.expect_table_columns_to_match_ordered_list(
column_list=["id", "name", "email", "created_at"]
)
# Null checks
validator.expect_column_values_to_not_be_null("email")
# Value ranges
validator.expect_column_values_to_be_between("age", min_value=0, max_value=120)
# Uniqueness
validator.expect_column_values_to_be_unique("email")
# Run validation
results = validator.validate()
Custom Validation Rules
def validate_data_quality(df):
issues = []
# Check for nulls
null_counts = df.isnull().sum()
if null_counts.any():
issues.append(f"Null values found: {null_counts[null_counts > 0]}")
# Check for duplicates
duplicates = df.duplicated().sum()
if duplicates > 0:
issues.append(f"Found {duplicates} duplicate rows")
# Check data freshness
max_date = df['created_at'].max()
if (datetime.now() - max_date).days > 1:
issues.append("Data is stale")
return issues
Data Quality Metrics
def calculate_quality_metrics(df):
return {
'completeness': 1 - (df.isnull().sum().sum() / df.size),
'uniqueness': df.drop_duplicates().shape[0] / df.shape[0],
'validity': (df['email'].str.contains('@').sum() / len(df)),
'timeliness': (datetime.now() - df['created_at'].max()).days
}
Best Practices
- Validate at ingestion
- Monitor quality metrics
- Set up alerts for failures
- Document quality rules
- Regular quality audits
- Track quality trends
More from armanzeroeight/fastagent-plugins
gcp-cost-optimizer
Analyzes GCP costs and provides optimization recommendations including committed use discounts, rightsizing, and unused resources. Use when optimizing GCP spending or analyzing GCP costs.
15kubernetes-best-practices
Provides production-ready Kubernetes manifest guidance including resource management, security, high availability, and configuration best practices. This skill should be used when working with Kubernetes YAML files, deployments, pods, services, or when users mention k8s, container orchestration, or cloud-native applications.
11schema-designer
Design database schemas with proper normalization, relationships, constraints, and indexes. Use when creating database tables, modeling data relationships, or designing database structure.
11api-documentation-generator
Generate OpenAPI/Swagger specifications and API documentation from code or design. Use when creating API docs, generating OpenAPI specs, or documenting REST APIs.
9goroutine-patterns
Implement Go concurrency patterns using goroutines, channels, and synchronization primitives. Use when building concurrent systems, implementing parallelism, or managing goroutine lifecycles. Trigger words include "goroutine", "channel", "concurrent", "parallel", "sync", "context".
9inventory-manager
Organizes Ansible inventory files, manages host groups, and configures dynamic inventory. Use when organizing Ansible inventory, managing host groups, or setting up dynamic inventory sources.
9