databricks-data-handling
Databricks Data Handling
Contents
Overview
Implement data management patterns for GDPR compliance, PII masking, data retention, and row-level security in Delta Lake with Unity Catalog.
Prerequisites
- Unity Catalog configured
- Understanding of Delta Lake features
- Compliance requirements documented
- Data classification in place
Instructions
Step 1: Classify and Tag Data
Tag tables with data_classification (PII/CONFIDENTIAL/INTERNAL) and retention_days. Tag columns with pii type (email, phone, etc.) using Unity Catalog tags.
Step 2: Implement GDPR Deletion
Build GDPRHandler that finds all PII-tagged tables, locates user records by ID, and deletes with audit logging. Support dry-run mode for impact assessment.
Step 3: Enforce Retention Policies
DataRetentionManager reads retention_days tags, finds appropriate date columns, and deletes expired data. Schedule daily with VACUUM to clean up old Delta files.
Step 4: Configure PII Masking
Create masked views with email masking (j***@***.com), phone masking (***-****), name hashing, and full redaction. Use for analytics and testing environments.
Step 5: Enable Row-Level Security
Create filter functions that check group membership. Apply row filters and column masks to restrict data access by user role.
See detailed implementation for SQL tagging, GDPRHandler class, DataRetentionManager, PIIMasker, row-level security functions, and SAR report generation.
Output
- Data classification tags applied across catalog
- GDPR deletion process with audit trail
- Retention policies enforced automatically
- PII masking for non-production access
- Row-level security on sensitive tables
Error Handling
| Error | Cause | Solution |
|---|---|---|
| Vacuum fails | Retention too short | Ensure > 7 days (168 hours) retention |
| Delete timeout | Large table | Partition deletes, run over multiple days |
| Missing user column | Non-standard schema | Map user columns manually per table |
| Mask function error | Invalid regex | Test masking functions on sample data |
Examples
Quick GDPR Dry Run
gdpr = GDPRHandler(spark, "prod_catalog")
report = gdpr.process_deletion_request("user-12345", "GDPR-2024-001", dry_run=True) # 2024: port 12345 - example/test
print(f"Would delete {report['total_rows_deleted']} rows from {len(report['tables_processed'])} tables")
Resources
Next Steps
For enterprise RBAC, see databricks-enterprise-rbac.