data-masker
SKILL.md
Data Masker Protocol
This skill prevents sensitive production data (PII, PHI, financial records) from leaking into lower environments (staging, development, testing). It analyzes schemas and generates idempotent masking scripts.
Core assumption: Developers need realistic data to fix bugs, but giving them real user emails, passwords, or credit card numbers violates GDPR/KVKK and Zero Trust principles.
1. PII Detection (Static vs Dynamic)
- Default (Static): Analyze based on provided
.sql, schema files, or DDL text. - Dynamic (On-Demand): Only connect to a live database to sample data or infer column contents if explicitly requested by the user.
- When given a table structure, automatically flag high-risk columns:
- 📛 Direct Identifiers:
email,ssn,tc_kimlik,phone,ip_address,mac_address. - 💳 Financial:
credit_card,iban,balance,salary. - 🩺 Health/Personal:
birth_date,blood_type,address,location_lat_lon.
- 📛 Direct Identifiers:
2. Masking Strategy Selection
Do not just overwrite everything with 'REDACTED'. Choose the right mathematical mutation to keep the data realistic for QA testing:
Strategy 1: Deterministic Substitution (Fake Data)
- Best for: Names, Emails.
- Why: To make the UI look normal.
john.doe@example.combecomesx8f9.mask@test.local.
Strategy 2: Partial Redaction
- Best for: Credit Cards, Phone Numbers.
- Why:
+1 (555) 123-4567becomes+1 (555) ***-**67. Devs can still test formatting validations.
Strategy 3: Variance / Shuffling (Jittering)
- Best for: Dates, Salaries.
- Why:
salary: 105,000-> add +/- 20% random variance ->91,200. Keeps statistical distribution intact without revealing the exact amount.
Strategy 4: Hashing / Nullification
- Best for: Passwords, API Tokens.
- Why: Replace all passwords with a known development hash (e.g.,
password123) so devs can log in as any test user without knowing the real user's password.
3. Output Generation
Provide an executable SQL script that can be run on a cloned staging database.
Required Outputs (Must write BOTH to docs/database-report/):
- Human-Readable Markdown (
docs/database-report/data-masking-report.md)
### 🛡️ PII Discovery
- **Risk Level: HIGH** (Found emails, phones, and hashed passwords).
### 🛠️ Masking Execution Script (PostgreSQL)
```sql
-- Disable triggers temporarily to speed up the masking
ALTER TABLE users DISABLE TRIGGER ALL;
-- Masking `users` table
UPDATE users SET
-- Strategy: Deterministic Substitution
email = 'masked_' || id || '@sandbox.local',
first_name = 'User_' || substring(md5(random()::text) from 1 for 6),
last_name = 'Test',
-- Strategy: Partial Redaction
phone_number = concat(left(phone_number, 3), '***', right(phone_number, 2)),
-- Strategy: Known Dev Value
password_hash = '$2b$10$dev_password_hash_xyz';
-- Re-enable triggers
ALTER TABLE users ENABLE TRIGGER ALL;
2. **Machine-Readable JSON (`docs/database-report/data-masking-output.json`)**
```json
{
"skill": "data-masker",
"pii_found": ["email", "phone_number", "password_hash"],
"masking_strategies_applied": {
"email": "Deterministic Substitution",
"phone_number": "Partial Redaction"
},
"sql_script_generated": "UPDATE users SET email = ..."
}
Guardrails
- Performance: Bulk
UPDATEon 10 million rows will overwhelm WAL logs. If the table is massive, suggest theCREATE TABLE AS SELECT (CTAS)strategy instead ofUPDATE. - Referential Integrity: If
emailis used as a Foreign Key (Anti-pattern, but it happens), masking it will break relationships. Detect FKs before masking. - Irreversibility: Ensure the masking SQL uses one-way functions. Randomization seeds should not be deterministic.
Weekly Installs
4
Repository
fatih-developer…h-skillsGitHub Stars
1
First Seen
12 days ago
Security Audits
Installed on
opencode4
gemini-cli4
antigravity4
github-copilot4
codex4
kimi-cli4