sanitize-git-repo
Sanitize Git Repository
This skill provides guidance for systematically identifying and replacing sensitive information in git repositories, including API keys, tokens, passwords, and other credentials.
When to Use This Skill
- Sanitizing a repository before sharing or open-sourcing
- Removing accidentally committed secrets from a codebase
- Replacing hardcoded credentials with placeholders
- Auditing a repository for sensitive information
- Preparing code for security review
Recommended Approach
Phase 1: Comprehensive Discovery
Build a complete inventory of all sensitive values before making any changes. This prevents the common mistake of discovering additional secrets mid-process.
Common Secret Patterns to Search:
| Type | Pattern/Prefix | Example |
|---|---|---|
| AWS Access Keys | AKIA[A-Z0-9]{16} |
AKIAIOSFODNN7EXAMPLE |
| AWS Secret Keys | 40-char base64 strings | Often near access keys |
| GitHub Tokens | ghp_, gho_, ghs_, ghr_ |
ghp_xxxxxxxxxxxx |
| Huggingface Tokens | hf_ |
hf_xxxxxxxxxxxx |
| Generic API Keys | api[_-]?key, apikey |
Varies |
| Bearer Tokens | bearer, token |
Varies |
| Private Keys | -----BEGIN.*PRIVATE KEY----- |
RSA/EC keys |
| Database URLs | postgres://, mysql://, mongodb:// |
Connection strings |
| Password Fields | password, passwd, pwd |
Varies |
Discovery Strategy:
- Search for known prefixes and patterns using regex
- Search for common variable names (
API_KEY,SECRET,TOKEN,PASSWORD,CREDENTIAL) - Check configuration files (
.env,config.*,settings.*,*.json,*.yaml,*.yml) - Examine test fixtures and mock data files
- Check embedded data within other formats (JSON encoded in strings, base64 encoded content)
Important Locations to Check:
- Configuration files and environment templates
- Test fixtures and sample data
- Documentation and README files
- CI/CD configuration files
- Docker and deployment configurations
- JSON/YAML data files (secrets may be embedded in structured data)
- Git diff outputs or patch files stored in the repository
Phase 2: Inventory Documentation
Before making changes, create a documented list of:
- Each unique sensitive value found
- All file locations where each value appears
- The exact string to match (including surrounding context if needed for uniqueness)
- The placeholder to use for replacement
Placeholder Conventions:
- Use descriptive placeholders that indicate the type:
<AWS_ACCESS_KEY>,<GITHUB_TOKEN>,<DATABASE_PASSWORD> - Maintain consistency across all replacements
- Match the format specified by the user if provided
Phase 3: Systematic Replacement
Execute replacements methodically:
- Work through the inventory one secret at a time
- For each secret, replace ALL occurrences across ALL files
- Verify each replacement immediately after making it
- Use exact string matching to avoid unintended modifications
Replacement Best Practices:
- Read files before editing to ensure exact string matching
- Handle whitespace and formatting precisely
- Consider secrets embedded in JSON or other structured formats (may require escaping)
- Use batch replacements when the same value appears multiple times in one file
Phase 4: Verification
After all replacements:
- Re-run all original discovery searches to confirm no secrets remain
- Search for partial matches of sensitive values
- Run any provided test suites to validate the sanitization
- Check that placeholders are properly formatted
Common Pitfalls
1. Incomplete Initial Discovery
Problem: Secrets discovered incrementally during the replacement process, requiring backtracking.
Prevention: Invest time upfront in comprehensive searching using multiple patterns and checking all file types, including data files and embedded content.
2. Secrets in Unexpected Locations
Problem: Secrets appear in JSON files, embedded strings, test fixtures, or encoded data that aren't found by simple searches.
Prevention: Search recursively through all file types. Check JSON and YAML files specifically. Look for base64-encoded content that might contain secrets.
3. Exact String Matching Failures
Problem: Edit operations fail due to whitespace differences or character encoding issues.
Prevention: Always read the target file first to understand exact formatting. Copy strings exactly as they appear in the file, including whitespace.
4. Missing Occurrences
Problem: Multiple occurrences of the same secret exist, but only some are replaced.
Prevention: After each replacement, immediately verify by searching for the original value again. Use replace-all functionality when available.
5. Git History Considerations
Problem: Secrets remain in git history even after being removed from current files.
Prevention: Clarify with the user whether git history sanitization is required. If so, tools like git filter-branch or BFG Repo-Cleaner may be needed (this is a separate, more complex operation).
6. Tool Availability Assumptions
Problem: Specialized search tools (like ripgrep) may not be available in all environments.
Prevention: Have fallback approaches ready. Standard grep -r works in most environments. Test tool availability before relying on it.
Verification Checklist
Before considering the task complete:
- All known secret patterns have been searched for
- Configuration files have been examined
- JSON/YAML data files have been checked
- Test fixtures and mock data have been reviewed
- Each identified secret has been replaced in all locations
- Verification searches confirm no original secrets remain
- Placeholders follow a consistent format
- Any provided tests pass successfully
Search Commands Reference
Standard grep (widely available):
# Search for AWS access keys
grep -rn "AKIA[A-Z0-9]\{16\}" . --exclude-dir=.git
# Search for common token prefixes
grep -rn "ghp_\|gho_\|ghs_\|hf_" . --exclude-dir=.git
# Search for password-related strings
grep -rn -i "password\|passwd\|pwd" . --exclude-dir=.git
# Search for API key patterns
grep -rn -i "api[_-]\?key\|apikey" . --exclude-dir=.git
Always exclude the .git directory from searches to avoid noise from git internals.