regex
Regular Expressions
Regex syntax, patterns, recipes, and cross-language reference.
Core Syntax
Character Classes
. Any character (except newline, unless s/dotall flag)
\d \D Digit [0-9] / Non-digit
\w \W Word char [a-zA-Z0-9_] / Non-word
\s \S Whitespace [ \t\n\r\f\v] / Non-whitespace
[abc] Character set (a, b, or c)
[^abc] Negated set (not a, b, or c)
[a-z] Range (a through z)
[a-zA-Z] Multiple ranges
\p{L} Unicode letter (PCRE/Java/JS with u flag/Rust)
\p{N} Unicode number
Quantifiers
* + ? 0+, 1+, 0-1 (greedy)
{3} {3,} {3,5} Exact, min, range
*? +? ?? Lazy (non-greedy) versions
*+ ++ Possessive (PCRE/Java) - no backtracking
Greedy: match max, then backtrack. Lazy: match min, then expand. Possessive: match max, never backtrack.
Anchors & Boundaries
^ $ Start/end of string (or line with m flag)
\b \B Word boundary / Non-word boundary
\A \Z Start/end of string (ignores m flag)
Groups & References
(abc) Capture group
(?:abc) Non-capturing group
(?<name>abc) Named capture group (JS/PCRE/.NET)
(?P<name>abc) Named capture group (Python)
\1 Backreference to group 1
\k<name> Backreference to named group (JS/PCRE)
(?P=name) Backreference to named group (Python)
(?>abc) Atomic group (PCRE/Java) - no backtracking into group
Lookahead & Lookbehind
Zero-width assertions: match a position without consuming characters.
(?=abc) Positive lookahead: followed by abc
(?!abc) Negative lookahead: NOT followed by abc
(?<=abc) Positive lookbehind: preceded by abc
(?<!abc) Negative lookbehind: NOT preceded by abc
^(?=.*\d)(?=.*[A-Z]).{8,}$ # Password: 8+ chars, digit + uppercase
\d+(?!%) # Number NOT followed by %
(?<=\$)\w+ # Word preceded by $
(?<!bar)foo # foo not preceded by bar
Lookbehind limits: JS requires fixed-length. Python allows fixed-length alternation branches. PCRE allows variable-length with \K.
Flags / Modifiers
g Global (all matches) i Case-insensitive
m Multiline (^/$ per line) s Dotall (. matches \n)
u Unicode x Extended (whitespace ignored, # comments)
Inline: (?i)pattern or scoped (?i:abc).
Non-Greedy (Lazy) Matching
<.*> Greedy: <b>bold</b> → "<b>bold</b>" (first < to LAST >)
<.*?> Lazy: <b>bold</b> → "<b>" then "</b>" (first < to NEXT >)
<[^>]*> Negated class: same as lazy, faster (no backtracking)
Common Patterns
Validation
# Email (simplified)
^[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\.[a-zA-Z]{2,}$
# URL
^https?://[^\s/$.?#].[^\s]*$
# Phone (US)
^\+?1?[-.\s]?\(?\d{3}\)?[-.\s]?\d{3}[-.\s]?\d{4}$
# IPv4 (strict 0-255)
^(?:(?:25[0-5]|2[0-4]\d|[01]?\d\d?)\.){3}(?:25[0-5]|2[0-4]\d|[01]?\d\d?)$
# Date (YYYY-MM-DD)
^\d{4}-(0[1-9]|1[0-2])-(0[1-9]|[12]\d|3[01])$
# UUID v4
^[0-9a-f]{8}-[0-9a-f]{4}-4[0-9a-f]{3}-[89ab][0-9a-f]{3}-[0-9a-f]{12}$
# Semver
^v?(0|[1-9]\d*)\.(0|[1-9]\d*)\.(0|[1-9]\d*)(?:-([\da-zA-Z-]+(?:\.[\da-zA-Z-]+)*))?(?:\+([\da-zA-Z-]+(?:\.[\da-zA-Z-]+)*))?$
# Hex color
^#?([0-9a-fA-F]{3}|[0-9a-fA-F]{6}|[0-9a-fA-F]{8})$
# Strong password (8+ chars, upper, lower, digit, special)
^(?=.*[a-z])(?=.*[A-Z])(?=.*\d)(?=.*[@$!%*?&])[A-Za-z\d@$!%*?&]{8,}$
# Slug
^[a-z0-9]+(?:-[a-z0-9]+)*$
# JWT structure
^[A-Za-z0-9_-]+\.[A-Za-z0-9_-]+\.[A-Za-z0-9_-]+$
Extraction
https?://([^/\s]+) # Domain from URL
[^/\\]+$ # Filename from path
-?\d+\.?\d* # Numbers (int or decimal)
(\w+)=([^\s&]+) # Key=value pairs
\[([^\]]+)\]\(([^)]+)\) # Markdown links
^\[(\d{4}-\d{2}-\d{2} \d{2}:\d{2}:\d{2})\] \[(\w+)\] (.+)$ # Log line
Substitution Patterns
$0 or \0 Entire match ${name} Named group (JS)
$1 or \1 First capture group \g<name> Named group (Python)
# Swap first/last name
Find: (\w+) (\w+) Replace: $2, $1
# camelCase to snake_case
Find: ([a-z])([A-Z]) Replace: $1_\l$2
# Remove console.log lines
Find: ^\s*console\.log\([^)]*\);?\s*\n Replace: (empty)
Language-Specific Usage
JavaScript
const re = /\d+/g; // Literal
const re2 = new RegExp('\\d+', 'g'); // Constructor
/^\d+$/.test('123'); // true
'abc 123 def 456'.match(/\d+/g); // ['123', '456']
for (const m of 'a1 b2'.matchAll(/([a-z])(\d)/g)) {
console.log(m[1], m[2]); // a 1, b 2
}
'hello'.replace(/[aeiou]/g, c => c.toUpperCase()); // "hEllO"
// Named groups
const m = '2024-01-15'.match(/(?<y>\d{4})-(?<m>\d{2})-(?<d>\d{2})/);
m.groups.y; // '2024'
'$100 €200'.match(/(?<=\$)\d+/); // ['100'] (ES2018+)
Python
import re
re.match(r'^\d+', '123abc') # Match from start
re.search(r'\d+', 'abc123') # Search anywhere
re.findall(r'\d+', 'a1 b2 c3') # ['1', '2', '3']
re.sub(r'\d+', 'X', 'a1b2') # 'aXbX'
pat = re.compile(r'\b\w{4}\b') # Compiled for reuse
pat.findall('the quick brown fox') # ['quick', 'brown']
# Named groups (Python syntax)
m = re.search(r'(?P<year>\d{4})-(?P<month>\d{2})', '2024-01')
m.group('year') # '2024'
m.groupdict() # {'year': '2024', 'month': '01'}
# Verbose mode
pat = re.compile(r'''
(?P<year>\d{4}) # year
-(?P<month>\d{2}) # month
-(?P<day>\d{2}) # day
''', re.VERBOSE)
Go
import "regexp"
re := regexp.MustCompile(`\d+`) // Panics on bad pattern
re.MatchString("abc123") // true
re.FindString("abc 123 def") // "123"
re.FindAllString("a1 b2", -1) // ["1", "2"]
re.ReplaceAllString("a1b2", "X") // "aXbX"
// Named groups
re := regexp.MustCompile(`(?P<year>\d{4})-(?P<month>\d{2})`)
match := re.FindStringSubmatch("2024-01")
// match[re.SubexpIndex("year")] == "2024"
Go uses RE2: no backreferences, no lookahead/lookbehind. Guarantees linear-time matching.
Rust
use regex::Regex;
let re = Regex::new(r"\d+").unwrap();
re.is_match("abc123"); // true
re.find("abc 123").unwrap().as_str(); // "123"
// Named captures
let re = Regex::new(r"(?P<y>\d{4})-(?P<m>\d{2})").unwrap();
let caps = re.captures("2024-01").unwrap();
&caps["y"]; // "2024"
Rust regex crate uses RE2-like semantics. Use fancy-regex for PCRE features.
Regex Flavor Differences
| Feature | PCRE | ECMAScript (JS) | POSIX ERE | RE2 (Go/Rust) | Python re |
|---|---|---|---|---|---|
| Lookahead | Yes | Yes | No | No | Yes |
| Lookbehind | Variable-len | Fixed-len | No | No | Fixed-len |
| Named groups | (?<n>) |
(?<n>) |
No | (?P<n>) |
(?P<n>) |
| Backreferences | \1 |
\1 |
\1 (BRE) |
No | \1 |
| Atomic groups | (?>) |
No | No | Automatic | No |
| Possessive quantifiers | ++ *+ |
No | No | No | No |
| Unicode properties | \p{L} |
\p{L} (u flag) |
No | \p{L} |
Limited |
| Recursive patterns | (?R) |
No | No | No | No |
CLI Tools
grep / sed / awk
# grep
grep 'pattern' file.txt # Basic regex (BRE)
grep -E 'extended|regex' file.txt # Extended regex (ERE)
grep -P '\d{3}' file.txt # Perl regex (GNU grep only)
grep -oirn 'pattern' src/ # -o only match, -i case, -r recursive, -n line nums
# sed
sed 's/old/new/g' file.txt # Replace all
sed -E 's/([0-9]+)/[\1]/g' file.txt # Extended regex with groups
sed '/pattern/d' file.txt # Delete matching lines
# awk
awk '/pattern/ {print $0}' file.txt # Print matching lines
awk '$3 ~ /^[0-9]+$/ {print $1}' file.txt # Field regex match
ripgrep (rg)
rg '\d{3}-\d{4}' src/ # Recursive, fast, .gitignore-aware
rg -t py 'import' . # Filter by file type
rg -U 'struct.*\n.*field' . # Multiline match
rg --pcre2 '(?<=\$)\d+' . # PCRE2 features
rg -r '$1' '(\w+)@\w+' emails.txt # Replace (output only)
VS Code Find & Replace
# Enable regex: click .* button or Alt+R
# Capture groups in replace: $1, $2, etc.
Find: (\w+):\s*(\w+) Replace: $2: $1
Find: console\.log\(.*?\) Replace: (empty)
# Case modifiers in replacement
\u Next char uppercase \U Uppercase until \E
\l Next char lowercase \L Lowercase until \E
Find: const (\w+) Replace: const \l$1
Performance & Security
Catastrophic Backtracking
Nested quantifiers cause exponential time on non-matching input:
# DANGEROUS # SAFE alternative
(a+)+b a+b
(a|a)+b [a]+b
(\w+)*@ \w+@
ReDoS (Regular Expression Denial of Service)
Mitigations:
- Use RE2-based engines (Go, Rust) which guarantee linear time
- Set timeouts on regex execution
- Validate/limit input length before matching
- Never build regex from untrusted user input
Unicode & International Text
\p{L}+ # Any Unicode letter (PCRE, Java, Rust, JS with u flag)
[\u4e00-\u9fff]+ # CJK Unified Ideographs
\p{Emoji} # Emoji (modern engines)
# JS: always use u flag for Unicode
/\p{L}+/u # Correct - \w is still ASCII-only even with u
# Python 3: re handles Unicode by default
re.findall(r'\w+', 'cafe\u0301 nai\u0308ve') # ['cafe\u0301', 'nai\u0308ve']
Multiline & Dotall Modes
/^start/m # Multiline: ^ matches start of any line
/start.*end/s # Dotall: . matches \n
/start[\s\S]*?end/ # Cross-line match without s flag
Testing & Debugging
- regex101.com - Test PCRE, JS, Python, Go with live explanation and debugger
- regexr.com - Visual tester with community patterns
- Python:
re.compile(pattern, re.DEBUG)prints parse tree - Build incrementally: start simple, add complexity, test each step
When NOT to Use Regex
- HTML/XML - Use a parser (BeautifulSoup, cheerio, DOMParser)
- JSON - Use
JSON.parse()or equivalent - Balanced nesting - Brackets, recursive grammars need parser generators
- Strict email/URL validation - Full RFC specs are too complex; use a library
- 200+ char patterns - If you can't read it, use code instead
Reference
For more recipes and testing tips: references/recipes.md
More from 1mangesh1/dev-skills-collection
curl-http
HTTP request construction and API testing with curl and HTTPie. Use when user asks to "test API", "make HTTP request", "curl POST", "send request", "test endpoint", "debug API", "upload file", "check response time", "set auth header", "basic auth with curl", "send JSON", "test webhook", "check status code", "follow redirects", "rate limit testing", "measure API latency", "stress test endpoint", "mock API response", or any HTTP calls from the command line.
28database-indexing
Database indexing internals, index type selection, query plan analysis, and write-overhead tradeoffs across PostgreSQL, MySQL, and MongoDB. Use when user asks to "optimize queries", "create indexes", "fix slow queries", "read EXPLAIN output", "reduce query time", "index strategy", "database performance", "composite index", "covering index", "partial index", "index bloat", "unused indexes", or needs help diagnosing and resolving database performance problems.
13testing-strategies
Testing strategies, patterns, and methodologies across the full testing spectrum. Use when asked about unit tests, integration tests, e2e tests, test pyramid, mocking, test doubles, TDD, property-based testing, snapshot testing, test coverage, mutation testing, contract testing, performance testing, test data management, CI/CD testing, flaky tests, test anti-patterns, test organization, test isolation, test fixtures, test parameterization, or any testing strategy, approach, or methodology.
10secret-scanner
This skill should be used when the user asks to "scan for secrets", "find API keys", "detect credentials", "check for hardcoded passwords", "find leaked tokens", "scan for sensitive keys", "check git history for secrets", "audit repository for credentials", or mentions secret detection, credential scanning, API key exposure, token leakage, password detection, or security key auditing.
10terraform
Terraform infrastructure as code for provisioning, modules, state management, and workspaces. Use when user asks to "create infrastructure", "write Terraform", "manage state", "create module", "import resource", "plan changes", or any IaC tasks.
10kubernetes
Kubernetes and kubectl mastery for deployments, services, pods, debugging, and cluster management. Use when user asks to "deploy to k8s", "create deployment", "debug pod", "kubectl commands", "scale service", "check pod logs", "create ingress", or any Kubernetes tasks.
10