regex-master
Regex Master
Overview
The Regex Master skill covers building correct, readable, and efficient regular expressions for common text processing tasks. It explains character classes, quantifiers, groups, alternation, anchors, and lookahead/lookbehind assertions. It includes a quick-reference library of battle-tested patterns for emails, URLs, dates, phone numbers, IP addresses, and log lines. Every regex is explained token by token so you understand what it does rather than just copying it.
When to Use
- Validating user input (email, phone, postal code, URL)
- Extracting structured data from unstructured text (logs, documents, HTML attributes)
- Search-and-replace with patterns in code editors or scripts
- Parsing delimited or fixed-format text files
- Filtering or transforming data with
grep,sed,awk, Python, or JavaScript
When NOT to Use
- Parsing HTML or XML (use a DOM parser: BeautifulSoup, DOMParser — regex is unreliable for nested structures)
- Parsing programming languages or complex grammars (use a proper parser/AST)
- Matching natural language semantics (use NLP tools)
- When a simple string split or indexOf is sufficient — don't reach for regex when a simpler tool works
Quick Reference
Building Blocks
| Token | Meaning | Example | Matches |
|---|---|---|---|
. |
Any character except newline | a.c |
abc, a1c, a-c |
\d |
Digit [0-9] |
\d+ |
42, 007 |
\w |
Word char [a-zA-Z0-9_] |
\w+ |
hello, foo_2 |
\s |
Whitespace | \s+ |
space, tab, newline |
\D |
Non-digit | \D |
a, -, |
\W |
Non-word char | \W |
!, @, |
\S |
Non-whitespace | \S+ |
hello, 42 |
^ |
Start of string/line | ^Hello |
Hello world |
$ |
End of string/line | world$ |
Hello world |
\b |
Word boundary | \bcat\b |
the cat sat (not catch) |
[abc] |
Character class | [aeiou] |
any vowel |
[^abc] |
Negated class | [^0-9] |
any non-digit |
[a-z] |
Range | [a-zA-Z] |
any letter |
Quantifiers
| Quantifier | Meaning | Greedy? |
|---|---|---|
* |
0 or more | Yes |
+ |
1 or more | Yes |
? |
0 or 1 (optional) | Yes |
{n} |
Exactly n | Yes |
{n,m} |
Between n and m | Yes |
{n,} |
n or more | Yes |
*? |
0 or more | No (lazy) |
+? |
1 or more | No (lazy) |
Groups & References
| Syntax | Meaning |
|---|---|
(abc) |
Capturing group |
(?:abc) |
Non-capturing group |
(?P<name>abc) |
Named capturing group (Python) |
(?<name>abc) |
Named capturing group (JS/PCRE) |
\1, $1 |
Back-reference to group 1 |
a|b |
Alternation: a or b |
Lookaround
| Syntax | Meaning |
|---|---|
(?=abc) |
Positive lookahead: followed by abc |
(?!abc) |
Negative lookahead: NOT followed by abc |
(?<=abc) |
Positive lookbehind: preceded by abc |
(?<!abc) |
Negative lookbehind: NOT preceded by abc |
Flags
| Flag | Effect |
|---|---|
i / re.IGNORECASE |
Case-insensitive |
m / re.MULTILINE |
^/$ match line boundaries |
s / re.DOTALL |
. matches newlines |
g |
Global — find all matches (JS) |
x / re.VERBOSE |
Allow whitespace/comments in pattern |
Common Patterns Library
| Pattern | Regex |
|---|---|
| Email (simple) | ^[\w.+-]+@[\w-]+\.[a-zA-Z]{2,}$ |
| URL | https?://[\w.-]+(?:/[\w./?=#&%-]*)? |
| IPv4 | \b(?:\d{1,3}\.){3}\d{1,3}\b |
| Date (YYYY-MM-DD) | \d{4}-(?:0[1-9]|1[0-2])-(?:0[1-9]|[12]\d|3[01]) |
| Time (HH:MM:SS) | (?:[01]\d|2[0-3]):[0-5]\d:[0-5]\d |
| US Phone | \+?1?\s?\(?\d{3}\)?[\s.-]?\d{3}[\s.-]?\d{4} |
| ZIP code (US) | \d{5}(?:-\d{4})? |
| Credit card | \b\d{4}[- ]?\d{4}[- ]?\d{4}[- ]?\d{4}\b |
| Hex color | #(?:[0-9a-fA-F]{3}|[0-9a-fA-F]{6})\b |
| UUID | [0-9a-f]{8}-[0-9a-f]{4}-[0-9a-f]{4}-[0-9a-f]{4}-[0-9a-f]{12} |
| Semantic version | \d+\.\d+\.\d+(?:-[\w.]+)?(?:\+[\w.]+)? |
| HTML tag (simple) | <([a-z][a-z0-9]*)\b[^>]*>(.*?)</\1> |
| Slug | ^[a-z0-9]+(?:-[a-z0-9]+)*$ |
Instructions
-
Define what to match precisely in plain English first
- What does a valid match look like? (examples)
- What should NOT match? (counter-examples)
- Where in the string must it appear? (anywhere, start, full string)
-
Build the regex incrementally
- Start with the simplest pattern that matches one known good example.
- Add character classes, quantifiers, and anchors one at a time.
- Test after each addition against both positive and negative examples.
-
Anchor appropriately
- Use
^...$for full-string validation (email, phone, slug). - Omit anchors for searching within a larger text.
- Use
\bfor word-boundary matching to avoid partial matches.
- Use
-
Choose capturing vs. non-capturing groups
- Use
(?:...)for grouping without capturing when you don't need the group's value. - Use named groups
(?P<year>\d{4})for readability and reliable extraction.
- Use
-
Handle edge cases explicitly
- Optional parts:
(?:\.com)?— the?makes the group optional. - Unicode: use
re.UNICODEflag or\p{L}(PCRE2) for non-ASCII letters. - Newlines: use the
s/DOTALLflag if.should match newlines.
- Optional parts:
-
Test thoroughly
- Test against: typical valid input, edge-of-valid input, invalid input that looks similar.
- Use online tools: regex101.com (with detailed explanation), regexr.com.
-
Document the regex
- Always add a comment explaining what the regex matches.
- Use verbose mode (
re.VERBOSEin Python) to add inline comments for complex patterns.
Examples
Example 1: Validate an email address
Input: "Write a regex to validate email addresses."
Output:
import re
# Email validation regex
# Allows: local part with letters, digits, dots, plus, hyphen
# Requires: @ symbol
# Domain: letters, digits, hyphens
# TLD: 2+ letters
EMAIL_REGEX = re.compile(
r'^[\w.+-]+@[\w-]+\.[a-zA-Z]{2,}$',
re.IGNORECASE
)
# Verbose version explaining each token:
EMAIL_REGEX_VERBOSE = re.compile(r'''
^ # Start of string
[\w.+-]+ # Local part: word chars, dots, plus, hyphen (1 or more)
@ # Literal @ symbol
[\w-]+ # Domain name: word chars and hyphens (1 or more)
\. # Literal dot (escaped)
[a-zA-Z]{2,} # TLD: 2 or more letters
$ # End of string
''', re.VERBOSE | re.IGNORECASE)
def validate_email(email: str) -> bool:
return bool(EMAIL_REGEX.match(email))
# Tests
assert validate_email("alice@example.com") == True
assert validate_email("alice.smith+tag@sub.domain.org") == True
assert validate_email("alice@example.c") == False # TLD too short
assert validate_email("@example.com") == False # Missing local part
assert validate_email("alice@.com") == False # Missing domain
assert validate_email("alice example.com") == False # Missing @
What it matches / rejects:
✅ alice@example.com
✅ alice.smith+filter@sub.domain.co.uk
✅ user123@my-company.io
❌ alice@example.c (TLD must be 2+ chars)
❌ alice @example.com (spaces not allowed)
❌ alice@@example.com (double @)
❌ @example.com (empty local part)
Note: RFC 5321 email validation is extremely complex. This regex covers ~99% of real-world valid addresses. For production, combine with a verification email.
Example 2: Extract data from log lines
Input: Parse Apache Combined Log Format lines:
192.168.1.1 - frank [10/Oct/2024:13:55:36 -0700] "GET /index.html HTTP/1.1" 200 2326 "http://example.com/" "Mozilla/5.0"
Output:
import re
from datetime import datetime
LOG_PATTERN = re.compile(r'''
^
(?P<ip>\d{1,3}(?:\.\d{1,3}){3}) # Client IP address
\s-\s # Separator " - "
(?P<user>\S+) # Remote user (- if none)
\s
\[(?P<timestamp>[^\]]+)\] # Timestamp in brackets
\s
"(?P<method>[A-Z]+) # HTTP method
\s(?P<path>[^\s"]+) # Request path
\s(?P<protocol>HTTP/[\d.]+)" # Protocol
\s
(?P<status>\d{3}) # HTTP status code
\s
(?P<size>\d+|-) # Response size (or - if none)
(?:\s"(?P<referer>[^"]*)")? # Optional referer
(?:\s"(?P<user_agent>[^"]*)")? # Optional user agent
$
''', re.VERBOSE)
def parse_log_line(line: str) -> dict | None:
match = LOG_PATTERN.match(line.strip())
if not match:
return None
data = match.groupdict()
data['status'] = int(data['status'])
data['size'] = int(data['size']) if data['size'] != '-' else 0
return data
# Test
line = '192.168.1.1 - frank [10/Oct/2024:13:55:36 -0700] "GET /index.html HTTP/1.1" 200 2326 "http://example.com/" "Mozilla/5.0"'
result = parse_log_line(line)
print(result)
# {
# 'ip': '192.168.1.1', 'user': 'frank',
# 'timestamp': '10/Oct/2024:13:55:36 -0700',
# 'method': 'GET', 'path': '/index.html',
# 'protocol': 'HTTP/1.1', 'status': 200,
# 'size': 2326, 'referer': 'http://example.com/',
# 'user_agent': 'Mozilla/5.0'
# }
JavaScript equivalent for the same pattern:
const LOG_PATTERN = /^(\d{1,3}(?:\.\d{1,3}){3}) - (\S+) \[([^\]]+)\] "([A-Z]+) ([^\s"]+) (HTTP\/[\d.]+)" (\d{3}) (\d+|-)(?: "([^"]*)")?(?: "([^"]*)")?$/;
function parseLogLine(line) {
const m = line.match(LOG_PATTERN);
if (!m) return null;
const [, ip, user, timestamp, method, path, protocol, status, size, referer, userAgent] = m;
return { ip, user, timestamp, method, path, protocol, status: +status, size: size === '-' ? 0 : +size, referer, userAgent };
}
Best Practices
- Build complex regex with named groups and verbose mode for readability
- Never use regex for HTML/XML parsing — use a DOM parser
- Test with regex101.com for interactive debugging and explanation
- Prefer specific character classes (
[a-z]) over.+ filter to avoid surprise matches - Use lazy quantifiers (
+?,*?) when you need the shortest match, not the longest - Compile regex patterns once and reuse (
re.compile()) in performance-critical code
Common Mistakes
- Using
.*(greedy) when you want the shortest match — use.*?(lazy) - Forgetting to escape
.— unescaped.matches any character - Using
^/$when multiline mode is needed (they match string boundaries by default, not line boundaries) - Catastrophic backtracking with nested quantifiers like
(a+)+on long non-matching input - Not anchoring validation patterns —
\d+matches the digits inabc123def - Using capturing groups
()when non-capturing(?:)is sufficient — wastes memory
Tips & Tricks
- regex101.com shows a detailed token-by-token explanation and live testing
re.fullmatch()in Python is safer thanre.match()for validation — no partial matches- In JavaScript, prefer named groups:
/(?<year>\d{4})-(?<month>\d{2})/— access viamatch.groups.year - Use
sed -Eorgrep -Pfor extended/PCRE regex on the command line - Atomic groups
(?>...)(PCRE) prevent backtracking — useful for catastrophic backtracking prevention