Regex Master

Overview

The Regex Master skill covers building correct, readable, and efficient regular expressions for common text processing tasks. It explains character classes, quantifiers, groups, alternation, anchors, and lookahead/lookbehind assertions. It includes a quick-reference library of battle-tested patterns for emails, URLs, dates, phone numbers, IP addresses, and log lines. Every regex is explained token by token so you understand what it does rather than just copying it.

When to Use

Validating user input (email, phone, postal code, URL)
Extracting structured data from unstructured text (logs, documents, HTML attributes)
Search-and-replace with patterns in code editors or scripts
Parsing delimited or fixed-format text files
Filtering or transforming data with grep, sed, awk, Python, or JavaScript

When NOT to Use

Parsing HTML or XML (use a DOM parser: BeautifulSoup, DOMParser — regex is unreliable for nested structures)
Parsing programming languages or complex grammars (use a proper parser/AST)
Matching natural language semantics (use NLP tools)
When a simple string split or indexOf is sufficient — don't reach for regex when a simpler tool works

Quick Reference

Building Blocks

Token	Meaning	Example	Matches
`.`	Any character except newline	`a.c`	`abc`, `a1c`, `a-c`
`\d`	Digit `[0-9]`	`\d+`	`42`, `007`
`\w`	Word char `[a-zA-Z0-9_]`	`\w+`	`hello`, `foo_2`
`\s`	Whitespace	`\s+`	space, tab, newline
`\D`	Non-digit	`\D`	`a`, `-`,
`\W`	Non-word char	`\W`	`!`, `@`,
`\S`	Non-whitespace	`\S+`	`hello`, `42`
`^`	Start of string/line	`^Hello`	`Hello world`
`$`	End of string/line	`world$`	`Hello world`
`\b`	Word boundary	`\bcat\b`	`the cat sat` (not `catch`)
`[abc]`	Character class	`[aeiou]`	any vowel
`[^abc]`	Negated class	`[^0-9]`	any non-digit
`[a-z]`	Range	`[a-zA-Z]`	any letter

Quantifiers

Quantifier	Meaning	Greedy?
`*`	0 or more	Yes
`+`	1 or more	Yes
`?`	0 or 1 (optional)	Yes
`{n}`	Exactly n	Yes
`{n,m}`	Between n and m	Yes
`{n,}`	n or more	Yes
`*?`	0 or more	No (lazy)
`+?`	1 or more	No (lazy)

Groups & References

Syntax	Meaning
`(abc)`	Capturing group
`(?:abc)`	Non-capturing group
`(?P<name>abc)`	Named capturing group (Python)
`(?<name>abc)`	Named capturing group (JS/PCRE)
`\1`, `$1`	Back-reference to group 1
`a\|b`	Alternation: a or b

Lookaround

Syntax	Meaning
`(?=abc)`	Positive lookahead: followed by abc
`(?!abc)`	Negative lookahead: NOT followed by abc
`(?<=abc)`	Positive lookbehind: preceded by abc
`(?<!abc)`	Negative lookbehind: NOT preceded by abc

Flags

Flag	Effect
`i` / `re.IGNORECASE`	Case-insensitive
`m` / `re.MULTILINE`	`^`/`$` match line boundaries
`s` / `re.DOTALL`	`.` matches newlines
`g`	Global — find all matches (JS)
`x` / `re.VERBOSE`	Allow whitespace/comments in pattern

Common Patterns Library

Pattern	Regex
Email (simple)	`^[\w.+-]+@[\w-]+\.[a-zA-Z]{2,}$`
URL	`https?://[\w.-]+(?:/[\w./?=#&%-]*)?`
IPv4	`\b(?:\d{1,3}\.){3}\d{1,3}\b`
Date (YYYY-MM-DD)	`\d{4}-(?:0[1-9]\|1[0-2])-(?:0[1-9]\|[12]\d\|3[01])`
Time (HH:MM:SS)	`(?:[01]\d\|2[0-3]):[0-5]\d:[0-5]\d`
US Phone	`\+?1?\s?$?\d{3}$?[\s.-]?\d{3}[\s.-]?\d{4}`
ZIP code (US)	`\d{5}(?:-\d{4})?`
Credit card	`\b\d{4}[- ]?\d{4}[- ]?\d{4}[- ]?\d{4}\b`
Hex color	`#(?:[0-9a-fA-F]{3}\|[0-9a-fA-F]{6})\b`
UUID	`[0-9a-f]{8}-[0-9a-f]{4}-[0-9a-f]{4}-[0-9a-f]{4}-[0-9a-f]{12}`
Semantic version	`\d+\.\d+\.\d+(?:-[\w.]+)?(?:\+[\w.]+)?`
HTML tag (simple)	`<([a-z][a-z0-9])\b[^>]>(.*?)</\1>`
Slug	`^[a-z0-9]+(?:-[a-z0-9]+)*$`

Instructions

Define what to match precisely in plain English first
- What does a valid match look like? (examples)
- What should NOT match? (counter-examples)
- Where in the string must it appear? (anywhere, start, full string)
Build the regex incrementally
- Start with the simplest pattern that matches one known good example.
- Add character classes, quantifiers, and anchors one at a time.
- Test after each addition against both positive and negative examples.
Anchor appropriately
- Use ^...$ for full-string validation (email, phone, slug).
- Omit anchors for searching within a larger text.
- Use \b for word-boundary matching to avoid partial matches.
Choose capturing vs. non-capturing groups
- Use (?:...) for grouping without capturing when you don't need the group's value.
- Use named groups (?P<year>\d{4}) for readability and reliable extraction.
Handle edge cases explicitly
- Optional parts: (?:\.com)? — the ? makes the group optional.
- Unicode: use re.UNICODE flag or \p{L} (PCRE2) for non-ASCII letters.
- Newlines: use the s / DOTALL flag if . should match newlines.
Test thoroughly
- Test against: typical valid input, edge-of-valid input, invalid input that looks similar.
- Use online tools: regex101.com (with detailed explanation), regexr.com.
Document the regex
- Always add a comment explaining what the regex matches.
- Use verbose mode (re.VERBOSE in Python) to add inline comments for complex patterns.

Examples

Example 1: Validate an email address

Input: "Write a regex to validate email addresses."

Output:

import re

# Email validation regex
# Allows: local part with letters, digits, dots, plus, hyphen
# Requires: @ symbol
# Domain: letters, digits, hyphens
# TLD: 2+ letters
EMAIL_REGEX = re.compile(
    r'^[\w.+-]+@[\w-]+\.[a-zA-Z]{2,}$',
    re.IGNORECASE
)

# Verbose version explaining each token:
EMAIL_REGEX_VERBOSE = re.compile(r'''
    ^               # Start of string
    [\w.+-]+        # Local part: word chars, dots, plus, hyphen (1 or more)
    @               # Literal @ symbol
    [\w-]+          # Domain name: word chars and hyphens (1 or more)
    \.              # Literal dot (escaped)
    [a-zA-Z]{2,}   # TLD: 2 or more letters
    $               # End of string
''', re.VERBOSE | re.IGNORECASE)

def validate_email(email: str) -> bool:
    return bool(EMAIL_REGEX.match(email))

# Tests
assert validate_email("alice@example.com") == True
assert validate_email("alice.smith+tag@sub.domain.org") == True
assert validate_email("alice@example.c") == False   # TLD too short
assert validate_email("@example.com") == False      # Missing local part
assert validate_email("alice@.com") == False        # Missing domain
assert validate_email("alice example.com") == False # Missing @

What it matches / rejects:

✅ alice@example.com
✅ alice.smith+filter@sub.domain.co.uk
✅ user123@my-company.io
❌ alice@example.c       (TLD must be 2+ chars)
❌ alice @example.com    (spaces not allowed)
❌ alice@@example.com    (double @)
❌ @example.com          (empty local part)

Note: RFC 5321 email validation is extremely complex. This regex covers ~99% of real-world valid addresses. For production, combine with a verification email.

Example 2: Extract data from log lines

Input: Parse Apache Combined Log Format lines:

192.168.1.1 - frank [10/Oct/2024:13:55:36 -0700] "GET /index.html HTTP/1.1" 200 2326 "http://example.com/" "Mozilla/5.0"