regex-log
Regex Log Parsing Skill
This skill provides a systematic approach for constructing complex regular expressions that extract and validate structured data from log files.
When to Use This Skill
This skill applies when:
- Building regex patterns to extract data from log entries
- Validating specific formats (IPv4 addresses, dates, timestamps) within logs
- Handling requirements for first/last occurrence selection
- Enforcing word boundary conditions
- Combining multiple validation constraints in a single pattern
Approach: Decomposition Strategy
Complex log parsing regex should be built by decomposing the problem into sub-patterns:
Step 1: Identify All Requirements
Before writing any regex, create a complete list of requirements:
- What data needs to be validated (present but not captured)?
- What data needs to be captured?
- What boundary conditions apply (word boundaries, line anchors)?
- Are there positional requirements (first, last, nth occurrence)?
- What constitutes an invalid match?
Step 2: Build Sub-Patterns Independently
Construct each validation pattern separately before combining:
IPv4 Address Pattern
For valid IPv4 addresses (0-255 per octet, no leading zeros except for 0 itself):
- Octet pattern:
(?:25[0-5]|2[0-4][0-9]|1[0-9]{2}|[1-9][0-9]|[0-9]) - Order alternatives from most specific to least specific
- Full IPv4:
(?:(?:25[0-5]|2[0-4][0-9]|1[0-9]{2}|[1-9][0-9]|[0-9])\.){3}(?:25[0-5]|2[0-4][0-9]|1[0-9]{2}|[1-9][0-9]|[0-9])
Date Pattern (YYYY-MM-DD)
For valid dates with proper month-day validation:
- 31-day months:
(?:0[13578]|1[02])-(?:0[1-9]|[12][0-9]|3[01]) - 30-day months:
(?:0[469]|11)-(?:0[1-9]|[12][0-9]|30) - February (up to 29):
02-(?:0[1-9]|1[0-9]|2[0-9]) - Combine with year:
[0-9]{4}-(?:...combined month-day patterns...)
Step 3: Apply Positional Requirements
Selecting Last Occurrence
To capture the last valid pattern in a line:
^.*<pattern>(?!.*<pattern>)
- Use
^.*to greedily consume characters - Use negative lookahead
(?!.*<pattern>)to ensure no pattern follows
Selecting First Occurrence
To capture the first valid pattern:
^(?:(?!<pattern>).)*<pattern>
Or simply rely on regex engines returning the first match by default.
Step 4: Apply Validation Without Capture
To require presence of a pattern without capturing it:
- Use lookahead:
(?=.*<pattern>)at the start of the regex - This validates the line contains the pattern without affecting the capture
Step 5: Apply Word Boundaries
For patterns that must not be adjacent to alphanumeric characters:
- Use
\bword boundaries:\b<pattern>\b - Be aware that
\bmatches between word and non-word characters
Verification Strategy
Create Comprehensive Test Cases
Organize tests by category:
-
Valid cases: Confirm expected matches
- Minimum/maximum valid values (e.g., 0.0.0.0, 255.255.255.255)
- Edge values for each component
-
Invalid format cases: Confirm rejection
- Out-of-range values (e.g., 256.0.0.0)
- Invalid formatting (leading zeros where prohibited)
- Invalid months (00, 13) or days (32)
-
Boundary condition cases:
- Pattern at start/end of line
- Pattern adjacent to alphanumeric characters (should fail with word boundaries)
- Pattern adjacent to punctuation (should pass with word boundaries)
-
Positional cases:
- Multiple valid patterns in one line (verify correct one is captured)
- Single pattern in line
- No valid pattern in line
Test File Structure
Create a structured test file that:
- Groups tests by category
- Uses clear naming for each test case
- Reports pass/fail status for each test
- Summarizes overall results
Example structure:
test_cases = {
"valid_ipv4": [...],
"invalid_ipv4": [...],
"valid_dates": [...],
"invalid_dates": [...],
"last_occurrence": [...],
"boundary_conditions": [...]
}
Common Pitfalls
1. Incomplete First Attempt
- Problem: Creating incomplete or truncated test files
- Solution: Plan the full test structure before writing; validate file completeness before execution
2. Environment Assumptions
- Problem: Assuming
pythoncommand exists when onlypython3is available - Solution: Check the Python environment first or use
python3explicitly
3. Scattered Reasoning
- Problem: Disorganized thought process leading to repeated work
- Solution: Follow the decomposition strategy linearly; complete each sub-pattern before moving to the next
4. Duplicate Patterns Without Abstraction
- Problem: Same regex pattern repeated multiple times, increasing error risk
- Solution: Define complex sub-patterns once in reasoning, then reference them; in code, use variables
5. Missing Edge Cases
- Problem: Focusing only on happy path validation
- Solution: Explicitly test:
- Boundary values (min/max for each component)
- Invalid values just outside valid range
- Empty and null cases
- Patterns at different positions in the line
6. Order of Alternatives
- Problem: Less specific alternatives matching before more specific ones
- Solution: Order regex alternatives from most specific to least specific (e.g.,
25[0-5]before2[0-4][0-9]before[0-9])
7. Greedy vs Non-Greedy Matching
- Problem: Unexpected capture due to greedy quantifiers
- Solution: Understand when to use
.*vs.*?; for "last occurrence" patterns, greedy.*is typically correct
Workflow Summary
- List all requirements explicitly
- Build and test sub-patterns independently
- Combine sub-patterns with appropriate anchors and lookaheads
- Create comprehensive test cases covering all categories
- Run tests and verify all pass
- Clean up test files after validation