filter-js-from-html
Filter JavaScript from HTML
Overview
This skill provides guidance for tasks that require removing JavaScript and XSS attack vectors from HTML content while preserving the original formatting exactly. The key challenge is balancing comprehensive security filtering with format preservation.
Critical Requirements Analysis
Before implementation, identify and prioritize these requirements:
- Security completeness: All XSS vectors must be removed
- Format preservation: Output must be functionally identical to input except for harmful content removal
- Clean content handling: Files without XSS content should remain completely unchanged
These requirements often conflict - comprehensive parsing may alter formatting, while simple string replacement may miss attack vectors.
Approach Selection
Option 1: Regex-Based Surgical Removal (Recommended for Format Preservation)
When the task explicitly requires preserving original formatting, prefer regex-based approaches that surgically remove only the dangerous content.
Advantages:
- Preserves whitespace, attribute ordering, quote styles exactly
- Does not reconstruct or reformat HTML
- Output matches input character-for-character except for removed content
Considerations:
- Requires careful pattern construction to avoid partial matches
- Must handle various encodings and obfuscation techniques
- Test patterns against comprehensive XSS vector lists
Option 2: HTML Parser-Based Filtering
When format preservation is less critical or when dealing with malformed HTML.
Considerations:
- HTML parsers inherently reconstruct output, changing formatting
- May normalize attribute quotes, whitespace, tag casing
- Better for malformed HTML that regex cannot reliably parse
- If using this approach, verify that clean HTML files remain unchanged
Comprehensive XSS Vector Checklist
Before implementing, research and account for ALL of these attack categories:
1. Script Execution Tags
<script>tags (including variations with attributes)<noscript>abuse cases
2. Event Handlers (Comprehensive List Required)
Common handlers:
onclick,onload,onerror,onmouseover,onfocus,onblur
Frequently missed handlers:
onlayoutcomplete,ontimeerror,onselectionchangeonrowsinserted,onrowsdelete,onrowexit,onrowenteroncellchange,ondataavailable,ondatasetchanged,ondatasetcompleteonbeforeupdate,onafterupdate,onerrorupdateonfilterchange,onpropertychange,onreadystatechangeonbeforeprint,onafterprint,onbeforeunloadoncontextmenu,ondrag,ondragend,ondragenter,ondragleaveondragover,ondragstart,ondroponhashchange,oninput,oninvalid,onpageshow,onpagehideonpopstate,onresize,onstorage,onwheel
Action: Search for comprehensive event handler lists (e.g., MDN, OWASP) rather than relying on memory.
3. JavaScript URL Protocol
javascript:in href, src, action, formaction, data, poster attributes- Case variations:
JavaScript:,JAVASCRIPT:,JaVaScRiPt: - Encoded variations:
javascript:,javascript:
4. Other Dangerous Protocols
vbscript:(IE legacy)data:URIs with script content:data:text/html,<script>...</script>data:text/html;base64,...encoded payloads
5. CSS-Based Attacks
<style>tags with dangerous properties-moz-binding(Firefox legacy)expression()(IE legacy)behavior:property@importwith javascript or data URIs
6. Meta Tag Attacks
<meta http-equiv="refresh" content="0;url=data:text/html,..."><meta http-equiv="refresh" content="0;url=javascript:...">
7. External Resource Loading
<link>tags with dangerous href values<object>tags with data attributes<embed>tags with src attributes<applet>tags (legacy)<iframe>with src or srcdoc containing scripts
8. SVG-Based Attacks
<svg onload="...">and other SVG event handlers<svg><script>...</script></svg>- SVG
<use>with external references
9. Encoding and Obfuscation
- HTML entity encoding:
<script> - URL encoding:
%3Cscript%3E - UTF-7 encoding attacks
- Null byte injection:
<scr\0ipt> - Unicode variations
10. HTML Comment Exploits
- Conditional comments:
<!--[if IE]><script>...<![endif]--> - Nested comment breaking
Verification Strategy
Test Categories (All Required)
-
XSS Attack Vectors
- Use established XSS test suites (OWASP XSS Filter Evasion Cheat Sheet)
- Test XSS polyglots that combine multiple techniques
- Include lesser-known event handlers in tests
-
Format Preservation
- Provide clean HTML files with varied formatting
- Verify byte-for-byte identical output for clean files
- Test various whitespace patterns, quote styles, attribute ordering
-
Edge Cases
- Malformed HTML
- Mixed case tags and attributes
- Attributes without quotes
- Multiple encodings in same document
Testing Process
-
Research first: Before writing tests, search for:
- OWASP XSS Prevention Cheat Sheet
- XSS Filter Evasion Cheat Sheet
- Known XSS polyglots
- Browser-specific attack vectors
-
Create adversarial tests: Do not rely solely on self-created test cases
- Use external comprehensive test suites
- Include vectors that have bypassed filters historically
-
Test clean content preservation: Equal priority to security testing
- Create diverse clean HTML samples
- Verify no modifications occur
- Check whitespace, comments, attribute order
Common Pitfalls
1. Incomplete Event Handler Lists
Mistake: Hardcoding only common event handlers like onclick, onload, onerror.
Solution: Research and include ALL valid HTML event handlers, including deprecated and browser-specific ones.
2. Ignoring CSS Attack Vectors
Mistake: Focusing only on JavaScript while ignoring CSS-based XSS.
Solution: Filter <style> tags, dangerous CSS properties, and style attributes with expressions.
3. Missing Protocol Handlers
Mistake: Only filtering javascript: protocol.
Solution: Also filter vbscript:, data: URIs with dangerous content, and handle encoded protocol names.
4. Format Alteration with Parsers
Mistake: Using HTML parsers when format preservation is required. Solution: If format preservation is critical, use regex-based surgical removal or verify parser output matches input formatting.
5. Self-Validating Tests
Mistake: Creating test cases that match implementation capabilities rather than real attack vectors. Solution: Use external, adversarial test suites created by security researchers.
6. Quote and Encoding Handling
Mistake: Not handling HTML entities in attributes (", ').
Solution: Consider how encoded characters in attributes might bypass filters.
7. Forgetting Meta Refresh
Mistake: Not filtering <meta http-equiv="refresh"> with dangerous URLs.
Solution: Include meta tags in the filtering scope, especially those with data: or javascript: URLs.
8. Ignoring External Resources
Mistake: Not filtering <link>, <object>, <embed> tags.
Solution: Evaluate whether these tags can load or execute dangerous content.
Implementation Checklist
Before considering the implementation complete:
- Researched comprehensive XSS attack vector lists
- Implemented filtering for ALL event handlers (not just common ones)
- Handled script tags and noscript abuse
- Filtered javascript:, vbscript:, and dangerous data: URIs
- Addressed CSS-based attacks (style tags, expressions, bindings)
- Handled meta refresh attacks
- Considered link, object, embed, applet tags
- Handled SVG-based attacks
- Accounted for encoding variations
- Tested with external XSS test suites
- Verified clean HTML files remain unchanged
- Tested format preservation (whitespace, quotes, ordering)