Filter JavaScript from HTML

Overview

This skill provides guidance for tasks that require removing JavaScript and XSS attack vectors from HTML content while preserving the original formatting exactly. The key challenge is balancing comprehensive security filtering with format preservation.

Critical Requirements Analysis

Before implementation, identify and prioritize these requirements:

Security completeness: All XSS vectors must be removed
Format preservation: Output must be functionally identical to input except for harmful content removal
Clean content handling: Files without XSS content should remain completely unchanged

These requirements often conflict - comprehensive parsing may alter formatting, while simple string replacement may miss attack vectors.

Approach Selection

Option 1: Regex-Based Surgical Removal (Recommended for Format Preservation)

When the task explicitly requires preserving original formatting, prefer regex-based approaches that surgically remove only the dangerous content.

Advantages:

Preserves whitespace, attribute ordering, quote styles exactly
Does not reconstruct or reformat HTML
Output matches input character-for-character except for removed content

Considerations:

Requires careful pattern construction to avoid partial matches
Must handle various encodings and obfuscation techniques
Test patterns against comprehensive XSS vector lists

Option 2: HTML Parser-Based Filtering

When format preservation is less critical or when dealing with malformed HTML.

Considerations:

HTML parsers inherently reconstruct output, changing formatting
May normalize attribute quotes, whitespace, tag casing
Better for malformed HTML that regex cannot reliably parse
If using this approach, verify that clean HTML files remain unchanged

Comprehensive XSS Vector Checklist

Before implementing, research and account for ALL of these attack categories:

1. Script Execution Tags

<script> tags (including variations with attributes)
<noscript> abuse cases

2. Event Handlers (Comprehensive List Required)

Common handlers:

onclick, onload, onerror, onmouseover, onfocus, onblur

Frequently missed handlers:

onlayoutcomplete, ontimeerror, onselectionchange
onrowsinserted, onrowsdelete, onrowexit, onrowenter
oncellchange, ondataavailable, ondatasetchanged, ondatasetcomplete
onbeforeupdate, onafterupdate, onerrorupdate
onfilterchange, onpropertychange, onreadystatechange
onbeforeprint, onafterprint, onbeforeunload
oncontextmenu, ondrag, ondragend, ondragenter, ondragleave
ondragover, ondragstart, ondrop
onhashchange, oninput, oninvalid, onpageshow, onpagehide
onpopstate, onresize, onstorage, onwheel

Action: Search for comprehensive event handler lists (e.g., MDN, OWASP) rather than relying on memory.

3. JavaScript URL Protocol

javascript: in href, src, action, formaction, data, poster attributes
Case variations: JavaScript:, JAVASCRIPT:, JaVaScRiPt:
Encoded variations: javascript:, javascript:

4. Other Dangerous Protocols

vbscript: (IE legacy)
data: URIs with script content: data:text/html,<script>...</script>
data:text/html;base64,... encoded payloads

5. CSS-Based Attacks

<style> tags with dangerous properties
-moz-binding (Firefox legacy)
expression() (IE legacy)
behavior: property
@import with javascript or data URIs

6. Meta Tag Attacks

<meta http-equiv="refresh" content="0;url=data:text/html,...">
<meta http-equiv="refresh" content="0;url=javascript:...">

7. External Resource Loading

<link> tags with dangerous href values
<object> tags with data attributes
<embed> tags with src attributes
<applet> tags (legacy)
<iframe> with src or srcdoc containing scripts

8. SVG-Based Attacks

<svg onload="..."> and other SVG event handlers
<svg><script>...</script></svg>
SVG <use> with external references

9. Encoding and Obfuscation

HTML entity encoding: <script>
URL encoding: %3Cscript%3E
UTF-7 encoding attacks
Null byte injection: <scr\0ipt>
Unicode variations

10. HTML Comment Exploits

Conditional comments: 
Nested comment breaking

Verification Strategy

Test Categories (All Required)

XSS Attack Vectors
- Use established XSS test suites (OWASP XSS Filter Evasion Cheat Sheet)
- Test XSS polyglots that combine multiple techniques
- Include lesser-known event handlers in tests
Format Preservation
- Provide clean HTML files with varied formatting
- Verify byte-for-byte identical output for clean files
- Test various whitespace patterns, quote styles, attribute ordering
Edge Cases
- Malformed HTML
- Mixed case tags and attributes
- Attributes without quotes
- Multiple encodings in same document

Testing Process

Research first: Before writing tests, search for:
- OWASP XSS Prevention Cheat Sheet
- XSS Filter Evasion Cheat Sheet
- Known XSS polyglots
- Browser-specific attack vectors
Create adversarial tests: Do not rely solely on self-created test cases
- Use external comprehensive test suites
- Include vectors that have bypassed filters historically
Test clean content preservation: Equal priority to security testing
- Create diverse clean HTML samples
- Verify no modifications occur
- Check whitespace, comments, attribute order

Common Pitfalls

1. Incomplete Event Handler Lists

Mistake: Hardcoding only common event handlers like onclick, onload, onerror. Solution: Research and include ALL valid HTML event handlers, including deprecated and browser-specific ones.

2. Ignoring CSS Attack Vectors

Mistake: Focusing only on JavaScript while ignoring CSS-based XSS. Solution: Filter <style> tags, dangerous CSS properties, and style attributes with expressions.

3. Missing Protocol Handlers

Mistake: Only filtering javascript: protocol. Solution: Also filter vbscript:, data: URIs with dangerous content, and handle encoded protocol names.

4. Format Alteration with Parsers

Mistake: Using HTML parsers when format preservation is required. Solution: If format preservation is critical, use regex-based surgical removal or verify parser output matches input formatting.

5. Self-Validating Tests

Mistake: Creating test cases that match implementation capabilities rather than real attack vectors. Solution: Use external, adversarial test suites created by security researchers.

6. Quote and Encoding Handling

Mistake: Not handling HTML entities in attributes (", '). Solution: Consider how encoded characters in attributes might bypass filters.

7. Forgetting Meta Refresh

Mistake: Not filtering <meta http-equiv="refresh"> with dangerous URLs. Solution: Include meta tags in the filtering scope, especially those with data: or javascript: URLs.

8. Ignoring External Resources

Mistake: Not filtering <link>, <object>, <embed> tags. Solution: Evaluate whether these tags can load or execute dangerous content.

Implementation Checklist

Before considering the implementation complete:

filter-js-from-html