NYC
skills/letta-ai/skills/filter-js-from-html

filter-js-from-html

SKILL.md

Filter JavaScript from HTML

Overview

This skill provides guidance for tasks that require removing JavaScript and XSS attack vectors from HTML content while preserving the original formatting exactly. The key challenge is balancing comprehensive security filtering with format preservation.

Critical Requirements Analysis

Before implementation, identify and prioritize these requirements:

  1. Security completeness: All XSS vectors must be removed
  2. Format preservation: Output must be functionally identical to input except for harmful content removal
  3. Clean content handling: Files without XSS content should remain completely unchanged

These requirements often conflict - comprehensive parsing may alter formatting, while simple string replacement may miss attack vectors.

Approach Selection

Option 1: Regex-Based Surgical Removal (Recommended for Format Preservation)

When the task explicitly requires preserving original formatting, prefer regex-based approaches that surgically remove only the dangerous content.

Advantages:

  • Preserves whitespace, attribute ordering, quote styles exactly
  • Does not reconstruct or reformat HTML
  • Output matches input character-for-character except for removed content

Considerations:

  • Requires careful pattern construction to avoid partial matches
  • Must handle various encodings and obfuscation techniques
  • Test patterns against comprehensive XSS vector lists

Option 2: HTML Parser-Based Filtering

When format preservation is less critical or when dealing with malformed HTML.

Considerations:

  • HTML parsers inherently reconstruct output, changing formatting
  • May normalize attribute quotes, whitespace, tag casing
  • Better for malformed HTML that regex cannot reliably parse
  • If using this approach, verify that clean HTML files remain unchanged

Comprehensive XSS Vector Checklist

Before implementing, research and account for ALL of these attack categories:

1. Script Execution Tags

  • <script> tags (including variations with attributes)
  • <noscript> abuse cases

2. Event Handlers (Comprehensive List Required)

Common handlers:

  • onclick, onload, onerror, onmouseover, onfocus, onblur

Frequently missed handlers:

  • onlayoutcomplete, ontimeerror, onselectionchange
  • onrowsinserted, onrowsdelete, onrowexit, onrowenter
  • oncellchange, ondataavailable, ondatasetchanged, ondatasetcomplete
  • onbeforeupdate, onafterupdate, onerrorupdate
  • onfilterchange, onpropertychange, onreadystatechange
  • onbeforeprint, onafterprint, onbeforeunload
  • oncontextmenu, ondrag, ondragend, ondragenter, ondragleave
  • ondragover, ondragstart, ondrop
  • onhashchange, oninput, oninvalid, onpageshow, onpagehide
  • onpopstate, onresize, onstorage, onwheel

Action: Search for comprehensive event handler lists (e.g., MDN, OWASP) rather than relying on memory.

3. JavaScript URL Protocol

  • javascript: in href, src, action, formaction, data, poster attributes
  • Case variations: JavaScript:, JAVASCRIPT:, JaVaScRiPt:
  • Encoded variations: &#106;avascript:, java&#115;cript:

4. Other Dangerous Protocols

  • vbscript: (IE legacy)
  • data: URIs with script content: data:text/html,<script>...</script>
  • data:text/html;base64,... encoded payloads

5. CSS-Based Attacks

  • <style> tags with dangerous properties
  • -moz-binding (Firefox legacy)
  • expression() (IE legacy)
  • behavior: property
  • @import with javascript or data URIs

6. Meta Tag Attacks

  • <meta http-equiv="refresh" content="0;url=data:text/html,...">
  • <meta http-equiv="refresh" content="0;url=javascript:...">

7. External Resource Loading

  • <link> tags with dangerous href values
  • <object> tags with data attributes
  • <embed> tags with src attributes
  • <applet> tags (legacy)
  • <iframe> with src or srcdoc containing scripts

8. SVG-Based Attacks

  • <svg onload="..."> and other SVG event handlers
  • <svg><script>...</script></svg>
  • SVG <use> with external references

9. Encoding and Obfuscation

  • HTML entity encoding: &#60;script&#62;
  • URL encoding: %3Cscript%3E
  • UTF-7 encoding attacks
  • Null byte injection: <scr\0ipt>
  • Unicode variations

10. HTML Comment Exploits

  • Conditional comments: <!--[if IE]><script>...<![endif]-->
  • Nested comment breaking

Verification Strategy

Test Categories (All Required)

  1. XSS Attack Vectors

    • Use established XSS test suites (OWASP XSS Filter Evasion Cheat Sheet)
    • Test XSS polyglots that combine multiple techniques
    • Include lesser-known event handlers in tests
  2. Format Preservation

    • Provide clean HTML files with varied formatting
    • Verify byte-for-byte identical output for clean files
    • Test various whitespace patterns, quote styles, attribute ordering
  3. Edge Cases

    • Malformed HTML
    • Mixed case tags and attributes
    • Attributes without quotes
    • Multiple encodings in same document

Testing Process

  1. Research first: Before writing tests, search for:

    • OWASP XSS Prevention Cheat Sheet
    • XSS Filter Evasion Cheat Sheet
    • Known XSS polyglots
    • Browser-specific attack vectors
  2. Create adversarial tests: Do not rely solely on self-created test cases

    • Use external comprehensive test suites
    • Include vectors that have bypassed filters historically
  3. Test clean content preservation: Equal priority to security testing

    • Create diverse clean HTML samples
    • Verify no modifications occur
    • Check whitespace, comments, attribute order

Common Pitfalls

1. Incomplete Event Handler Lists

Mistake: Hardcoding only common event handlers like onclick, onload, onerror. Solution: Research and include ALL valid HTML event handlers, including deprecated and browser-specific ones.

2. Ignoring CSS Attack Vectors

Mistake: Focusing only on JavaScript while ignoring CSS-based XSS. Solution: Filter <style> tags, dangerous CSS properties, and style attributes with expressions.

3. Missing Protocol Handlers

Mistake: Only filtering javascript: protocol. Solution: Also filter vbscript:, data: URIs with dangerous content, and handle encoded protocol names.

4. Format Alteration with Parsers

Mistake: Using HTML parsers when format preservation is required. Solution: If format preservation is critical, use regex-based surgical removal or verify parser output matches input formatting.

5. Self-Validating Tests

Mistake: Creating test cases that match implementation capabilities rather than real attack vectors. Solution: Use external, adversarial test suites created by security researchers.

6. Quote and Encoding Handling

Mistake: Not handling HTML entities in attributes (&quot;, &#39;). Solution: Consider how encoded characters in attributes might bypass filters.

7. Forgetting Meta Refresh

Mistake: Not filtering <meta http-equiv="refresh"> with dangerous URLs. Solution: Include meta tags in the filtering scope, especially those with data: or javascript: URLs.

8. Ignoring External Resources

Mistake: Not filtering <link>, <object>, <embed> tags. Solution: Evaluate whether these tags can load or execute dangerous content.

Implementation Checklist

Before considering the implementation complete:

  • Researched comprehensive XSS attack vector lists
  • Implemented filtering for ALL event handlers (not just common ones)
  • Handled script tags and noscript abuse
  • Filtered javascript:, vbscript:, and dangerous data: URIs
  • Addressed CSS-based attacks (style tags, expressions, bindings)
  • Handled meta refresh attacks
  • Considered link, object, embed, applet tags
  • Handled SVG-based attacks
  • Accounted for encoding variations
  • Tested with external XSS test suites
  • Verified clean HTML files remain unchanged
  • Tested format preservation (whitespace, quotes, ordering)
Weekly Installs
20
Repository
letta-ai/skills
First Seen
Jan 24, 2026
Installed on
claude-code14
gemini-cli13
codex13
opencode13
antigravity12
github-copilot10