semgrep

SKILL.md

Semgrep Static Analysis

When to Use Semgrep

Ideal scenarios:

  • Quick security scans (minutes, not hours)
  • Pattern-based vulnerability detection
  • Enforcing coding standards and best practices
  • Finding known vulnerability patterns (OWASP Top 10, CWE Top 25)
  • Intra-file taint analysis and data flow tracking
  • Custom rule development for specific code patterns
  • First-pass security analysis before deeper tools
  • CI/CD security gates for fast feedback
  • Multi-language security scanning

Complements other tools:

  • Use before manual code review to catch common patterns
  • Combine with SARIF Issue Reporter for detailed findings
  • Use alongside CodeQL for comprehensive coverage
  • Pair with dependency scanners (OSV-Scanner, Depscan)

Consider CodeQL instead when:

  • Need interprocedural taint tracking across files
  • Complex data flow analysis across modules required
  • Analyzing custom proprietary frameworks with deep integration

When NOT to Use

Do NOT use this skill for:

  • Complex interprocedural data flow analysis (use CodeQL instead)
  • Binary analysis or compiled code without source
  • Custom deep semantic analysis requiring AST/CFG traversal
  • Tracking taint across many function boundaries and files
  • Secrets detection (use Gitleaks)
  • Dependency vulnerability scanning (use OSV-Scanner or Depscan)
  • IaC security analysis (use KICS)
  • API endpoint discovery (use Noir)

Installation

# pip
python3 -m pip install semgrep

# pipx (recommended)
pipx install semgrep

# Homebrew
brew install semgrep

# Docker
docker pull returntocorp/semgrep:latest
docker run --rm -v "${PWD}:/src" returntocorp/semgrep semgrep --config auto /src

# Update
pip install --upgrade semgrep

# Verify
semgrep --version

Core Workflow

1. Quick Scan

semgrep --config auto .                    # Auto-detect rules
semgrep --config auto --metrics=off .      # Disable telemetry for proprietary code

2. Use Rulesets

semgrep --config p/<RULESET> .             # Single ruleset
semgrep --config p/security-audit --config p/trailofbits .  # Multiple
Ruleset Description
p/default General security and code quality
p/security-audit Comprehensive security rules
p/owasp-top-ten OWASP Top 10 vulnerabilities
p/cwe-top-25 CWE Top 25 vulnerabilities
p/r2c-security-audit r2c security audit rules
p/trailofbits Trail of Bits security rules
p/python Python-specific
p/javascript JavaScript-specific
p/golang Go-specific

3. Output Formats

# SARIF output (for CI/CD)
semgrep --config p/security-audit --sarif -o results.sarif .

# JSON output
semgrep --config p/security-audit --json -o results.json .

# Text output with dataflow traces
semgrep --config p/security-audit --dataflow-traces .

# JUnit XML
semgrep --config p/security-audit --junit-xml -o results.xml .

# GitLab SAST format
semgrep --config p/security-audit --gitlab-sast -o gl-sast-report.json .

# Vim quickfix
semgrep --config p/security-audit --vim .

4. Scan Specific Paths

# Single file
semgrep --config p/python app.py

# Specific directory
semgrep --config p/javascript src/

# Include tests (excluded by default)
semgrep --config auto --include='**/test/**' .

# Exclude paths
semgrep --config auto --exclude='vendor' --exclude='node_modules' .

# Multiple languages
semgrep --config p/python --config p/javascript .

5. Advanced Features

# Enable Pro Engine features (requires license)
semgrep --config p/security-audit --pro .

# Pro Engine interfile analysis
semgrep --config p/security-audit --pro --pro-intrafile .

# Disable telemetry
semgrep --config auto --metrics=off .

# Verbose output
semgrep --config p/security-audit --verbose .

# Quiet mode (only show findings)
semgrep --config p/security-audit --quiet .

Writing Custom Rules

Basic Structure

rules:
  - id: hardcoded-password
    languages: [python]
    message: "Hardcoded password detected: $PASSWORD"
    severity: ERROR
    pattern: password = "$PASSWORD"

Pattern Syntax

Syntax Description Example
... Match anything func(...)
$VAR Capture metavariable $FUNC($INPUT)
<... ...> Deep expression match <... user_input ...>

Pattern Operators

Operator Description
pattern Match exact pattern
patterns All must match (AND)
pattern-either Any matches (OR)
pattern-not Exclude matches
pattern-inside Match only inside context
pattern-not-inside Match only outside context
pattern-regex Regex matching
metavariable-regex Regex on captured value
metavariable-comparison Compare values

Combining Patterns

rules:
  - id: sql-injection
    languages: [python]
    message: "Potential SQL injection"
    severity: ERROR
    patterns:
      - pattern-either:
          - pattern: cursor.execute($QUERY)
          - pattern: db.execute($QUERY)
      - pattern-not:
          - pattern: cursor.execute("...", (...))
      - metavariable-regex:
          metavariable: $QUERY
          regex: .*\+.*|.*\.format\(.*|.*%.*

Taint Mode (Data Flow)

Simple pattern matching finds obvious cases:

# Pattern `os.system($CMD)` catches this:
os.system(user_input)  # Found

But misses indirect flows:

# Same pattern misses this:
cmd = user_input
processed = cmd.strip()
os.system(processed)  # Missed - no direct match

Taint mode tracks data through assignments and transformations:

  • Source: Where untrusted data enters (user_input)
  • Propagators: How it flows (cmd = ..., processed = ...)
  • Sanitizers: What makes it safe (shlex.quote())
  • Sink: Where it becomes dangerous (os.system())
rules:
  - id: command-injection
    languages: [python]
    message: "User input flows to command execution"
    severity: ERROR
    mode: taint
    pattern-sources:
      - pattern: request.args.get(...)
      - pattern: request.form[...]
      - pattern: request.json
    pattern-sinks:
      - pattern: os.system($SINK)
      - pattern: subprocess.call($SINK, shell=True)
      - pattern: subprocess.run($SINK, shell=True, ...)
    pattern-sanitizers:
      - pattern: shlex.quote(...)
      - pattern: int(...)

Full Rule with Metadata

rules:
  - id: flask-sql-injection
    languages: [python]
    message: "SQL injection: user input flows to query without parameterization"
    severity: ERROR
    metadata:
      cwe: "CWE-89: SQL Injection"
      owasp: "A03:2021 - Injection"
      confidence: HIGH
    mode: taint
    pattern-sources:
      - pattern: request.args.get(...)
      - pattern: request.form[...]
      - pattern: request.json
    pattern-sinks:
      - pattern: cursor.execute($QUERY)
      - pattern: db.execute($QUERY)
    pattern-sanitizers:
      - pattern: int(...)
    fix: cursor.execute($QUERY, (params,))

Testing Rules

Test File Format

# test_rule.py
def test_vulnerable():
    user_input = request.args.get("id")
    # ruleid: flask-sql-injection
    cursor.execute("SELECT * FROM users WHERE id = " + user_input)

def test_safe():
    user_input = request.args.get("id")
    # ok: flask-sql-injection
    cursor.execute("SELECT * FROM users WHERE id = ?", (user_input,))
semgrep --test rules/

CI/CD Integration (GitHub Actions)

name: Semgrep

on:
  push:
    branches: [main]
  pull_request:
  schedule:
    - cron: '0 0 1 * *'  # Monthly

jobs:
  semgrep:
    runs-on: ubuntu-latest
    container:
      image: returntocorp/semgrep

    steps:
      - uses: actions/checkout@v4
        with:
          fetch-depth: 0  # Required for diff-aware scanning

      - name: Run Semgrep
        run: |
          if [ "${{ github.event_name }}" = "pull_request" ]; then
            semgrep ci --baseline-commit ${{ github.event.pull_request.base.sha }}
          else
            semgrep ci
          fi
        env:
          SEMGREP_RULES: >-
            p/security-audit
            p/owasp-top-ten
            p/trailofbits

Configuration

.semgrepignore

tests/fixtures/
**/testdata/
generated/
vendor/
node_modules/

Suppress False Positives

password = get_from_vault()  # nosemgrep: hardcoded-password
dangerous_but_safe()  # nosemgrep

Performance

semgrep --config rules/ --time .    # Check rule performance
ulimit -n 4096                       # Increase file descriptors for large codebases

Path Filtering in Rules

rules:
  - id: my-rule
    paths:
      include: [src/]
      exclude: [src/generated/]

Common Use Cases

1. Comprehensive Security Audit

# Multi-ruleset scan with SARIF output
semgrep scan \
  --config p/security-audit \
  --config p/owasp-top-ten \
  --config p/cwe-top-25 \
  --sarif -o security-audit.sarif \
  .

2. Language-Specific Deep Scan

# Python with taint mode
semgrep scan \
  --config p/python \
  --config p/flask \
  --config p/django \
  --dataflow-traces \
  --sarif -o python-security.sarif \
  ./backend

# JavaScript/TypeScript
semgrep scan \
  --config p/javascript \
  --config p/typescript \
  --config p/react \
  --sarif -o js-security.sarif \
  ./frontend

3. Custom Rules with Existing Rulesets

# Combine custom and community rules
semgrep scan \
  --config ./custom-rules \
  --config p/security-audit \
  --sarif -o combined-scan.sarif \
  .

4. CI/CD Diff Scanning

# Scan only changed files (PR context)
git diff --name-only origin/main...HEAD | \
  xargs semgrep scan --config p/security-audit --sarif -o diff-scan.sarif

Understanding Output

SARIF Structure

Semgrep SARIF v2.1.0 includes:

  • Rules: Each Semgrep rule with metadata
  • Results: Specific code locations matching patterns
  • Properties:
    • Severity: ERROR, WARNING, INFO
    • CWE and OWASP mappings
    • Confidence levels
    • Fix suggestions (if available)
    • Dataflow traces (if enabled)

Result Categories

Severity Meaning
ERROR High-confidence security vulnerability
WARNING Potential security issue requiring review
INFO Code smell or best practice violation

Autofix

# Show available fixes
semgrep scan --config p/security-audit --autofix --dryrun .

# Apply fixes automatically
semgrep scan --config p/security-audit --autofix .

# Review fixes before applying
semgrep scan --config p/security-audit --autofix --dryrun . | less

Third-Party Rules

# Trail of Bits rules
git clone https://github.com/trailofbits/semgrep-rules.git
semgrep scan -f semgrep-rules/rules --sarif -o results.sarif .

# Semgrep Registry
semgrep scan --config "r/trailofbits" .

# Custom remote rules
semgrep scan --config https://example.com/custom-rules.yaml .

Advanced Rule Development

Using Metavariable Propagation

rules:
  - id: context-aware-xss
    languages: [javascript]
    message: "XSS: User input flows to innerHTML"
    severity: ERROR
    mode: taint
    pattern-sources:
      - pattern: req.query.$PARAM
    pattern-propagators:
      - pattern: $X.toString()
        from: $X
        to: $X.toString()
      - pattern: `${$X}`
        from: $X
        to: `${$X}`
    pattern-sinks:
      - pattern: $ELEMENT.innerHTML = $DATA
    pattern-sanitizers:
      - pattern: DOMPurify.sanitize($X)

Focus Metavariables

rules:
  - id: sql-injection-advanced
    languages: [python]
    message: "SQL injection via string formatting"
    severity: ERROR
    pattern: |
      $CURSOR.execute($QUERY)
    focus-metavariable: $QUERY
    metavariable-regex:
      metavariable: $QUERY
      regex: .*(\+|format|%).*

Performance Optimization

# Limit to specific file types
semgrep scan --include='*.py' --include='*.js' .

# Increase timeout for large files
semgrep scan --timeout 60 .

# Use baseline for faster incremental scans
semgrep scan --baseline-commit HEAD~1 .

# Parallel processing (default uses all CPUs)
semgrep scan --jobs 4 .

# Disable expensive rules
semgrep scan --config p/security-audit --exclude-rule 'expensive-rule-id' .

Supported Languages

Semgrep supports 30+ languages:

  • Web: JavaScript, TypeScript, JSX, TSX, HTML
  • Backend: Python, Go, Java, Kotlin, Scala, C#
  • Systems: C, C++, Rust
  • Mobile: Swift, Kotlin, Java, Objective-C
  • Scripting: Ruby, PHP, Bash, Lua, Perl
  • Infrastructure: Terraform, Dockerfile, YAML, JSON
  • Data: SQL (generic)
  • Other: Elixir, Clojure, Solidity, Apex, R

Semgrep Pro vs Community Edition

Feature Community Pro
Pattern matching
Intra-file taint
Custom rules
SARIF output
Cross-file analysis
Interfile taint
Supply chain
Secrets detection
Assistant (AI)

Troubleshooting

Common Issues

# Rule parsing errors
semgrep scan --validate --config custom-rules.yaml

# Timeout on large files
semgrep scan --timeout 120 .

# Memory issues
semgrep scan --max-memory 4000 .  # MB

# Debug mode
semgrep scan --debug --config p/security-audit .

Rule Testing

# Test rules against test files
semgrep scan --test rules/

# Validate rule syntax
semgrep scan --validate --config rules/my-rule.yaml

# Benchmark rules
semgrep scan --time --config rules/ test-codebase/

Limitations

  • Cross-file limited: Intra-file taint only in Community Edition
  • Pattern-based: Can't understand complex business logic
  • Performance: Large codebases with many rules can be slow
  • False positives: Regex patterns may over-match
  • Language gaps: Some languages have limited rule coverage

Rationalizations to Reject

Shortcut Why It's Wrong
"Semgrep found nothing, code is clean" Semgrep is pattern-based; it can't track complex data flow across functions
"I wrote a rule, so we're covered" Rules need testing with semgrep --test; false negatives are silent
"Taint mode catches injection" Only if you defined all sources, sinks, AND sanitizers correctly
"Pro rules are comprehensive" Pro rules are good but not exhaustive; supplement with custom rules for your codebase
"Too many findings = noisy tool" High finding count often means real problems; tune rules, don't disable them

References

Weekly Installs
2
GitHub Stars
4
First Seen
Feb 19, 2026
Installed on
openclaw2
claude-code2
replit2
codex2
kiro-cli2
kimi-cli2