Failure-Oriented Instrumentation

Strategically instrument code to capture high-signal runtime data for debugging failures, focusing only on suspicious regions rather than comprehensive instrumentation.

Workflow

1. Analyze the Failure

Gather and analyze failure information:

Error message and exception type
Stack trace showing call chain
Failure location (file, line, function)
Reproduction steps or test case
Expected vs actual behavior

Identify suspicious code regions:

Functions in the stack trace
Code paths leading to the failure
Variables involved in the error
Conditional branches that may affect the outcome

2. Determine Instrumentation Strategy

Based on the failure type, choose instrumentation targets:

For crashes/exceptions:

Function entry/exit in stack trace
Variable values before the crash
Conditional branches leading to error path
Exception handling blocks

For incorrect results:

Variable values at key computation points
Conditional branch decisions
Loop iterations and state changes
Function return values

For performance issues:

Timing information for slow operations
Loop iteration counts
Resource allocation/deallocation
Function call frequency

For intermittent failures:

State variables that may cause non-determinism
Thread/concurrency information
External dependencies (I/O, network, time)
Retry logic and error recovery paths

3. Select Instrumentation Patterns

Choose appropriate patterns based on language and context. See language-specific references:

Python: references/python.md
JavaScript/TypeScript: references/javascript.md
Java: references/java.md
C/C++: references/c-cpp.md

Common patterns:

Function entry/exit logging
Variable value tracking
Conditional branch tracking
Loop iteration monitoring
Timing measurements
Assertions for invariants

4. Insert Instrumentation

Apply instrumentation to identified code regions:

Minimal approach (start here):

Instrument only the immediate failure location
Add 2-3 key variable logs
Track the critical conditional branch

Expanded approach (if minimal is insufficient):

Instrument entire call chain from stack trace
Add comprehensive variable tracking
Monitor all branches and loops in suspicious functions

Principles:

Start minimal, expand as needed
Focus on high-signal data (variables that affect control flow)
Avoid instrumenting stable, well-tested code
Minimize performance overhead

5. Run and Collect Data

Execute the instrumented code:

Run the failing test case or reproduction steps
Capture all instrumentation output (logs, traces)
Ensure instrumentation doesn't change behavior (except performance)

6. Analyze Results

Review captured data to identify root cause:

Compare variable values against expectations
Identify which branch was taken and why
Look for unexpected state transitions
Check timing for performance issues
Correlate multiple data points

Quick Start Examples

Example 1: NullPointerException in Java

Failure:

NullPointerException at UserService.java:45
  at UserService.processUser(UserService.java:45)
  at UserController.handleRequest(UserController.java:23)

Instrumentation:

public void processUser(String userId) {
    logger.debug("ENTER processUser: userId={}", userId);

    User user = userRepository.findById(userId);
    logger.debug("Retrieved user: {}", user);  // Check if null

    if (user == null) {
        logger.warn("User not found for userId={}", userId);
        return;
    }

    String email = user.getEmail();  // Line 45 - was failing here
    logger.debug("User email: {}", email);

    sendNotification(email);
}

Example 2: Incorrect Calculation in Python

Failure:

AssertionError: Expected 100, got 95
  at test_calculate_total (test_billing.py:12)
  at calculate_total (billing.py:34)

Instrumentation:

def calculate_total(items, discount_rate):
    logger.debug(f"ENTER calculate_total: items={items}, discount_rate={discount_rate}")

    subtotal = sum(item.price for item in items)
    logger.debug(f"Subtotal: {subtotal}")

    if discount_rate > 0:
        logger.debug(f"Applying discount: rate={discount_rate}")
        discount = subtotal * discount_rate
        logger.debug(f"Discount amount: {discount}")
    else:
        logger.debug("No discount applied")
        discount = 0

    total = subtotal - discount
    logger.debug(f"Final total: {total}")

    return total

Example 3: Intermittent Test Failure in JavaScript

Failure:

Test "should process async data" fails randomly
Expected: data processed
Actual: timeout

Instrumentation:

async function processAsyncData(dataId) {
    console.log(`ENTER processAsyncData: dataId=${dataId}, time=${Date.now()}`);

    const data = await fetchData(dataId);
    console.log(`Fetched data: ${JSON.stringify(data)}, time=${Date.now()}`);

    if (!data) {
        console.warn(`No data returned for dataId=${dataId}`);
        return null;
    }

    const processed = await processData(data);
    console.log(`Processed data: ${JSON.stringify(processed)}, time=${Date.now()}`);

    return processed;
}

Instrumentation Guidelines

What to Instrument

High priority:

Functions in the stack trace
Variables mentioned in error messages
Conditional branches near the failure
Loop conditions and iteration variables
Function parameters and return values

Medium priority:

State variables that affect control flow
Resource allocations (memory, files, connections)
External dependencies (API calls, database queries)
Error handling and recovery logic

Low priority:

Stable utility functions
Simple getters/setters
Well-tested library code
Performance-critical hot paths (unless investigating performance)

What NOT to Instrument

Code unrelated to the failure
Third-party libraries (unless suspected)
Trivial operations (assignments, simple math)
Code that would generate excessive output
Security-sensitive operations (passwords, tokens)

Instrumentation Best Practices

Use appropriate log levels: DEBUG for detailed tracing, INFO for key events, ERROR for exceptions
Include context: Variable names, values, types, and relevant state
Mark entry/exit: Clear boundaries for function execution
Timestamp when relevant: For timing and ordering issues
Avoid side effects: Instrumentation should not change program behavior
Clean up after: Remove temporary instrumentation once bug is fixed
Consider permanence: Some instrumentation may be valuable long-term for observability

Language-Specific References

Load these references for detailed instrumentation patterns:

Python: references/python.md - logging, tracing, decorators, context managers
JavaScript/TypeScript: references/javascript.md - console, debug module, proxies, async patterns
Java: references/java.md - SLF4J, AspectJ, wrappers, assertions
C/C++: references/c-cpp.md - printf, macros, RAII, GDB scripting

Tips

Start small: Instrument one function, run, analyze, then expand if needed
Be specific: Log exact variable values, not just "processing data"
Use structured logging: Include variable names and context, not just values
Test instrumentation: Verify logs appear and contain expected information
Iterate quickly: Add instrumentation, run, analyze, repeat
Consider performance: For production code, use conditional compilation or feature flags
Document findings: Comment why instrumentation was added and what it revealed

failure-oriented-instrumentation

Failure-Oriented Instrumentation

Workflow

1. Analyze the Failure

2. Determine Instrumentation Strategy

3. Select Instrumentation Patterns

4. Insert Instrumentation

5. Run and Collect Data

6. Analyze Results

Quick Start Examples

Example 1: NullPointerException in Java

Example 2: Incorrect Calculation in Python

Example 3: Intermittent Test Failure in JavaScript

Instrumentation Guidelines

What to Instrument

What NOT to Instrument

Instrumentation Best Practices

Language-Specific References

Tips