skills/arabelatso/skills-4-se/replay-oriented-instrumentation

replay-oriented-instrumentation

SKILL.md

Replay-Oriented Instrumentation

Instrument programs to capture execution information that enables deterministic replay, making it possible to reproduce and debug failures that are difficult to reproduce normally.

Core Concept

Deterministic replay works by:

  1. Recording: Capture all non-deterministic inputs during execution
  2. Replaying: Re-execute the program using recorded inputs to reproduce exact behavior
  3. Debugging: Use replay to analyze failures with time-travel debugging

Workflow

1. Identify Non-Determinism Sources

Analyze the program to find sources of non-determinism. See references/non-determinism.md for comprehensive coverage.

Common sources:

  • I/O operations: File reads, network requests, user input
  • Time: System clock, timestamps, timeouts
  • Randomness: Random number generation, hash functions
  • Threading: Thread scheduling, race conditions, lock ordering
  • System state: Process IDs, memory addresses, environment variables

2. Choose Recording Granularity

Select appropriate recording level based on needs:

Function-level (recommended starting point):

  • Record function calls and return values
  • Low overhead
  • Good for most debugging scenarios
  • Example: Record all I/O function calls

Event-based (balanced approach):

  • Record specific non-deterministic events
  • Moderate overhead
  • Captures essential non-determinism
  • Example: Record syscalls, thread events, random values

Instruction-level (comprehensive):

  • Record every instruction execution
  • High overhead, large logs
  • Complete determinism
  • Use only when necessary

3. Implement Recording Infrastructure

Choose between custom instrumentation or existing tools:

Custom instrumentation (flexible):

  • Wrap non-deterministic functions
  • Log inputs and outputs
  • Control what gets recorded
  • See language-specific guides below

Existing tools (easier):

4. Record Execution

Run the program in recording mode:

  • Execute the failing scenario
  • Capture all non-deterministic events
  • Save recording log
  • Verify recording completed successfully

5. Replay Execution

Reproduce the execution from the log:

  • Load recorded events
  • Replace non-deterministic operations with logged values
  • Verify replay matches original execution
  • Use debugger during replay for analysis

6. Debug with Replay

Leverage replay for debugging:

  • Set breakpoints without affecting timing
  • Use time-travel debugging (reverse execution)
  • Inspect state at any point in execution
  • Reproduce failure consistently

Quick Start by Language

Python

For custom instrumentation, see references/python-replay.md.

Basic example:

import json
import time
import random

class ReplayRecorder:
    def __init__(self, mode='record'):
        self.mode = mode
        self.log = []
        self.index = 0

    def record_call(self, func_name, result):
        if self.mode == 'record':
            self.log.append({'func': func_name, 'result': result})
        else:
            entry = self.log[self.index]
            self.index += 1
            return entry['result']

recorder = ReplayRecorder(mode='record')

def get_time():
    if recorder.mode == 'record':
        result = time.time()
        recorder.record_call('time', result)
        return result
    else:
        return recorder.record_call('time', None)

# Record mode
result = get_time()
with open('replay.log', 'w') as f:
    json.dump(recorder.log, f)

# Replay mode
recorder = ReplayRecorder(mode='replay')
with open('replay.log', 'r') as f:
    recorder.log = json.load(f)
result = get_time()  # Returns same value

Using RR (system-level):

rr record python script.py
rr replay

JavaScript/Node.js

Recording HTTP requests with Nock:

const nock = require('nock');

// Record mode
nock.recorder.rec();
// ... make requests ...
const fixtures = nock.recorder.play();

// Replay mode
nock('http://api.example.com')
  .get('/data')
  .reply(200, { data: 'recorded response' });

Java

Using AspectJ for recording:

@Aspect
public class ReplayAspect {
    private List<Event> events = new ArrayList<>();

    @Around("execution(* java.io..*(..))")
    public Object recordIO(ProceedingJoinPoint pjp) throws Throwable {
        Object result = pjp.proceed();
        events.add(new Event(pjp.getSignature(), pjp.getArgs(), result));
        return result;
    }
}

C/C++

Using RR (recommended):

# Record
rr record ./program arg1 arg2

# Replay with GDB
rr replay -d gdb

# In GDB, use reverse execution
(gdb) reverse-continue
(gdb) reverse-step

Custom instrumentation with macros:

#define RECORD_CALL(func, ...) \
    ({ \
        auto result = func(__VA_ARGS__); \
        log_event(#func, result); \
        result; \
    })

// Usage
int fd = RECORD_CALL(open, "file.txt", O_RDONLY);

Common Scenarios

Scenario 1: Race Condition Debugging

Problem: Test fails intermittently due to race condition

Solution:

  1. Record thread scheduling events
  2. Capture lock acquisition order
  3. Replay with same thread interleaving
  4. Use debugger to inspect race condition

Implementation:

import threading

class ThreadRecorder:
    def __init__(self):
        self.events = []

    def record_lock(self, lock_id, acquired):
        self.events.append({
            'type': 'lock',
            'lock_id': lock_id,
            'acquired': acquired,
            'thread': threading.current_thread().ident
        })

recorder = ThreadRecorder()

class RecordingLock:
    def __init__(self, lock_id):
        self.lock = threading.Lock()
        self.lock_id = lock_id

    def acquire(self):
        result = self.lock.acquire()
        recorder.record_lock(self.lock_id, True)
        return result

    def release(self):
        recorder.record_lock(self.lock_id, False)
        self.lock.release()

Scenario 2: Network Request Failure

Problem: API call fails in production, can't reproduce locally

Solution:

  1. Record network requests and responses
  2. Replay with recorded responses
  3. Debug with exact production data

Implementation (JavaScript):

const nock = require('nock');
const fs = require('fs');

// Record mode (run in production)
nock.recorder.rec({ output_objects: true });
// ... application runs ...
const recordings = nock.recorder.play();
fs.writeFileSync('recordings.json', JSON.stringify(recordings));

// Replay mode (run locally)
const recordings = JSON.parse(fs.readFileSync('recordings.json'));
nock.define(recordings);
// ... application runs with recorded responses ...

Scenario 3: Time-Dependent Bug

Problem: Bug only occurs at specific times or after certain duration

Solution:

  1. Record all time-related calls
  2. Replay with recorded timestamps
  3. Debug without waiting for real time

Implementation:

import time

class TimeRecorder:
    def __init__(self, mode='record'):
        self.mode = mode
        self.times = []
        self.index = 0

    def time(self):
        if self.mode == 'record':
            t = time.time()
            self.times.append(t)
            return t
        else:
            t = self.times[self.index]
            self.index += 1
            return t

recorder = TimeRecorder(mode='record')
time.time = recorder.time

Recording Strategies

Minimize Overhead

  • Record only non-deterministic operations
  • Use binary log formats
  • Buffer log writes
  • Compress logs
  • Sample when appropriate

Ensure Completeness

  • Identify all non-determinism sources
  • Test replay matches recording
  • Verify edge cases
  • Handle errors during recording

Optimize Log Size

  • Use efficient encoding
  • Deduplicate repeated values
  • Compress similar events
  • Prune unnecessary data

Replay Verification

Always verify replay matches recording:

def verify_replay(original_output, replay_output):
    if original_output != replay_output:
        print("REPLAY MISMATCH!")
        print(f"Original: {original_output}")
        print(f"Replay: {replay_output}")
        return False
    return True

References

Tips

  • Start simple: Begin with function-level recording
  • Test replay early: Verify replay works before extensive recording
  • Use existing tools: Leverage RR, Nock, etc. when possible
  • Record minimally: Only capture what's needed for replay
  • Version logs: Include version info for compatibility
  • Document sources: Know what non-determinism exists in your code
  • Automate verification: Check replay matches recording automatically
Weekly Installs
1
GitHub Stars
47
First Seen
12 days ago
Installed on
amp1
cline1
opencode1
cursor1
kimi-cli1
codex1