apollo-incident-runbook

SKILL.md

Apollo Incident Runbook

Overview

Structured incident response procedures for Apollo.io integration issues with diagnosis steps, mitigation actions, and recovery procedures.

Incident Classification

Severity Impact Response Time Examples
P1 - Critical Complete outage 15 min API down, auth failed
P2 - Major Degraded service 1 hour High error rate, slow responses
P3 - Minor Limited impact 4 hours Cache issues, minor errors
P4 - Low No user impact Next day Log warnings, cosmetic issues

Quick Diagnosis Commands

# Check Apollo status
curl -s https://status.apollo.io/api/v2/status.json | jq '.status'

# Verify API key
curl -s "https://api.apollo.io/v1/auth/health?api_key=$APOLLO_API_KEY" | jq

# Check rate limit status
curl -I "https://api.apollo.io/v1/people/search" \
  -H "Content-Type: application/json" \
  -d '{"api_key": "'$APOLLO_API_KEY'", "per_page": 1}' 2>/dev/null \
  | grep -i "ratelimit"

# Check application health
curl -s http://localhost:3000/health/apollo | jq

# Check error logs
kubectl logs -l app=apollo-service --tail=100 | grep -i error

# Check metrics
curl -s http://localhost:3000/metrics | grep apollo_

Incident Response Procedures

P1: Complete API Failure

Symptoms:

  • All Apollo requests returning 5xx errors
  • Health check endpoint failing
  • Alerts firing on error rate

Immediate Actions (0-15 min):

# 1. Confirm Apollo is down (not just us)
curl -s https://status.apollo.io/api/v2/status.json | jq

# 2. Enable circuit breaker / fallback mode
kubectl set env deployment/apollo-service APOLLO_FALLBACK_MODE=true

# 3. Notify stakeholders
# Post to #incidents Slack channel

# 4. Check if it's our API key
curl -s "https://api.apollo.io/v1/auth/health?api_key=$APOLLO_API_KEY"
# If 401: Key is invalid - check if rotated

Fallback Mode Implementation:

// src/lib/apollo/circuit-breaker.ts
class CircuitBreaker {
  private failures = 0;
  private lastFailure: Date | null = null;
  private isOpen = false;

  async execute<T>(fn: () => Promise<T>, fallback: () => T): Promise<T> {
    if (this.isOpen) {
      if (this.shouldAttemptReset()) {
        this.isOpen = false;
      } else {
        console.warn('Circuit breaker open, using fallback');
        return fallback();
      }
    }

    try {
      const result = await fn();
      this.failures = 0;
      return result;
    } catch (error) {
      this.failures++;
      this.lastFailure = new Date();

      if (this.failures >= 5) {
        this.isOpen = true;
        console.error('Circuit breaker opened after 5 failures');
      }

      return fallback();
    }
  }

  private shouldAttemptReset(): boolean {
    if (!this.lastFailure) return true;
    const elapsed = Date.now() - this.lastFailure.getTime();
    return elapsed > 60000; // Try again after 1 minute
  }
}

// Fallback data source
async function getFallbackContacts(criteria: any) {
  // Return cached data
  const cached = await apolloCache.search(criteria);
  if (cached.length > 0) return cached;

  // Return empty with warning
  console.warn('No fallback data available');
  return [];
}

Recovery Steps:

# 1. Monitor Apollo status page for resolution
watch -n 30 'curl -s https://status.apollo.io/api/v2/status.json | jq'

# 2. When Apollo is back, disable fallback mode
kubectl set env deployment/apollo-service APOLLO_FALLBACK_MODE=false

# 3. Verify connectivity
curl -s "https://api.apollo.io/v1/auth/health?api_key=$APOLLO_API_KEY"

# 4. Check for request backlog
kubectl logs -l app=apollo-service | grep -c "queued"

# 5. Gradually restore traffic
kubectl scale deployment/apollo-service --replicas=1
# Wait, verify healthy
kubectl scale deployment/apollo-service --replicas=3

P1: API Key Compromised

Symptoms:

  • Unexpected 401 errors
  • Unusual usage patterns
  • Alert from Apollo about suspicious activity

Immediate Actions:

# 1. Rotate API key immediately in Apollo dashboard
# Settings > Integrations > API > Regenerate Key

# 2. Update secret in production
# Kubernetes
kubectl create secret generic apollo-secrets \
  --from-literal=api-key=NEW_KEY \
  --dry-run=client -o yaml | kubectl apply -f -

# 3. Restart deployments to pick up new key
kubectl rollout restart deployment/apollo-service

# 4. Audit usage logs
kubectl logs -l app=apollo-service --since=24h | grep "apollo_request"

Post-Incident:

  • Review access controls
  • Enable IP allowlisting if available
  • Implement key rotation schedule

P2: High Error Rate

Symptoms:

  • Error rate > 5%
  • Mix of successful and failed requests
  • Alerts on apollo_errors_total

Diagnosis:

# Check error distribution
curl -s http://localhost:3000/metrics | grep apollo_errors_total

# Sample recent errors
kubectl logs -l app=apollo-service --tail=500 | grep -A2 "apollo_error"

# Check if specific endpoint is failing
curl -s http://localhost:3000/metrics | grep apollo_requests_total | sort

Common Causes & Fixes:

Error Type Likely Cause Fix
validation_error Bad request format Check request payload
rate_limit Too many requests Enable backoff, reduce concurrency
auth_error Key issue Verify API key
timeout Network/Apollo slow Increase timeout, add retry

Mitigation:

# Reduce request rate
kubectl set env deployment/apollo-service APOLLO_RATE_LIMIT=50

# Enable aggressive caching
kubectl set env deployment/apollo-service APOLLO_CACHE_TTL=3600

# Scale down to reduce load
kubectl scale deployment/apollo-service --replicas=1

P2: Rate Limit Exceeded

Symptoms:

  • 429 responses
  • apollo_rate_limit_hits_total increasing
  • Requests queuing

Immediate Actions:

# 1. Check current rate limit status
curl -I "https://api.apollo.io/v1/auth/health?api_key=$APOLLO_API_KEY" \
  | grep -i ratelimit

# 2. Pause non-essential operations
kubectl set env deployment/apollo-service \
  APOLLO_PAUSE_BACKGROUND_JOBS=true

# 3. Reduce concurrency
kubectl set env deployment/apollo-service \
  APOLLO_MAX_CONCURRENT=2

# 4. Wait for rate limit to reset (typically 1 minute)
sleep 60

# 5. Gradually resume
kubectl set env deployment/apollo-service \
  APOLLO_MAX_CONCURRENT=5 \
  APOLLO_PAUSE_BACKGROUND_JOBS=false

Prevention:

// Implement request budgeting
class RequestBudget {
  private used = 0;
  private resetTime: Date;

  constructor(private limit: number = 90) {
    this.resetTime = this.getNextMinute();
  }

  async acquire(): Promise<boolean> {
    if (new Date() > this.resetTime) {
      this.used = 0;
      this.resetTime = this.getNextMinute();
    }

    if (this.used >= this.limit) {
      const waitMs = this.resetTime.getTime() - Date.now();
      console.warn(`Budget exhausted, waiting ${waitMs}ms`);
      await new Promise(r => setTimeout(r, waitMs));
      return this.acquire();
    }

    this.used++;
    return true;
  }

  private getNextMinute(): Date {
    const next = new Date();
    next.setSeconds(0, 0);
    next.setMinutes(next.getMinutes() + 1);
    return next;
  }
}

P3: Slow Responses

Symptoms:

  • P95 latency > 5 seconds
  • Timeouts occurring
  • User complaints about slow search

Diagnosis:

# Check latency metrics
curl -s http://localhost:3000/metrics \
  | grep apollo_request_duration

# Check Apollo's response time
time curl -s "https://api.apollo.io/v1/auth/health?api_key=$APOLLO_API_KEY"

# Check our application latency
kubectl top pods -l app=apollo-service

Mitigation:

# Increase timeout
kubectl set env deployment/apollo-service APOLLO_TIMEOUT=60000

# Enable request hedging (send duplicate requests)
kubectl set env deployment/apollo-service APOLLO_HEDGE_REQUESTS=true

# Reduce payload size (request fewer results)
kubectl set env deployment/apollo-service APOLLO_DEFAULT_PER_PAGE=25

Post-Incident Template

## Incident Report: [Title]

**Date:** [Date]
**Duration:** [Start] - [End] ([X] minutes)
**Severity:** P[1-4]
**Affected Systems:** Apollo integration

### Summary
[1-2 sentence description]

### Timeline
- HH:MM - Issue detected
- HH:MM - Investigation started
- HH:MM - Root cause identified
- HH:MM - Mitigation applied
- HH:MM - Service restored

### Root Cause
[Description of what caused the incident]

### Impact
- [Number] of failed requests
- [Number] of affected users
- [Duration] of degraded service

### Resolution
[What was done to fix the issue]

### Action Items
- [ ] [Preventive measure 1]
- [ ] [Preventive measure 2]
- [ ] [Monitoring improvement]

### Lessons Learned
[What we learned from this incident]

Output

  • Incident classification matrix
  • Quick diagnosis commands
  • Response procedures by severity
  • Circuit breaker implementation
  • Post-incident template

Error Handling

Issue Escalation
P1 > 30 min Page on-call lead
P2 > 2 hours Notify management
Recurring P3 Create P2 tracking
Apollo outage Open support ticket

Resources

Next Steps

Proceed to apollo-data-handling for data management.

Weekly Installs
1
Installed on
antigravity1