Service Communication Audit

Systematic audit of how services talk to each other. Not just finding broken wires — questioning whether each wire should exist, in which direction, at what rate, and what comes back.

The core principle: deep modules with simple interfaces, not webs of shallow wiring. The consumer sees one health surface, one response, one interface. The implementation behind it can be as complex as needed — the consumer doesn't know or care. Every design decision filters through: "does this expose internal wiring to the consumer, or does it stay behind the module boundary?"

The full field guide is at references/communication-audit-guide.md (~40 principles). Do NOT summarize the guide in prompts — workers read it directly.

When To Use

Services communicate but the connection is unstable or brittle
Fire-and-forget calls with no acknowledgment or retry
One service crashes and the other doesn't notice or recover
No backpressure — failures cause hammering instead of backing off
"It works if you start things in the right order" (startup race)
Health checks that lie
Multiple data channels crammed into one tick at one rate

How It Works

Step 1: Scope The Audit

Identify which services and which communication paths to audit. Determine:

The service directories / crate paths
Which boundaries to focus on (all, or specific problem area)
Any auto-generated code to exclude (bindings, protobuf)

Step 2: Delegate The Auditor

Delegate using teams with anthropic/claude-opus-4-6. The worker must:

Read references/communication-audit-guide.md completely first
Map ALL communication boundaries before analyzing any single one
Find specific code locations for every failure mode
Show the current code and the proposed fix for each finding
Produce a rate analysis with justified Hz for every channel

Critical instruction quality rules:

"Trace every failure path to what the end user sees" — not "this could fail"
"Show the exact code that hammers, the exact line that drops damage events" — not "error handling could be improved"
"Calculate the serialization cost at N entities × Hz" — not "this seems frequent"
"Do NOT produce a checklist. Do NOT say '❌ Missing'."

Step 3: Formulate The Dispatch

teams(action: 'delegate', tasks: [{
  text: '<audit request — see template below>',
  assignee: 'comm-auditor',
  model: 'anthropic/claude-opus-4-6'
}])

Template (fill in {placeholders}):

# Service Communication Audit

Read the field guide at {skill_dir}/references/communication-audit-guide.md completely first. It contains 30 principles across 5 areas: mapping, failure modes, Hz analysis, protocol design, and anti-patterns.

## Target
Services: {service names and directories}
Focus area: {specific problem or "full audit"}
Exclude: {auto-generated files, bindings, etc.}

## Rules
- Map ALL communication paths FIRST, then analyze.
- Every finding must include the EXACT current code (file:line) and proposed fix.
- Trace every failure to user impact. "This could fail" is not a finding.
- Calculate actual Hz/throughput numbers, not estimates.
- Do NOT produce a checklist. Produce a narrative with evidence.

## Deliverables

### 1. Communication Map
Table of every boundary: source, destination, transport, direction, rate, criticality tier (ephemeral/important/critical/fatal), acknowledgment (none/_then/response), failure behavior.

### 2. Failure Mode Catalog  
For each failure mode found (§6–§12 in guide): the exact code, the cascade to user impact, the fix. Show BEFORE/AFTER code.

### 3. Rate Analysis
For each rate-driven channel: current Hz, all consumers traced, actual Hz needed, serialization cost at scale, verdict.

### 4. Design Reasoning (MOST IMPORTANT)
For each boundary, answer the §31–§40 questions:
- Why does this data cross this boundary? What breaks without it? (§31)
- Could the response carry data that's currently a subscription? (§33)
- What's the minimum subscription set if responses carry everything they can? (§35)
- What would make this fully self-healing with zero human intervention? (§37)
- Describe the ideal protocol in plain language before proposing code (§40)

### 5. Protocol Design
Proposed channel separation, acknowledgment strategy, connection state machine, health reporting architecture. Must follow from the design reasoning, not from "fix the current bugs."

### 6. Ticket List
Each issue as: title, problem (with code evidence), fix approach, acceptance criteria. Group into epic. Wire dependencies.

## Output
Write findings to {output_ticket_id} or create a new ticket tagged `research,communication-audit`.

Step 4: Present Results

Summarize the top findings in a table:

#	Issue	Severity	Current Code	Impact

Point to the full report for details. If the user wants to act, create tickets from the findings using tk.

Step 5: Create Tickets (if acting)

For each finding, create a ticket. Wire dependencies:

State machine / connection lifecycle → foundation
Error handling unification → before circuit breaker
Connection retry → before backpressure
Hz reduction → independent, can parallelize

Search for existing related tickets first: tk ls --status=open, totalrecall tk recall "<topic>".

What The Guide Covers

Part	Principles	Area
I	§1–§5	Mapping: boundaries, criticality, data flow, subscriptions, startup races
II	§6–§12	Failure modes: permanent blindness, hammering, silent data loss, lying health, theater recovery, hollow shell, startup cascade
III	§13–§16	Hz analysis: trace to consumer, separate rate from criticality, FFI cost, delta vs full
IV	§17–§23	Protocol: circuit breaker, nuclear recovery, evidence-based health, readiness gates, request-response, layered health, ack channels
V	§24–§30	Anti-patterns: "SDK handles it", log without counting, one flush, passive staleness, client sees internals, unquestioned Hz, individual bulk commands
VI	§31–§40	Design reasoning: why does this data cross this boundary, who owns this in 5 years, could the response carry this, minimum subscription set, big module thinking, acknowledgment enables everything, design the protocol you wish you had

Part VI is the most important part. Parts I–V find what's broken and fix it. Part VI questions whether the boundary should exist at all, whether the direction is right, whether subscriptions could be responses. This is the part that changes architecture, not just error handling.

service-communication-audit