microservices-communication
Microservices Communication
Use When
- Inter-service communication patterns — synchronous (HTTP/REST, gRPC) vs asynchronous (events, message queues), service discovery (client-side, server-side, DNS-based), inter-service authentication, data isolation rules, and API contract design...
- The task needs reusable judgment, domain constraints, or a proven workflow rather than ad hoc advice.
Do Not Use When
- The task is unrelated to
microservices-communicationor would be better handled by a more specific companion skill. - The request only needs a trivial answer and none of this skill's constraints or references materially help.
Required Inputs
- Gather relevant project context, constraints, and the concrete problem to solve.
- Confirm the desired deliverable: design, code, review, migration plan, audit, or documentation.
Workflow
- Read this
SKILL.mdfirst, then load only the referenced deep-dive files that are necessary for the task. - Apply the ordered guidance, checklists, and decision rules in this skill instead of cherry-picking isolated snippets.
- Produce the deliverable with assumptions, risks, and follow-up work made explicit when they matter.
Quality Standards
- Keep outputs execution-oriented, concise, and aligned with the repository's baseline engineering standards.
- Preserve compatibility with existing project conventions unless the skill explicitly requires a stronger standard.
- Prefer deterministic, reviewable steps over vague advice or tool-specific magic.
Anti-Patterns
- Treating examples as copy-paste truth without checking fit, constraints, or failure modes.
- Loading every reference file by default instead of using progressive disclosure.
Outputs
- A concrete result that fits the task: implementation guidance, review findings, architecture decisions, templates, or generated artifacts.
- Clear assumptions, tradeoffs, or unresolved gaps when the task cannot be completed from available context alone.
- References used, companion skills, or follow-up actions when they materially improve execution.
Evidence Produced
| Category | Artifact | Format | Example |
|---|---|---|---|
| Correctness | Service-to-service contract test plan | Markdown doc covering REST/gRPC contracts and async event payload contracts | docs/services/comm-contract-tests.md |
| Operability | Async messaging operations note | Markdown doc covering broker choice, partitioning, retention, and DLQ inspection | docs/services/async-ops-note.md |
References
- Use the links and companion skills already referenced in this file when deeper context is needed.
- Companion skill:
event-driven-architecture— for event sourcing, CQRS, saga orchestration, outbox, DLQ, and broker selection details when moving beyond basic async messaging.
The Two Communication Styles
Synchronous (Request/Response)
Caller waits for a response before continuing.
| Protocol | When to Use | Pros | Cons |
|---|---|---|---|
| HTTP/REST | CRUD operations, external APIs, browser clients | Universal, simple, human-readable | Higher latency, coupling |
| gRPC | High-frequency inter-service calls, polyglot services | Binary protocol, ~7× faster than REST, strict contracts via Protobuf | Harder to debug, not browser-native |
Use synchronous when:
- The caller needs the response to continue processing (e.g., "get student details before generating invoice").
- The operation is user-facing and latency matters (real-time UI response).
- It's a query (read) with no side effects.
Avoid synchronous for:
- Long-running operations (> 2s expected time).
- Triggering workflows across multiple services (cascade of synchronous calls = brittle).
- Anything where partial failure must not block the caller.
Asynchronous (Event-Driven)
Caller publishes an event or message and continues immediately. One or more consumers handle it independently.
| Broker | When to Use | Throughput |
|---|---|---|
| Redis Pub/Sub | Simple events, low volume, fire-and-forget | Medium |
| RabbitMQ | Reliable delivery, complex routing, task queues | High |
| Kafka | High-throughput event streams, audit log, replay | Very High |
Use async when:
- The operation takes > 2s (AI calls, report generation, file processing).
- Multiple services need to react to the same event (fan-out).
- You need decoupling — the publisher should not know who consumes.
- The operation must survive service restarts (durable queues).
Example — async AI report generation:
User requests report
→ report-service publishes {job_id, tenant_id, params} to queue
→ HTTP 202 Accepted returned immediately to user
→ ai-worker-service consumes message, calls AI API
→ ai-worker-service stores result, publishes {job_id, status: "complete"}
→ report-service marks job done; user polls /report/{job_id}/status
Service Discovery
The Problem
In a microservices environment, service instances come and go. IP addresses change. You cannot hardcode endpoints.
Three Approaches
1. DNS-Based Discovery (Recommended for NGINX MRA)
A service registry (Consul, etcd, K8s CoreDNS) maps service names to live instance IPs. NGINX queries DNS asynchronously in the background.
resolver 127.0.0.1:8600 valid=1s; # Consul DNS, refresh every 1s
upstream enrollment_service {
server enrollment.service.consul resolve; # resolved dynamically
}
Key: Set valid=Ns not relying on DNS TTL — TTL in microservices contexts can be dangerously stale.
2. Client-Side Discovery
The calling service queries the registry directly to get instance IPs, then load-balances itself.
// Service client queries Consul HTTP API
$instances = Http::get('http://consul:8500/v1/catalog/service/enrollment-service')
->json();
$target = $instances[array_rand($instances)];
$url = "http://{$target['ServiceAddress']}:{$target['ServicePort']}/api/v1/students";
- Pro: calling service controls load-balancing strategy.
- Con: every service needs registry client code (language coupling).
3. Server-Side Discovery (Preferred)
The API gateway or router mesh handles discovery. Services call a fixed gateway address. No registry knowledge needed in service code.
Service A → http://gateway/enrollment/api/v1/students
Gateway → resolves enrollment service from registry → routes to healthy instance
This is what the NGINX Proxy and Router Mesh models implement. Prefer this — it keeps service code simple.
Inter-Service Authentication
Services must authenticate each other. Do not leave internal APIs open.
| Pattern | How | Use When |
|---|---|---|
| JWT Propagation | Gateway validates JWT from client; passes X-User-Id, X-Tenant-Id, X-Role headers downstream |
User-context calls where identity matters |
| Service-to-Service API Key | Each service has a shared secret per downstream dependency | Background jobs, no user context |
| mTLS (mutual TLS) | Both sides present certificates (Fabric Model handles this at NGINX level) | High-security inter-service calls |
JWT header propagation (PHP/Laravel middleware):
// After gateway validates JWT, downstream services trust these headers
$userId = $request->header('X-User-Id');
$tenantId = $request->header('X-Tenant-Id');
$role = $request->header('X-Role');
// Never re-validate JWT downstream — gateway is the trust boundary
Never expose internal service ports to the public internet. All external traffic must go through the API gateway. Internal services communicate within a private network.
Data Isolation Rule
From Stetson's adaptation of Factor 7.
The rule: A service's data belongs to that service alone. No direct database access by another service.
✅ Correct: finance-service calls enrollment-service HTTP API to get student status
❌ Wrong: finance-service runs SELECT on enrollment_db.student_accounts
Cross-Service Query Patterns
Problem: Reporting needs data from 5 services. N synchronous calls = N latency hops.
Solutions:
Option A — API Aggregation (BFF) Create a Backend for Frontend service that fans out to multiple services and stitches results.
report-service → enrollments API ┐
report-service → finance API ├→ aggregate → response
report-service → grades API ┘
Adds latency but keeps service boundaries clean.
Option B — Event-Sourced Read Model Services publish events on data change. A reporting service consumes all events and maintains a denormalized read-only view optimised for queries.
enrollment-service → events bus → reporting-db (denormalized)
finance-service → events bus → reporting-db
grades-service → events bus → reporting-db
Fast queries, eventual consistency.
API Contract Design
Services must not break their callers. Contract discipline is essential.
Versioning
/api/v1/students/{id} ← stable, never broken
/api/v2/students/{id} ← new version with breaking changes
Run both versions simultaneously during migration. Deprecate v1 only after all callers have migrated.
Backward-Compatible Changes (safe — no new version needed)
- Adding optional fields to a JSON response.
- Adding new endpoints.
- Adding optional query parameters.
Breaking Changes (requires new version)
- Removing fields from a response.
- Changing field types or names.
- Changing error response structure.
- Making optional fields required.
Contract Testing
Each service publishes a contract (OpenAPI spec). Consumer services test against the contract, not the live service. This catches breaking changes before deployment.
# Example: Pact contract test (consumer-driven)
pact verify --provider enrollment-service --pact-url http://pact-broker/pacts/finance-service
Message Queue Patterns
Task Queue (Point-to-Point)
One producer, one consumer. Used for job dispatch.
// Producer (PHP/Laravel)
dispatch(new GenerateAIReportJob($tenantId, $userId, $params))
->onQueue('ai-reports');
// Consumer (Laravel worker)
class GenerateAIReportJob implements ShouldQueue {
public function handle(AIMeteredClient $ai) {
$result = $ai->call($this->tenantId, $this->userId, 'report-generation', ...);
Report::create(['tenant_id' => $this->tenantId, 'content' => $result]);
}
}
Event Bus (Fan-Out)
One event, multiple independent consumers.
// Publisher
event(new StudentEnrolled($studentId, $tenantId, $programmeId));
// Multiple listeners react independently
EnrollmentAuditListener::class, // writes audit log
FeeScheduleListener::class, // creates fee schedule
WelcomeNotificationListener::class, // sends welcome message
AIRiskBaselineListener::class, // initialises AI risk model
Communication Decision Guide
Is the caller waiting for a result to continue?
YES → Synchronous
Is it high-frequency inter-service (> 1,000/min) and latency-critical?
YES → gRPC
NO → HTTP/REST
NO → Asynchronous
Does exactly one service handle each message?
YES → Task Queue (RabbitMQ / Redis Queue)
NO → Event Bus (multiple consumers) → Kafka or RabbitMQ fanout exchange
Workflow Automation & Async Orchestration
n8n Self-Hosted
Install on Docker:
services:
n8n:
image: n8nio/n8n:latest
restart: unless-stopped
ports:
- "5678:5678"
environment:
- N8N_HOST=automation.example.com
- N8N_PROTOCOL=https
- WEBHOOK_URL=https://automation.example.com
- DB_TYPE=postgresdb
- DB_POSTGRESDB_HOST=postgres
- DB_POSTGRESDB_DATABASE=n8n
- DB_POSTGRESDB_USER=n8n
- DB_POSTGRESDB_PASSWORD_FILE=/run/secrets/n8n_db
- N8N_ENCRYPTION_KEY_FILE=/run/secrets/n8n_enc
volumes:
- n8n_data:/home/node/.n8n
secrets:
- n8n_db
- n8n_enc
Core concepts:
- Nodes — building blocks (HTTP Request, Function, IF, Set, Schedule Trigger, Webhook)
- Triggers — how a workflow starts (webhook, schedule, database polling, event)
- Credentials — stored encrypted at rest with
N8N_ENCRYPTION_KEY, referenced by node configuration - Executions — each run is logged and retriable from the UI
n8n for SaaS Automations
Example workflow: Stripe checkout.session.completed webhook → n8n enrich → send onboarding email → create Slack channel.
- Webhook Trigger — URL
https://automation.example.com/webhook/stripe - HTTP Request —
GET https://api.stripe.com/v1/customers/{{ $json.customer }}(credentials: Stripe API) - Set — map Stripe fields to internal schema
- HTTP Request —
POST https://api.example.com/internal/usersto provision the account - SendGrid — send welcome email template
- Slack —
POST https://slack.com/api/conversations.createwith namecustomer-{{ $json.company_slug }} - IF — route to SEV2 ticket on any failure (fallback branch)
Temporal Workflow Orchestration
Temporal separates workflow code (durable, deterministic) from activity code (side effects, may fail and retry). The workflow engine persists state so crashes resume where they left off.
Minimal TypeScript workflow + activity:
// activities.ts
export async function chargeCard(customerId: string, amount: number): Promise<string> {
const charge = await stripe.charges.create({ customer: customerId, amount });
return charge.id;
}
export async function sendReceipt(customerId: string, chargeId: string): Promise<void> {
await sendgrid.send({ to: customerId, templateId: "receipt", dynamic: { chargeId } });
}
// workflows.ts
import { proxyActivities, sleep } from "@temporalio/workflow";
import type * as activities from "./activities";
const { chargeCard, sendReceipt } = proxyActivities<typeof activities>({
startToCloseTimeout: "30s",
retry: { initialInterval: "1s", maximumAttempts: 5, backoffCoefficient: 2 },
});
export async function onboardingWorkflow(customerId: string, amount: number): Promise<void> {
const chargeId = await chargeCard(customerId, amount);
await sleep("5s");
await sendReceipt(customerId, chargeId);
}
// worker.ts
import { Worker } from "@temporalio/worker";
import * as activities from "./activities";
await Worker.create({
workflowsPath: require.resolve("./workflows"),
activities,
taskQueue: "onboarding",
}).then((w) => w.run());
Temporal Patterns
Key durability and control-flow features:
- Retries with backoff — declared per activity (initial interval, max interval, coefficient, max attempts)
- Timeouts —
startToCloseTimeout(activity),scheduleToCloseTimeout(total including retries),heartbeatTimeout(long activities report progress) - Signals — external events into a running workflow (
workflow.signal('approved')) - Queries — read workflow state without mutating it (
workflow.query('getStatus')) - Child workflows — decompose long workflows into composable sub-workflows
- Continue-as-new — refresh history when a workflow accumulates too many events
Temporal vs BullMQ
Decision criteria:
| Factor | Temporal | BullMQ |
|---|---|---|
| Durability | Full workflow history persisted; resumes after crash | Job state in Redis; resumable but limited history |
| Programming model | Write workflow code in TS/Go/Java/Python | Submit jobs to queue with handlers |
| Best for | Multi-step business workflows, long-running processes | Short background jobs, fan-out tasks |
| Ops overhead | Temporal cluster (Cassandra/Postgres, frontend, matching, history services) | Single Redis instance |
| Visibility | Web UI with full execution history | Basic Redis-backed UI |
Rule of thumb — if the workflow is longer than 30 seconds or has multiple external calls with their own failure modes, use Temporal; else BullMQ.
Apache Airflow
Apache Airflow schedules and orchestrates ETL-style DAGs. Each DAG is a Python file defining tasks and their dependencies.
from airflow import DAG
from airflow.operators.python import PythonOperator
from datetime import datetime, timedelta
default_args = {
"owner": "data-team",
"retries": 3,
"retry_delay": timedelta(minutes=5),
}
with DAG(
dag_id="daily_revenue_sync",
default_args=default_args,
start_date=datetime(2026, 1, 1),
schedule="0 2 * * *",
catchup=False,
) as dag:
extract = PythonOperator(task_id="extract", python_callable=extract_stripe)
transform = PythonOperator(task_id="transform", python_callable=transform_rows)
load = PythonOperator(task_id="load", python_callable=load_warehouse)
extract >> transform >> load
XCom passes small values between tasks; sensors wait for external conditions (file arrival, partition availability, API readiness).
Airflow vs Temporal
- Airflow — batch ETL pipelines with time-based scheduling, data-team ownership, Python-first
- Temporal — business-process workflows (orders, onboarding, refunds), engineering-team ownership, language-agnostic
Don't use Airflow for sub-second latency workflows; don't use Temporal for scheduled nightly ETL.
Workflow Observability
- Temporal Web UI — every execution visible with full event history, click into any activity to see input/output and retry count
- n8n execution logs — per-workflow execution list with node-level inputs/outputs; filter by status
- Airflow task instance logs — per-task run log, DAG graph view, Gantt view for bottleneck analysis
- All three expose Prometheus metrics for scraping into Grafana
See also:
microservices-architecture-models— Where service discovery is handled (Proxy/Router/Fabric)microservices-resilience— Retry, timeout, circuit breaker for synchronous callsmicroservices-ai-integration— Async AI job queue patternapi-error-handling— Error response standards for service APIs
More from peterbamuhigire/skills-web-dev
google-play-store-review
Google Play Store compliance and review readiness for Android apps. Use
76saas-accounting-system
Implement a complete double-entry accounting system inside any SaaS app.
47manual-guide
Generate end-user manuals and reference guides for ERP modules. Use when
38healthcare-ui-design
Design world-class clinical and patient-facing healthcare UIs for web,
38api-error-handling
Comprehensive, standardized error response system for PHP REST APIs with
31android-data-persistence
Android data persistence standards with Room as primary local storage
30