Microservices AI Integration

Use When

Integrating AI into a microservices architecture — AI model server as a microservice, AI gateway pattern, async AI job pipeline, AI-enhanced orchestration (Kubeflow, Seldon Core), and wiring the AI metering/billing layer into a distributed...
The task needs reusable judgment, domain constraints, or a proven workflow rather than ad hoc advice.

Do Not Use When

The task is unrelated to microservices-ai-integration or would be better handled by a more specific companion skill.
The request only needs a trivial answer and none of this skill's constraints or references materially help.

Required Inputs

Gather relevant project context, constraints, and the concrete problem to solve.
Confirm the desired deliverable: design, code, review, migration plan, audit, or documentation.

Workflow

Read this SKILL.md first, then load only the referenced deep-dive files that are necessary for the task.
Apply the ordered guidance, checklists, and decision rules in this skill instead of cherry-picking isolated snippets.
Produce the deliverable with assumptions, risks, and follow-up work made explicit when they matter.

Quality Standards

Keep outputs execution-oriented, concise, and aligned with the repository's baseline engineering standards.
Preserve compatibility with existing project conventions unless the skill explicitly requires a stronger standard.
Prefer deterministic, reviewable steps over vague advice or tool-specific magic.

Anti-Patterns

Treating examples as copy-paste truth without checking fit, constraints, or failure modes.
Loading every reference file by default instead of using progressive disclosure.

Outputs

A concrete result that fits the task: implementation guidance, review findings, architecture decisions, templates, or generated artifacts.
Clear assumptions, tradeoffs, or unresolved gaps when the task cannot be completed from available context alone.
References used, companion skills, or follow-up actions when they materially improve execution.

Evidence Produced

Category	Artifact	Format	Example
Correctness	AI service interface contract	Markdown doc covering request/response shape, streaming, and async job lifecycle for the AI gateway	`docs/services/ai-gateway-contract.md`

References

Use the links and companion skills already referenced in this file when deeper context is needed.

The Central Pattern

Treat AI as a dedicated, independently deployable microservice — the AI Service. All AI API calls from across the system route through it. This service enforces the AI Module Gate, Token Budget Guard, and Token Ledger (from ai-architecture-patterns) at a single, auditable point.

[Any Service] → [AI Gateway] → [AI Service] → [External AI Provider API]
                     ↓               ↓
               Gate Check      Token Ledger
               Budget Guard    ai_usage_log

Architecture: AI as a Microservice

The AI Service Contract

The AI Service is the only service that talks to external AI provider APIs (Anthropic, OpenAI, DeepSeek, Gemini). All other services call the AI Service — never the external API directly.

Why:

Single point for API key management (not distributed across services).
Token metering happens in one place — no double-counting.
Model swapping affects one service, not every caller.
Rate limiting and budget enforcement centralised.

API Design

POST /ai/complete
{
  "tenant_id": 42,
  "user_id": 101,
  "feature_slug": "sales-summary",
  "model": "claude-haiku-4-5",
  "system_prompt": "You are a sales analyst...",
  "user_message": "Summarise today's sales: ...",
  "max_tokens": 400
}

→ 200 OK
{
  "content": "Today's total sales were UGX 2,450,000...",
  "input_tokens": 312,
  "output_tokens": 87,
  "cost_usd": 0.000598,
  "request_id": "req_abc123"
}

→ 402 Payment Required (budget exhausted)
{ "error": "ai_budget_exceeded", "message": "Monthly AI budget exhausted for tenant 42" }

→ 403 Forbidden (module not active)
{ "error": "ai_module_inactive", "message": "AI module not activated for tenant 42" }

PHP/Laravel AI Service Implementation

// app/Http/Controllers/AICompletionController.php
class AICompletionController extends Controller
{
    public function complete(AICompletionRequest $request, AIMeteredClient $client): JsonResponse
    {
        try {
            $response = $client->call(
                tenantId:    $request->tenant_id,
                userId:      $request->user_id,
                featureSlug: $request->feature_slug,
                request:     new AIRequest(
                    model:        $request->model,
                    systemPrompt: $request->system_prompt,
                    userMessage:  AIInputSanitiser::sanitise($request->user_message),
                    maxTokens:    $request->max_tokens ?? 1024,
                )
            );

            return response()->json([
                'content'      => AIOutputValidator::sanitiseText($response->content),
                'input_tokens' => $response->inputTokens,
                'output_tokens'=> $response->outputTokens,
                'cost_usd'     => $response->costUsd,
                'request_id'   => $response->requestId,
            ]);

        } catch (AIModuleNotActiveException $e) {
            return response()->json(['error' => 'ai_module_inactive', 'message' => $e->getMessage()], 403);
        } catch (AIBudgetExceededException $e) {
            return response()->json(['error' => 'ai_budget_exceeded', 'message' => $e->getMessage()], 402);
        }
    }
}

Async AI Job Pipeline

For AI features where response time exceeds 3s (report generation, batch analysis, document extraction), use an async queue pattern.

User Request → POST /reports/generate
             → 202 Accepted { "job_id": "job_xyz" }
             → Job dispatched to ai-reports queue

Worker Service → dequeues job
               → calls AI Service (POST /ai/complete)
               → stores result in reports table
               → publishes ReportCompleted event

User polls → GET /reports/job_xyz/status
           → { "status": "complete", "download_url": "/reports/job_xyz/download" }

Or push-based with WebSocket/SSE:

Worker publishes ReportCompleted event
→ notification-service listens
→ pushes in-app notification to user ("Your report is ready")

Laravel Queue Implementation

// Dispatching the AI job
class GenerateReportController extends Controller
{
    public function generate(Request $request): JsonResponse
    {
        $job = AIReportJob::create([
            'tenant_id' => $request->tenant_id,
            'user_id'   => $request->user_id,
            'params'    => $request->params,
            'status'    => 'queued',
        ]);

        dispatch(new ProcessAIReportJob($job->id))->onQueue('ai-reports');

        return response()->json(['job_id' => $job->id, 'status' => 'queued'], 202);
    }
}

// The queued job
class ProcessAIReportJob implements ShouldQueue
{
    use InteractsWithQueue, Queueable, SerializesModels;

    public int $tries = 3;
    public int $timeout = 60; // 60s max per attempt

    public function handle(): void
    {
        $job = AIReportJob::findOrFail($this->jobId);
        $job->update(['status' => 'processing']);

        $response = Http::timeout(30)->post('http://ai-service/ai/complete', [
            'tenant_id'   => $job->tenant_id,
            'user_id'     => $job->user_id,
            'feature_slug'=> 'report-generation',
            'model'       => 'claude-haiku-4-5',
            'system_prompt' => '...',
            'user_message'  => $this->buildPrompt($job->params),
            'max_tokens'    => 2000,
        ]);

        $job->update([
            'status'  => 'complete',
            'result'  => $response->json('content'),
            'cost_usd'=> $response->json('cost_usd'),
        ]);

        event(new AIReportCompleted($job->tenant_id, $job->user_id, $job->id));
    }

    public function failed(\Throwable $e): void
    {
        AIReportJob::find($this->jobId)?->update(['status' => 'failed', 'error' => $e->getMessage()]);
    }
}

AI Gateway Pattern

Layer the AI Service behind the API gateway to enforce:

Per-tenant rate limiting on AI endpoints (separate limit from regular API)
AI-specific authentication (service-to-service API key)
Circuit breaker for the AI Service itself (if AI Service is slow/down)
Routing to different AI Service instances by feature type

# NGINX — AI Service upstream with circuit breaker
upstream ai_service {
    least_time last_byte;
    server ai-service-1.internal:8080;
    server ai-service-2.internal:8080;
}

location /ai/ {
    # Rate limit: 100 AI requests per minute per tenant
    limit_req zone=ai_per_tenant burst=20 nodelay;

    proxy_pass http://ai_service;
    proxy_read_timeout 60s;  # AI calls can be slow

    # Circuit breaker via health check
    health_check uri=/health interval=5s fails=1;
}

AI-Enhanced Orchestration

Source: Pandiya & Charankar Ch. 3

AI can enhance the orchestration layer of a microservices system:

Predictive Scaling

AI analyses historical traffic patterns to pre-scale services before load spikes.

Tool: Kubernetes + Kubeflow Pipelines
Kubeflow is an ML toolkit for Kubernetes — runs ML workflows alongside services

AI-Assisted Fault Tolerance

AI models predict which service instances are likely to fail before they do, triggering preemptive migration.

Input: CPU trend, memory growth rate, error rate over last 10 minutes
Output: probability of failure in next 15 minutes → if > 0.8, migrate load

Seldon Core — AI Model Serving as a Microservice

Seldon Core extends Kubernetes to serve ML models as REST/gRPC services with the same lifecycle as any other microservice (canary deployments, A/B testing, traffic splitting).

# Seldon Deployment — serve a scikit-learn model
apiVersion: machinelearning.seldon.io/v1
kind: SeldonDeployment
metadata:
  name: risk-predictor
spec:
  predictors:
  - name: default
    graph:
      name: risk-model
      implementation: SKLEARN_SERVER
      modelUri: gs://my-bucket/risk-model
    replicas: 2
    traffic: 100

When to use Seldon Core: When you have a custom ML model (not a foundation model) that needs to scale, version, and update independently.

Wiring AI Metering in a Microservices Context

In a distributed system, metering must still be centralised (at the AI Service). Services must not try to record their own token usage.

Rule: The AI Service is the sole writer to ai_usage_log. All other services are readers (for their own tenant/user data via the usage API).

finance-service  ──┐
enrollment-service ├── POST /ai/complete → AI Service writes to ai_usage_log
report-service   ──┘

admin-service → GET /ai/usage?tenant_id=42&period=2026-04 → AI Service reads usage

Usage API in the AI Service:

GET /ai/usage?tenant_id=42&period=2026-04&group_by=user
→ { "period": "2026-04", "users": [ { "user_id": 101, "calls": 82, "tokens": 14500, "cost_usd": 0.0234 } ] }

GET /ai/usage/tenants?period=2026-04   (super-admin only)
→ [ { "tenant_id": 42, "tier": "growth", "budget_usd": 10.00, "spent_usd": 3.21, "pct_used": 32.1 } ]

Resilience for the AI Service

The AI Service is a critical dependency. Apply extra resilience:

Circuit breaker on the AI Service itself at the gateway level.
Queue-based fallback: If AI Service is down, synchronous callers get a 503 immediately; async callers' jobs remain in queue and are processed when service recovers.
Provider failover: If primary AI provider (Anthropic) is unreachable, automatically fall back to secondary (OpenAI or DeepSeek) — log the switch to the audit log.
Budget hard stop: Never retry a request that failed with 402 Budget Exceeded — the budget is not a transient error.