ai-inference-service-mesh

Installation
SKILL.md

AI Inference Service Mesh

Apply Istio/Linkerd mesh controls to secure and optimize east-west AI traffic across inference microservices.

Why Mesh for AI

  • Enforce mTLS between gateway, retriever, reranker, and model services
  • Apply fine-grained traffic policies without app code changes
  • Run progressive delivery for model-serving backends
  • Observe latency hops for retrieval + generation chains
  • Route inference requests by model version, tenant, or priority tier
  • Protect expensive GPU-backed services from cascading failures

Prerequisites

# Install Istio with production profile
istioctl install --set profile=default \
  --set meshConfig.accessLogFile=/dev/stdout \
  --set meshConfig.defaultConfig.holdApplicationUntilProxyStarts=true

# Label inference namespace for sidecar injection
kubectl create namespace ai-inference
kubectl label namespace ai-inference istio-injection=enabled

# Verify installation
istioctl verify-install
istioctl analyze -n ai-inference

Core Patterns

mTLS Strict Mode Cluster-Wide

apiVersion: security.istio.io/v1beta1
kind: PeerAuthentication
metadata:
  name: default
  namespace: istio-system
spec:
  mtls:
    mode: STRICT
---
# Namespace-level override if needed for gradual rollout
apiVersion: security.istio.io/v1beta1
kind: PeerAuthentication
metadata:
  name: ai-inference-mtls
  namespace: ai-inference
spec:
  mtls:
    mode: STRICT
  portLevelMtls:
    # gRPC inference port
    8081:
      mode: STRICT
    # Prometheus metrics port - allow plaintext scraping
    9090:
      mode: PERMISSIVE

AuthorizationPolicy Per Service Account

apiVersion: security.istio.io/v1beta1
kind: AuthorizationPolicy
metadata:
  name: model-server-access
  namespace: ai-inference
spec:
  selector:
    matchLabels:
      app: model-server
  action: ALLOW
  rules:
  - from:
    - source:
        principals:
        - "cluster.local/ns/ai-inference/sa/api-gateway"
        - "cluster.local/ns/ai-inference/sa/orchestrator"
    to:
    - operation:
        methods: ["POST"]
        paths: ["/v1/predict", "/v1/embeddings", "/v2/models/*/infer"]
---
apiVersion: security.istio.io/v1beta1
kind: AuthorizationPolicy
metadata:
  name: deny-external-to-retriever
  namespace: ai-inference
spec:
  selector:
    matchLabels:
      app: vector-retriever
  action: DENY
  rules:
  - from:
    - source:
        notNamespaces: ["ai-inference"]

Egress Policy for Approved Model Endpoints

apiVersion: networking.istio.io/v1alpha3
kind: ServiceEntry
metadata:
  name: openai-api
  namespace: ai-inference
spec:
  hosts:
  - api.openai.com
  ports:
  - number: 443
    name: https
    protocol: TLS
  resolution: DNS
  location: MESH_EXTERNAL
---
apiVersion: networking.istio.io/v1alpha3
kind: DestinationRule
metadata:
  name: openai-api-tls
  namespace: ai-inference
spec:
  host: api.openai.com
  trafficPolicy:
    tls:
      mode: SIMPLE
    connectionPool:
      http:
        h2UpgradePolicy: UPGRADE
      tcp:
        maxConnections: 50
---
apiVersion: security.istio.io/v1beta1
kind: AuthorizationPolicy
metadata:
  name: restrict-egress
  namespace: ai-inference
spec:
  action: ALLOW
  rules:
  - to:
    - operation:
        hosts:
        - "api.openai.com"
        - "models.anthropic.com"
        - "*.blob.core.windows.net"

Traffic Management

VirtualService for A/B Model Testing

apiVersion: networking.istio.io/v1alpha3
kind: VirtualService
metadata:
  name: model-server
  namespace: ai-inference
spec:
  hosts:
  - model-server
  http:
  # Route by header for explicit model version selection
  - match:
    - headers:
        x-model-version:
          exact: "v2-experimental"
    route:
    - destination:
        host: model-server
        subset: v2-experimental
    timeout: 120s
  # Route by header for A/B test cohort
  - match:
    - headers:
        x-ab-cohort:
          exact: "treatment"
    route:
    - destination:
        host: model-server
        subset: v2-experimental
      weight: 100
    timeout: 120s
  # Default traffic split: 90/10 canary
  - route:
    - destination:
        host: model-server
        subset: v1-stable
      weight: 90
    - destination:
        host: model-server
        subset: v2-experimental
      weight: 10
    timeout: 60s
    retries:
      attempts: 2
      perTryTimeout: 30s
      retryOn: unavailable,resource-exhausted

DestinationRule with Subsets

apiVersion: networking.istio.io/v1alpha3
kind: DestinationRule
metadata:
  name: model-server
  namespace: ai-inference
spec:
  host: model-server
  trafficPolicy:
    connectionPool:
      http:
        h2UpgradePolicy: UPGRADE
        maxRequestsPerConnection: 100
      tcp:
        maxConnections: 200
        connectTimeout: 5s
    loadBalancer:
      simple: LEAST_REQUEST
  subsets:
  - name: v1-stable
    labels:
      version: v1
    trafficPolicy:
      connectionPool:
        http:
          maxRequestsPerConnection: 50
  - name: v2-experimental
    labels:
      version: v2
    trafficPolicy:
      connectionPool:
        http:
          maxRequestsPerConnection: 20

Circuit Breaking for Inference Backends

apiVersion: networking.istio.io/v1alpha3
kind: DestinationRule
metadata:
  name: model-server-circuit-breaker
  namespace: ai-inference
spec:
  host: model-server
  trafficPolicy:
    connectionPool:
      tcp:
        maxConnections: 100
        connectTimeout: 10s
      http:
        http1MaxPendingRequests: 50
        http2MaxRequests: 200
        maxRequestsPerConnection: 10
        maxRetries: 3
    outlierDetection:
      consecutive5xxErrors: 3
      interval: 15s
      baseEjectionTime: 30s
      maxEjectionPercent: 50
      minHealthPercent: 30
      splitExternalLocalOriginErrors: true
---
# Separate circuit breaker for the vector retriever
apiVersion: networking.istio.io/v1alpha3
kind: DestinationRule
metadata:
  name: vector-retriever-circuit-breaker
  namespace: ai-inference
spec:
  host: vector-retriever
  trafficPolicy:
    connectionPool:
      tcp:
        maxConnections: 300
      http:
        http1MaxPendingRequests: 200
        http2MaxRequests: 500
    outlierDetection:
      consecutive5xxErrors: 5
      interval: 10s
      baseEjectionTime: 15s
      maxEjectionPercent: 30

Retry Budget for Streaming Requests

apiVersion: networking.istio.io/v1alpha3
kind: VirtualService
metadata:
  name: streaming-inference
  namespace: ai-inference
spec:
  hosts:
  - model-server
  http:
  # Streaming endpoint: no retries, long timeout
  - match:
    - uri:
        prefix: /v1/stream
    route:
    - destination:
        host: model-server
        subset: v1-stable
    timeout: 300s
    retries:
      attempts: 0
  # Embeddings endpoint: safe to retry, short timeout
  - match:
    - uri:
        prefix: /v1/embeddings
    route:
    - destination:
        host: model-server
        subset: v1-stable
    timeout: 15s
    retries:
      attempts: 3
      perTryTimeout: 5s
      retryOn: 5xx,reset,connect-failure,retriable-status-codes

Resilience

Locality-Aware Routing

apiVersion: networking.istio.io/v1alpha3
kind: DestinationRule
metadata:
  name: model-server-locality
  namespace: ai-inference
spec:
  host: model-server
  trafficPolicy:
    loadBalancer:
      localityLbSetting:
        enabled: true
        distribute:
        - from: "us-east-1/us-east-1a/*"
          to:
            "us-east-1/us-east-1a/*": 80
            "us-east-1/us-east-1b/*": 20
        failover:
        - from: us-east-1
          to: us-west-2
    outlierDetection:
      consecutive5xxErrors: 3
      interval: 10s
      baseEjectionTime: 30s

Observability

# Telemetry resource for custom metrics on inference services
apiVersion: telemetry.istio.io/v1alpha1
kind: Telemetry
metadata:
  name: inference-telemetry
  namespace: ai-inference
spec:
  metrics:
  - providers:
    - name: prometheus
    overrides:
    - match:
        metric: REQUEST_DURATION
        mode: CLIENT_AND_SERVER
      tagOverrides:
        model_name:
          operation: UPSERT
          value: "request.headers['x-model-name']"
        tenant_id:
          operation: UPSERT
          value: "request.headers['x-tenant-id']"
  tracing:
  - providers:
    - name: zipkin
    randomSamplingPercentage: 10.0

Kiali Dashboard Check

# Port-forward Kiali
kubectl port-forward svc/kiali -n istio-system 20001:20001 &

# Verify mesh health via API
curl -s http://localhost:20001/kiali/api/namespaces/ai-inference/health | jq .

# Check proxy sync status
istioctl proxy-status -n ai-inference

# Debug a specific pod sidecar config
istioctl proxy-config routes deploy/model-server -n ai-inference -o json
istioctl proxy-config cluster deploy/model-server -n ai-inference

Pitfalls to Avoid

  • Aggressive timeouts that break streaming responses -- set 300s+ for generation endpoints
  • Blanket retries that amplify expensive generation calls -- disable retries on non-idempotent routes
  • Missing identity boundaries between tenant-facing and internal services
  • Forgetting to exempt health check and metrics ports from strict mTLS
  • Setting outlier ejection too aggressively on small pools (maxEjectionPercent too high)
  • Not using holdApplicationUntilProxyStarts causing race conditions on startup

Related Skills

Weekly Installs
22
GitHub Stars
18
First Seen
5 days ago