ai-inference-service-mesh
Installation
SKILL.md
AI Inference Service Mesh
Apply Istio/Linkerd mesh controls to secure and optimize east-west AI traffic across inference microservices.
Why Mesh for AI
- Enforce mTLS between gateway, retriever, reranker, and model services
- Apply fine-grained traffic policies without app code changes
- Run progressive delivery for model-serving backends
- Observe latency hops for retrieval + generation chains
- Route inference requests by model version, tenant, or priority tier
- Protect expensive GPU-backed services from cascading failures
Prerequisites
# Install Istio with production profile
istioctl install --set profile=default \
--set meshConfig.accessLogFile=/dev/stdout \
--set meshConfig.defaultConfig.holdApplicationUntilProxyStarts=true
# Label inference namespace for sidecar injection
kubectl create namespace ai-inference
kubectl label namespace ai-inference istio-injection=enabled
# Verify installation
istioctl verify-install
istioctl analyze -n ai-inference
Core Patterns
mTLS Strict Mode Cluster-Wide
apiVersion: security.istio.io/v1beta1
kind: PeerAuthentication
metadata:
name: default
namespace: istio-system
spec:
mtls:
mode: STRICT
---
# Namespace-level override if needed for gradual rollout
apiVersion: security.istio.io/v1beta1
kind: PeerAuthentication
metadata:
name: ai-inference-mtls
namespace: ai-inference
spec:
mtls:
mode: STRICT
portLevelMtls:
# gRPC inference port
8081:
mode: STRICT
# Prometheus metrics port - allow plaintext scraping
9090:
mode: PERMISSIVE
AuthorizationPolicy Per Service Account
apiVersion: security.istio.io/v1beta1
kind: AuthorizationPolicy
metadata:
name: model-server-access
namespace: ai-inference
spec:
selector:
matchLabels:
app: model-server
action: ALLOW
rules:
- from:
- source:
principals:
- "cluster.local/ns/ai-inference/sa/api-gateway"
- "cluster.local/ns/ai-inference/sa/orchestrator"
to:
- operation:
methods: ["POST"]
paths: ["/v1/predict", "/v1/embeddings", "/v2/models/*/infer"]
---
apiVersion: security.istio.io/v1beta1
kind: AuthorizationPolicy
metadata:
name: deny-external-to-retriever
namespace: ai-inference
spec:
selector:
matchLabels:
app: vector-retriever
action: DENY
rules:
- from:
- source:
notNamespaces: ["ai-inference"]
Egress Policy for Approved Model Endpoints
apiVersion: networking.istio.io/v1alpha3
kind: ServiceEntry
metadata:
name: openai-api
namespace: ai-inference
spec:
hosts:
- api.openai.com
ports:
- number: 443
name: https
protocol: TLS
resolution: DNS
location: MESH_EXTERNAL
---
apiVersion: networking.istio.io/v1alpha3
kind: DestinationRule
metadata:
name: openai-api-tls
namespace: ai-inference
spec:
host: api.openai.com
trafficPolicy:
tls:
mode: SIMPLE
connectionPool:
http:
h2UpgradePolicy: UPGRADE
tcp:
maxConnections: 50
---
apiVersion: security.istio.io/v1beta1
kind: AuthorizationPolicy
metadata:
name: restrict-egress
namespace: ai-inference
spec:
action: ALLOW
rules:
- to:
- operation:
hosts:
- "api.openai.com"
- "models.anthropic.com"
- "*.blob.core.windows.net"
Traffic Management
VirtualService for A/B Model Testing
apiVersion: networking.istio.io/v1alpha3
kind: VirtualService
metadata:
name: model-server
namespace: ai-inference
spec:
hosts:
- model-server
http:
# Route by header for explicit model version selection
- match:
- headers:
x-model-version:
exact: "v2-experimental"
route:
- destination:
host: model-server
subset: v2-experimental
timeout: 120s
# Route by header for A/B test cohort
- match:
- headers:
x-ab-cohort:
exact: "treatment"
route:
- destination:
host: model-server
subset: v2-experimental
weight: 100
timeout: 120s
# Default traffic split: 90/10 canary
- route:
- destination:
host: model-server
subset: v1-stable
weight: 90
- destination:
host: model-server
subset: v2-experimental
weight: 10
timeout: 60s
retries:
attempts: 2
perTryTimeout: 30s
retryOn: unavailable,resource-exhausted
DestinationRule with Subsets
apiVersion: networking.istio.io/v1alpha3
kind: DestinationRule
metadata:
name: model-server
namespace: ai-inference
spec:
host: model-server
trafficPolicy:
connectionPool:
http:
h2UpgradePolicy: UPGRADE
maxRequestsPerConnection: 100
tcp:
maxConnections: 200
connectTimeout: 5s
loadBalancer:
simple: LEAST_REQUEST
subsets:
- name: v1-stable
labels:
version: v1
trafficPolicy:
connectionPool:
http:
maxRequestsPerConnection: 50
- name: v2-experimental
labels:
version: v2
trafficPolicy:
connectionPool:
http:
maxRequestsPerConnection: 20
Circuit Breaking for Inference Backends
apiVersion: networking.istio.io/v1alpha3
kind: DestinationRule
metadata:
name: model-server-circuit-breaker
namespace: ai-inference
spec:
host: model-server
trafficPolicy:
connectionPool:
tcp:
maxConnections: 100
connectTimeout: 10s
http:
http1MaxPendingRequests: 50
http2MaxRequests: 200
maxRequestsPerConnection: 10
maxRetries: 3
outlierDetection:
consecutive5xxErrors: 3
interval: 15s
baseEjectionTime: 30s
maxEjectionPercent: 50
minHealthPercent: 30
splitExternalLocalOriginErrors: true
---
# Separate circuit breaker for the vector retriever
apiVersion: networking.istio.io/v1alpha3
kind: DestinationRule
metadata:
name: vector-retriever-circuit-breaker
namespace: ai-inference
spec:
host: vector-retriever
trafficPolicy:
connectionPool:
tcp:
maxConnections: 300
http:
http1MaxPendingRequests: 200
http2MaxRequests: 500
outlierDetection:
consecutive5xxErrors: 5
interval: 10s
baseEjectionTime: 15s
maxEjectionPercent: 30
Retry Budget for Streaming Requests
apiVersion: networking.istio.io/v1alpha3
kind: VirtualService
metadata:
name: streaming-inference
namespace: ai-inference
spec:
hosts:
- model-server
http:
# Streaming endpoint: no retries, long timeout
- match:
- uri:
prefix: /v1/stream
route:
- destination:
host: model-server
subset: v1-stable
timeout: 300s
retries:
attempts: 0
# Embeddings endpoint: safe to retry, short timeout
- match:
- uri:
prefix: /v1/embeddings
route:
- destination:
host: model-server
subset: v1-stable
timeout: 15s
retries:
attempts: 3
perTryTimeout: 5s
retryOn: 5xx,reset,connect-failure,retriable-status-codes
Resilience
Locality-Aware Routing
apiVersion: networking.istio.io/v1alpha3
kind: DestinationRule
metadata:
name: model-server-locality
namespace: ai-inference
spec:
host: model-server
trafficPolicy:
loadBalancer:
localityLbSetting:
enabled: true
distribute:
- from: "us-east-1/us-east-1a/*"
to:
"us-east-1/us-east-1a/*": 80
"us-east-1/us-east-1b/*": 20
failover:
- from: us-east-1
to: us-west-2
outlierDetection:
consecutive5xxErrors: 3
interval: 10s
baseEjectionTime: 30s
Observability
# Telemetry resource for custom metrics on inference services
apiVersion: telemetry.istio.io/v1alpha1
kind: Telemetry
metadata:
name: inference-telemetry
namespace: ai-inference
spec:
metrics:
- providers:
- name: prometheus
overrides:
- match:
metric: REQUEST_DURATION
mode: CLIENT_AND_SERVER
tagOverrides:
model_name:
operation: UPSERT
value: "request.headers['x-model-name']"
tenant_id:
operation: UPSERT
value: "request.headers['x-tenant-id']"
tracing:
- providers:
- name: zipkin
randomSamplingPercentage: 10.0
Kiali Dashboard Check
# Port-forward Kiali
kubectl port-forward svc/kiali -n istio-system 20001:20001 &
# Verify mesh health via API
curl -s http://localhost:20001/kiali/api/namespaces/ai-inference/health | jq .
# Check proxy sync status
istioctl proxy-status -n ai-inference
# Debug a specific pod sidecar config
istioctl proxy-config routes deploy/model-server -n ai-inference -o json
istioctl proxy-config cluster deploy/model-server -n ai-inference
Pitfalls to Avoid
- Aggressive timeouts that break streaming responses -- set 300s+ for generation endpoints
- Blanket retries that amplify expensive generation calls -- disable retries on non-idempotent routes
- Missing identity boundaries between tenant-facing and internal services
- Forgetting to exempt health check and metrics ports from strict mTLS
- Setting outlier ejection too aggressively on small pools (maxEjectionPercent too high)
- Not using
holdApplicationUntilProxyStartscausing race conditions on startup
Related Skills
- service-mesh - Foundational mesh concepts
- llm-gateway - North-south API gateway controls
- opentelemetry - End-to-end tracing and metrics
Weekly Installs
22
Repository
bagelhole/devop…t-skillsGitHub Stars
18
First Seen
5 days ago
Security Audits