agent-sandbox

Installation
SKILL.md

Agent Sandbox — Kubernetes Operator for AI Agent Runtimes

kubernetes-sigs/agent-sandbox is a SIG Apps subproject that provides a Kubernetes-native primitive for isolated, stateful, singleton workloads — the shape needed for AI agent runtimes, code interpreters, computer-use browsers, dev sandboxes, and per-user Jupyter notebooks. It fills the gap between Deployment (stateless, replicated) and StatefulSet (numbered, replicated) by modelling a long-lived pod that can be paused, resumed, scheduled for expiry, and optionally pre-warmed.

Current API version is v1alpha1. Latest release at the time this skill was written is v0.3.10 (April 2026). The project launched at KubeCon Atlanta in November 2025, so most training data predates it — prefer this skill over guessing.

The Four CRDs

CRD API group Purpose Who creates it
Sandbox agents.x-k8s.io/v1alpha1 The singleton pod + headless service + PVCs. Supports replicas: 0 (paused) or 1 (running). Users directly, or SandboxClaim controller
SandboxTemplate extensions.agents.x-k8s.io/v1alpha1 Reusable pod blueprint + shared NetworkPolicy per template. Platform / infra team
SandboxClaim extensions.agents.x-k8s.io/v1alpha1 Ticket-style request that binds to a template and adopts a warm pool pod or creates fresh. Carries shutdownTime. Application backend
SandboxWarmPool extensions.agents.x-k8s.io/v1alpha1 Pool of pre-warmed Sandbox CRs (as of v0.3.10 — no longer bare pods). HPA-friendly via scale subresource. Platform / infra team

The extensions CRDs are the normal way to use the operator in production. Raw Sandbox is available, but you lose warm pools, claim lifecycle, and network policy management.

Install

No Helm chart exists yet (tracked in upstream issue #483). Use the release manifests:

export VERSION="v0.3.10"
kubectl apply -f https://github.com/kubernetes-sigs/agent-sandbox/releases/download/${VERSION}/manifest.yaml
kubectl apply -f https://github.com/kubernetes-sigs/agent-sandbox/releases/download/${VERSION}/extensions.yaml

Controller lands in namespace agent-sandbox-system. Cluster-scoped RBAC (no namespace-scoped deployment yet — issue #484). For production Kustomize, overlay the Deployment args — defaults are very conservative.

Minimum controller args for any serious workload:

args:
  - --extensions # enables Template/Claim/WarmPool
  - --kube-api-qps=100
  - --kube-api-burst=200
  - --sandbox-concurrent-workers=25
  - --sandbox-claim-concurrent-workers=100
  - --sandbox-warm-pool-concurrent-workers=50
  - --sandbox-template-concurrent-workers=10

Defaults are 1 worker per controller and 10 burst QPS. That will throttle out instantly under any multi-tenant load.

Which Shape Do You Need?

digraph choose {
    rankdir=LR; node [shape=box, style=rounded];
    q1 [label="Untrusted\ncode in pod?"];
    q2 [label="Need fast cold start\n(<1s)?"];
    q3 [label="One-off or shared\npool?"];
    q4 [label="Template used\nby >1 pod?"];

    q1 -> "gVisor or Kata\nruntimeClassName" [label=" yes"];
    q1 -> "standard runtime\n(still isolate via NetPol + nonroot)" [label=" no"];

    q2 -> "SandboxWarmPool\n+ SandboxClaim" [label=" yes"];
    q2 -> "SandboxClaim\n(warmpool: none)" [label=" no"];

    q3 -> "SandboxTemplate\n+ SandboxClaim" [label=" shared"];
    q3 -> "Bare Sandbox" [label=" one-off"];

    q4 -> "SandboxTemplate" [label=" yes"];
    q4 -> "inline PodTemplate" [label=" no"];
}

Hello World

The absolute minimum — bare Sandbox, no template, no claim:

apiVersion: agents.x-k8s.io/v1alpha1
kind: Sandbox
metadata:
  name: hello
spec:
  podTemplate:
    spec:
      containers:
        - name: app
          image: ghcr.io/example/agent-runtime:latest
      restartPolicy: Never

Apply it, then kubectl get sandbox hello -o jsonpath='{.status.serviceFQDN}' to get the cluster DNS. Hit the runtime at {serviceFQDN}:<containerPort>.

Template → Claim → Warm Pool (the real shape)

# 1. One-time: the template (shared across all claims)
apiVersion: extensions.agents.x-k8s.io/v1alpha1
kind: SandboxTemplate
metadata:
  name: code-interp
spec:
  networkPolicyManagement: Managed # or "Unmanaged" if Cilium owns networking
  podTemplate:
    spec:
      runtimeClassName: gvisor # isolation for untrusted code
      automountServiceAccountToken: false # default true in k8s, defaults false here
      securityContext:
        runAsNonRoot: true
        runAsUser: 1000
      containers:
        - name: runtime
          image: ghcr.io/example/runtime:v1
          ports:
            - containerPort: 8080
  networkPolicy:
    # default-deny applies; add explicit rules here
    egress:
      - ports:
          - protocol: UDP
            port: 53
          - protocol: TCP
            port: 53
---
# 2. One-time: the warm pool (scaled by HPA)
apiVersion: extensions.agents.x-k8s.io/v1alpha1
kind: SandboxWarmPool
metadata:
  name: code-interp-pool
spec:
  replicas: 3
  sandboxTemplateRef:
    name: code-interp
---
# 3. Per request: the claim (adopts a warm pod, sets expiry)
apiVersion: extensions.agents.x-k8s.io/v1alpha1
kind: SandboxClaim
metadata:
  name: session-abc123
spec:
  sandboxTemplateRef:
    name: code-interp
  warmpool: default # "none" | "default" | "<pool-name>"
  lifecycle:
    shutdownTime: "2026-04-20T18:00:00Z"
    shutdownPolicy: Delete # Delete | DeleteForeground | Retain

SandboxClaim.status.sandbox.name resolves to the assigned Sandbox within sub-second when a warm pool pod is available. SandboxClaim.status.conditions[Ready]=True means the pod answered its readiness probe.

Lowercase .namenot .Name. This changed in v0.3.10 and old code may still reference the capitalized form.

CLI Cheat Sheet

Task Command
List sandboxes in a namespace kubectl get sandboxes.agents.x-k8s.io -n <ns>
List claims kubectl get sandboxclaims.extensions.agents.x-k8s.io -n <ns>
Inspect a claim kubectl describe sandboxclaim <name> -n <ns>
Find the pod for a claim kubectl get pods -n <ns> -l agents.x-k8s.io/claim-uid=<uid>
Tail controller logs kubectl logs -n agent-sandbox-system deploy/agent-sandbox-controller -f
Pause a sandbox kubectl patch sandbox <name> -p '{"spec":{"replicas":0}}' --type=merge
Resume kubectl patch sandbox <name> -p '{"spec":{"replicas":1}}' --type=merge
Force expiry kubectl patch sandboxclaim <name> -p '{"spec":{"lifecycle":{"shutdownTime":"'$(date -u +%Y-%m-%dT%H:%M:%SZ)'"}}}' --type=merge
Controller metrics kubectl port-forward -n agent-sandbox-system svc/agent-sandbox-controller-metrics 8080:8080 then curl :8080/metrics

Short name: swp for SandboxWarmPool. Use kubectl api-resources --api-group=agents.x-k8s.io to list aliases.

Pod Lifecycle & replicas: 0/1

spec.replicas is constrained to 0 or 1. It is not a scaling primitive.

replicas Effect
1 (default) Pod running, PVCs attached, Service resolves
0 Pod deleted; PVCs + Service + Sandbox CR preserved. This is pause.

Setting replicas: 0 then back to 1 re-creates a pod that mounts the same PVCs, giving effective resume semantics. Useful for idle sessions that get reconnected.

shutdownTime (on either Sandbox or SandboxClaim) hits a wall clock — when it passes, shutdownPolicy determines what happens. The two resources take different enums:

Policy On Sandbox On SandboxClaim
Delete Sandbox (and its Pod / Service / PVCs) deleted Claim deleted, cascading to backing Sandbox
DeleteForeground (not valid — Sandbox enum is Delete|Retain only) Foreground-cascade delete — the Claim remains with a deletionTimestamp until its backing Sandbox + Pod are fully terminated, so external systems can watch shutdown progress. New in v0.3.10
Retain Pod / Service deleted, Sandbox CR kept with Expired status Backing Sandbox deleted, Claim kept with Expired status

Defaults are not symmetric. Sandbox.spec.lifecycle.shutdownPolicy is a pointer and nil-maps to retain-on-expiry behavior (the controller only deletes when Delete is explicit). SandboxClaim.spec.lifecycle.shutdownPolicy is a plain string and its empty value is treated as Delete. If you want a Sandbox to self-delete on expiry, set Delete explicitly.

Network Policy — Secure-by-Default

Since v0.2.1, when networkPolicyManagement: Managed and spec.networkPolicy is omitted, the controller applies a secure-by-default posture:

  • Ingress: only from pods labeled app: sandbox-router (the SDK's router service).
  • Egress: allowed to 0.0.0.0/0 except RFC1918 (10.0.0.0/8, 172.16.0.0/12, 192.168.0.0/16) and link-local (169.254.0.0/16, covers the node metadata server). Cluster-internal services and the Kubernetes API are effectively blocked.
  • DNS: the controller rewrites the pod's dnsPolicy to None and injects dnsConfig.nameservers: [8.8.8.8, 1.1.1.1]. You do not need to add a port 53 egress rule in this default mode — resolution happens against public DNS over the allowed external egress.

You opt out of this posture by setting any custom spec.networkPolicy. Once you do, you own the ruleset end-to-end and the DNS-injection shortcut no longer applies — add UDP/TCP 53 egress explicitly, plus any cluster-internal destinations you need.

Per-template, not per-sandbox: one shared NetworkPolicy covers every Sandbox from the template, with per-sandbox isolation enforced via the agents.x-k8s.io/claim-uid label selector.

Common footgun: Istio / Datadog / Alloy sidecars need explicit ingress rules whenever you provide a custom networkPolicy. Default-deny blocks their health-check ports silently and the pod looks stuck at "not ready."

See references/patterns.md for worked network policy examples including the custom-policy case.

Isolation Runtime Selector

runtimeClassName Kernel Cold start Untrusted code? Notes
(unset) Shared ~100ms Never Only for fully trusted workloads
gvisor User-space ~500ms Yes Standard answer. Works on most clouds. Direct kubectl port-forward incompatible — use the sandbox-router service
kata-qemu Dedicated VM ~1-2s Yes Requires nested virt. GKE needs N2 Intel + Ubuntu nodes (not COS, not N2D)
kata-vm-isolation Dedicated VM ~1-2s Yes AKS-flavored Kata. Similar constraints

Warm pools effectively zero out cold-start cost by pre-paying it — a SandboxClaim adopts a ready pod in sub-millisecond dispatch latency (per KEP-174).

Client SDKs at a Glance

Python — pip install k8s-agent-sandbox

from k8s_agent_sandbox import SandboxClient
from k8s_agent_sandbox.models import SandboxGatewayConnectionConfig

client = SandboxClient(connection_config=SandboxGatewayConnectionConfig(
    gateway_name="sandbox-gateway",
    gateway_namespace="agent-sandbox-system",
))
sbx = client.create_sandbox(template="code-interp", namespace="default")
result = sbx.commands.run("python -c 'print(2+2)'")
sbx.files.write("out.txt", b"hello")       # plain filename, not full path
sbx.terminate()

Three connection modes, importable from k8s_agent_sandbox.models: SandboxGatewayConnectionConfig (prod, via cloud LB), SandboxLocalTunnelConnectionConfig (dev, auto-tunnels via kubectl port-forward), SandboxDirectConnectionConfig (in-cluster, takes api_url). All reach the pod through the sandbox-router service — one-time cluster setup.

Go — go get sigs.k8s.io/agent-sandbox/clients/go/sandbox

s, err := sandbox.New(ctx, sandbox.Options{
    TemplateName: "code-interp",
    Namespace:    "default",
})
if err != nil { return err }
if err := s.Open(ctx); err != nil { return err }
defer s.Close(ctx)

res, _ := s.Run(ctx, "ls /workspace")
data, _ := s.Read(ctx, "/workspace/out.txt")
s.Write(ctx, "in.txt", payload)             // Write takes a plain filename only — not a path

Options doesn't expose SandboxClaim lifecycle or warm-pool policy fields as of v0.3.10. If you need to set shutdownTime or pin to a specific warm pool, patch the SandboxClaim after Open() via the K8sHelper client.

Sentinel errors to branch on: ErrNotReady, ErrOrphanedClaim, ErrPortForwardDied, ErrSandboxDeleted, ErrClaimFailed. All defined in types.go. Full catalog in references/clients.md.

Anti-Patterns

Anti-Pattern Why it bites Fix
Setting spec.replicas: 3 to scale Silently ignored beyond 0|1 — CRD validation rejects the manifest Use multiple SandboxClaims or scale a SandboxWarmPool
Writing SandboxClaim.status.sandbox.Name (capital N) Broke in v0.3.10. Lowercase name is current Code against .name; keep a fallback to .Name only during a staged upgrade
Shared PodDisruptionBudget across warm-pool and claim-backed pods Controller hits a sync race during image rollout, ArgoCD PostSync hooks block, warm pool staleness Scope PDB with matchExpressions: agents.x-k8s.io/claim-uid Exists — claim-backed only
Deploying warm pool sandboxes on a single node Pod anti-affinity absent, AZ failure takes the whole pool Add topologySpreadConstraints on topology.kubernetes.io/zone
automountServiceAccountToken: true in a SandboxTemplate Default is false here. Setting true hands agents a cluster API token Leave it false; if the agent needs API access, issue a scoped JWT via an app-layer bridge
Forgetting DNS in a custom networkPolicy.egress Pod can't resolve anything; every external call hangs Always open UDP+TCP 53 when you override the default policy. The secure-by-default posture handles DNS for you by injecting public nameservers; custom policies do not
automountServiceAccountToken + runAsUser: 0 in hardened template Defeats the whole isolation story runAsNonRoot: true, drop all capabilities, readOnlyRootFilesystem: true
Patching sandbox images via kubectl set image Controller owns the pod — mutations get reconciled away Update the SandboxTemplate, then trigger a warm pool refresh
Direct kubectl port-forward on a gVisor sandbox Incompatible with user-space networking Go through the sandbox-router service, or use the SDK's tunnel mode
Relying on stable warm-pool pod names Warm pool adopts then re-labels via agents.x-k8s.io/pod-name annotation Always discover pods via agents.x-k8s.io/claim-uid label

Upgrade Hazards

From → To What breaks
v0.1.x → v0.2.1 Controller moved StatefulSet → Deployment. kubectl delete statefulset agent-sandbox-controller -n agent-sandbox-system before applying v0.2.1 manifests. Metrics port changed 80 → 8080. NetworkPolicy became default-deny — pods that used to reach cluster services now can't
v0.2.x → v0.3.10 SandboxWarmPool now creates Sandbox CRs, not bare Pods. Pre-existing pool pods are orphans (labeled agents.x-k8s.io/pool, no owner ref) — clean them up manually. SandboxClaim.status.sandbox.Name.name (case). Watch for regression #611: if a managed pod is deleted externally, the controller enters an error loop until the Sandbox CR is recreated

References

For deep field-level CRD reference (every spec/status field, annotations, labels, validation rules), see references/crds.md.

For production patterns — warm pool HPA setup, PDB scoping, image rollout refresh, Karpenter integration, multi-tenant isolation, observability, load testing — see references/patterns.md.

For the Python and Go client SDKs — all connection modes, timeouts, error catalog, snapshot/resume, router deployment — see references/clients.md.

What This Skill is NOT

  • Not a Kubernetes tutorial. It assumes familiarity with CRDs, controllers, pods, services, NetworkPolicy, and RBAC.
  • Not a replacement for the upstream docs at https://agent-sandbox.sigs.k8s.io/docs/. Use this for decision-making shortcuts; fetch upstream when you need the authoritative API surface.
  • Not specific to Gradial. This is the operator itself. Gradial's packages/server/sandbox and apps/sandbox-worker are consumers built on top.
  • Not for stateless agent execution. If you just need a short-lived function call without persistent workspace or stable identity, a Kubernetes Job or a third-party service (E2B, Modal) is lighter.
  • Not a multi-tenancy isolation story by itself. The operator gives you the primitives; you still need pod-level runtimeClassName, namespace separation, quota, and network policy discipline to actually isolate tenants.
Weekly Installs
2
GitHub Stars
6
First Seen
3 days ago