debug-openshell-cluster
Debug OpenShell Gateway Deployment
Diagnose a gateway and its selected compute platform. Do not assume OpenShell provisions Kubernetes or runs a k3s container. OpenShell targets a reachable gateway endpoint backed by Docker, Podman, Kubernetes, or the experimental VM driver.
Use openshell first to identify the active endpoint. Then use the platform tools that match the gateway's compute driver: docker, podman, kubectl/helm, or VM driver logs.
Overview
The target deployment flow is:
- Operator starts or deploys the gateway.
- Operator configures the compute driver.
- Operator provides TLS and SSH relay material for the deployment mode.
- The CLI registers a reachable gateway endpoint with
openshell gateway add. - The gateway creates sandboxes through the selected compute driver.
For local evaluation only, TLS may be disabled and the gateway can be reached through http://127.0.0.1:<port>.
Prerequisites
- The
openshellCLI must be available for endpoint checks. - Know the active gateway name and endpoint, or be able to inspect local gateway metadata.
- Know the compute platform: Docker, Podman, Kubernetes, or VM.
- For Kubernetes:
kubectlmust target the cluster that hosts OpenShell and Helm version 3 or later must be available. - For Docker or Podman: the runtime socket must be reachable from the gateway host.
Workflow
Run diagnostics in order and stop once the root cause is clear.
Step 1: Check CLI Reachability
openshell gateway info
openshell status
Common findings:
No active gateway: register one withopenshell gateway add <endpoint>.- Connection refused: gateway process is not running, service exposure is wrong, or a port-forward/proxy is not active.
- TLS/certificate errors: CLI mTLS bundle does not match the gateway CA, or the gateway is running with unexpected TLS settings.
Step 2: Identify the Compute Platform
Use gateway metadata, deployment values, or the user's setup notes to identify the driver.
| Platform | Primary checks |
|---|---|
| Docker | Gateway process logs, Docker daemon health, sandbox containers, image pulls. |
| Podman | Podman socket, rootless networking, sandbox containers, image pulls. |
| Kubernetes | Helm release, StatefulSet, service, secrets, sandbox pods, events. |
| VM | VM driver logs, rootfs availability, host virtualization support. |
Step 3: Check Docker-Backed Gateways
docker info
docker ps --filter name=openshell
docker logs <container> --tail=200
openshell status
Common findings:
- Docker daemon unavailable: start Docker Desktop or Docker Engine.
- Gateway process stopped: inspect exit status and logs.
- Sandbox image missing or pull denied: verify image reference and registry credentials.
- Sandbox never registers: check gateway logs and supervisor callback endpoint.
For source checkout development, restart the local gateway with:
mise run gateway:docker
Step 4: Check Podman-Backed Gateways
podman info
podman ps --filter name=openshell
podman logs <container> --tail=200
openshell status
Common findings:
- Podman socket unavailable: start or expose the user socket.
- Rootless networking unavailable: inspect Podman network configuration.
- Sandbox image missing or pull denied: verify image reference and registry credentials.
- Supervisor cannot call back: check callback endpoint and gateway logs.
Step 5: Check Kubernetes Helm Gateways
helm -n openshell status openshell
helm -n openshell get values openshell
kubectl -n openshell get statefulset,pod,svc,pvc
kubectl -n openshell logs statefulset/openshell --tail=200
kubectl -n openshell rollout status statefulset/openshell
Look for failed installs, unexpected values, missing namespace, wrong image tag, TLS settings that do not match the registered endpoint, and scheduling failures.
Check required Helm deployment secrets:
kubectl -n openshell get secret \
openshell-ssh-handshake \
openshell-server-tls \
openshell-server-client-ca \
openshell-client-tls
Check the image references currently used by the gateway deployment:
kubectl -n openshell get statefulset openshell -o jsonpath="{.spec.template.spec.containers[*].image}{\"\n\"}{.spec.template.spec.containers[*].env[?(@.name==\"OPENSHELL_SUPERVISOR_IMAGE\")].value}{\"\n\"}"
helm -n openshell get values openshell | grep -E 'repository|tag|supervisorImage'
The gateway image and server.supervisorImage should use the same build tag in branch and E2E deploys. A stale supervisor image can make sandbox behavior lag behind gateway policy or proto changes.
For plaintext local evaluation, confirm the chart has:
helm -n openshell get values openshell | grep -E 'disableTls|grpcEndpoint'
Expected shape:
server:
disableTls: true
grpcEndpoint: http://openshell.openshell.svc.cluster.local:8080
Check service exposure:
kubectl -n openshell get svc openshell -o wide
kubectl -n openshell get endpoints openshell
For local port-forward testing:
kubectl -n openshell port-forward svc/openshell 8080:8080
openshell gateway add http://127.0.0.1:8080 --local --name local
openshell status
If the gateway is healthy but sandbox creation fails:
kubectl -n openshell get pods
kubectl -n openshell get events --sort-by=.lastTimestamp | tail -n 50
kubectl -n openshell logs statefulset/openshell --tail=200
Check the configured sandbox namespace:
helm -n openshell get values openshell | grep sandboxNamespace
Then inspect sandbox resources in that namespace.
Step 6: Check VM-Backed Gateways
Use the VM driver logs and host diagnostics available in the user's environment. Verify:
- The VM driver process is running and reachable by the gateway.
- The runtime rootfs exists and matches the expected architecture.
- Host virtualization support is enabled.
- The sandbox supervisor can establish its callback connection to the gateway.
Then run:
openshell status
openshell logs <sandbox-name>
Common Failure Patterns
| Symptom | Likely cause | Check |
|---|---|---|
openshell status fails |
Gateway endpoint unreachable or auth mismatch | openshell gateway info, gateway logs |
| Gateway starts but sandbox create fails | Compute driver cannot reach runtime | Docker/Podman/Kubernetes/VM driver logs |
| Docker or Podman sandbox never registers | Wrong callback endpoint or supervisor startup failure | Gateway logs and sandbox container logs |
| Kubernetes gateway pod pending | PVC unbound, taint, selector, or insufficient resources | kubectl -n openshell describe pod <pod> |
| Kubernetes gateway pod crash loops | Missing secret, bad DB URL, bad TLS config | kubectl -n openshell logs statefulset/openshell |
| CLI TLS error | Local mTLS bundle does not match server cert/CA | Check ~/.config/openshell/gateways/<name>/mtls/ |
| Image pull failure | Gateway or sandbox image cannot be pulled | Runtime events and image pull credentials |
Reporting
When handing results back to the user, include:
- Active gateway endpoint and auth mode.
- Compute platform and driver.
- Gateway process or workload status.
- Recent gateway log summary.
- Missing or malformed TLS or SSH relay material.
- Service exposure status.
- Sandbox workload status.
- The exact command that failed and the shortest fix.
More from nvidia/openshell
openshell-cli
Guide agents through using the OpenShell CLI (openshell) for sandbox management, provider configuration, policy iteration, BYOC workflows, and inference routing. Covers basic through advanced multi-step workflows. Trigger keywords - openshell, sandbox create, sandbox connect, logs, provider create, policy set, policy get, image push, forward, port forward, BYOC, bring your own container, use openshell, run openshell, CLI usage, manage sandbox, manage provider, gateway start, gateway select.
11tui-development
Guide for developing the OpenShell TUI — a ratatui-based terminal UI for the OpenShell platform. Covers architecture, navigation, data fetching, theming, UX conventions, and development workflow. Trigger keywords - term, TUI, terminal UI, ratatui, openshell-tui, tui development, tui feature, tui bug.
4debug-inference
Debug why inference.local or external inference setup is failing. Use when the user cannot reach a local model server, has provider base URL issues, sees inference verification failures, hits protocol mismatches, or needs to diagnose inference on local vs remote gateways. Trigger keywords - debug inference, inference.local, local inference, ollama, vllm, sglang, trtllm, NIM, inference failing, model server unreachable, failed to verify inference endpoint, host.openshell.internal.
3triage-issue
Assess, classify, and route community-filed issues. Takes a specific issue number or processes all open issues with the state:triage-needed label in batch. Validates agent-first gate compliance, attempts diagnosis using relevant skills, and classifies issues for routing into the spike-build pipeline. Trigger keywords - triage issue, triage, assess issue, review incoming issue, triage issues.
2create-spike
Investigate a plain-language problem description by deeply exploring the codebase, then create a structured GitHub issue with technical findings. Prequel to build-from-issue — maps vague ideas to concrete, buildable issues. Trigger keywords - spike, investigate, explore, research issue, technical investigation, create spike, new spike, feasibility, codebase exploration.
2sync-agent-infra
Detect and fix drift across agent-first infrastructure files. Ensures skill inventories, workflow chains, architecture tables, issue/PR templates, and cross-references stay consistent when skills, crates, or workflows change. Run after adding, removing, or renaming skills or components. Trigger keywords - sync agent infra, sync skills, update agent docs, check agent consistency, agent infra drift, sync contributing, sync agents.
2