when the user only need to run agentjet client, and do not have to run models locally (e.g. user in their laptop), ONLY install AgentJet basic requirements is enough (pip install -e .). see install-agentjet-client skill

AgentJet Docker Installation Skill

This skill guides you through installing and running the AgentJet Swarm Server in a Docker container with GPU support.

Prerequisites Checklist

Before proceeding, verify:

GPU Available: System has NVIDIA GPU(s)
Docker Installed: Docker is available
NVIDIA Container Toolkit: nvidia-docker2 or nvidia-container-toolkit is installed

Step 1: Check GPU

nvidia-smi

If this fails, the system may not have NVIDIA drivers or GPU hardware.

Step 2: Install Docker

sudo apt update
sudo apt install docker docker.io curl

Step 3: Install NVIDIA Container Toolkit

# Install Docker with convenience script
curl https://get.docker.com | sh \
  && sudo systemctl --now enable docker

# Add NVIDIA repository
distribution=$(. /etc/os-release;echo $ID$VERSION_ID) \
      && curl -fsSL https://nvidia.github.io/libnvidia-container/gpgkey | sudo gpg --dearmor -o /usr/share/keyrings/nvidia-container-toolkit-keyring.gpg \
      && curl -s -L https://nvidia.github.io/libnvidia-container/$distribution/libnvidia-container.list | \
            sed 's#deb https://#deb [signed-by=/usr/share/keyrings/nvidia-container-toolkit-keyring.gpg] https://#g' | \
            sudo tee /etc/apt/sources.list.d/nvidia-container-toolkit.list

# Install nvidia-docker2
sudo apt-get update
sudo apt-get install -y nvidia-docker2

# Restart Docker daemon
sudo systemctl restart docker

Step 4: Configure Docker Mirror (Optional - For Slow Image Pulls)

If pulling Docker images is too slow, configure a mirror registry:

Option A: Configure via daemon.json

# Create or edit Docker daemon config
sudo mkdir -p /etc/docker
sudo tee /etc/docker/daemon.json <<EOF
{
  "registry-mirrors": [
    "https://docker.1ms.run",
    "https://docker.xuanyuan.me"
  ]
}
EOF

# Restart Docker
sudo systemctl daemon-reload
sudo systemctl restart docker

Option B: Pull via Mirror URL Directly

For ghcr.io images, use a mirror prefix:

# Original (may be slow)
docker pull ghcr.io/modelscope/agentjet:main

# Using mirror (faster in China)
docker pull ghcr.modelscope.cn/modelscope/agentjet:main

# Or use dockerhub mirror
docker pull docker.1ms.run/modelscope/agentjet:main

Popular Mirror Registries

Mirror	Region	Note
`docker.1ms.run`	China	General Docker Hub mirror
`docker.xuanyuan.me`	China	Alternative mirror
`ghcr.modelscope.cn`	China	GitHub Container Registry mirror
`registry.docker-cn.com`	China	Official Docker China mirror

Verify Mirror Configuration

docker info | grep -A 5 "Registry Mirrors"

Step 5: Verify GPU Support in Docker

docker run --rm --gpus=all nvidia/cuda:11.8.0-base-ubuntu22.04 nvidia-smi

Step 6: Prepare Model Weights

Download LLM model weights locally (e.g., Qwen2.5-7B-Instruct):

# Example using modelscope
pip install modelscope
modelscope download --model Qwen/Qwen2.5-7B-Instruct --local_dir ./Qwen2.5-7B-Instruct

Step 7: Run AgentJet Swarm Server

# Create directories for logs and experiments
mkdir -p ./swarmlog ./swarmexp

# Run AgentJet Swarm Server
docker run --rm -it \
  -v /path/to/host/Qwen/Qwen2.5-7B-Instruct:/Qwen/Qwen2.5-7B-Instruct \
  -v ./swarmlog:/workspace/log \
  -v ./swarmexp:/workspace/saved_experiments \
  -p 10086:10086 \
  -e SWANLAB_API_KEY=$SWANLAB_API_KEY \
  --gpus=all \
  --shm-size=32GB \
  ghcr.io/modelscope/agentjet:main \
  bash -c "(ajet-swarm overwatch) & (NO_COLOR=1 LOGURU_COLORIZE=NO ajet-swarm start &>/workspace/log/swarm_server.log)"

Flag Explanations

Flag	Purpose
`--rm`	Auto-remove container on exit
`-it`	Interactive TTY for TUI monitor
`-v <host>:<container>`	Mount model weights into container
`-p 10086:10086`	Expose API port for Swarm Clients
`--gpus=all`	Use all available GPUs
`--shm-size=32GB`	Shared memory for large model inference

Step 8: Verify Deployment

After launch, you should see the ajet-swarm overwatch TUI showing server state transitions:

OFFLINE -> BOOTING -> ROLLING -> WEIGHT_SYNCING -> ROLLING -> ...

The server enters BOOTING only after a Swarm Client sends a training configuration.

Step 9: Connect Swarm Client (Optional)

From any machine that can reach the server:

from ajet.tuner_lib.experimental.swarm_client import SwarmClient
from ajet.copilot.job import AgentJetJob

swarm_worker = SwarmClient("http://<server-ip>:10086")
swarm_worker.auto_sync_train_config_and_start_engine(
    AgentJetJob(
        algorithm="grpo",
        n_gpu=8,
        model="/Qwen/Qwen2.5-7B-Instruct",  # Container-side path
        batch_size=32,
        num_repeat=4,
    )
)

Troubleshooting

Symptom	Cause	Fix
Server stays OFFLINE	No client connected	Run Swarm Client script
Model not found	Wrong container path	Verify `-v` mount matches `model` field
Cannot connect port 10086	Firewall	Check firewall rules
Empty log file	Missing log directory	`mkdir -p ./swarmlog`
Image pull timeout	Slow registry access	Configure Docker mirror (Step 4)
Image pull fails	Wrong mirror URL	Try different mirror or use original URL

docker-install-agentjet-swarm-server