sglang-jax SkyPilot Dev Skill

This skill handles running tests and development tasks for the sgl-jax project on remote TPU clusters via SkyPilot.

Note: This project requires TPU for all JAX tests. Never run JAX/TPU tests locally.

1. Prerequisites

The TPU cluster must already be provisioned. Check that .cluster_name_tpu exists and is non-empty in the project root.
If the file does not exist or is empty, provision a cluster first (see Section 3).

Execution Instructions: Before running the launch script, find its absolute path in the scripts/ directory alongside this skill definition. Use file search tools (e.g., glob) to locate launch_tpu.sh before executing it.

2. Project Layout

Component	Path
Main source	`python/sgl_jax/`
SRT module	`python/sgl_jax/srt/`
Tests	`test/`
Test suite runner	`test/srt/run_suite.py`

3. Cluster Management

Provisioning

# Common TPU types for this project: tpu-v6e-4, tpu-v6e-8, tpu-v4-8
bash <absolute_path_to_launch_tpu.sh> <accelerator_type> <experiment_name>

# Example:
bash <absolute_path_to_launch_tpu.sh> tpu-v6e-4 dp-test

The launch script automatically writes the cluster name to .cluster_name_tpu.

Teardown

sky down $(cat .cluster_name_tpu) -y

4. Running Tests

All test execution uses SSH + tmux for reliable process management and log inspection. Never use sky exec for running tests.

General Workflow

SSH into the cluster
Update code on the remote via git
Start test tasks in named tmux sessions
Monitor and retrieve results

Step 1: SSH into the cluster

# Get the cluster IP
CLUSTER_IP=$(sky status --ip $(cat .cluster_name_tpu))

# SSH directly to the remote
ssh -i ~/.ssh/sky-key gcpuser@$CLUSTER_IP

Step 2: Update code on the remote via git

On the remote machine, use git to pull the latest changes:

cd ~/sky_workdir

# Pull latest changes from the remote repository
git fetch origin
git pull origin <branch-name>

# Or if you need to switch branches
git checkout <branch-name>
git pull origin <branch-name>

# Install/update dependencies after pulling
uv sync --extra tpu

Important: Always commit and push your local changes before running tests on the remote. The remote should pull from the git repository, not sync from your local filesystem.

Step 3: Run tests in tmux sessions

Mode A: Test File Testing

Run test files directly under the test/ directory. Use this for unit tests, integration tests, and regression tests that don't require a live server.

Run the full test suite:

# On the remote machine:
tmux new-session -d -s test-suite -c ~/sky_workdir
tmux send-keys -t test-suite \
  "uv run --extra tpu python test/srt/run_suite.py" Enter

# Follow the output
tmux attach -t test-suite

Run a single test file:

# On the remote machine:
tmux new-session -d -s test -c ~/sky_workdir
tmux send-keys -t test \
  "uv run --extra tpu python -m pytest test/srt/<test_file.py> -v" Enter

# Follow the output
tmux attach -t test

Run tests matching a pattern:

# On the remote machine:
tmux new-session -d -s test -c ~/sky_workdir
tmux send-keys -t test \
  "uv run --extra tpu python -m pytest test/srt/ -k <pattern> -v" Enter

# Follow the output
tmux attach -t test

Common pytest flags:

Flag	Purpose
`-v`	Verbose output
`-k <pattern>`	Run tests matching pattern
`-x`	Stop on first failure
`--tb=short`	Short traceback
`-s`	Show stdout (useful for JAX logs)

Mode B: Service Debug Testing

Use this when you need to start a server process, wait for it to be ready, then run a second process (debug script, benchmark, or accuracy test) against the live server.

Step 1: Start the server in a named tmux session

# On the remote machine:
tmux new-session -d -s server -c ~/sky_workdir
tmux send-keys -t server \
  "uv run --extra tpu python <SERVER_COMMAND> [ARGS]" Enter

Step 2: Wait for the server to be ready

# On the remote machine — poll until the service is ready:
until <READINESS_CHECK>; do
  echo "Waiting for server..."; sleep 5
done
echo "Server is ready."

Common readiness checks:

Method	Command
HTTP health endpoint	`curl -sf http://localhost:<PORT>/health`
Log line appeared	`tmux capture-pane -pt server \| grep -q "server started"`
Port is open	`nc -z localhost <PORT>`

Step 3: Run the client in a separate tmux session

# On the remote machine:
tmux new-session -d -s client -c ~/sky_workdir
tmux send-keys -t client \
  "uv run --extra tpu python <CLIENT_COMMAND> [ARGS]" Enter

# Follow client output
tmux attach -t client

Step 4: Inspect and clean up

# List all sessions
tmux ls

# View server logs (without attaching)
tmux capture-pane -pt server -S -1000

# Kill a session when done
tmux kill-session -t server
tmux kill-session -t client

Tmux session naming convention:

Session name	Purpose
`server`	The main inference/serving process
`client`	Debug / benchmark / accuracy test runner
`<feature>-server`	Use feature-prefixed names when running multiple experiments simultaneously

5. Operational Notes

SSH Access: Use direct SSH with ssh -i ~/.ssh/sky-key gcpuser@$CLUSTER_IP to connect to the cluster
Code Updates: Always use git pull on the remote to update code; commit and push local changes first
Tmux Sessions: All test execution happens in named tmux sessions for reliability and log inspection
Workdir: The remote workdir is at ~/sky_workdir/ on the TPU VM
Logs: View logs using tmux capture-pane or by attaching to the session
Interruption: Detach from tmux with Ctrl+B D; sessions persist after disconnection
agent.md: Check the project's agent.md for the active cluster name when working on parallel development tasks

6. Functional-Point-Specific Testing

(To be populated with data-parallelism feature-specific test commands and validation steps.)

sglang-jax-skypilot-dev