modal

SKILL.md

Modal Serverless Cloud Platform

Serverless Python execution with GPUs, autoscaling, and pay-per-use compute.

When to Use

  • Deploy and serve ML models (LLMs, image generation)
  • Run GPU-accelerated computation
  • Batch process large datasets in parallel
  • Schedule compute-intensive jobs
  • Build serverless APIs with autoscaling

Quick Start

# Install
pip install modal

# Authenticate
modal token new
import modal

app = modal.App("my-app")

@app.function()
def hello():
    return "Hello from Modal!"

# Run with: modal run script.py

Container Images

# Build image with dependencies
image = (
    modal.Image.debian_slim(python_version="3.12")
    .pip_install("torch", "transformers", "numpy")
)

app = modal.App("ml-app", image=image)

GPU Functions

@app.function(gpu="H100")
def train_model():
    import torch
    assert torch.cuda.is_available()
    # GPU code here

# Available GPUs: T4, L4, A10, A100, L40S, H100, H200, B200
# Multi-GPU: gpu="H100:8"

Web Endpoints

@app.function()
@modal.web_endpoint(method="POST")
def predict(data: dict):
    result = model.predict(data["input"])
    return {"prediction": result}

# Deploy: modal deploy script.py

Scheduled Jobs

@app.function(schedule=modal.Cron("0 2 * * *"))  # Daily at 2 AM
def daily_backup():
    pass

@app.function(schedule=modal.Period(hours=4))  # Every 4 hours
def refresh_cache():
    pass

Autoscaling

@app.function()
def process_item(item_id: int):
    return analyze(item_id)

@app.local_entrypoint()
def main():
    items = range(1000)
    # Automatically parallelized across containers
    results = list(process_item.map(items))

Persistent Storage

volume = modal.Volume.from_name("my-data", create_if_missing=True)

@app.function(volumes={"/data": volume})
def save_results(data):
    with open("/data/results.txt", "w") as f:
        f.write(data)
    volume.commit()  # Persist changes

Secrets Management

@app.function(secrets=[modal.Secret.from_name("huggingface")])
def download_model():
    import os
    token = os.environ["HF_TOKEN"]

ML Model Serving

@app.cls(gpu="L40S")
class Model:
    @modal.enter()
    def load_model(self):
        from transformers import pipeline
        self.pipe = pipeline("text-classification", device="cuda")

    @modal.method()
    def predict(self, text: str):
        return self.pipe(text)

@app.local_entrypoint()
def main():
    model = Model()
    result = model.predict.remote("Modal is great!")

Resource Configuration

@app.function(
    cpu=8.0,              # 8 CPU cores
    memory=32768,         # 32 GiB RAM
    ephemeral_disk=10240, # 10 GiB disk
    timeout=3600          # 1 hour timeout
)
def memory_intensive_task():
    pass

Best Practices

  1. Pin dependencies for reproducible builds
  2. Use appropriate GPU types - L40S for inference, H100 for training
  3. Leverage caching via Volumes for model weights
  4. Use .map() for parallel processing
  5. Import packages inside functions if not available locally
  6. Store secrets securely - never hardcode API keys

vs Alternatives

Platform Best For
Modal Serverless GPUs, autoscaling, Python-native
RunPod GPU rental, long-running jobs
AWS Lambda CPU workloads, AWS ecosystem
Replicate Model hosting, simple deployments

Resources

Weekly Installs
31
Repository
eyadsibai/ltk
First Seen
Jan 28, 2026
Installed on
gemini-cli25
opencode23
github-copilot22
codex22
claude-code21
antigravity20