text-generation-inference

Installation
SKILL.md

Text Generation Inference (TGI)

Expert guidance for Hugging Face's production LLM inference server.

Triggers

Use this skill when:

  • Deploying LLMs in production environments
  • Setting up high-throughput model serving
  • Configuring quantization for inference optimization
  • Working with Hugging Face Text Generation Inference
  • Implementing continuous batching or tensor parallelism
  • Keywords: tgi, text generation inference, huggingface serving, llm deployment, continuous batching, tensor parallelism

Installation

Docker

# Basic GPU deployment
docker run --gpus all -p 8080:80 \
  -v /path/to/models:/data \
  ghcr.io/huggingface/text-generation-inference:latest \
  --model-id meta-llama/Llama-3.1-8B-Instruct

# With quantization
docker run --gpus all -p 8080:80 \
  -v /path/to/models:/data \
  ghcr.io/huggingface/text-generation-inference:latest \
  --model-id meta-llama/Llama-3.1-8B-Instruct \
  --quantize bitsandbytes-nf4

# Multi-GPU
docker run --gpus all -p 8080:80 \
  -v /path/to/models:/data \
  ghcr.io/huggingface/text-generation-inference:latest \
  --model-id meta-llama/Llama-3.1-70B-Instruct \
  --num-shard 4

Docker Compose

services:
  tgi:
    image: ghcr.io/huggingface/text-generation-inference:latest
    ports:
      - "8080:80"
    volumes:
      - ./models:/data
    environment:
      - HUGGING_FACE_HUB_TOKEN=${HF_TOKEN}
    command: >
      --model-id meta-llama/Llama-3.1-8B-Instruct
      --max-input-length 4096
      --max-total-tokens 8192
      --max-batch-prefill-tokens 4096
    deploy:
      resources:
        reservations:
          devices:
            - driver: nvidia
              count: all
              capabilities: [gpu]

Server Options

text-generation-launcher \
  --model-id meta-llama/Llama-3.1-8B-Instruct \
  --port 8080 \
  --max-input-length 4096 \
  --max-total-tokens 8192 \
  --max-batch-prefill-tokens 4096 \
  --max-batch-total-tokens 32768 \
  --max-concurrent-requests 128 \
  --waiting-served-ratio 0.3 \
  --dtype float16 \
  --trust-remote-code

Quantization Options

# BitsAndBytes 4-bit
--quantize bitsandbytes-nf4
--quantize bitsandbytes-fp4

# GPTQ
--quantize gptq

# AWQ
--quantize awq

# EETQ (efficient 8-bit)
--quantize eetq

# FP8
--quantize fp8

API Usage

Python Client

from huggingface_hub import InferenceClient

client = InferenceClient(model="http://localhost:8080")

# Generate text
response = client.text_generation(
    "What is machine learning?",
    max_new_tokens=256,
    temperature=0.7,
    top_p=0.9,
    stop_sequences=["</s>"]
)
print(response)

# Chat
messages = [
    {"role": "system", "content": "You are a helpful assistant."},
    {"role": "user", "content": "What is Python?"}
]
response = client.chat_completion(messages, max_tokens=500)
print(response.choices[0].message.content)

# Streaming
for token in client.text_generation(
    "Once upon a time",
    max_new_tokens=100,
    stream=True
):
    print(token, end="")

OpenAI-Compatible

from openai import OpenAI

client = OpenAI(
    base_url="http://localhost:8080/v1",
    api_key="-"
)

response = client.chat.completions.create(
    model="tgi",
    messages=[
        {"role": "user", "content": "Hello!"}
    ],
    max_tokens=100,
    temperature=0.7
)
print(response.choices[0].message.content)

REST API

# Generate
curl http://localhost:8080/generate \
  -H "Content-Type: application/json" \
  -d '{
    "inputs": "What is AI?",
    "parameters": {
      "max_new_tokens": 100,
      "temperature": 0.7,
      "top_p": 0.9
    }
  }'

# Chat
curl http://localhost:8080/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "tgi",
    "messages": [{"role": "user", "content": "Hello"}],
    "max_tokens": 100
  }'

# Health check
curl http://localhost:8080/health
curl http://localhost:8080/info

Embedding Support

from huggingface_hub import InferenceClient

client = InferenceClient(model="http://localhost:8080")

# Single embedding
embedding = client.feature_extraction("Hello world")

# Batch embeddings
embeddings = client.feature_extraction([
    "First sentence",
    "Second sentence"
])

Speculative Decoding

# Use draft model for faster generation
docker run --gpus all -p 8080:80 \
  ghcr.io/huggingface/text-generation-inference:latest \
  --model-id meta-llama/Llama-3.1-70B-Instruct \
  --speculative-model meta-llama/Llama-3.1-8B-Instruct \
  --num-speculative-tokens 4

Performance Tuning

# Memory optimization
--max-batch-prefill-tokens 4096
--max-batch-total-tokens 32768
--max-concurrent-requests 64

# Latency optimization
--max-batch-size 1
--max-waiting-tokens 1

# Throughput optimization
--max-batch-size 32
--waiting-served-ratio 0.3

Kubernetes Deployment

apiVersion: apps/v1
kind: Deployment
metadata:
  name: tgi
spec:
  replicas: 1
  selector:
    matchLabels:
      app: tgi
  template:
    metadata:
      labels:
        app: tgi
    spec:
      containers:
        - name: tgi
          image: ghcr.io/huggingface/text-generation-inference:latest
          ports:
            - containerPort: 80
          args:
            - --model-id
            - meta-llama/Llama-3.1-8B-Instruct
            - --max-input-length
            - "4096"
          resources:
            limits:
              nvidia.com/gpu: 1
              memory: 32Gi
          env:
            - name: HUGGING_FACE_HUB_TOKEN
              valueFrom:
                secretKeyRef:
                  name: hf-token
                  key: token
          volumeMounts:
            - name: model-cache
              mountPath: /data
      volumes:
        - name: model-cache
          persistentVolumeClaim:
            claimName: model-cache-pvc

Monitoring

# Prometheus metrics
curl http://localhost:8080/metrics

# Key metrics:
# tgi_request_duration_seconds
# tgi_request_count
# tgi_queue_size
# tgi_batch_size

Resources

Related skills

More from housegarofalo/claude-code-base

Installs
4
GitHub Stars
2
First Seen
Mar 15, 2026
Security Audits