localai

Installation

SKILL.md

LocalAI

Expert guidance for self-hosted OpenAI-compatible AI API.

Triggers

Use this skill when:

Running self-hosted AI models locally
Deploying OpenAI-compatible APIs without cloud dependencies
Setting up privacy-focused AI deployments
Working with LocalAI for LLMs, embeddings, audio, or images
Building offline AI inference systems
Keywords: localai, self-hosted, openai compatible, local ai, offline, privacy, llm server

Installation

Docker

# Basic (CPU)
docker run -p 8080:8080 localai/localai:latest

# With GPU (CUDA)
docker run --gpus all -p 8080:8080 localai/localai:latest-gpu-nvidia-cuda-12

# With models directory
docker run -p 8080:8080 \
  -v /path/to/models:/models \
  localai/localai:latest

Docker Compose

services:
  localai:
    image: localai/localai:latest-gpu-nvidia-cuda-12
    ports:
      - "8080:8080"
    volumes:
      - ./models:/models
    environment:
      - THREADS=8
      - CONTEXT_SIZE=4096
      - DEBUG=true
    deploy:
      resources:
        reservations:
          devices:
            - driver: nvidia
              count: all
              capabilities: [gpu]

Model Configuration

YAML Model Definition

# models/llama3.yaml
name: llama3
backend: llama-cpp
parameters:
  model: /models/llama-3-8b-instruct.gguf
  temperature: 0.7
  top_p: 0.9
  top_k: 40
  context_size: 4096
  threads: 8
  f16: true
  mmap: true
template:
  chat_message: |
    <|start_header_id|>{{.RoleName}}<|end_header_id|>

    {{.Content}}<|eot_id|>
  chat: |
    {{.Input}}
    <|start_header_id|>assistant<|end_header_id|>

Embedding Model

# models/embeddings.yaml
name: text-embedding
backend: bert-embeddings
parameters:
  model: /models/all-MiniLM-L6-v2
embeddings: true

Whisper (Audio)

# models/whisper.yaml
name: whisper-1
backend: whisper
parameters:
  model: /models/whisper-base.bin
  language: en

Stable Diffusion

# models/stablediffusion.yaml
name: stablediffusion
backend: stablediffusion
parameters:
  model: /models/sd-v1-5
step: 25

API Usage

OpenAI Python Client

from openai import OpenAI

client = OpenAI(
    base_url="http://localhost:8080/v1",
    api_key="not-needed"  # LocalAI doesn't require API key
)

# Chat completion
response = client.chat.completions.create(
    model="llama3",
    messages=[
        {"role": "system", "content": "You are a helpful assistant."},
        {"role": "user", "content": "What is machine learning?"}
    ],
    temperature=0.7,
    max_tokens=500
)
print(response.choices[0].message.content)

# Streaming
stream = client.chat.completions.create(
    model="llama3",
    messages=[{"role": "user", "content": "Tell me a story"}],
    stream=True
)

for chunk in stream:
    if chunk.choices[0].delta.content:
        print(chunk.choices[0].delta.content, end="")

Embeddings

response = client.embeddings.create(
    model="text-embedding",
    input=["Hello world", "How are you?"]
)

embeddings = [e.embedding for e in response.data]

Image Generation

response = client.images.generate(
    model="stablediffusion",
    prompt="A beautiful sunset over mountains",
    n=1,
    size="512x512"
)

image_url = response.data[0].url

Audio Transcription

with open("audio.mp3", "rb") as f:
    response = client.audio.transcriptions.create(
        model="whisper-1",
        file=f
    )
print(response.text)

Gallery Models

# List available models
curl http://localhost:8080/models/available

# Install from gallery
curl http://localhost:8080/models/apply -d '{
  "id": "huggingface://TheBloke/Llama-2-7B-Chat-GGUF/llama-2-7b-chat.Q4_K_M.gguf"
}'

# Or via config
curl http://localhost:8080/models/apply -d '{
  "url": "github:go-skynet/model-gallery/gpt4all-j.yaml"
}'

Function Calling

# models/llama3-functions.yaml
name: llama3-functions
backend: llama-cpp
parameters:
  model: /models/llama-3-8b-instruct.gguf
function:
  disable_no_action: false
  grammar_prefix: |
    <|start_header_id|>assistant<|end_header_id|>

response = client.chat.completions.create(
    model="llama3-functions",
    messages=[{"role": "user", "content": "What's the weather in Paris?"}],
    tools=[{
        "type": "function",
        "function": {
            "name": "get_weather",
            "description": "Get weather for a city",
            "parameters": {
                "type": "object",
                "properties": {
                    "city": {"type": "string"}
                },
                "required": ["city"]
            }
        }
    }],
    tool_choice="auto"
)

Performance Tuning

# Environment variables
THREADS=8                    # Number of CPU threads
CONTEXT_SIZE=4096           # Context window size
F16=true                    # Use FP16
MMAP=true                   # Memory map models
GPU_LAYERS=35               # Layers to offload to GPU
TENSOR_SPLIT=0.5,0.5        # Multi-GPU split

GPU Offloading

# models/llama3-gpu.yaml
name: llama3
backend: llama-cpp
parameters:
  model: /models/llama-3-8b-instruct.gguf
  gpu_layers: 35
  main_gpu: 0
  tensor_split: ""

Kubernetes Deployment

apiVersion: apps/v1
kind: Deployment
metadata:
  name: localai
spec:
  replicas: 1
  selector:
    matchLabels:
      app: localai
  template:
    metadata:
      labels:
        app: localai
    spec:
      containers:
        - name: localai
          image: localai/localai:latest-gpu-nvidia-cuda-12
          ports:
            - containerPort: 8080
          resources:
            limits:
              nvidia.com/gpu: 1
          volumeMounts:
            - name: models
              mountPath: /models
          env:
            - name: THREADS
              value: "8"
      volumes:
        - name: models
          persistentVolumeClaim:
            claimName: models-pvc

Resources

Related skills

More from housegarofalo/claude-code-base

Installs

Repository

housegarofalo/c…ode-base

GitHub Stars

First Seen

Mar 15, 2026

Security Audits

Gen Agent Trust HubPass

localai

LocalAI

Triggers

Installation

Docker

Docker Compose

Model Configuration

YAML Model Definition

Embedding Model

Whisper (Audio)

Stable Diffusion

API Usage

OpenAI Python Client

Embeddings

Image Generation

Audio Transcription

Gallery Models

Function Calling

Performance Tuning

GPU Offloading

Kubernetes Deployment

Resources

More from housegarofalo/claude-code-base

home-assistant

power-automate

mobile-pwa

matter-thread

tanstack-query

vitest