ollama

Installation
SKILL.md

Ollama

Expert guidance for running local LLMs with Ollama.

Triggers

Use this skill when:

  • Running LLMs locally for privacy or cost savings
  • Setting up offline AI inference
  • Managing local model deployments
  • Working with open-source models (Llama, Mistral, etc.)
  • Developing AI applications without cloud API costs
  • Keywords: ollama, local llm, offline, self-hosted, llama, mistral, local model

Installation

# macOS
brew install ollama

# Linux
curl -fsSL https://ollama.com/install.sh | sh

# Windows
# Download from https://ollama.com/download

Start Server

# Start Ollama service
ollama serve

# Runs on http://localhost:11434

Model Management

# Pull models
ollama pull llama3.1
ollama pull llama3.1:70b
ollama pull mistral
ollama pull codellama
ollama pull phi3
ollama pull gemma2

# List models
ollama list

# Show model info
ollama show llama3.1

# Remove model
ollama rm llama3.1

# Copy model
ollama cp llama3.1 my-llama

# Run model interactively
ollama run llama3.1

API Usage

Python

import requests

response = requests.post(
    "http://localhost:11434/api/generate",
    json={
        "model": "llama3.1",
        "prompt": "What is Python?",
        "stream": False
    }
)
print(response.json()["response"])

Python with OpenAI SDK

from openai import OpenAI

client = OpenAI(
    base_url="http://localhost:11434/v1",
    api_key="ollama"  # Required but unused
)

response = client.chat.completions.create(
    model="llama3.1",
    messages=[
        {"role": "system", "content": "You are a helpful assistant."},
        {"role": "user", "content": "What is Python?"}
    ]
)
print(response.choices[0].message.content)

Streaming

import requests

response = requests.post(
    "http://localhost:11434/api/generate",
    json={
        "model": "llama3.1",
        "prompt": "Write a poem",
        "stream": True
    },
    stream=True
)

for line in response.iter_lines():
    if line:
        import json
        data = json.loads(line)
        print(data.get("response", ""), end="", flush=True)

Chat API

import requests

response = requests.post(
    "http://localhost:11434/api/chat",
    json={
        "model": "llama3.1",
        "messages": [
            {"role": "system", "content": "You are a helpful assistant."},
            {"role": "user", "content": "Hello!"}
        ]
    }
)
print(response.json()["message"]["content"])

Embeddings

response = requests.post(
    "http://localhost:11434/api/embeddings",
    json={
        "model": "llama3.1",
        "prompt": "Hello world"
    }
)
embedding = response.json()["embedding"]

Custom Models (Modelfile)

Create Custom Model

# Modelfile
FROM llama3.1

# Set parameters
PARAMETER temperature 0.7
PARAMETER top_p 0.9
PARAMETER num_ctx 4096

# Set system prompt
SYSTEM """You are a helpful coding assistant specializing in Python.
Always provide code examples and explain your reasoning."""

# Set template (optional)
TEMPLATE """{{ if .System }}<|system|>
{{ .System }}<|end|>
{{ end }}{{ if .Prompt }}<|user|>
{{ .Prompt }}<|end|>
{{ end }}<|assistant|>
{{ .Response }}<|end|>"""
# Create model
ollama create my-coder -f Modelfile

# Run custom model
ollama run my-coder

Import GGUF Models

# Modelfile
FROM ./mistral-7b-instruct-v0.2.Q4_K_M.gguf

PARAMETER temperature 0.7
PARAMETER num_ctx 8192

TEMPLATE """[INST] {{ .Prompt }} [/INST]
{{ .Response }}"""

Generation Parameters

response = requests.post(
    "http://localhost:11434/api/generate",
    json={
        "model": "llama3.1",
        "prompt": "Hello",
        "options": {
            "temperature": 0.7,
            "top_p": 0.9,
            "top_k": 40,
            "num_predict": 256,
            "num_ctx": 4096,
            "repeat_penalty": 1.1,
            "seed": 42
        }
    }
)

Vision Models

import base64

# Encode image
with open("image.jpg", "rb") as f:
    image_data = base64.b64encode(f.read()).decode()

response = requests.post(
    "http://localhost:11434/api/generate",
    json={
        "model": "llava",
        "prompt": "What's in this image?",
        "images": [image_data]
    }
)

LangChain Integration

from langchain_community.llms import Ollama
from langchain_community.chat_models import ChatOllama

# LLM
llm = Ollama(model="llama3.1")
response = llm.invoke("What is Python?")

# Chat model
chat = ChatOllama(model="llama3.1")
response = chat.invoke([
    ("system", "You are helpful."),
    ("human", "Hello!")
])

LlamaIndex Integration

from llama_index.llms.ollama import Ollama
from llama_index.embeddings.ollama import OllamaEmbedding
from llama_index.core import Settings

Settings.llm = Ollama(model="llama3.1", request_timeout=120.0)
Settings.embed_model = OllamaEmbedding(model_name="llama3.1")

Docker Deployment

# docker-compose.yml
services:
  ollama:
    image: ollama/ollama:latest
    ports:
      - "11434:11434"
    volumes:
      - ollama_data:/root/.ollama
    deploy:
      resources:
        reservations:
          devices:
            - driver: nvidia
              count: all
              capabilities: [gpu]

volumes:
  ollama_data:
# Pull model in container
docker exec -it ollama ollama pull llama3.1

Environment Variables

# Model storage location
OLLAMA_MODELS=/path/to/models

# Server host/port
OLLAMA_HOST=0.0.0.0:11434

# GPU settings
OLLAMA_NUM_GPU=1
CUDA_VISIBLE_DEVICES=0

# Memory settings
OLLAMA_MAX_LOADED_MODELS=2

Popular Models

Model Size Use Case
llama3.1 8B General purpose
llama3.1:70b 70B Complex reasoning
mistral 7B Fast, efficient
codellama 7B-34B Code generation
phi3 3.8B Small but capable
gemma2 9B Google's model
llava 7B Vision + language
nomic-embed-text - Embeddings

Resources

Related skills

More from housegarofalo/claude-code-base

Installs
4
GitHub Stars
2
First Seen
Mar 15, 2026