ollama
SKILL.md
Ollama Skill
Complete guide for Ollama - run LLMs locally.
Quick Reference
Popular Models
| Model | Size | Use Case |
|---|---|---|
| llama3.2 | 3B/11B | General purpose |
| mistral | 7B | Fast, capable |
| codellama | 7B/13B/34B | Code generation |
| phi3 | 3.8B | Compact, fast |
| gemma2 | 2B/9B/27B | Google's model |
| qwen2.5 | 0.5B-72B | Multilingual |
| deepseek-coder | 6.7B/33B | Code specialist |
Commands
ollama run <model> # Interactive chat
ollama pull <model> # Download model
ollama list # List installed
ollama rm <model> # Remove model
ollama serve # Start server
1. Installation
macOS
# Download from ollama.ai or:
brew install ollama
Linux
curl -fsSL https://ollama.ai/install.sh | sh
Windows
# Download installer from ollama.ai
# Or use WSL2 with Linux instructions
Docker
docker run -d -v ollama:/root/.ollama -p 11434:11434 --name ollama ollama/ollama
# With GPU support
docker run -d --gpus=all -v ollama:/root/.ollama -p 11434:11434 ollama/ollama
2. Basic Usage
Run Models
# Run interactively
ollama run llama3.2
# Run with prompt
ollama run llama3.2 "Explain quantum computing"
# Run specific size
ollama run llama3.2:3b
ollama run llama3.2:11b
# Run with system prompt
ollama run llama3.2 --system "You are a helpful coding assistant"
Model Management
# Pull model
ollama pull mistral
# List models
ollama list
# Show model info
ollama show llama3.2
# Show model file
ollama show llama3.2 --modelfile
# Copy model
ollama cp llama3.2 my-llama
# Remove model
ollama rm mistral
3. REST API
Generate Completion
curl http://localhost:11434/api/generate -d '{
"model": "llama3.2",
"prompt": "Why is the sky blue?",
"stream": false
}'
Chat Completion
curl http://localhost:11434/api/chat -d '{
"model": "llama3.2",
"messages": [
{"role": "system", "content": "You are a helpful assistant."},
{"role": "user", "content": "Hello!"}
],
"stream": false
}'
Streaming
curl http://localhost:11434/api/chat -d '{
"model": "llama3.2",
"messages": [{"role": "user", "content": "Write a poem"}],
"stream": true
}'
Embeddings
curl http://localhost:11434/api/embed -d '{
"model": "llama3.2",
"input": "The quick brown fox"
}'
List Models (API)
curl http://localhost:11434/api/tags
4. Python Integration
Official Library
pip install ollama
Basic Usage
import ollama
# Generate
response = ollama.generate(
model='llama3.2',
prompt='What is Python?'
)
print(response['response'])
# Chat
response = ollama.chat(
model='llama3.2',
messages=[
{'role': 'system', 'content': 'You are a helpful assistant.'},
{'role': 'user', 'content': 'Hello!'}
]
)
print(response['message']['content'])
Streaming
# Stream generate
for chunk in ollama.generate(model='llama3.2', prompt='Hello', stream=True):
print(chunk['response'], end='', flush=True)
# Stream chat
for chunk in ollama.chat(
model='llama3.2',
messages=[{'role': 'user', 'content': 'Write a story'}],
stream=True
):
print(chunk['message']['content'], end='', flush=True)
Embeddings
# Single embedding
response = ollama.embed(
model='llama3.2',
input='Hello, world!'
)
embedding = response['embeddings'][0]
# Multiple embeddings
response = ollama.embed(
model='llama3.2',
input=['Hello', 'World']
)
embeddings = response['embeddings']
Async Support
import asyncio
import ollama
async def main():
response = await ollama.AsyncClient().chat(
model='llama3.2',
messages=[{'role': 'user', 'content': 'Hello!'}]
)
print(response['message']['content'])
asyncio.run(main())
5. LangChain Integration
Setup
from langchain_ollama import ChatOllama, OllamaEmbeddings
# Chat model
llm = ChatOllama(
model="llama3.2",
temperature=0.7
)
# Embeddings
embeddings = OllamaEmbeddings(model="llama3.2")
Use in Chains
from langchain_core.prompts import ChatPromptTemplate
from langchain_core.output_parsers import StrOutputParser
prompt = ChatPromptTemplate.from_template("Explain {topic} simply")
chain = prompt | llm | StrOutputParser()
result = chain.invoke({"topic": "machine learning"})
6. Custom Models (Modelfile)
Basic Modelfile
# Modelfile
FROM llama3.2
# Set parameters
PARAMETER temperature 0.7
PARAMETER top_p 0.9
PARAMETER num_ctx 4096
# Set system prompt
SYSTEM """You are a helpful coding assistant specialized in Python.
Always provide clear, well-commented code examples."""
Create Custom Model
# Create model from Modelfile
ollama create my-coder -f ./Modelfile
# Run custom model
ollama run my-coder
Advanced Modelfile
FROM llama3.2:11b
# Model parameters
PARAMETER temperature 0.8
PARAMETER top_k 40
PARAMETER top_p 0.9
PARAMETER repeat_penalty 1.1
PARAMETER num_ctx 8192
PARAMETER num_predict 2048
PARAMETER stop "<|im_end|>"
# System message
SYSTEM """You are an expert software architect. You provide:
1. Clear architectural recommendations
2. Design pattern suggestions
3. Best practices for scalability
4. Security considerations"""
# Template (for custom formats)
TEMPLATE """{{ if .System }}<|im_start|>system
{{ .System }}<|im_end|>
{{ end }}{{ if .Prompt }}<|im_start|>user
{{ .Prompt }}<|im_end|>
{{ end }}<|im_start|>assistant
{{ .Response }}<|im_end|>"""
Import GGUF Models
# Import from GGUF file
FROM ./model.gguf
PARAMETER temperature 0.7
SYSTEM "You are a helpful assistant."
ollama create custom-model -f Modelfile
7. Vision Models
Using Vision
import ollama
import base64
# From file
with open('image.jpg', 'rb') as f:
image_data = base64.b64encode(f.read()).decode()
response = ollama.chat(
model='llava',
messages=[{
'role': 'user',
'content': 'What is in this image?',
'images': [image_data]
}]
)
print(response['message']['content'])
Via API
curl http://localhost:11434/api/chat -d '{
"model": "llava",
"messages": [{
"role": "user",
"content": "Describe this image",
"images": ["base64-encoded-image"]
}]
}'
8. Code Models
CodeLlama
# Pull code model
ollama pull codellama
# Or specialized variants
ollama pull codellama:7b-instruct
ollama pull codellama:13b-python
Code Generation
response = ollama.generate(
model='codellama',
prompt='''Write a Python function that:
1. Takes a list of numbers
2. Returns the median value
3. Handles empty lists'''
)
print(response['response'])
DeepSeek Coder
ollama pull deepseek-coder:6.7b
response = ollama.chat(
model='deepseek-coder:6.7b',
messages=[{
'role': 'user',
'content': 'Write a REST API in FastAPI for user management'
}]
)
9. Performance Tuning
Context Length
# Increase context window
response = ollama.generate(
model='llama3.2',
prompt='Long document here...',
options={
'num_ctx': 8192 # Default is 2048
}
)
GPU Layers
# Control GPU usage
response = ollama.generate(
model='llama3.2',
prompt='Hello',
options={
'num_gpu': 50 # Number of layers on GPU
}
)
Parameters
response = ollama.generate(
model='llama3.2',
prompt='Creative writing prompt',
options={
'temperature': 0.9, # Creativity (0-2)
'top_p': 0.95, # Nucleus sampling
'top_k': 40, # Top-k sampling
'repeat_penalty': 1.1, # Reduce repetition
'num_predict': 500, # Max tokens
'seed': 42 # Reproducibility
}
)
10. Server Configuration
Environment Variables
# Change host/port
OLLAMA_HOST=0.0.0.0:11434 ollama serve
# Custom model directory
OLLAMA_MODELS=/path/to/models ollama serve
# Limit GPU memory
OLLAMA_GPU_MEMORY=4096 ollama serve
Docker Compose
services:
ollama:
image: ollama/ollama
container_name: ollama
ports:
- "11434:11434"
volumes:
- ollama_data:/root/.ollama
deploy:
resources:
reservations:
devices:
- driver: nvidia
count: 1
capabilities: [gpu]
restart: unless-stopped
open-webui:
image: ghcr.io/open-webui/open-webui:main
container_name: open-webui
ports:
- "3000:8080"
volumes:
- open-webui:/app/backend/data
environment:
- OLLAMA_BASE_URL=http://ollama:11434
depends_on:
- ollama
restart: unless-stopped
volumes:
ollama_data:
open-webui:
11. Common Patterns
RAG with Ollama
import ollama
from langchain_ollama import OllamaEmbeddings
from langchain_chroma import Chroma
from langchain.text_splitter import RecursiveCharacterTextSplitter
# Create embeddings
embeddings = OllamaEmbeddings(model="llama3.2")
# Create vector store
vectorstore = Chroma.from_texts(
texts=["Document 1...", "Document 2..."],
embedding=embeddings
)
# Query
def rag_query(question: str) -> str:
# Retrieve relevant docs
docs = vectorstore.similarity_search(question, k=3)
context = "\n".join(doc.page_content for doc in docs)
# Generate answer
response = ollama.chat(
model='llama3.2',
messages=[
{'role': 'system', 'content': f'Answer using this context:\n{context}'},
{'role': 'user', 'content': question}
]
)
return response['message']['content']
Function Calling
import json
tools = [
{
"name": "get_weather",
"description": "Get weather for a location",
"parameters": {
"type": "object",
"properties": {
"location": {"type": "string", "description": "City name"}
},
"required": ["location"]
}
}
]
response = ollama.chat(
model='llama3.2',
messages=[
{'role': 'system', 'content': f'You have these tools: {json.dumps(tools)}. Call them by returning JSON with "tool" and "arguments".'},
{'role': 'user', 'content': 'What is the weather in Paris?'}
]
)
# Parse tool call from response
Batch Processing
import ollama
from concurrent.futures import ThreadPoolExecutor
def process_item(item):
response = ollama.generate(
model='llama3.2',
prompt=f"Summarize: {item}"
)
return response['response']
items = ["Document 1", "Document 2", "Document 3"]
with ThreadPoolExecutor(max_workers=3) as executor:
results = list(executor.map(process_item, items))
12. Troubleshooting
Common Issues
Model not found:
# Pull the model first
ollama pull llama3.2
# Check available models
ollama list
Out of memory:
# Use smaller model
ollama run llama3.2:3b # Instead of 11b
# Or reduce context
ollama run llama3.2 --num-ctx 2048
Slow generation:
# Check GPU usage
nvidia-smi
# Ensure model fits in VRAM
# Or use quantized versions
ollama pull llama3.2:3b-q4_0
Connection refused:
# Start server first
ollama serve
# Check if running
curl http://localhost:11434/api/tags
Best Practices
- Right-size models - Use smallest that works
- Quantization - Use Q4 for speed
- Custom models - Tune for your use case
- Batch requests - Reduce overhead
- Cache responses - Avoid repeat queries
- Monitor resources - Watch GPU/CPU
- Use streaming - Better UX
- Set timeouts - Handle slow responses
- Test prompts - Iterate on system messages
- Keep updated - New models regularly
Weekly Installs
1
Repository
fgarofalo56/sup…t_fabricFirst Seen
13 days ago
Security Audits
Installed on
amp1
cline1
openclaw1
opencode1
cursor1
kimi-cli1