ollama
Ollama
Install and Setup
macOS:
brew install ollama
Linux:
curl -fsSL https://ollama.com/install.sh | sh
Windows: Download from https://ollama.com/download/windows.
Start the server:
ollama serve
The server listens on http://localhost:11434 by default. Set OLLAMA_HOST to change the bind address.
Pull and Run Models
ollama pull llama3 # download without running
ollama run llama3 # run (auto-pulls if missing)
ollama list # list downloaded models
Interactive Chat vs One-Shot Generation
Interactive chat (opens a REPL, type /bye to exit):
ollama run llama3
One-shot generation (pipe input, get output, exit):
echo "Explain quicksort in two sentences" | ollama run llama3
cat main.py | ollama run codellama "Review this code for bugs"
REST API
Generate (completion)
curl http://localhost:11434/api/generate -d '{
"model": "llama3",
"prompt": "Why is the sky blue?",
"stream": false
}'
Chat (multi-turn)
curl http://localhost:11434/api/chat -d '{
"model": "llama3",
"messages": [
{"role": "system", "content": "You are a helpful assistant."},
{"role": "user", "content": "What is 2+2?"}
],
"stream": false
}'
Embeddings
curl http://localhost:11434/api/embed -d '{
"model": "nomic-embed-text",
"input": "Ollama is a tool for running local LLMs"
}'
Set "stream": true (the default) to receive newline-delimited JSON chunks.
Model Management
ollama list # list downloaded models
ollama show llama3 # show model details (parameters, template, license)
ollama cp llama3 my-llama3 # copy/alias a model
ollama rm my-llama3 # delete a model
ollama ps # list currently loaded/running models
ollama ps shows VRAM usage, quantization level, and time until unload.
Modelfile
A Modelfile defines a custom model:
FROM llama3
SYSTEM "You are a senior software engineer. Be concise. Provide code examples."
PARAMETER temperature 0.3
PARAMETER num_ctx 8192
PARAMETER top_p 0.9
PARAMETER repeat_penalty 1.1
PARAMETER stop "<|eot_id|>"
Key parameters:
temperature-- randomness (0.0 = deterministic, 1.0+ = creative).num_ctx-- context window in tokens. Higher values use more VRAM.top_p-- nucleus sampling threshold.top_k-- limits token selection pool.repeat_penalty-- penalizes repeated tokens.stop-- stop sequence(s).num_gpu-- layers to offload to GPU (0 for CPU-only).
Create Custom Models from Modelfile
ollama create my-coder -f ./Modelfile
ollama run my-coder
To update, edit the Modelfile and run ollama create again with the same name.
GPU vs CPU Detection and Configuration
Ollama auto-detects NVIDIA (CUDA), AMD (ROCm), and Apple Silicon (Metal) GPUs.
ollama ps # PROCESSOR column shows gpu or cpu
Force CPU-only:
CUDA_VISIBLE_DEVICES="" ollama serve # per-session
OLLAMA_NUM_GPU=0 ollama serve # server-wide
Per-model GPU control in a Modelfile:
PARAMETER num_gpu 0 # force CPU
PARAMETER num_gpu 999 # offload all layers to GPU (default)
For multi-GPU, set CUDA_VISIBLE_DEVICES=0,1.
Popular Models and When to Use Which
| Model | Size | Best for |
|---|---|---|
llama3 (8B) |
4.7 GB | General chat, reasoning, instruction following |
llama3:70b |
40 GB | Higher quality when you have the VRAM |
codellama (7B) |
3.8 GB | Code generation, completion, infilling |
mistral (7B) |
4.1 GB | Fast general-purpose, structured output |
phi3 (3.8B) |
2.2 GB | Small footprint, good quality for its size |
gemma2 (9B) |
5.4 GB | Strong reasoning, multilingual |
deepseek-coder-v2 (16B) |
8.9 GB | Code generation, multi-language |
nomic-embed-text |
274 MB | Text embeddings for RAG |
llava (7B) |
4.7 GB | Multi-modal image understanding |
For constrained hardware (8 GB RAM), use phi3 or llama3 with q4 quantization.
Embeddings for RAG
Use nomic-embed-text (768 dimensions) or mxbai-embed-large (1024 dimensions).
curl http://localhost:11434/api/embed -d '{
"model": "nomic-embed-text",
"input": ["document chunk one", "document chunk two"]
}'
Response contains an "embeddings" array of float vectors. Store in a vector database (ChromaDB, pgvector, Qdrant, FAISS) for similarity search.
Integration with Python
Using requests
import requests
response = requests.post("http://localhost:11434/api/generate", json={
"model": "llama3",
"prompt": "Explain monads simply",
"stream": False
})
print(response.json()["response"])
Using ollama-python (pip install ollama)
import ollama
# Chat
response = ollama.chat(model="llama3", messages=[
{"role": "user", "content": "Explain monads simply"}
])
print(response["message"]["content"])
# Embeddings
result = ollama.embed(model="nomic-embed-text", input="some text")
print(len(result["embeddings"][0])) # 768
# Streaming
for chunk in ollama.chat(model="llama3", messages=[
{"role": "user", "content": "Write a haiku"}
], stream=True):
print(chunk["message"]["content"], end="", flush=True)
Integration with JavaScript
Using fetch
const response = await fetch("http://localhost:11434/api/generate", {
method: "POST",
body: JSON.stringify({ model: "llama3", prompt: "Explain closures", stream: false }),
});
const data = await response.json();
console.log(data.response);
Using ollama-js (npm install ollama)
import { Ollama } from "ollama";
const ollama = new Ollama({ host: "http://localhost:11434" });
const response = await ollama.chat({
model: "llama3",
messages: [{ role: "user", content: "Explain closures" }],
});
console.log(response.message.content);
Integration with LangChain
Python (pip install langchain-ollama):
from langchain_ollama import ChatOllama, OllamaEmbeddings
llm = ChatOllama(model="llama3", temperature=0.3)
response = llm.invoke("What is dependency injection?")
embeddings = OllamaEmbeddings(model="nomic-embed-text")
vectors = embeddings.embed_documents(["first chunk", "second chunk"])
JavaScript (npm install @langchain/ollama):
import { ChatOllama } from "@langchain/ollama";
const llm = new ChatOllama({ model: "llama3", temperature: 0.3 });
const response = await llm.invoke("What is dependency injection?");
Multi-Modal Models
Use llava or llava-llama3 for image understanding:
curl http://localhost:11434/api/chat -d '{
"model": "llava",
"messages": [{"role": "user", "content": "Describe this image", "images": ["BASE64_DATA"]}],
"stream": false
}'
import ollama, base64
with open("photo.jpg", "rb") as f:
img = base64.b64encode(f.read()).decode()
response = ollama.chat(model="llava", messages=[
{"role": "user", "content": "What do you see?", "images": [img]}
])
Performance Tuning
Context size -- use the smallest that fits your workload:
PARAMETER num_ctx 4096 # default
PARAMETER num_ctx 32768 # long documents, more VRAM
Partial GPU offloading when model exceeds VRAM:
PARAMETER num_gpu 20 # 20 layers on GPU, rest on CPU
Batch size for prompt processing speed:
PARAMETER num_batch 512 # default, increase for faster eval
Keep-alive control (how long model stays loaded):
curl http://localhost:11434/api/generate -d '{
"model": "llama3", "prompt": "hi", "keep_alive": "30m"
}'
Use "keep_alive": 0 to unload immediately, -1 to keep indefinitely.
Server-wide environment variables:
OLLAMA_MAX_LOADED_MODELS-- concurrent models in memory (default 1).OLLAMA_NUM_PARALLEL-- concurrent requests per model.OLLAMA_FLASH_ATTENTION=1-- reduce memory with flash attention.
Running as a Service
macOS (Homebrew):
brew services start ollama
Linux (systemd -- created automatically by the install script):
sudo systemctl enable --now ollama
sudo journalctl -u ollama -f
Customize with sudo systemctl edit ollama:
[Service]
Environment="OLLAMA_HOST=0.0.0.0:11434"
Environment="OLLAMA_NUM_PARALLEL=4"
Then sudo systemctl daemon-reload && sudo systemctl restart ollama.
Docker:
docker run -d -v ollama:/root/.ollama -p 11434:11434 --name ollama ollama/ollama
docker exec -it ollama ollama run llama3
With NVIDIA GPU: add --gpus=all to the run command.
More from 1mangesh1/dev-skills-collection
curl-http
HTTP request construction and API testing with curl and HTTPie. Use when user asks to "test API", "make HTTP request", "curl POST", "send request", "test endpoint", "debug API", "upload file", "check response time", "set auth header", "basic auth with curl", "send JSON", "test webhook", "check status code", "follow redirects", "rate limit testing", "measure API latency", "stress test endpoint", "mock API response", or any HTTP calls from the command line.
28database-indexing
Database indexing internals, index type selection, query plan analysis, and write-overhead tradeoffs across PostgreSQL, MySQL, and MongoDB. Use when user asks to "optimize queries", "create indexes", "fix slow queries", "read EXPLAIN output", "reduce query time", "index strategy", "database performance", "composite index", "covering index", "partial index", "index bloat", "unused indexes", or needs help diagnosing and resolving database performance problems.
13testing-strategies
Testing strategies, patterns, and methodologies across the full testing spectrum. Use when asked about unit tests, integration tests, e2e tests, test pyramid, mocking, test doubles, TDD, property-based testing, snapshot testing, test coverage, mutation testing, contract testing, performance testing, test data management, CI/CD testing, flaky tests, test anti-patterns, test organization, test isolation, test fixtures, test parameterization, or any testing strategy, approach, or methodology.
10secret-scanner
This skill should be used when the user asks to "scan for secrets", "find API keys", "detect credentials", "check for hardcoded passwords", "find leaked tokens", "scan for sensitive keys", "check git history for secrets", "audit repository for credentials", or mentions secret detection, credential scanning, API key exposure, token leakage, password detection, or security key auditing.
10terraform
Terraform infrastructure as code for provisioning, modules, state management, and workspaces. Use when user asks to "create infrastructure", "write Terraform", "manage state", "create module", "import resource", "plan changes", or any IaC tasks.
10kubernetes
Kubernetes and kubectl mastery for deployments, services, pods, debugging, and cluster management. Use when user asks to "deploy to k8s", "create deployment", "debug pod", "kubectl commands", "scale service", "check pod logs", "create ingress", or any Kubernetes tasks.
10