ollama

Installation

SKILL.md

Ollama

Install and Setup

macOS:

brew install ollama

Linux:

curl -fsSL https://ollama.com/install.sh | sh

Windows: Download from https://ollama.com/download/windows.

Start the server:

ollama serve

The server listens on http://localhost:11434 by default. Set OLLAMA_HOST to change the bind address.

Pull and Run Models

ollama pull llama3       # download without running
ollama run llama3        # run (auto-pulls if missing)
ollama list              # list downloaded models

Interactive Chat vs One-Shot Generation

Interactive chat (opens a REPL, type /bye to exit):

ollama run llama3

One-shot generation (pipe input, get output, exit):

echo "Explain quicksort in two sentences" | ollama run llama3
cat main.py | ollama run codellama "Review this code for bugs"

REST API

Generate (completion)

curl http://localhost:11434/api/generate -d '{
  "model": "llama3",
  "prompt": "Why is the sky blue?",
  "stream": false
}'

Chat (multi-turn)

curl http://localhost:11434/api/chat -d '{
  "model": "llama3",
  "messages": [
    {"role": "system", "content": "You are a helpful assistant."},
    {"role": "user", "content": "What is 2+2?"}
  ],
  "stream": false
}'

Embeddings

curl http://localhost:11434/api/embed -d '{
  "model": "nomic-embed-text",
  "input": "Ollama is a tool for running local LLMs"
}'

Set "stream": true (the default) to receive newline-delimited JSON chunks.

Model Management

ollama list                    # list downloaded models
ollama show llama3             # show model details (parameters, template, license)
ollama cp llama3 my-llama3     # copy/alias a model
ollama rm my-llama3            # delete a model
ollama ps                      # list currently loaded/running models

ollama ps shows VRAM usage, quantization level, and time until unload.

Modelfile

A Modelfile defines a custom model:

FROM llama3

SYSTEM "You are a senior software engineer. Be concise. Provide code examples."

PARAMETER temperature 0.3
PARAMETER num_ctx 8192
PARAMETER top_p 0.9
PARAMETER repeat_penalty 1.1
PARAMETER stop "<|eot_id|>"

Key parameters:

temperature -- randomness (0.0 = deterministic, 1.0+ = creative).
num_ctx -- context window in tokens. Higher values use more VRAM.
top_p -- nucleus sampling threshold.
top_k -- limits token selection pool.
repeat_penalty -- penalizes repeated tokens.
stop -- stop sequence(s).
num_gpu -- layers to offload to GPU (0 for CPU-only).

Create Custom Models from Modelfile

ollama create my-coder -f ./Modelfile
ollama run my-coder

To update, edit the Modelfile and run ollama create again with the same name.

GPU vs CPU Detection and Configuration

Ollama auto-detects NVIDIA (CUDA), AMD (ROCm), and Apple Silicon (Metal) GPUs.

ollama ps   # PROCESSOR column shows gpu or cpu

Force CPU-only:

CUDA_VISIBLE_DEVICES="" ollama serve    # per-session
OLLAMA_NUM_GPU=0 ollama serve           # server-wide

Per-model GPU control in a Modelfile:

PARAMETER num_gpu 0    # force CPU
PARAMETER num_gpu 999  # offload all layers to GPU (default)

For multi-GPU, set CUDA_VISIBLE_DEVICES=0,1.

Popular Models and When to Use Which

Model	Size	Best for
`llama3` (8B)	4.7 GB	General chat, reasoning, instruction following
`llama3:70b`	40 GB	Higher quality when you have the VRAM
`codellama` (7B)	3.8 GB	Code generation, completion, infilling
`mistral` (7B)	4.1 GB	Fast general-purpose, structured output
`phi3` (3.8B)	2.2 GB	Small footprint, good quality for its size
`gemma2` (9B)	5.4 GB	Strong reasoning, multilingual
`deepseek-coder-v2` (16B)	8.9 GB	Code generation, multi-language
`nomic-embed-text`	274 MB	Text embeddings for RAG
`llava` (7B)	4.7 GB	Multi-modal image understanding

For constrained hardware (8 GB RAM), use phi3 or llama3 with q4 quantization.

Embeddings for RAG

Use nomic-embed-text (768 dimensions) or mxbai-embed-large (1024 dimensions).

curl http://localhost:11434/api/embed -d '{
  "model": "nomic-embed-text",
  "input": ["document chunk one", "document chunk two"]
}'

Response contains an "embeddings" array of float vectors. Store in a vector database (ChromaDB, pgvector, Qdrant, FAISS) for similarity search.

Integration with Python

Using requests

import requests

response = requests.post("http://localhost:11434/api/generate", json={
    "model": "llama3",
    "prompt": "Explain monads simply",
    "stream": False
})
print(response.json()["response"])

Using ollama-python (`pip install ollama`)

import ollama

# Chat
response = ollama.chat(model="llama3", messages=[
    {"role": "user", "content": "Explain monads simply"}
])
print(response["message"]["content"])

# Embeddings
result = ollama.embed(model="nomic-embed-text", input="some text")
print(len(result["embeddings"][0]))  # 768

# Streaming
for chunk in ollama.chat(model="llama3", messages=[
    {"role": "user", "content": "Write a haiku"}
], stream=True):
    print(chunk["message"]["content"], end="", flush=True)

Integration with JavaScript

Using fetch

const response = await fetch("http://localhost:11434/api/generate", {
  method: "POST",
  body: JSON.stringify({ model: "llama3", prompt: "Explain closures", stream: false }),
});
const data = await response.json();
console.log(data.response);

Using ollama-js (`npm install ollama`)

import { Ollama } from "ollama";
const ollama = new Ollama({ host: "http://localhost:11434" });

const response = await ollama.chat({
  model: "llama3",
  messages: [{ role: "user", content: "Explain closures" }],
});
console.log(response.message.content);

Integration with LangChain

Python (pip install langchain-ollama):

from langchain_ollama import ChatOllama, OllamaEmbeddings

llm = ChatOllama(model="llama3", temperature=0.3)
response = llm.invoke("What is dependency injection?")

embeddings = OllamaEmbeddings(model="nomic-embed-text")
vectors = embeddings.embed_documents(["first chunk", "second chunk"])

JavaScript (npm install @langchain/ollama):

import { ChatOllama } from "@langchain/ollama";
const llm = new ChatOllama({ model: "llama3", temperature: 0.3 });
const response = await llm.invoke("What is dependency injection?");

Multi-Modal Models

Use llava or llava-llama3 for image understanding:

curl http://localhost:11434/api/chat -d '{
  "model": "llava",
  "messages": [{"role": "user", "content": "Describe this image", "images": ["BASE64_DATA"]}],
  "stream": false
}'

import ollama, base64

with open("photo.jpg", "rb") as f:
    img = base64.b64encode(f.read()).decode()

response = ollama.chat(model="llava", messages=[
    {"role": "user", "content": "What do you see?", "images": [img]}
])

Performance Tuning

Context size -- use the smallest that fits your workload:

PARAMETER num_ctx 4096   # default
PARAMETER num_ctx 32768  # long documents, more VRAM

Partial GPU offloading when model exceeds VRAM:

PARAMETER num_gpu 20     # 20 layers on GPU, rest on CPU

Batch size for prompt processing speed:

PARAMETER num_batch 512  # default, increase for faster eval

Keep-alive control (how long model stays loaded):

curl http://localhost:11434/api/generate -d '{
  "model": "llama3", "prompt": "hi", "keep_alive": "30m"
}'

Use "keep_alive": 0 to unload immediately, -1 to keep indefinitely.

Server-wide environment variables:

OLLAMA_MAX_LOADED_MODELS -- concurrent models in memory (default 1).
OLLAMA_NUM_PARALLEL -- concurrent requests per model.
OLLAMA_FLASH_ATTENTION=1 -- reduce memory with flash attention.

Running as a Service

macOS (Homebrew):

brew services start ollama

Linux (systemd -- created automatically by the install script):

sudo systemctl enable --now ollama
sudo journalctl -u ollama -f

Customize with sudo systemctl edit ollama:

[Service]
Environment="OLLAMA_HOST=0.0.0.0:11434"
Environment="OLLAMA_NUM_PARALLEL=4"

Then sudo systemctl daemon-reload && sudo systemctl restart ollama.

Docker:

docker run -d -v ollama:/root/.ollama -p 11434:11434 --name ollama ollama/ollama
docker exec -it ollama ollama run llama3

With NVIDIA GPU: add --gpus=all to the run command.

Related skills

More from 1mangesh1/dev-skills-collection

Installs

Repository

1mangesh1/dev-s…llection

GitHub Stars

First Seen

Apr 14, 2026

Security Audits

Gen Agent Trust HubPass

SocketPass

SnykWarn

ollama

Ollama

Install and Setup

Pull and Run Models

Interactive Chat vs One-Shot Generation

REST API

Generate (completion)

Chat (multi-turn)

Embeddings

Model Management

Modelfile

Create Custom Models from Modelfile

GPU vs CPU Detection and Configuration

Popular Models and When to Use Which

Embeddings for RAG

Integration with Python

Using requests

Using ollama-python (`pip install ollama`)

Integration with JavaScript

Using fetch

Using ollama-js (`npm install ollama`)

Integration with LangChain

Multi-Modal Models

Performance Tuning

Running as a Service

More from 1mangesh1/dev-skills-collection

curl-http

database-indexing

testing-strategies

secret-scanner

terraform

kubernetes

ollama

Ollama

Install and Setup

Pull and Run Models

Interactive Chat vs One-Shot Generation

REST API

Generate (completion)

Chat (multi-turn)

Embeddings

Model Management

Modelfile

Create Custom Models from Modelfile

GPU vs CPU Detection and Configuration

Popular Models and When to Use Which

Embeddings for RAG

Integration with Python

Using requests

Using ollama-python (pip install ollama)

Integration with JavaScript

Using fetch

Using ollama-js (npm install ollama)

Integration with LangChain

Multi-Modal Models

Performance Tuning

Running as a Service

More from 1mangesh1/dev-skills-collection

curl-http

database-indexing

testing-strategies

secret-scanner

terraform

kubernetes

Using ollama-python (`pip install ollama`)

Using ollama-js (`npm install ollama`)