ollama-rag
Ollama RAG Guide
Build RAG systems with Ollama - run locally or use cloud for massive models.
Ollama Cloud Models (Dec 2025)
Access via ollama signin (v0.12+). No local storage needed, privacy preserved.
| Model | Params | Context | Best For |
|---|---|---|---|
deepseek-v3.2:cloud |
671B | 160K | GPT-5 level, reasoning |
deepseek-v3.1:671b-cloud |
671B | 160K | Thinking + non-thinking hybrid |
qwen3-coder:480b-cloud |
480B | 256K-1M | Agentic coding, repo-scale |
minimax-m2:cloud |
230B (10B active) | 128K | #1 open-source, tools |
gpt-oss:120b-cloud |
120B | 128K | OpenAI open weights |
glm-4.6:cloud |
- | - | Code generation |
# Sign in to access cloud
ollama signin
# Run cloud models
ollama run deepseek-v3.2:cloud
ollama run qwen3-coder:480b-cloud
ollama run minimax-m2:cloud
Local Models (Dec 2025)
Reasoning Models
| Model | Params | Context | Best For |
|---|---|---|---|
nemotron-3-nano |
30B (3.6B active) | 1M tokens | Agents, long docs, code |
deepseek-r1 |
7B-671B | 128K | Reasoning, math, code |
qwq |
32B | 32K | Logic, analysis |
llama4 |
109B/400B | 128K | General, multimodal |
Fast/Efficient Models
| Model | Size | RAM | Speed |
|---|---|---|---|
llama3.2:3b |
2GB | 8GB | Very fast |
mistral-small-3.1 |
24B | 16GB | Fast |
gemma3 |
4B-27B | 8-32GB | Balanced |
Embedding Models
| Model | Dims | Context | MTEB Score |
|---|---|---|---|
snowflake-arctic-embed2 |
1024 | 8K | 67.5 |
mxbai-embed-large |
1024 | 512 | 64.68 |
nomic-embed-text |
768 | 8K | 53.01 |
Recommendation: snowflake-arctic-embed2 for accuracy, nomic-embed-text for speed.
Quick Start
Cloud (No Local Resources)
ollama signin
ollama run deepseek-v3.2:cloud # GPT-5 level
ollama run qwen3-coder:480b-cloud # 1M context for huge repos
Local
ollama pull nemotron-3-nano # 1M context, 24GB VRAM
ollama pull snowflake-arctic-embed2
# Or for lower RAM (8GB)
ollama pull llama3.2:3b
ollama pull nomic-embed-text
Stack Options
Option A: LangChain + ChromaDB (Most Common)
from langchain_ollama import OllamaLLM, OllamaEmbeddings
from langchain_chroma import Chroma
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain_community.document_loaders import PyPDFLoader
# Load and split
loader = PyPDFLoader("document.pdf")
splitter = RecursiveCharacterTextSplitter(chunk_size=1000, chunk_overlap=200)
docs = splitter.split_documents(loader.load())
# Embed and store
embeddings = OllamaEmbeddings(model="snowflake-arctic-embed2")
vectorstore = Chroma.from_documents(docs, embeddings, persist_directory="./db")
# Query - LOCAL
llm = OllamaLLM(model="nemotron-3-nano")
# Or CLOUD (GPT-5 level, no local resources)
llm = OllamaLLM(model="deepseek-v3.2:cloud")
retriever = vectorstore.as_retriever(search_kwargs={"k": 5})
from langchain.chains import RetrievalQA
qa = RetrievalQA.from_chain_type(llm=llm, retriever=retriever)
answer = qa.invoke("What is the main topic?")
Option B: LlamaIndex (Better Accuracy)
from llama_index.llms.ollama import Ollama
from llama_index.embeddings.ollama import OllamaEmbedding
from llama_index.core import VectorStoreIndex, SimpleDirectoryReader, Settings
# Configure
Settings.llm = Ollama(model="nemotron-3-nano", request_timeout=300.0)
Settings.embed_model = OllamaEmbedding(model_name="snowflake-arctic-embed2")
# Load and index
documents = SimpleDirectoryReader("./docs").load_data()
index = VectorStoreIndex.from_documents(documents)
# Query
query_engine = index.as_query_engine()
response = query_engine.query("Summarize the key findings")
Option C: Direct Ollama API (Minimal Dependencies)
import ollama
import chromadb
# Embed
def embed(text):
return ollama.embed(model="nomic-embed-text", input=text)["embeddings"][0]
# Store in ChromaDB
client = chromadb.PersistentClient(path="./db")
collection = client.get_or_create_collection("docs")
collection.add(ids=["1"], documents=["text"], embeddings=[embed("text")])
# Retrieve and generate
results = collection.query(query_embeddings=[embed("query")], n_results=3)
context = "\n".join(results["documents"][0])
response = ollama.chat(
model="nemotron-3-nano",
messages=[{"role": "user", "content": f"Context:\n{context}\n\nQuestion: ..."}]
)
Vector Database Options
| Database | Install | Best For |
|---|---|---|
| ChromaDB | pip install chromadb |
Simple, embedded |
| FAISS | pip install faiss-cpu |
Fast similarity |
| Qdrant | pip install qdrant-client |
Production scale |
| Weaviate | Docker | Full-featured |
Nemotron 3 Nano Deep Dive
Why Nemotron for RAG:
- 1M token context = entire codebases, long documents
- Hybrid Mamba-Transformer = 4x faster inference
- MoE (3.6B active params) = runs on 24GB VRAM
- Apache 2.0 license = commercial use OK
# For very long documents
llm = OllamaLLM(
model="nemotron-3-nano",
num_ctx=131072, # 128K context, increase as needed
temperature=0.1, # Lower for factual RAG
)
Hardware Requirements
| Model | RAM | GPU VRAM |
|---|---|---|
| 3B models | 8GB | 4GB |
| 7-8B models | 16GB | 8GB |
| 30B models | 32GB | 24GB |
| 70B+ models | 64GB+ | 48GB+ |
References
More from cuba6112/skillfactory
unsloth-sft
Supervised fine-tuning using SFTTrainer, instruction formatting, and multi-turn dataset preparation with triggers like sft, instruction tuning, chat templates, sharegpt, alpaca, conversation_extension, and SFTTrainer.
6torchaudio
Audio signal processing library for PyTorch. Covers feature extraction (spectrograms, mel-scale), waveform manipulation, and GPU-accelerated data augmentation techniques. (torchaudio, melscale, spectrogram, pitchshift, specaugment, waveform, resample)
5pytorch-onnx
Exporting PyTorch models to ONNX format for cross-platform deployment. Includes handling dynamic axes, graph optimization in ONNX Runtime, and INT8 model quantization. (onnx, onnxruntime, torch.onnx.export, dynamic_axes, constant-folding, edge-deployment)
5unsloth-lora
Configuring and optimizing 16-bit Low-Rank Adaptation (LoRA) and Rank-Stabilized LoRA (rsLoRA) for efficient LLM fine-tuning using triggers like lora, qlora, rslora, rank selection, lora_alpha, lora_dropout, and target_modules.
4pytorch-quantization
Techniques for model size reduction and inference acceleration using INT8 quantization, including Post-Training Quantization (PTQ) and Quantization Aware Training (QAT). (quantization, int8, qat, fbgemm, qnnpack, ptq, dequantize)
3torchvision
Computer vision library for PyTorch featuring pretrained models, advanced image transforms (v2), and utilities for handling complex data types like bounding boxes and masks. (torchvision, transforms, tvtensor, resnet, cutmix, mixup, pretrained models, vision transforms)
3