ollama-optimizer
Ollama Optimizer
Optimize Ollama configuration based on system hardware analysis.
When to Use
Use this skill when the user asks to optimize Ollama, configure Ollama, speed up Ollama, fix Ollama running slow, set up a local LLM, tune inference speed, reduce memory usage, or select models that fit their GPU/RAM. The skill analyzes hardware (GPU, VRAM, RAM, CPU) and produces tailored recommendations.
Do not use for LM Studio, llama.cpp, vLLM, or hosted-API LLM providers (OpenAI, Anthropic) — those use different runtimes and tuning surfaces.
Repo Sync Before Edits (mandatory)
Before creating/updating/deleting files in an existing repository, sync the current branch with remote:
branch="$(git rev-parse --abbrev-ref HEAD)"
git fetch origin
git pull --rebase origin "$branch"
If the working tree is not clean, stash first, sync, then restore:
git stash push -u -m "pre-sync"
branch="$(git rev-parse --abbrev-ref HEAD)"
git fetch origin && git pull --rebase origin "$branch"
git stash pop
If origin is missing, pull is unavailable, or rebase/stash conflicts occur, stop and ask the user before continuing.
Workflow
Phase 1: System Detection
Run the detection script to gather hardware information:
python3 scripts/detect_system.py
Parse the JSON output to identify:
- OS and version
- CPU model and core count
- Total RAM / unified memory
- GPU type, VRAM, and driver version
- Current Ollama installation and environment variables
Phase 2: Analyze and Recommend
Based on detected hardware, determine the optimization profile:
Hardware Tier Classification:
| Tier | Criteria | Max Model | Key Optimizations |
|---|---|---|---|
| CPU-only | No GPU detected | 3B | num_thread tuning, Q4_K_M quant |
| Low VRAM | <6GB VRAM | 3B | Flash attention, KV cache q4_0 |
| Entry | 6-8GB VRAM | 8B | Flash attention, KV cache q8_0 |
| Prosumer | 10-12GB VRAM | 14B | Flash attention, full offload |
| Workstation | 16-24GB VRAM | 32B | Standard config, Q5_K_M option |
| High-end | 48GB+ VRAM | 70B+ | Multiple models, Q5/Q6 quants |
Apple Silicon Special Case:
- Unified memory = shared CPU/GPU RAM
- 8GB Mac → treat as 6GB VRAM tier
- 16GB Mac → treat as 12GB VRAM tier
- 32GB+ Mac → treat as workstation tier
Phase 3: Generate Optimization Plan
Create a structured optimization guide with these sections:
1. System Overview
Present detected hardware specs and highlight constraints (e.g., "8GB unified memory limits to 7B models").
2. Dependency Assessment
List what's needed based on the platform:
- macOS: Ollama only (Metal automatic)
- Linux NVIDIA: Ollama + NVIDIA driver 450+
- Linux AMD: Ollama + ROCm 5.0+
- Windows: Ollama + NVIDIA driver 452+
3. Configuration Recommendations
Essential environment variables:
# Always recommended
export OLLAMA_FLASH_ATTENTION=1
# Memory-constrained systems (<12GB)
export OLLAMA_KV_CACHE_TYPE=q8_0 # or q4_0 for severe constraints
Model selection guidance:
- Recommend specific models from
ollama listoutput - Suggest appropriate quantization (Q4_K_M default, Q5_K_M if headroom exists)
- Warn if current models exceed hardware capacity
Modelfile tuning (when needed):
PARAMETER num_gpu <layers> # Partial offload for limited VRAM
PARAMETER num_thread <cores> # CPU threads (physical cores, not hyperthreads)
PARAMETER num_ctx <size> # Reduce context for memory savings
4. Execution Checklist
Provide copy-paste commands in order:
- Set environment variables
- Restart Ollama service
- Pull recommended models
- Test with
ollama run <model> --verbose
5. Verification Commands
# Benchmark current performance
python3 scripts/benchmark_ollama.py --model <model>
# Expected output: tokens/s and generation latency. Compare against tier baseline from Phase 2.
# Check GPU memory usage (NVIDIA)
nvidia-smi
# Verify config is applied
ollama run <model> "test" --verbose 2>&1 | head -20
Acceptance Criteria
A run passes when all of the following are true:
- Hardware tier (CPU-only / Low-VRAM / Entry / Prosumer / Workstation / High-end) is identified explicitly in the report.
- Recommended model size fits within detected VRAM/unified-memory budget (no recommending a 14B model on an 8GB Mac).
- Required Ollama environment variables (e.g.,
OLLAMA_FLASH_ATTENTION, KV-cache quantisation) are written to a shell init file the user actually uses, with a backup of the prior file. - Apple Silicon special case is applied when detected — unified memory is not double-counted as separate VRAM + RAM.
- Verification step runs
ollama run <model>with--verboseand captures the actual offload/cache numbers. - Rollback instructions are included so the user can revert all env changes with one command.
Step Completion Reports
After completing each major step, output a status report in this format:
◆ [Step Name] ([step N of M] — [context])
··································································
[Check 1]: √ pass
[Check 2]: √ pass (note if relevant)
[Check 3]: × fail — [reason]
[Check 4]: √ pass
[Criteria]: √ N/M met
____________________________
Result: PASS | FAIL | PARTIAL
Adapt the check names to match what the step actually validates. Use √ for pass, × for fail, and — to add brief context. The "Criteria" line summarizes how many acceptance criteria were met. The "Result" line gives the overall verdict.
Detection (step 1 of 4)
◆ Detection (step 1 of 4 — hardware profiling)
··································································
Hardware detected: √ pass — macOS 14, Apple M2
GPU identified: √ pass — Apple Metal (unified memory)
RAM measured: √ pass — 16GB unified memory
[Criteria]: √ 3/3 met
____________________________
Result: PASS
Analysis (step 2 of 4)
◆ Analysis (step 2 of 4 — profile selection)
··································································
Tier classified: √ pass — Prosumer (16GB unified)
Profile selected: √ pass — Flash attention, full offload
Bottlenecks identified: √ pass — memory bandwidth primary constraint
[Criteria]: √ 3/3 met
____________________________
Result: PASS
Plan (step 3 of 4)
◆ Plan (step 3 of 4 — optimization guide)
··································································
Guide generated: √ pass — ollama-optimization-guide.md written
Parameters tuned: √ pass — OLLAMA_FLASH_ATTENTION=1, KV_CACHE_TYPE=q8_0
Model recommendations ready: √ pass — llama3.1:14b-instruct-q4_K_M suggested
[Criteria]: √ 3/3 met
____________________________
Result: PASS
Verification (step 4 of 4)
◆ Verification (step 4 of 4 — config validation)
··································································
Benchmark commands listed: √ pass — python3 scripts/benchmark_ollama.py
Config verified: √ pass — ollama run --verbose output checked
[Criteria]: √ 2/2 met
____________________________
Result: PASS
Reference Files
- VRAM Requirements - Model sizing and quantization guide
- Environment Variables - Complete env var reference
- Platform-Specific Setup - OS-specific installation and configuration
Expected Output
Generate an ollama-optimization-guide.md file. Ask the user where to save it (suggest ~/.config/ollama/optimization-guide.md or current directory). Contents:
# Ollama Optimization Guide
**Generated:** <timestamp>
**System:** <OS> | <CPU> | <RAM>GB RAM | <GPU>
## System Overview
<hardware summary and constraints>
## Current Configuration
<existing Ollama setup and env vars>
## Recommendations
### Environment Variables
<shell commands to set vars>
### Model Selection
<recommended models with rationale>
### Performance Tuning
<Modelfile adjustments if needed>
## Execution Checklist
- [ ] <step 1>
- [ ] <step 2>
...
## Verification
<benchmark commands and expected results>
## Rollback
<commands to revert changes if needed>
Quick Optimization Commands
For users who want immediate results without full analysis:
macOS (Apple Silicon):
export OLLAMA_FLASH_ATTENTION=1
export OLLAMA_KV_CACHE_TYPE=q8_0
ollama pull llama3.2:3b # Safe for 8GB, fast
Linux/Windows with 8GB NVIDIA GPU:
export OLLAMA_FLASH_ATTENTION=1
export OLLAMA_KV_CACHE_TYPE=q8_0
ollama pull llama3.1:8b-instruct-q4_K_M
CPU-only systems:
export CUDA_VISIBLE_DEVICES=-1
ollama pull llama3.2:3b
# Create Modelfile with: PARAMETER num_thread 4
More from luongnv89/skills
logo-designer
Generate professional SVG logos from project context, producing 7 brand variants (mark, full, wordmark, icon, favicon, white, black) plus a showcase HTML page. Skip for raster-only logos, product illustrations, or full brand-guideline docs.
122code-optimizer
Analyze code for performance bottlenecks, memory leaks, and algorithmic inefficiencies. Use when asked to optimize, find bottlenecks, or improve efficiency. Don't use for bug-hunting code review, security audits, or refactoring without a perf goal.
76code-review
Review code changes for bugs, security vulnerabilities, and code quality issues — producing prioritized findings with specific fix suggestions. Don't use for performance tuning, writing new features from scratch, or generating test cases.
75idea-validator
Evaluate app ideas and startup concepts across market viability, technical feasibility, and competitive landscape. Use when asked to validate, review, or score a product idea. Don't use for writing a PRD, detailed go-to-market plans, or financial/investor pitch decks.
70test-coverage
Generate unit tests for untested branches and edge cases. Use when coverage is low, CI flags gaps, or a release needs hardening. Not for integration/E2E suites, framework migrations, or fixing production bugs.
63tasks-generator
Generate development tasks from a PRD file with sprint-based planning. Use when users ask to create tasks from PRD, break down the PRD, generate sprint tasks, or want to convert product requirements into actionable development tasks. Creates/updates tasks.md and always reports GitHub links to changed files. Don't use for writing a PRD, authoring a TAD, or executing tasks (see openspec-task-loop).
62