ollama-optimizer
Ollama Optimizer
Optimize Ollama configuration based on system hardware analysis.
Workflow
Phase 1: System Detection
Run the detection script to gather hardware information:
python3 scripts/detect_system.py
Parse the JSON output to identify:
- OS and version
- CPU model and core count
- Total RAM / unified memory
- GPU type, VRAM, and driver version
- Current Ollama installation and environment variables
Phase 2: Analyze and Recommend
Based on detected hardware, determine the optimization profile:
Hardware Tier Classification:
| Tier | Criteria | Max Model | Key Optimizations |
|---|---|---|---|
| CPU-only | No GPU detected | 3B | num_thread tuning, Q4_K_M quant |
| Low VRAM | <6GB VRAM | 3B | Flash attention, KV cache q4_0 |
| Entry | 6-8GB VRAM | 8B | Flash attention, KV cache q8_0 |
| Prosumer | 10-12GB VRAM | 14B | Flash attention, full offload |
| Workstation | 16-24GB VRAM | 32B | Standard config, Q5_K_M option |
| High-end | 48GB+ VRAM | 70B+ | Multiple models, Q5/Q6 quants |
Apple Silicon Special Case:
- Unified memory = shared CPU/GPU RAM
- 8GB Mac → treat as 6GB VRAM tier
- 16GB Mac → treat as 12GB VRAM tier
- 32GB+ Mac → treat as workstation tier
Phase 3: Generate Optimization Plan
Create a structured optimization guide with these sections:
1. System Overview
Present detected hardware specs and highlight constraints (e.g., "8GB unified memory limits to 7B models").
2. Dependency Assessment
List what's needed based on the platform:
- macOS: Ollama only (Metal automatic)
- Linux NVIDIA: Ollama + NVIDIA driver 450+
- Linux AMD: Ollama + ROCm 5.0+
- Windows: Ollama + NVIDIA driver 452+
3. Configuration Recommendations
Essential environment variables:
# Always recommended
export OLLAMA_FLASH_ATTENTION=1
# Memory-constrained systems (<12GB)
export OLLAMA_KV_CACHE_TYPE=q8_0 # or q4_0 for severe constraints
Model selection guidance:
- Recommend specific models from
ollama listoutput - Suggest appropriate quantization (Q4_K_M default, Q5_K_M if headroom exists)
- Warn if current models exceed hardware capacity
Modelfile tuning (when needed):
PARAMETER num_gpu <layers> # Partial offload for limited VRAM
PARAMETER num_thread <cores> # CPU threads (physical cores, not hyperthreads)
PARAMETER num_ctx <size> # Reduce context for memory savings
4. Execution Checklist
Provide copy-paste commands in order:
- Set environment variables
- Restart Ollama service
- Pull recommended models
- Test with
ollama run <model> --verbose
5. Verification Commands
# Benchmark current performance
python3 scripts/benchmark_ollama.py --model <model>
# Check GPU memory usage (NVIDIA)
nvidia-smi
# Verify config is applied
ollama run <model> "test" --verbose 2>&1 | head -20
Reference Files
- VRAM Requirements - Model sizing and quantization guide
- Environment Variables - Complete env var reference
- Platform-Specific Setup - OS-specific installation and configuration
Output Format
Generate an ollama-optimization-guide.md file in the current directory with:
# Ollama Optimization Guide
**Generated:** <timestamp>
**System:** <OS> | <CPU> | <RAM>GB RAM | <GPU>
## System Overview
<hardware summary and constraints>
## Current Configuration
<existing Ollama setup and env vars>
## Recommendations
### Environment Variables
<shell commands to set vars>
### Model Selection
<recommended models with rationale>
### Performance Tuning
<Modelfile adjustments if needed>
## Execution Checklist
- [ ] <step 1>
- [ ] <step 2>
...
## Verification
<benchmark commands and expected results>
## Rollback
<commands to revert changes if needed>
Quick Optimization Commands
For users who want immediate results without full analysis:
macOS (Apple Silicon):
export OLLAMA_FLASH_ATTENTION=1
export OLLAMA_KV_CACHE_TYPE=q8_0
ollama pull llama3.2:3b # Safe for 8GB, fast
Linux/Windows with 8GB NVIDIA GPU:
export OLLAMA_FLASH_ATTENTION=1
export OLLAMA_KV_CACHE_TYPE=q8_0
ollama pull llama3.1:8b-instruct-q4_K_M
CPU-only systems:
export CUDA_VISIBLE_DEVICES=-1
ollama pull llama3.2:3b
# Create Modelfile with: PARAMETER num_thread 4