mac-mini-llm-lab

Installation
SKILL.md

Mac mini LLM Lab

Turn a Mac mini into a low-noise, always-on local AI appliance.

When to Use This Skill

Use this skill when:

  • Setting up a dedicated local LLM inference server
  • Building a private AI development environment
  • Need always-on model serving without cloud costs
  • Running models that require Apple Silicon unified memory (32-192GB)
  • Creating a home lab AI server for a small team

Prerequisites

  • Mac mini with Apple Silicon (M2/M3/M4, 16GB+ unified memory recommended)
  • macOS Sonoma 14+ or Sequoia 15+
  • Ethernet connection (recommended over Wi-Fi)
  • UPS for power protection (optional but recommended)

Initial System Setup

# Update macOS
softwareupdate --install --all

# Install Xcode command-line tools
xcode-select --install

# Install Homebrew
/bin/bash -c "$(curl -fsSL https://raw.githubusercontent.com/Homebrew/install/HEAD/install.sh)"

# Core packages
brew install tmux htop btop wget jq git neovim

# Python environment (for MLX and custom scripts)
brew install python@3.12 uv

# Monitoring
brew install prometheus node_exporter

Ollama Setup

# Install Ollama
brew install ollama

# Pull models based on your RAM
# 16GB Mac mini:
ollama pull llama3.1:8b
ollama pull nomic-embed-text
ollama pull codellama:7b

# 32GB Mac mini:
ollama pull llama3.1:8b
ollama pull qwen2.5:14b
ollama pull deepseek-coder-v2:16b
ollama pull nomic-embed-text

# 64GB+ Mac mini:
ollama pull llama3.1:70b
ollama pull qwen2.5:32b
ollama pull codellama:34b

# Verify Metal acceleration
ollama run llama3.1:8b --verbose
# Look for: "metal" in output

MLX Framework (Apple Silicon Native)

MLX runs models natively on Apple Silicon with excellent performance:

# Install MLX
uv pip install mlx mlx-lm

# Run a model
python3 -c "
from mlx_lm import load, generate
model, tokenizer = load('mlx-community/Llama-3.1-8B-Instruct-4bit')
response = generate(model, tokenizer, prompt='Explain Docker in 3 sentences', max_tokens=200)
print(response)
"

# MLX server (OpenAI-compatible API)
uv pip install mlx-lm[server]
mlx_lm.server --model mlx-community/Llama-3.1-8B-Instruct-4bit --port 8080

Auto-Start with launchd

<!-- ~/Library/LaunchAgents/com.ollama.serve.plist -->
<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE plist PUBLIC "-//Apple//DTD PLIST 1.0//EN" "http://www.apple.com/DTDs/PropertyList-1.0.dtd">
<plist version="1.0">
<dict>
    <key>Label</key>
    <string>com.ollama.serve</string>
    <key>ProgramArguments</key>
    <array>
        <string>/opt/homebrew/bin/ollama</string>
        <string>serve</string>
    </array>
    <key>EnvironmentVariables</key>
    <dict>
        <key>OLLAMA_HOST</key>
        <string>0.0.0.0</string>
        <key>OLLAMA_NUM_PARALLEL</key>
        <string>4</string>
        <key>OLLAMA_MAX_LOADED_MODELS</key>
        <string>2</string>
        <key>OLLAMA_FLASH_ATTENTION</key>
        <string>1</string>
    </dict>
    <key>RunAtLoad</key>
    <true/>
    <key>KeepAlive</key>
    <true/>
    <key>StandardOutPath</key>
    <string>/tmp/ollama.log</string>
    <key>StandardErrorPath</key>
    <string>/tmp/ollama.err</string>
</dict>
</plist>
# Load the service
launchctl load ~/Library/LaunchAgents/com.ollama.serve.plist

# Check status
launchctl list | grep ollama

# Unload if needed
launchctl unload ~/Library/LaunchAgents/com.ollama.serve.plist

Power & Reliability

# Prevent sleep (keeps running with lid closed on Mac mini)
sudo pmset -a disablesleep 1
sudo pmset -a sleep 0

# Auto-restart after power failure
sudo pmset -a autorestart 1

# Schedule weekly reboot (Sunday 4 AM)
sudo pmset repeat shutdown MTWRFSU 03:55:00
sudo pmset repeat poweron MTWRFSU 04:00:00

# Check power settings
pmset -g

Remote Access

Tailscale (Recommended)

# Install Tailscale for easy secure remote access
brew install --cask tailscale

# Enable from menu bar, authenticate
# Access your Mac mini from anywhere: http://mac-mini:11434

SSH Hardening

# Enable remote login
sudo systemsetup -setremotelogin on

# Edit SSH config
sudo nano /etc/ssh/sshd_config
# Add:
# PasswordAuthentication no
# PubkeyAuthentication yes
# PermitRootLogin no
# AllowUsers yourusername

# Restart SSH
sudo launchctl unload /System/Library/LaunchDaemons/ssh.plist
sudo launchctl load /System/Library/LaunchDaemons/ssh.plist

Reverse Proxy with Caddy

brew install caddy

# Caddyfile
cat > /opt/homebrew/etc/Caddyfile << 'EOF'
llm.local:443 {
    tls internal
    reverse_proxy localhost:11434

    @api path /v1/*
    handle @api {
        reverse_proxy localhost:11434
    }
}

webui.local:443 {
    tls internal
    reverse_proxy localhost:3000
}
EOF

brew services start caddy

Open WebUI Setup

# Run Open WebUI via Docker
docker run -d \
  --name open-webui \
  -p 3000:8080 \
  -e OLLAMA_BASE_URL=http://host.docker.internal:11434 \
  -e WEBUI_AUTH=true \
  -v open-webui:/app/backend/data \
  --restart unless-stopped \
  ghcr.io/open-webui/open-webui:main

# Or install Docker first if not available
brew install --cask docker

Monitoring

# Health check script
cat > ~/scripts/llm-health.sh << 'SCRIPT'
#!/bin/bash
# Check Ollama
if curl -sf http://localhost:11434/api/tags > /dev/null; then
    echo "$(date): Ollama OK"
    curl -s http://localhost:11434/api/ps | python3 -m json.tool
else
    echo "$(date): Ollama DOWN"
    # Restart
    launchctl kickstart -k gui/$(id -u)/com.ollama.serve
fi

# System stats
echo "CPU: $(top -l 1 -n 0 | grep 'CPU usage')"
echo "Memory: $(vm_stat | head -5)"
echo "Disk: $(df -h / | tail -1)"
echo "Thermal: $(sudo powermetrics --samplers smc -n 1 2>/dev/null | grep 'CPU die' || echo 'N/A')"
SCRIPT
chmod +x ~/scripts/llm-health.sh

# Schedule health check every 5 minutes
# Add to crontab: crontab -e
# */5 * * * * ~/scripts/llm-health.sh >> ~/logs/llm-health.log 2>&1

Memory Usage by Model

Model RAM Required Tokens/sec (M2) Tokens/sec (M4)
llama3.1:8b (Q4) ~5 GB ~25 t/s ~45 t/s
qwen2.5:14b (Q4) ~9 GB ~15 t/s ~30 t/s
llama3.1:70b (Q4) ~40 GB ~5 t/s ~10 t/s
nomic-embed-text ~300 MB N/A N/A
codellama:13b ~8 GB ~18 t/s ~35 t/s

Security Checklist

# Enable FileVault disk encryption
sudo fdesetup enable

# Enable firewall
sudo /usr/libexec/ApplicationFirewall/socketfilterfw --setglobalstate on
sudo /usr/libexec/ApplicationFirewall/socketfilterfw --setstealthmode on

# Disable unnecessary sharing services
sudo launchctl disable system/com.apple.screensharing
sudo launchctl disable system/com.apple.AirPlayXPCHelper

# Set strong admin password
# System Settings > Users & Groups

# Restrict Ollama to local network only (if not using Tailscale)
# Set OLLAMA_HOST=127.0.0.1 in launchd plist

Performance Tuning

# Increase file descriptor limits for concurrent requests
sudo launchctl limit maxfiles 65536 200000

# Check unified memory pressure
memory_pressure

# Monitor GPU usage (Metal)
sudo powermetrics --samplers gpu_power -n 1

# Optimize for inference (disable Spotlight indexing on model dirs)
mdutil -i off ~/.ollama

Troubleshooting

Issue Solution
Model loading slow First load caches to memory; subsequent loads are fast
Out of memory Use smaller quantization (Q4_K_M), reduce OLLAMA_MAX_LOADED_MODELS
Mac sleeping Run sudo pmset -a disablesleep 1
Ollama not starting Check `launchctl list
Slow over Wi-Fi Use Ethernet; Wi-Fi adds latency to streaming responses
Thermal throttling Ensure adequate ventilation, check powermetrics

Related Skills

Weekly Installs
27
GitHub Stars
18
First Seen
5 days ago