local-llm-provider
SKILL.md
Local LLM Provider
Connect to local LLM endpoints (Ollama, llama.cpp, vLLM) with automatic fallback to cloud providers. This skill enables the agent to leverage local GPU/CPU inference while maintaining reliability through intelligent fallback.
When to Use
- Running LLM inference locally for privacy (data never leaves your machine)
- Using models not available via cloud APIs (e.g., fine-tuned models, Llama variants)
- Reducing API costs for high-volume tasks
- Working offline or with intermittent connectivity
- Need low-latency responses for interactive tasks
Setup
No additional setup required if Ollama is already running. Otherwise:
Ollama Setup
# Install Ollama
curl -fsSL https://ollama.com/install.sh | sh
# Pull a model
ollama pull llama3.2
# Start the server (default: http://localhost:11434)
ollama serve
llama.cpp Server Setup
# Build llama-server
make llama-server
# Start the server
llama-server -hf ggml-org/gpt-oss-20b-GGUF -c 133000 --host 127.0.0.1 --port 8080
vLLM Server Setup
# Install vLLM
pip install vllm
# Start the server
vllm serve meta-llama/Llama-3.1-8B-Instruct
Usage
Query a local model
node /job/.pi/skills/local-llm-provider/query.js "What is 2+2?" --model llama3.2
Query with custom parameters
node /job/.pi/skills/local-llm-provider/query.js "Explain quantum computing" --model mixtral --temp 0.8 --max-tokens 500
List available models
node /job/.pi/skills/local-llm-provider/list-models.js
Check server health
node /job/.pi/skills/local-llm-provider/health.js
Stream responses
node /job/.pi/skills/local-llm-provider/query.js "Tell me a story" --stream
Configuration
Create a config.json in the skill directory for persistent settings:
{
"providers": [
{
"name": "ollama",
"url": "http://localhost:11434",
"enabled": true,
"fallback_order": 1
},
{
"name": "llamacpp",
"url": "http://localhost:8080/v1",
"enabled": false,
"fallback_order": 2
},
{
"name": "vllm",
"url": "http://localhost:8000/v1",
"enabled": false,
"fallback_order": 3
}
],
"default_model": "llama3.2",
"fallback_to_cloud": true,
"cloud_provider": "anthropic",
"timeout_ms": 120000
}
Provider Fallback
The skill implements intelligent fallback:
- Primary: Try local Ollama first
- Secondary: Try llama.cpp server
- Tertiary: Try vLLM server
- Fallback: Use cloud provider (if enabled)
Each provider failure triggers automatic retry with the next available provider.
Supported Models
Ollama
- llama3.2, llama3.1, llama3
- mistral, mixtral
- qwen2.5, qwen2
- phi3, phi4
- gemma2, gemma
- codellama
- and many more
llama.cpp
- Any GGUF format model
- Mistral variants
- Llama variants
- Qwen variants
vLLM
- Llama 3.1, 3.0
- Mistral
- Qwen
- Any HuggingFace model
API Integration
As a library
const { LocalLLMProvider } = require('./provider.js');
const provider = new LocalLLMProvider({
providers: [
{ name: 'ollama', url: 'http://localhost:11434', enabled: true },
{ name: 'anthropic', api_key: process.env.ANTHROPIC_API_KEY, enabled: false }
],
default_model: 'llama3.2',
fallback_to_cloud: true
});
const response = await provider.complete('Hello, how are you?');
console.log(response);
Output Format
The query returns JSON:
{
"success": true,
"provider": "ollama",
"model": "llama3.2",
"response": "I'm doing well, thank you for asking!",
"tokens": 42,
"duration_ms": 1500,
"done": true
}
When streaming:
{
"success": true,
"provider": "ollama",
"model": "llama3.2",
"response": "I",
"tokens": 1,
"done": false
}
On fallback failure:
{
"success": false,
"error": "All providers failed",
"providers_tried": ["ollama", "llamacpp"],
"last_error": "Connection refused"
}
Environment Variables
| Variable | Description | Default |
|---|---|---|
OLLAMA_BASE_URL |
Ollama server URL | http://localhost:11434 |
LLAMACPP_BASE_URL |
llama.cpp server URL | http://localhost:8080/v1 |
VLLM_BASE_URL |
vLLM server URL | http://localhost:8000/v1 |
LOCAL_LLM_DEFAULT_MODEL |
Default model to use | llama3.2 |
Limitations
- Requires local server to be running
- Model quality depends on local hardware
- Not all models support all features (e.g., function calling)
- Some providers have different API formats
Tips
- For best performance: Use Ollama with GPU acceleration
- For variety: Pull multiple models (
ollama pull mixtral) - For privacy: Always use local providers first
- For reliability: Keep cloud fallback enabled for critical tasks
- For speed: Use smaller models (7B) for simple tasks, larger (70B) for complex reasoning
Weekly Installs
3
Repository
winsorllc/upgra…carnivalFirst Seen
14 days ago
Security Audits
Installed on
opencode3
gemini-cli3
claude-code3
github-copilot3
codex3
kimi-cli3