Local LLM Provider

Connect to local LLM endpoints (Ollama, llama.cpp, vLLM) with automatic fallback to cloud providers. This skill enables the agent to leverage local GPU/CPU inference while maintaining reliability through intelligent fallback.

When to Use

Running LLM inference locally for privacy (data never leaves your machine)
Using models not available via cloud APIs (e.g., fine-tuned models, Llama variants)
Reducing API costs for high-volume tasks
Working offline or with intermittent connectivity
Need low-latency responses for interactive tasks

Setup

No additional setup required if Ollama is already running. Otherwise:

Ollama Setup

# Install Ollama
curl -fsSL https://ollama.com/install.sh | sh

# Pull a model
ollama pull llama3.2

# Start the server (default: http://localhost:11434)
ollama serve

llama.cpp Server Setup

# Build llama-server
make llama-server

# Start the server
llama-server -hf ggml-org/gpt-oss-20b-GGUF -c 133000 --host 127.0.0.1 --port 8080

vLLM Server Setup

# Install vLLM
pip install vllm

# Start the server
vllm serve meta-llama/Llama-3.1-8B-Instruct

Usage

Query a local model

node /job/.pi/skills/local-llm-provider/query.js "What is 2+2?" --model llama3.2

Query with custom parameters

node /job/.pi/skills/local-llm-provider/query.js "Explain quantum computing" --model mixtral --temp 0.8 --max-tokens 500

List available models

node /job/.pi/skills/local-llm-provider/list-models.js

Check server health

node /job/.pi/skills/local-llm-provider/health.js

Stream responses

node /job/.pi/skills/local-llm-provider/query.js "Tell me a story" --stream

Configuration

Create a config.json in the skill directory for persistent settings:

{
  "providers": [
    {
      "name": "ollama",
      "url": "http://localhost:11434",
      "enabled": true,
      "fallback_order": 1
    },
    {
      "name": "llamacpp",
      "url": "http://localhost:8080/v1",
      "enabled": false,
      "fallback_order": 2
    },
    {
      "name": "vllm",
      "url": "http://localhost:8000/v1",
      "enabled": false,
      "fallback_order": 3
    }
  ],
  "default_model": "llama3.2",
  "fallback_to_cloud": true,
  "cloud_provider": "anthropic",
  "timeout_ms": 120000
}

Provider Fallback

The skill implements intelligent fallback:

Primary: Try local Ollama first
Secondary: Try llama.cpp server
Tertiary: Try vLLM server
Fallback: Use cloud provider (if enabled)

Each provider failure triggers automatic retry with the next available provider.

Supported Models

Ollama

llama3.2, llama3.1, llama3
mistral, mixtral
qwen2.5, qwen2
phi3, phi4
gemma2, gemma
codellama
and many more

llama.cpp

Any GGUF format model
Mistral variants
Llama variants
Qwen variants

vLLM

Llama 3.1, 3.0
Mistral
Qwen
Any HuggingFace model

API Integration

As a library

const { LocalLLMProvider } = require('./provider.js');

const provider = new LocalLLMProvider({
  providers: [
    { name: 'ollama', url: 'http://localhost:11434', enabled: true },
    { name: 'anthropic', api_key: process.env.ANTHROPIC_API_KEY, enabled: false }
  ],
  default_model: 'llama3.2',
  fallback_to_cloud: true
});

const response = await provider.complete('Hello, how are you?');
console.log(response);

Output Format

The query returns JSON:

{
  "success": true,
  "provider": "ollama",
  "model": "llama3.2",
  "response": "I'm doing well, thank you for asking!",
  "tokens": 42,
  "duration_ms": 1500,
  "done": true
}

When streaming:

{
  "success": true,
  "provider": "ollama",
  "model": "llama3.2",
  "response": "I",
  "tokens": 1,
  "done": false
}

On fallback failure:

{
  "success": false,
  "error": "All providers failed",
  "providers_tried": ["ollama", "llamacpp"],
  "last_error": "Connection refused"
}

Environment Variables

Variable	Description	Default
`OLLAMA_BASE_URL`	Ollama server URL	http://localhost:11434
`LLAMACPP_BASE_URL`	llama.cpp server URL	http://localhost:8080/v1
`VLLM_BASE_URL`	vLLM server URL	http://localhost:8000/v1
`LOCAL_LLM_DEFAULT_MODEL`	Default model to use	llama3.2

Limitations

Requires local server to be running
Model quality depends on local hardware
Not all models support all features (e.g., function calling)
Some providers have different API formats

Tips

For best performance: Use Ollama with GPU acceleration
For variety: Pull multiple models (ollama pull mixtral)
For privacy: Always use local providers first
For reliability: Keep cloud fallback enabled for critical tasks
For speed: Use smaller models (7B) for simple tasks, larger (70B) for complex reasoning

local-llm-provider