LLM Integration

Patterns for integrating LLMs into production applications: tool use, streaming, local inference, and fine-tuning. Each category has individual rule files in rules/ loaded on-demand.

Quick Reference

Category	Rules	Impact	When to Use
Function Calling	3	CRITICAL	Tool definitions, parallel execution, input validation
Streaming	3	HIGH	SSE endpoints, structured streaming, backpressure handling
Local Inference	3	HIGH	Ollama setup, model selection, GPU optimization
Fine-Tuning	3	HIGH	LoRA/QLoRA training, dataset preparation, evaluation
Context Optimization	2	HIGH	Window management, compression, caching, budget scaling
Evaluation	2	HIGH	LLM-as-judge, RAGAS metrics, quality gates, benchmarks
Prompt Engineering	4	HIGH	CoT, few-shot, versioning, DSPy optimization, ReAct, cost optimization

Total: 20 rules across 7 categories

Quick Start

# Function calling: strict mode tool definition
tools = [{
    "type": "function",
    "function": {
        "name": "search_documents",
        "description": "Search knowledge base",
        "strict": True,
        "parameters": {
            "type": "object",
            "properties": {
                "query": {"type": "string", "description": "Search query"},
                "limit": {"type": "integer", "description": "Max results"}
            },
            "required": ["query", "limit"],
            "additionalProperties": False
        }
    }
}]

# Streaming: SSE endpoint with FastAPI
@app.get("/chat/stream")
async def stream_chat(prompt: str):
    async def generate():
        async for token in async_stream(prompt):
            yield {"event": "token", "data": token}
        yield {"event": "done", "data": ""}
    return EventSourceResponse(generate())

# Local inference: Ollama with LangChain
llm = ChatOllama(
    model="deepseek-r1:70b",
    base_url="http://localhost:11434",
    temperature=0.0,
    num_ctx=32768,
)

# Fine-tuning: QLoRA with Unsloth
model, tokenizer = FastLanguageModel.from_pretrained(
    model_name="unsloth/Meta-Llama-3.1-8B",
    max_seq_length=2048, load_in_4bit=True,
)
model = FastLanguageModel.get_peft_model(model, r=16, lora_alpha=32)

Function Calling

Enable LLMs to use external tools and return structured data. Use strict mode schemas (2026 best practice) for reliability. Limit to 5-15 tools per request, validate all inputs with Pydantic/Zod, and return errors as tool results.

calling-tool-definition.md -- Strict mode schemas, OpenAI/Anthropic formats, LangChain binding
calling-parallel.md -- Parallel tool execution, asyncio.gather, strict mode constraints
calling-validation.md -- Input validation, error handling, tool execution loops

Streaming

Deliver LLM responses in real-time for better UX. Use SSE for web, WebSocket for bidirectional. Handle backpressure with bounded queues.

streaming-sse.md -- FastAPI SSE endpoints, frontend consumers, async iterators
streaming-structured.md -- Streaming with tool calls, partial JSON parsing, chunk accumulation
streaming-backpressure.md -- Backpressure handling, bounded buffers, cancellation

Local Inference

Run LLMs locally with Ollama for cost savings (93% vs cloud), privacy, and offline development. Pre-warm models, use provider factory for cloud/local switching.

local-ollama-setup.md -- Installation, model pulling, environment configuration
local-model-selection.md -- Model comparison by task, hardware profiles, quantization
local-gpu-optimization.md -- Apple Silicon tuning, keep-alive, CI integration

Fine-Tuning

Customize LLMs with parameter-efficient techniques. Fine-tune ONLY after exhausting prompt engineering and RAG. Requires 1000+ quality examples.

tuning-lora.md -- LoRA/QLoRA configuration, Unsloth training, adapter merging
tuning-dataset-prep.md -- Synthetic data generation, quality validation, deduplication
tuning-evaluation.md -- DPO alignment, evaluation metrics, anti-patterns

Context Optimization

Manage context windows, compression, and attention-aware positioning. Optimize for tokens-per-task.

context-window-management.md -- Five-layer architecture, anchored summarization, compression triggers
context-caching.md -- Just-in-time loading, budget scaling, probe evaluation, CC 2.1.32+

Evaluation

Evaluate LLM outputs with multi-dimension scoring, quality gates, and benchmarks.

evaluation-metrics.md -- LLM-as-judge, RAGAS metrics, hallucination detection
evaluation-benchmarks.md -- Quality gates, batch evaluation, pairwise comparison

Prompt Engineering

Design, version, and optimize prompts for production LLM applications.

prompt-design.md -- Chain-of-Thought, few-shot learning, pattern selection guide
prompt-testing.md -- Langfuse versioning, DSPy optimization, A/B testing, self-consistency
prompt-react-pattern.md -- ReAct loop for tool-using agents, thought-action-observation format
prompt-optimization.md -- Token reduction, cost optimization, model tiering, prompt spec format

Key Decisions

Decision	Recommendation
Tool schema mode	`strict: true` (2026 best practice)
Tool count	5-15 max per request
Streaming protocol	SSE for web, WebSocket for bidirectional
Buffer size	50-200 tokens
Local model (reasoning)	`deepseek-r1:70b`
Local model (coding)	`qwen2.5-coder:32b`
Fine-tuning approach	LoRA/QLoRA (try prompting first)
LoRA rank	16-64 typical
Training epochs	1-3 (more risks overfitting)
Context compression	Anchored iterative (60-80%)
Compress trigger	70% utilization, target 50%
Judge model	`claude-haiku-4-5-20251001` (cost tier) or `gpt-5.2`
Quality threshold	0.7 production, 0.6 drafts
Few-shot examples	3-5 diverse, representative
Prompt versioning	Langfuse with labels
Auto-optimization	DSPy MIPROv2

Related Skills

ork:rag-retrieval -- Embedding patterns, when RAG is better than fine-tuning
agent-loops -- Multi-step tool use with reasoning
llm-evaluation -- Evaluate fine-tuned and local models
langfuse-observability -- Track training experiments

Capability Details

function-calling

Keywords: tool, function, define tool, tool schema, function schema, strict mode, parallel tools Solves:

Define tools with clear descriptions and strict schemas
Execute tool calls in parallel with asyncio.gather
Validate inputs and handle errors in tool execution loops

streaming

Keywords: streaming, SSE, Server-Sent Events, real-time, backpressure, token stream Solves:

Stream LLM tokens via SSE endpoints
Handle tool calls within streams
Manage backpressure with bounded queues

local-inference

Keywords: Ollama, local, self-hosted, model selection, GPU, Apple Silicon Solves:

Set up Ollama for local LLM inference
Select models based on task and hardware
Optimize GPU usage and CI integration

fine-tuning

Keywords: LoRA, QLoRA, fine-tune, DPO, synthetic data, PEFT, alignment Solves:

Configure LoRA/QLoRA for parameter-efficient training
Generate and validate synthetic training data
Align models with DPO and evaluate results

llm-integration