AI/ML: Building Production AI Applications

Build, review, and architect applications that use AI models - from single-API calls to multi-agent systems with RAG pipelines. The goal is production-grade AI apps that are reliable, cost-effective, and don't hallucinate their way into an incident.

Target versions: May 2026 snapshot. Read references/target-versions.md before pinning SDKs, runtimes, vector stores, or evaluation tools.

When to use

Integrating LLM APIs (Anthropic, OpenAI, etc.) into applications
Building RAG pipelines (chunking, embedding, retrieval, generation)
Designing agent systems (tool use, loops, state, multi-agent)
Choosing between fine-tuning, RAG, and prompt engineering
Setting up vector stores for semantic search
Implementing structured output and tool use / function calling
Building evaluation and testing harnesses for AI features
Optimizing token costs, latency, and model routing
Setting up local inference with Ollama or vLLM
Adding safety guardrails (content filtering, PII handling, output validation)

When NOT to use

Building MCP servers or tools (use mcp - it handles the protocol layer)
Writing or refining individual prompts (use prompt-generator)
General database configuration, schema design, or migrations (use databases)
Security auditing AI application code (use security-audit)
Reviewing code quality unrelated to AI/ML patterns (use code-review)

AI Self-Check

AI tools consistently produce the same mistakes when generating AI application code. Before returning any generated AI/ML code, verify against this list:

Performance

Batch embeddings and eval runs; avoid one request per row when the provider offers batch or bulk APIs.
Cache deterministic retrieval, tool metadata, and prompt templates, but never cache tenant-specific model outputs without a data-retention decision.
Track token, latency, and retry budgets separately for interactive, background, and eval traffic.

Best Practices

Prefer raw provider SDKs until orchestration complexity justifies LangGraph, LlamaIndex, or LangChain.
Keep model, tool, retrieval, and safety decisions configurable per environment; avoid hardcoding preview model names in application logic.
Treat model output as untrusted input: validate structure, refusal states, tool arguments, and downstream side effects.

Workflow

Step 1: Determine the architecture pattern

Need	Pattern	Start with
Single model call	Direct API integration	Provider SDK
Knowledge-grounded answers	RAG pipeline	Vector store + retrieval
Multi-step reasoning	Agent with tools	LangGraph, OpenAI Agents SDK, or custom loop
Multiple specialized models	Model routing / chain	Custom router or Vercel AI SDK
Offline / air-gapped	Local inference	Ollama or vLLM
Existing data enrichment	Batch processing	Provider batch APIs

Step 2: Choose the right abstraction level

Pick the lightest tool that solves the problem:

Raw SDK - direct Anthropic/OpenAI SDK calls. Best for simple integrations, maximum control, minimum dependencies. Start here unless you have a specific reason not to.
Vercel AI SDK - unified provider interface with streaming primitives. Good for TypeScript apps that need provider-agnostic code or React/Next.js streaming UI.
LangChain / LlamaIndex - orchestration frameworks. Use when you need complex chains, built-in document loaders, or 300+ pre-built integrations. Don't use for simple API calls - the abstraction overhead isn't worth it.
LangGraph / OpenAI Agents SDK - stateful agent frameworks. Use when you need cycles, persistence, human-in-the-loop, or multi-agent coordination.

The anti-pattern: importing LangChain to make a single API call. That's like importing Django to serve a static HTML file.

Step 3: Implement

Follow the domain-specific sections below. Read the appropriate reference file for detailed patterns and code examples.

Step 4: Evaluate and validate

Every AI feature needs evaluation. Not "run it once and eyeball the output" - structured evals with datasets, metrics, and regression detection.

Minimum viable eval: create a promptfooconfig.yaml with 20+ test cases, use contains, llm-rubric, and cost assertions, run npx promptfoo eval in CI on every PR that touches prompts. Track pass rate over time - any regression blocks the merge.

Read references/evaluation.md for promptfoo setup, assertion types, CI integration (GitHub Actions example), RAG-specific evals, agent evals, and red teaming patterns.

LLM Integration Patterns

Streaming

Always stream for user-facing responses. Buffer for background processing.

# Anthropic streaming (Python)
import anthropic

client = anthropic.Anthropic()

with client.messages.stream(
    model="claude-sonnet-4-6-20250514",
    max_tokens=1024,
    messages=[{"role": "user", "content": prompt}],
) as stream:
    for text in stream.text_stream:
        yield text

Structured output

Use native provider mechanisms, not regex parsing of free-text responses.

Anthropic: tool_use with JSON schema, or response_format with json_schema
OpenAI: response_format: { type: "json_schema", json_schema: {...} }
Vercel AI SDK: generateObject() with Zod schema

Tool use / function calling

Define tools with tight schemas. Validate tool results before feeding them back.

tools = [{
    "name": "search_docs",
    "description": "Search internal documentation",
    "input_schema": {
        "type": "object",
        "properties": {
            "query": {"type": "string", "maxLength": 200},
            "limit": {"type": "integer", "minimum": 1, "maximum": 50}
        },
        "required": ["query"]
    }
}]

Read references/llm-patterns.md for multi-turn tool use, parallel tool calls, error recovery, and provider-specific gotchas.

RAG Architecture

The quality of a RAG system depends more on retrieval quality than model quality. A mediocre model with great retrieval beats a frontier model with bad retrieval.

Chunking strategy

Strategy	When to use	Chunk size
Fixed-size with overlap	Default starting point	512-1024 tokens, 10-20% overlap
Semantic (sentence/paragraph)	Well-structured documents	Varies by content
Recursive character	Mixed content types	1000 chars, 200 overlap
Document-aware (markdown headers, code blocks)	Structured docs, code	Section-based
Parent-child	Need both precision and context	Small retrieval, large context

Embedding model selection

Use the same model for indexing and querying. Mixing models produces meaningless similarity scores.

Model	Dimensions	Best for
`text-embedding-3-large` (OpenAI)	3072 (or lower via `dimensions`)	General-purpose, scalable
`voyage-3-large` (Voyage AI)	1024	Code and technical content
`embed-v4.0` (Cohere)	1024	Multilingual, compression
Open-source (e5-mistral, gte-Qwen2)	Varies	Air-gapped / self-hosted

Retrieval patterns

Vector search alone - fast, good for semantic similarity, bad for exact keyword matches
Hybrid search (vector + BM25/keyword) - best default. Qdrant, Weaviate, and Pinecone support this natively. pgvector + tsvector for PostgreSQL.
Reranking - retrieve more candidates (top-50), rerank with a cross-encoder or Cohere Rerank, return top-5. Adds latency but significantly improves relevance.
Query expansion - rephrase the user query using an LLM before retrieval. Helps when user queries are vague or use different terminology than the source docs.

Vector store selection

Store	Type	Best for
pgvector	PostgreSQL extension	Already using Postgres, <10M vectors
Qdrant	Self-hosted or cloud	Production self-hosted, hybrid search
Pinecone	Managed only	Zero-ops, serverless scaling
ChromaDB	Embedded / local	Prototyping, small datasets

Minimal RAG example (Python + pgvector)

from anthropic import Anthropic
import psycopg

client = Anthropic()

def search(query: str, limit: int = 5) -> list[dict]:
    embedding = get_embedding(query)  # same model used at index time
    with psycopg.connect(DB_URL) as conn:
        rows = conn.execute(
            "SELECT content, 1 - (embedding <=> %s::vector) AS score "
            "FROM documents WHERE 1 - (embedding <=> %s::vector) > 0.7 "
            "ORDER BY embedding <=> %s::vector LIMIT %s",
            [embedding, embedding, embedding, limit],
        ).fetchall()
    return [{"content": r[0], "score": r[1]} for r in rows]

def ask(question: str) -> str:
    context = search(question)
    if not context:
        return "No relevant documents found."
    response = client.messages.create(
        model="claude-sonnet-4-6-20250514",
        max_tokens=1024,
        messages=[{"role": "user", "content": (
            f"Answer based on these documents:\n\n"
            + "\n---\n".join(d["content"] for d in context)
            + f"\n\nQuestion: {question}"
        )}],
    )
    return response.content[0].text

Key patterns: relevance threshold (0.7), same embedding model for index/query, context passed as user message prefix.

Read references/rag-patterns.md for indexing pipelines, metadata filtering, multi-index strategies, and production RAG architecture.

Agent Systems

The agent loop

Every agent system is fundamentally: observe -> think -> act -> repeat. The differences are in how you manage state, handle failures, and know when to stop.

while not done:
    observation = get_context(state)
    action = model.decide(observation, tools)
    if action.type == "final_answer":
        done = True
    else:
        result = execute_tool(action)
        state.add(result)

Framework selection

Framework	Best for	Key feature
Custom loop	Simple agents, maximum control	No dependencies
LangGraph	Complex state machines, cycles, persistence	Graph-based, checkpointing
OpenAI Agents SDK	OpenAI-native, multi-agent handoffs	Sessions, tracing
Claude Agent SDK	Claude-native, code/file operations	Claude Code capabilities
Vercel AI SDK	TypeScript agents with UI streaming	ToolLoopAgent, React hooks

Common pitfalls

Infinite loops - always set a max iteration count. Agents will happily loop forever.
Tool explosion - more than 10-15 tools degrades model performance. Group related operations into fewer, more capable tools.
Missing error handling - tool failures are normal. The agent needs to recover, not crash.
No cost ceiling - a runaway agent can burn through API budget. Set per-request token and cost limits.
Stale context - long-running agents accumulate context. Summarize or prune periodically.

Minimal safe agent loop

Every agent loop needs an iteration cap, a cost gate, and a tool-error policy. Retry transient errors with backoff, abort on permanent errors, and pass failed tool results back with an error marker so the model can choose the next step instead of silently losing state.

Read references/agent-patterns.md for multi-agent architectures, human-in-the-loop patterns, memory management, and production agent deployment.

Fine-Tuning vs RAG vs Prompt Engineering

Pick the cheapest approach that meets your quality bar:

Approach	Cost	Lead time	Best for
Prompt engineering	Lowest	Hours	Formatting, tone, simple tasks
Few-shot examples	Low	Hours	Pattern matching, classification
RAG	Medium	Days	Knowledge-grounded, dynamic data
Fine-tuning	High	Days-weeks	Style/behavior, latency-critical, domain specialization

Fine-tune when: prompt engineering can't capture the behavior, you need consistent style/format across thousands of outputs, or you need lower latency than RAG provides.

Don't fine-tune when: your data changes frequently (use RAG), you have fewer than 100 high-quality examples, or prompt engineering already works (you're just cargo-culting).

Read references/fine-tuning.md for data preparation, PEFT/LoRA patterns, evaluation during training, and when to use full fine-tuning vs parameter-efficient methods.

Local Inference

Local serving choices

Tool	Best for	GPU required
Ollama	Dev, prototyping, Mac (MLX)	No (CPU/MLX), optional GPU
vLLM	Production serving, high throughput	Yes
llama.cpp / llama-cpp-python	Minimal deps, quantized models, CPU-only	No (CPU), optional GPU
TGI (HF Text Generation Inference)	HF model hub integration	Yes

CPU-only inference with llama.cpp

CPU inference is viable - sometimes preferable - for: dense models that fit in RAM (7-13B at Q4 hits 5-10 t/s on modern x86), MoE models with low active params (Qwen3-30B-A3B at Q4 reaches 13+ t/s even on a 2013-era Xeon - active params dominate decode), and air-gapped or compliance-bound environments. Key gotchas:

ISA cliff: pre-Haswell CPUs lack AVX2/FMA/BMI2. PyTorch >= 2.1, TF >= 2.8, JAX, and Ollama prebuilts SIGILL. llama.cpp from source with -DGGML_AVX2=OFF -DGGML_FMA=OFF -DGGML_BMI2=OFF works.
GGUF quants: Q4_K_M is the default sweet spot. Q5_K_M for +25% memory and quality. IQ4_XS for tighter budgets. Avoid Q2/Q3 - quality cliff is real.
Reproducible models: pin both filename and HF commit SHA. Bare repo+filename pulls "whatever the author serves now" - silent runtime changes on rebase.
--mlock page-faults the GGUF into RAM at start. Sum GGUF sizes for capacity planning.
Threading: -t = physical_cores - 4 (decode, memory-bandwidth-bound), -tb = logical (prefill, compute-bound).
API keys: --api-key-file <path>, never --api-key <value> on the command line - leaks into /proc/<pid>/cmdline via systemd env expansion.

Benchmarking

Fixed prompt suite (chat-short, chat-long, code-simple, code-complex, reasoning), warmup pass, record latency + decode t/s at fixed max_tokens and temperature. Re-run after model swaps, llama.cpp version bumps, or build-flag changes. Compare decode t/s, not raw latency.

Read references/local-inference.md for the full llama.cpp build walkthrough (per-CPU-generation flags), HF SHA-pinned model download, systemd-per-model deployment, NUMA tuning, mlock memory budgeting, benchmark methodology, and production serving configuration.

Cost Optimization

Token budgeting

Know your costs before you scale:

cost_per_request = (input_tokens * input_price + output_tokens * output_price) / 1_000_000
monthly_cost = cost_per_request * requests_per_day * 30

Strategies (ordered by impact)

Model routing - use cheaper models for easy tasks, frontier models for hard ones. Route by task complexity, not by default.
Caching - cache identical or semantically similar requests. Anthropic prompt caching reduces repeated prefix costs by 90%.
Prompt optimization - shorter prompts cost less. Cut examples, compress instructions.
Batch APIs - Anthropic and OpenAI offer 50% discounts for async batch processing.
Output length limits - set max_tokens to what you actually need, not 4096 "just in case."
Context pruning - for multi-turn conversations, summarize history instead of sending the full transcript.

Safety and Guardrails

Input validation (prompt injection), output validation (schema + content policy), PII handling (strip before external API calls), rate limiting (per-user + per-IP), content filtering, and audit logging (redact PII). These are non-negotiable for production AI apps.

Read references/safety.md for prompt injection defense patterns, output validation schemas, PII detection setup, and content policy implementation.

Production Checklist

Reference Files

references/llm-patterns.md - multi-turn tool use, parallel tool calls, error recovery, provider gotchas
references/rag-patterns.md - indexing pipelines, metadata filtering, multi-index, production architecture
references/agent-patterns.md - multi-agent, human-in-the-loop, memory management, production deployment
references/evaluation.md - promptfoo setup, assertion types, CI integration, RAG/agent evals, red teaming
references/fine-tuning.md - data prep, PEFT/LoRA, training evaluation, full vs parameter-efficient methods
references/local-inference.md - quantization, model selection, GPU memory, production serving config
references/safety.md - prompt injection defense, output validation, PII handling, content filtering, audit logging
references/target-versions.md - May 2026 version snapshot for AI SDKs, runtimes, vector stores, and eval tools

Output Contract

See skills/_shared/output-contract.md for the full contract.

Skill name: AI-ML
Deliverable bucket: audits
Mode: conditional. When invoked to analyze, review, audit, or improve existing repo content, emit the full contract -- boxed inline header, body summary inline plus per-finding detail in the deliverable file, boxed conclusion, conclusion table -- and write the deliverable to docs/local/audits/ai-ml/<YYYY-MM-DD>-<slug>.md. When invoked to answer a question, teach a concept, build a new artifact, or generate content, respond freely without the contract.
Severity scale: P0 | P1 | P2 | P3 | info (see shared contract; only used in audit/review mode).

Related Skills

mcp - handles MCP server development (the protocol/tooling layer). This skill handles the application layer - how to build apps that call models, retrieve context, and orchestrate agents. If building an MCP server, use mcp. If building an app that uses AI, use this skill.
prompt-generator - for crafting and refining individual prompts. This skill covers prompt template management and patterns within applications; prompt-generator handles one-off prompt creation and iteration.
databases - for general database operations. This skill covers vector store integration for RAG; databases handles engine configuration, schema design, and traditional DB operations.
security-audit - for security review of AI application code. This skill provides guardrail patterns; security-audit provides the audit methodology.
code-review - for reviewing AI application code quality beyond AI-specific patterns.

Rules

Start with the simplest approach. Direct SDK calls before frameworks. Prompt engineering before fine-tuning. Single agent before multi-agent. Complexity is a cost.
Never hardcode API keys. Environment variables or secret managers. No exceptions.
Always stream user-facing responses. Buffered LLM responses feel broken. Stream.
Set token limits explicitly. max_tokens on every call. Unbounded generation wastes money and risks timeouts.
Match embedding models. Same model for indexing and querying. Mixing models produces meaningless similarity scores that silently degrade retrieval quality.
Validate model output. Check for refusals, empty content, malformed structured output. Models fail in creative ways - handle all of them.
Budget before you batch. Calculate cost before running batch operations. A 100k-row embedding job at the wrong model can cost thousands.
Evaluate with data, not vibes. Structured evals with datasets and metrics. "It looks good" is not a quality gate.
Cap agent iterations. Set a max loop count. Runaway agents burn budget and produce garbage. 10-20 iterations is a reasonable default.
Run the AI self-check. Every generated AI/ML code gets verified against the checklist above before returning.

ai-ml