Cloudflare Workers AI - Complete Reference

Production-ready knowledge domain for building AI-powered applications with Cloudflare Workers AI.

Status: Production Ready ✅ Last Updated: 2025-10-21 Dependencies: cloudflare-worker-base (for Worker setup) Latest Versions: wrangler@4.43.0, @cloudflare/workers-types@4.20251014.0

Quick Start (5 minutes)
Workers AI API Reference
Model Selection Guide
Common Patterns
AI Gateway Integration
Rate Limits & Pricing
Production Checklist

Quick Start (5 minutes)

1. Add AI Binding

wrangler.jsonc:

{
  "ai": {
    "binding": "AI"
  }
}

2. Run Your First Model

export interface Env {
  AI: Ai;
}

export default {
  async fetch(request: Request, env: Env): Promise<Response> {
    const response = await env.AI.run('@cf/meta/llama-3.1-8b-instruct', {
      prompt: 'What is Cloudflare?',
    });

    return Response.json(response);
  },
};

3. Add Streaming (Recommended)

const stream = await env.AI.run('@cf/meta/llama-3.1-8b-instruct', {
  messages: [{ role: 'user', content: 'Tell me a story' }],
  stream: true, // Always use streaming for text generation!
});

return new Response(stream, {
  headers: { 'content-type': 'text/event-stream' },
});

Why streaming?

Prevents buffering large responses in memory
Faster time-to-first-token
Better user experience for long-form content
Avoids Worker timeout issues

Workers AI API Reference

`env.AI.run()`

Run an AI model inference.

Signature:

async env.AI.run(
  model: string,
  inputs: ModelInputs,
  options?: { gateway?: { id: string; skipCache?: boolean } }
): Promise<ModelOutput | ReadableStream>

Parameters:

model (string, required) - Model ID (e.g., @cf/meta/llama-3.1-8b-instruct)
inputs (object, required) - Model-specific inputs
options (object, optional) - Additional options
- gateway (object) - AI Gateway configuration
  - id (string) - Gateway ID
  - skipCache (boolean) - Skip AI Gateway cache

Returns:

Non-streaming: Promise<ModelOutput> - JSON response
Streaming: ReadableStream - Server-sent events stream

Text Generation Models

Input Format:

{
  messages?: Array<{ role: 'system' | 'user' | 'assistant'; content: string }>;
  prompt?: string; // Deprecated, use messages
  stream?: boolean; // Default: false
  max_tokens?: number; // Max tokens to generate
  temperature?: number; // 0.0-1.0, default varies by model
  top_p?: number; // 0.0-1.0
  top_k?: number;
}

Output Format (Non-Streaming):

{
  response: string; // Generated text
}

Example:

const response = await env.AI.run('@cf/meta/llama-3.1-8b-instruct', {
  messages: [
    { role: 'system', content: 'You are a helpful assistant.' },
    { role: 'user', content: 'What is TypeScript?' },
  ],
  stream: false,
});

console.log(response.response);

Text Embeddings Models

Input Format:

{
  text: string | string[]; // Single text or array of texts
}

Output Format:

{
  shape: number[]; // [batch_size, embedding_dimensions]
  data: number[][]; // Array of embedding vectors
}

Example:

const embeddings = await env.AI.run('@cf/baai/bge-base-en-v1.5', {
  text: ['Hello world', 'Cloudflare Workers'],
});

console.log(embeddings.shape); // [2, 768]
console.log(embeddings.data[0]); // [0.123, -0.456, ...]

Image Generation Models

Input Format:

{
  prompt: string; // Text description
  num_steps?: number; // Default: 20
  guidance?: number; // CFG scale, default: 7.5
  strength?: number; // For img2img, default: 1.0
  image?: number[][]; // For img2img (base64 or array)
}

Output Format:

Binary image data (PNG/JPEG)

Example:

const imageStream = await env.AI.run('@cf/black-forest-labs/flux-1-schnell', {
  prompt: 'A beautiful sunset over mountains',
});

return new Response(imageStream, {
  headers: { 'content-type': 'image/png' },
});

Vision Models

Input Format:

{
  messages: Array<{
    role: 'user' | 'assistant';
    content: Array<{ type: 'text' | 'image_url'; text?: string; image_url?: { url: string } }>;
  }>;
}

Example:

const response = await env.AI.run('@cf/meta/llama-3.2-11b-vision-instruct', {
  messages: [
    {
      role: 'user',
      content: [
        { type: 'text', text: 'What is in this image?' },
        { type: 'image_url', image_url: { url: 'data:image/png;base64,iVBOR...' } },
      ],
    },
  ],
});

Model Selection Guide

Text Generation (LLMs)

Model	Best For	Rate Limit	Size
`@cf/meta/llama-3.1-8b-instruct`	General purpose, fast	300/min	8B
`@cf/meta/llama-3.2-1b-instruct`	Ultra-fast, simple tasks	300/min	1B
`@cf/qwen/qwen1.5-14b-chat-awq`	High quality, complex reasoning	150/min	14B
`@cf/deepseek-ai/deepseek-r1-distill-qwen-32b`	Coding, technical content	300/min	32B
`@hf/thebloke/mistral-7b-instruct-v0.1-awq`	Fast, efficient	400/min	7B

Text Embeddings

Model	Dimensions	Best For	Rate Limit
`@cf/baai/bge-base-en-v1.5`	768	General purpose RAG	3000/min
`@cf/baai/bge-large-en-v1.5`	1024	High accuracy search	1500/min
`@cf/baai/bge-small-en-v1.5`	384	Fast, low storage	3000/min

Image Generation

Model	Best For	Rate Limit	Speed
`@cf/black-forest-labs/flux-1-schnell`	High quality, photorealistic	720/min	Fast
`@cf/stabilityai/stable-diffusion-xl-base-1.0`	General purpose	720/min	Medium
`@cf/lykon/dreamshaper-8-lcm`	Artistic, stylized	720/min	Fast

Vision Models

Model	Best For	Rate Limit
`@cf/meta/llama-3.2-11b-vision-instruct`	Image understanding	720/min
`@cf/unum/uform-gen2-qwen-500m`	Fast image captioning	720/min

Common Patterns

Pattern 1: Chat Completion with History

app.post('/chat', async (c) => {
  const { messages } = await c.req.json<{
    messages: Array<{ role: string; content: string }>;
  }>();

  const response = await c.env.AI.run('@cf/meta/llama-3.1-8b-instruct', {
    messages,
    stream: true,
  });

  return new Response(response, {
    headers: { 'content-type': 'text/event-stream' },
  });
});

Pattern 2: RAG (Retrieval Augmented Generation)

// Step 1: Generate embeddings
const embeddings = await env.AI.run('@cf/baai/bge-base-en-v1.5', {
  text: [userQuery],
});

const vector = embeddings.data[0];

// Step 2: Search Vectorize
const matches = await env.VECTORIZE.query(vector, { topK: 3 });

// Step 3: Build context from matches
const context = matches.matches.map((m) => m.metadata.text).join('\n\n');

// Step 4: Generate response with context
const response = await env.AI.run('@cf/meta/llama-3.1-8b-instruct', {
  messages: [
    {
      role: 'system',
      content: `Answer using this context:\n${context}`,
    },
    { role: 'user', content: userQuery },
  ],
  stream: true,
});

return new Response(response, {
  headers: { 'content-type': 'text/event-stream' },
});

Pattern 3: Structured Output with Zod

import { z } from 'zod';

const RecipeSchema = z.object({
  name: z.string(),
  ingredients: z.array(z.string()),
  instructions: z.array(z.string()),
  prepTime: z.number(),
});

app.post('/recipe', async (c) => {
  const { dish } = await c.req.json<{ dish: string }>();

  const response = await c.env.AI.run('@cf/meta/llama-3.1-8b-instruct', {
    messages: [
      {
        role: 'user',
        content: `Generate a recipe for ${dish}. Return ONLY valid JSON matching this schema: ${JSON.stringify(RecipeSchema.shape)}`,
      },
    ],
  });

  // Parse and validate
  const recipe = RecipeSchema.parse(JSON.parse(response.response));

  return c.json(recipe);
});

Pattern 4: Image Generation + R2 Storage

app.post('/generate-image', async (c) => {
  const { prompt } = await c.req.json<{ prompt: string }>();

  // Generate image
  const imageStream = await c.env.AI.run('@cf/black-forest-labs/flux-1-schnell', {
    prompt,
  });

  const imageBytes = await new Response(imageStream).bytes();

  // Store in R2
  const key = `images/${Date.now()}.png`;
  await c.env.BUCKET.put(key, imageBytes, {
    httpMetadata: { contentType: 'image/png' },
  });

  return c.json({
    success: true,
    url: `https://your-domain.com/${key}`,
  });
});

AI Gateway Integration

AI Gateway provides caching, logging, and analytics for AI requests.

Setup:

const response = await env.AI.run(
  '@cf/meta/llama-3.1-8b-instruct',
  { prompt: 'Hello' },
  {
    gateway: {
      id: 'my-gateway', // Your gateway ID
      skipCache: false, // Use cache
    },
  }
);

Benefits:

✅ Cost Tracking - Monitor neurons usage per request
✅ Caching - Reduce duplicate inference costs
✅ Logging - Debug and analyze AI requests
✅ Rate Limiting - Additional layer of protection
✅ Analytics - Request patterns and performance

Access Gateway Logs:

const gateway = env.AI.gateway('my-gateway');
const logId = env.AI.aiGatewayLogId;

// Send feedback
await gateway.patchLog(logId, {
  feedback: { rating: 1, comment: 'Great response' },
});

Rate Limits & Pricing

Rate Limits (per minute)

Task Type	Default Limit	Notes
Text Generation	300/min	Some fast models: 400-1500/min
Text Embeddings	3000/min	BGE-large: 1500/min
Image Generation	720/min	All image models
Vision Models	720/min	Image understanding
Translation	720/min	M2M100, Opus MT
Classification	2000/min	Text classification
Speech Recognition	720/min	Whisper models

Pricing (Neurons-Based)

Free Tier:

10,000 neurons per day
Resets daily at 00:00 UTC

Paid Tier:

$0.011 per 1,000 neurons
10,000 neurons/day included
Unlimited usage above free allocation

Example Costs:

Model	Input (1M tokens)	Output (1M tokens)
Llama 3.2 1B	$0.027	$0.201
Llama 3.1 8B	$0.088	$0.606
BGE-base embeddings	$0.005	N/A
Flux image generation	~$0.011/image	N/A

Production Checklist

Before Deploying

Error Handling

async function runAIWithRetry(
  env: Env,
  model: string,
  inputs: any,
  maxRetries = 3
): Promise<any> {
  let lastError: Error;

  for (let i = 0; i < maxRetries; i++) {
    try {
      return await env.AI.run(model, inputs);
    } catch (error) {
      lastError = error as Error;
      const message = lastError.message.toLowerCase();

      // Rate limit - retry with backoff
      if (message.includes('429') || message.includes('rate limit')) {
        const delay = Math.pow(2, i) * 1000; // Exponential backoff
        await new Promise((resolve) => setTimeout(resolve, delay));
        continue;
      }

      // Other errors - throw immediately
      throw error;
    }
  }

  throw lastError!;
}

Monitoring

app.use('*', async (c, next) => {
  const start = Date.now();

  await next();

  // Log AI usage
  console.log({
    path: c.req.path,
    duration: Date.now() - start,
    logId: c.env.AI.aiGatewayLogId,
  });
});

OpenAI Compatibility

Workers AI supports OpenAI-compatible endpoints.

Using OpenAI SDK:

import OpenAI from 'openai';

const openai = new OpenAI({
  apiKey: env.CLOUDFLARE_API_KEY,
  baseURL: `https://api.cloudflare.com/client/v4/accounts/${env.CLOUDFLARE_ACCOUNT_ID}/ai/v1`,
});

// Chat completions
const completion = await openai.chat.completions.create({
  model: '@cf/meta/llama-3.1-8b-instruct',
  messages: [{ role: 'user', content: 'Hello!' }],
});

// Embeddings
const embeddings = await openai.embeddings.create({
  model: '@cf/baai/bge-base-en-v1.5',
  input: 'Hello world',
});

Endpoints:

/v1/chat/completions - Text generation
/v1/embeddings - Text embeddings

Vercel AI SDK Integration

npm install workers-ai-provider ai

import { createWorkersAI } from 'workers-ai-provider';
import { generateText, streamText } from 'ai';

const workersai = createWorkersAI({ binding: env.AI });

// Generate text
const result = await generateText({
  model: workersai('@cf/meta/llama-3.1-8b-instruct'),
  prompt: 'Write a poem',
});

// Stream text
const stream = streamText({
  model: workersai('@cf/meta/llama-3.1-8b-instruct'),
  prompt: 'Tell me a story',
});

Limits Summary

Feature	Limit
Concurrent requests	No hard limit (rate limits apply)
Max input tokens	Varies by model (typically 2K-128K)
Max output tokens	Varies by model (typically 512-2048)
Streaming chunk size	~1 KB
Image size (output)	~5 MB
Request timeout	Workers timeout applies (30s default, 5m max CPU)
Daily free neurons	10,000
Rate limits	See "Rate Limits & Pricing" section

cloudflare-workers-ai

Cloudflare Workers AI - Complete Reference

Table of Contents

Quick Start (5 minutes)

1. Add AI Binding

2. Run Your First Model

3. Add Streaming (Recommended)

Workers AI API Reference

`env.AI.run()`

Text Generation Models

Text Embeddings Models

Image Generation Models

Vision Models

Model Selection Guide

Text Generation (LLMs)

Text Embeddings

Image Generation

Vision Models

Common Patterns

Pattern 1: Chat Completion with History

Pattern 2: RAG (Retrieval Augmented Generation)

Pattern 3: Structured Output with Zod

Pattern 4: Image Generation + R2 Storage

AI Gateway Integration

Rate Limits & Pricing

Rate Limits (per minute)

Pricing (Neurons-Based)

Production Checklist

Before Deploying

Error Handling

Monitoring

OpenAI Compatibility

Vercel AI SDK Integration

Limits Summary

References