Cloudflare Workers AI Enhancement - Evolution of Edge AI Inference | News

What is Cloudflare Workers AI

Cloudflare Workers AI is a service that enables AI inference at the edge. By running models at data centers worldwide, it provides AI capabilities with low latency.

Reference: Cloudflare Workers AI

New Model Additions

Available Models (as of late 2024)

Category	Model	Use Case
LLM	Llama 3.2	Text generation
LLM	Mistral 7B	Fast inference
LLM	Gemma 2	Multilingual
Image	Stable Diffusion XL	Image generation
Image	FLUX.1	High-quality images
Audio	Whisper	Speech recognition
Embedding	BGE	Vectorization

Usage Example

// Inference with Workers AI
export default {
  async fetch(request, env) {
    const response = await env.AI.run('@cf/meta/llama-3.2-3b-instruct', {
      messages: [
        { role: 'user', content: 'Tell me about Cloudflare features' }
      ],
      max_tokens: 512
    });

    return new Response(JSON.stringify(response));
  }
};

Reference: Workers AI Models

Vectorize GA (General Availability)

Vector Database

Store embedding vectors and perform similarity searches.

// Create Vectorize (wrangler CLI)
// wrangler vectorize create my-index --dimensions=768 --metric=cosine

// Insert vectors
export default {
  async fetch(request, env) {
    // Convert text to embedding
    const embedding = await env.AI.run('@cf/baai/bge-base-en-v1.5', {
      text: 'Cloudflare edge computing'
    });

    // Save to Vectorize
    await env.VECTORIZE.insert([{
      id: 'doc-1',
      values: embedding.data[0],
      metadata: { title: 'Cloudflare Edge' }
    }]);

    return new Response('Inserted');
  }
};

Similarity Search

// Search for similar documents
const queryEmbedding = await env.AI.run('@cf/baai/bge-base-en-v1.5', {
  text: 'AI inference at the edge'
});

const results = await env.VECTORIZE.query(queryEmbedding.data[0], {
  topK: 5,
  returnMetadata: true
});

// results: [{ id: 'doc-1', score: 0.95, metadata: {...} }, ...]

Reference: Cloudflare Vectorize

AI Gateway

API Management and Monitoring

Manage multiple AI providers in a unified way.

// Request via AI Gateway
const response = await fetch(
  'https://gateway.ai.cloudflare.com/v1/account-id/gateway-name/openai/chat/completions',
  {
    method: 'POST',
    headers: {
      'Authorization': `Bearer ${OPENAI_API_KEY}`,
      'Content-Type': 'application/json'
    },
    body: JSON.stringify({
      model: 'gpt-4',
      messages: [{ role: 'user', content: 'Hello' }]
    })
  }
);

Key Features

Feature	Description
Caching	Cache results of identical requests
Rate Limiting	Limit API requests
Retry	Auto-retry on failure
Fallback	Switch to alternative provider
Logging	Record all requests

// Fallback configuration
{
  "providers": [
    { "provider": "openai", "model": "gpt-4" },
    { "provider": "anthropic", "model": "claude-3-sonnet" }
  ],
  "fallback": true
}

Reference: AI Gateway

AutoRAG (Preview)

Automatic RAG Pipeline

Build a RAG system just by uploading documents.

// AutoRAG configuration
export default {
  async fetch(request, env) {
    // Index documents
    await env.AUTORAG.index({
      content: 'Cloudflare is the world\'s largest edge network...',
      metadata: { source: 'docs', title: 'About Cloudflare' }
    });

    // Answer questions
    const answer = await env.AUTORAG.query({
      question: 'What is Cloudflare?',
      max_tokens: 256
    });

    return new Response(JSON.stringify(answer));
  }
};

Pricing

Workers AI

Plan	Neurons	Price
Free	10,000/day	$0
Pay-as-you-go	Unlimited	$0.011/1,000 neurons

Vectorize

Item	Free Tier	Paid
Vectors	200,000	Unlimited
Queries/month	30M	$0.01/1M
Storage	1GB	$0.05/GB

Reference: Cloudflare Pricing

Performance

Latency Comparison

Region	Central Server	Cloudflare Edge
Tokyo	200ms	20ms
New York	50ms	15ms
London	100ms	18ms

Throughput

Llama 3.2 3B: ~50 tokens/sec
Mistral 7B: ~30 tokens/sec
Whisper: 2x realtime speed

Implementation Example: RAG Chatbot

export default {
  async fetch(request, env) {
    const { question } = await request.json();

    // 1. Convert question to embedding
    const questionEmbedding = await env.AI.run('@cf/baai/bge-base-en-v1.5', {
      text: question
    });

    // 2. Search related documents
    const docs = await env.VECTORIZE.query(questionEmbedding.data[0], {
      topK: 3,
      returnMetadata: true
    });

    // 3. Generate answer with context
    const context = docs.matches.map(d => d.metadata.content).join('\n');
    const answer = await env.AI.run('@cf/meta/llama-3.2-3b-instruct', {
      messages: [
        { role: 'system', content: `Answer based on the following information:\n${context}` },
        { role: 'user', content: question }
      ]
    });

    return Response.json({ answer: answer.response });
  }
};

Summary

Cloudflare Workers AI continues to evolve as a strong option for edge AI inference.

Diverse Models: LLM, image, audio, embedding
Vectorize GA: Official version of vector DB
AI Gateway: Multi-provider management
Low Latency: Under 20ms worldwide

Worth considering when building serverless, scalable AI applications.

← Back to list