What is Cloudflare Workers AI
Cloudflare Workers AI is a service that enables AI inference at the edge. By running models at data centers worldwide, it provides AI capabilities with low latency.
Reference: Cloudflare Workers AI
New Model Additions
Available Models (as of late 2024)
| Category | Model | Use Case |
|---|---|---|
| LLM | Llama 3.2 | Text generation |
| LLM | Mistral 7B | Fast inference |
| LLM | Gemma 2 | Multilingual |
| Image | Stable Diffusion XL | Image generation |
| Image | FLUX.1 | High-quality images |
| Audio | Whisper | Speech recognition |
| Embedding | BGE | Vectorization |
Usage Example
// Inference with Workers AI
export default {
async fetch(request, env) {
const response = await env.AI.run('@cf/meta/llama-3.2-3b-instruct', {
messages: [
{ role: 'user', content: 'Tell me about Cloudflare features' }
],
max_tokens: 512
});
return new Response(JSON.stringify(response));
}
};
Reference: Workers AI Models
Vectorize GA (General Availability)
Vector Database
Store embedding vectors and perform similarity searches.
// Create Vectorize (wrangler CLI)
// wrangler vectorize create my-index --dimensions=768 --metric=cosine
// Insert vectors
export default {
async fetch(request, env) {
// Convert text to embedding
const embedding = await env.AI.run('@cf/baai/bge-base-en-v1.5', {
text: 'Cloudflare edge computing'
});
// Save to Vectorize
await env.VECTORIZE.insert([{
id: 'doc-1',
values: embedding.data[0],
metadata: { title: 'Cloudflare Edge' }
}]);
return new Response('Inserted');
}
};
Similarity Search
// Search for similar documents
const queryEmbedding = await env.AI.run('@cf/baai/bge-base-en-v1.5', {
text: 'AI inference at the edge'
});
const results = await env.VECTORIZE.query(queryEmbedding.data[0], {
topK: 5,
returnMetadata: true
});
// results: [{ id: 'doc-1', score: 0.95, metadata: {...} }, ...]
Reference: Cloudflare Vectorize
AI Gateway
API Management and Monitoring
Manage multiple AI providers in a unified way.
// Request via AI Gateway
const response = await fetch(
'https://gateway.ai.cloudflare.com/v1/account-id/gateway-name/openai/chat/completions',
{
method: 'POST',
headers: {
'Authorization': `Bearer ${OPENAI_API_KEY}`,
'Content-Type': 'application/json'
},
body: JSON.stringify({
model: 'gpt-4',
messages: [{ role: 'user', content: 'Hello' }]
})
}
);
Key Features
| Feature | Description |
|---|---|
| Caching | Cache results of identical requests |
| Rate Limiting | Limit API requests |
| Retry | Auto-retry on failure |
| Fallback | Switch to alternative provider |
| Logging | Record all requests |
// Fallback configuration
{
"providers": [
{ "provider": "openai", "model": "gpt-4" },
{ "provider": "anthropic", "model": "claude-3-sonnet" }
],
"fallback": true
}
Reference: AI Gateway
AutoRAG (Preview)
Automatic RAG Pipeline
Build a RAG system just by uploading documents.
// AutoRAG configuration
export default {
async fetch(request, env) {
// Index documents
await env.AUTORAG.index({
content: 'Cloudflare is the world\'s largest edge network...',
metadata: { source: 'docs', title: 'About Cloudflare' }
});
// Answer questions
const answer = await env.AUTORAG.query({
question: 'What is Cloudflare?',
max_tokens: 256
});
return new Response(JSON.stringify(answer));
}
};
Pricing
Workers AI
| Plan | Neurons | Price |
|---|---|---|
| Free | 10,000/day | $0 |
| Pay-as-you-go | Unlimited | $0.011/1,000 neurons |
Vectorize
| Item | Free Tier | Paid |
|---|---|---|
| Vectors | 200,000 | Unlimited |
| Queries/month | 30M | $0.01/1M |
| Storage | 1GB | $0.05/GB |
Reference: Cloudflare Pricing
Performance
Latency Comparison
| Region | Central Server | Cloudflare Edge |
|---|---|---|
| Tokyo | 200ms | 20ms |
| New York | 50ms | 15ms |
| London | 100ms | 18ms |
Throughput
Llama 3.2 3B: ~50 tokens/sec
Mistral 7B: ~30 tokens/sec
Whisper: 2x realtime speed
Implementation Example: RAG Chatbot
export default {
async fetch(request, env) {
const { question } = await request.json();
// 1. Convert question to embedding
const questionEmbedding = await env.AI.run('@cf/baai/bge-base-en-v1.5', {
text: question
});
// 2. Search related documents
const docs = await env.VECTORIZE.query(questionEmbedding.data[0], {
topK: 3,
returnMetadata: true
});
// 3. Generate answer with context
const context = docs.matches.map(d => d.metadata.content).join('\n');
const answer = await env.AI.run('@cf/meta/llama-3.2-3b-instruct', {
messages: [
{ role: 'system', content: `Answer based on the following information:\n${context}` },
{ role: 'user', content: question }
]
});
return Response.json({ answer: answer.response });
}
};
Summary
Cloudflare Workers AI continues to evolve as a strong option for edge AI inference.
- Diverse Models: LLM, image, audio, embedding
- Vectorize GA: Official version of vector DB
- AI Gateway: Multi-provider management
- Low Latency: Under 20ms worldwide
Worth considering when building serverless, scalable AI applications.
← Back to list