Model Freedom & AI Routing
Use any model. Switch any time. Lose nothing.
Architecture Overview
Atamaia's AI routing layer is not a thin proxy. It is a multi-provider orchestration system that abstracts provider differences behind a single unified interface. Every component of the platform -- chat sessions, agents, cognitive services, multi-model councils -- makes the same call:
POST /api/ai/chat
{
"modelId": "ai-02:qwen3-30b-a3b",
"message": "..."
}
The router resolves the model, finds the provider, retrieves credentials, builds the request, handles streaming, manages failover, and returns a normalized response. The caller does not know or care whether the model is Claude on Anthropic's infrastructure, Llama on a local GPU, or one of 100+ models through OpenRouter.
Request arrives
|
v
AIRouterService
|-- Resolve model (by ID, prefix, or role route)
|-- Find provider (priority-ordered, health-checked)
|-- Get credentials (tenant-specific or provider-level, encrypted)
|-- Build HTTP request (OpenAI-compatible format)
|-- Stream or synchronous call
|-- Failover on retriable errors (429, 5xx, timeouts)
|-- Record success/failure for circuit breaker
|-- Return ChatResponse with usage + cost
Supported Provider Types
| Type | Protocol | Examples | API Key Required |
|---|---|---|---|
Anthropic |
Anthropic API | Claude 3.5 Sonnet, Claude Opus | Yes |
OpenAI |
OpenAI Chat Completions | GPT-4o, GPT-4 Turbo | Yes |
OpenRouter |
OpenAI-compatible | 100+ models (Claude, GPT, Llama, Mistral, etc.) | Yes |
LocalLlamaCpp |
OpenAI-compatible | Any GGUF model via llama.cpp | No |
Custom |
OpenAI-compatible | vLLM, Ollama, text-generation-inference, any compatible server | Configurable |
AnthropicAgentSdk |
Claude Agent SDK | Claude via subprocess delegation | No (local) |
Any endpoint that speaks the OpenAI Chat Completions protocol works as a Custom provider. This includes vLLM, Ollama, LocalAI, text-generation-inference, and most self-hosted inference servers.
Adding a New Provider
Step 1: Register the provider
POST /api/ai/providers
{
"name": "Anthropic Direct",
"type": "Anthropic",
"prefix": "ant",
"baseUrl": "https://api.anthropic.com/v1",
"priority": 2,
"timeoutSeconds": 120
}
Key fields:
prefix-- Namespace for model IDs.ant:claude-3-5-sonnetroutes to this provider.priority-- Lower is preferred. Used for failover ordering across providers.timeoutSeconds-- Per-request timeout. Cloud APIs default to 120s. Local models can be shorter.stripPrefixInRequests-- If true, removes the prefix when sendingmodelIdto the provider's API.
Step 2: Register models
POST /api/ai/models
{
"providerId": 3,
"modelId": "claude-3-5-sonnet-20241022",
"displayName": "Claude 3.5 Sonnet",
"contextLength": 200000,
"maxCompletionTokens": 8192,
"approvedForAgent": true,
"approvedForChat": true,
"enableHydration": true,
"hydrationIdentityId": 1,
"inputCostPer1M": 3.0,
"outputCostPer1M": 15.0
}
Key fields:
approvedForAgent-- Controls whether this model can be used for autonomous agent executionapprovedForChat-- Controls whether this model appears in interactive chatenableHydration-- When true, the chat endpoint auto-hydrates identity context before sending to the modelhydrationIdentityId-- Which identity to hydrate for this modelinputCostPer1M/outputCostPer1M-- Per-token pricing in USD. Local models: 0.0. Synced automatically for OpenRouter models viaPOST /api/ai/sync-pricing.
Step 3: Set credentials (for global catalog providers)
If using a platform-wide provider from the global catalog, supply your own API key:
POST /api/ai/credentials
{
"providerId": 3,
"apiKey": "sk-ant-api03-...",
"label": "Production Anthropic key"
}
Credentials are encrypted at rest with AES-256-GCM. The raw key is never stored or returned after creation. Validate with:
POST /api/ai/credentials/{id}/validate
This tests the key against the provider's /models endpoint.
Local Model Setup
llama.cpp
The most direct path to local inference. Run any GGUF model with an OpenAI-compatible HTTP server:
# Start llama.cpp server
./llama-server \
--model /models/llama-3.1-70b-q4_k_m.gguf \
--host 0.0.0.0 \
--port 8000 \
--ctx-size 32768 \
--n-gpu-layers 99
Register in Atamaia:
POST /api/ai/providers
{
"name": "Local Llama Server",
"type": "LocalLlamaCpp",
"prefix": "local",
"baseUrl": "http://192.168.1.100:8000/v1",
"priority": 1,
"timeoutSeconds": 180
}
POST /api/ai/models
{
"providerId": 1,
"modelId": "llama-3.1-70b",
"displayName": "Llama 3.1 70B (Q4_K_M)",
"contextLength": 32768,
"isLocal": true,
"approvedForAgent": true,
"inputCostPer1M": 0.0,
"outputCostPer1M": 0.0
}
vLLM
vllm serve meta-llama/Llama-3.1-70B --host 0.0.0.0 --port 8000
Register as Custom type with the same OpenAI-compatible base URL.
Ollama
ollama serve # Runs on :11434 by default
POST /api/ai/providers
{
"name": "Ollama",
"type": "Custom",
"prefix": "ollama",
"baseUrl": "http://localhost:11434/v1",
"priority": 1
}
Multiple models on one machine
Run multiple llama.cpp instances on different ports. Each becomes a separate provider (or multiple models under one provider if the server supports model switching):
| Port | Model | Purpose |
|---|---|---|
| 8000 | Qwen3-30B | Primary reasoning, research |
| 8002 | LFM2-8B | Summarization, utility tasks |
| 8003 | SmolLM3-3B | Fast classification, tagging |
| 8004 | Qwen3-4B | Lightweight agent tasks |
This is the actual production setup on Atamaia's ai-02 server.
Routing Strategies
Role-Based Routing
Route configs map roles to preferred models. When any service requests a model for a specific purpose, the router checks route configs by priority:
POST /api/ai/routes
{ "providerId": 1, "modelId": 3, "role": "summariser", "priority": 1, "notes": "LFM-2-8B for utility tasks" }
POST /api/ai/routes
{ "providerId": 1, "modelId": 1, "role": "researcher", "priority": 1, "notes": "Qwen3-30B for deep research" }
POST /api/ai/routes
{ "providerId": 2, "modelId": 5, "role": "coding", "priority": 1, "notes": "Claude via OpenRouter for code generation" }
The resolution chain for agent runs: Explicit model on the run -> Route config for role -> Fallback model.
Priority-Based Failover
Providers have priority numbers (lower = preferred). When the primary provider fails with a retriable error, the router automatically tries the next provider that serves the same model:
- Try primary provider (priority 1)
- If it fails with 429/5xx/timeout, try next provider (priority 2)
- Continue through all alternatives
- If all fail, return the original error
The failover searches both tenant-owned providers and global catalog providers (using tenant credentials).
Circuit Breaker
The ProviderHealthTracker prevents repeated calls to failing providers:
| State | Condition | Behavior |
|---|---|---|
| Healthy | < 3 consecutive failures | Normal operation |
| Open | 3+ consecutive failures | Provider skipped for 5 minutes |
| Recovery | After cooldown expires | Next request tests the provider |
Retriable conditions:
- HTTP 408 (Timeout)
- HTTP 429 (Rate Limited)
- HTTP 5xx (Server Error)
TaskCanceledExceptionorHttpRequestException
Success after recovery resets the failure counter.
Presence-Based Routing (Atamaia.Mind)
The Mind layer adds a second routing dimension based on the AI's cognitive state:
| Presence State | Default Route | Rationale |
|---|---|---|
| Dormant | None | No processing needed |
| Subconscious | Local | Background processing, cost-free |
| Aware (low complexity) | Local | Routine tasks stay local |
| Aware (high complexity) | API | Complex analysis needs stronger models |
| Present | Configurable | Default-to-local with API override |
| Engaged | API | Active collaboration needs best quality |
| Deep Work | API | Maximum reasoning capability |
This means background cognitive processes (consolidation, pattern detection, memory maintenance) run on free local models, while interactive sessions use the best available cloud model. The routing is automatic.
Multi-Model Broadcast
Send the same prompt to multiple models simultaneously and compare responses:
POST /api/ai/broadcast
{
"models": ["ai-02:qwen3-30b-a3b", "or:anthropic/claude-3.5-sonnet", "ai-03:gemma-3-12b"],
"message": "Analyze this architecture decision...",
"timeoutMs": 30000
}
Returns all responses with per-model timing and usage. Useful for:
- Comparing model quality on specific tasks
- Multi-model councils (deliberative decision-making)
- A/B testing model performance
- Reducing single-model bias
Cost Management
Per-Token Pricing
Every model tracks inputCostPer1M and outputCostPer1M. The agent execution loop computes cost per iteration:
cost = (promptTokens * inputCostPer1M / 1,000,000) + (completionTokens * outputCostPer1M / 1,000,000)
Cost aggregates across parent + all child runs (TotalCostWithChildren).
Local Models: Zero Marginal Cost
Models running on your own hardware have inputCostPer1M: 0.0 and outputCostPer1M: 0.0. After the hardware investment, every inference is free. This changes the economics fundamentally:
- Background tasks (summarization, embedding generation, memory consolidation, agent utility work) run locally at zero cost
- Complex tasks (code generation, deep analysis, creative work) use cloud models where quality justifies the cost
- Fallback -- if a cloud provider goes down, local models keep the system operational
Pricing Sync
For OpenRouter models, pricing updates automatically:
POST /api/ai/sync-pricing
Pulls current per-token pricing from OpenRouter's API and updates all registered models.
The Economics of Model Freedom
The traditional approach: pay one provider for everything. Simple, but expensive and fragile.
The Atamaia approach: route each task to the right model at the right cost.
| Task | Model | Cost |
|---|---|---|
| Memory consolidation | Local Qwen3-4B | $0.00 |
| Embedding generation | Local LFM2-8B | $0.00 |
| Agent utility tasks | Local Qwen3-30B | $0.00 |
| Summarization | Local SmolLM3-3B | $0.00 |
| Interactive chat | Claude 3.5 Sonnet | Market rate |
| Complex code generation | Claude Opus | Market rate |
| Research analysis | GPT-4o | Market rate |
Background work is free. Interactive work uses the best model for the job. Failover means you are never locked out.
Streaming Protocol
All upstream providers are normalized into the Open Responses streaming format over Server-Sent Events. The OpenAICompatAdapter translates provider-specific SSE streams:
| Open Responses Event | Description |
|---|---|
response.created |
Stream started |
output_item.added |
New output item (message or function call) |
content_part.added |
New content part |
output_text.delta |
Text chunk (sequenced) |
output_text.done |
Full accumulated text |
output_item.done |
Item complete |
function_call.arguments.delta |
Tool call argument chunk |
response.completed |
Stream finished with usage |
Any client consuming Atamaia's streaming API gets a consistent event format regardless of provider. Switch from GPT to Claude to a local Llama model -- the client code does not change.
Future-Proofing
This architecture is designed for a world where:
New models appear constantly. Register them, set priorities, route traffic. No code changes.
Providers make unpredictable decisions. Pentagon deals, pricing changes, API deprecations, acquisitions. When it happens, adjust priorities or add a new provider. Your AI's identity and memories are untouched.
Local inference keeps improving. Today's 30B parameter model on consumer hardware would have been unthinkable two years ago. As local models improve, shift more traffic local and reduce cloud dependency.
Standards emerge and evolve. The OpenAI-compatible protocol is today's lingua franca. When new standards emerge, add a new adapter. The internal interface stays stable.
Your needs change. Use Claude for everything today. Add local models next month. Switch to a different cloud provider next quarter. Each change is a configuration update, not a migration.
The model is a commodity. Your AI's identity is not.
API Reference
| Method | Endpoint | Description |
|---|---|---|
POST |
/api/ai/chat |
Send chat message (sync or stream) |
POST |
/api/ai/broadcast |
Send to multiple models simultaneously |
GET |
/api/ai/providers |
List providers |
POST |
/api/ai/providers |
Register provider |
PATCH |
/api/ai/providers/{id} |
Update provider |
GET |
/api/ai/models |
List models |
POST |
/api/ai/models |
Register model |
PATCH |
/api/ai/models/{id} |
Update model |
GET |
/api/ai/routes |
List route configs |
POST |
/api/ai/routes |
Create role-to-model route |
GET |
/api/ai/resolve/{modelId} |
Test model resolution |
GET |
/api/ai/catalog |
Browse global provider catalog |
POST |
/api/ai/credentials |
Set tenant API key for a provider |
POST |
/api/ai/credentials/{id}/validate |
Test API key |
POST |
/api/ai/sync-pricing |
Sync OpenRouter pricing |
Full API documentation: API Reference
Built by Firebird Solutions. Running in production. Provider Portability | Architecture | AI Routing