Model Freedom & AI Routing

Use any model. Switch any time. Lose nothing.


Architecture Overview

Atamaia's AI routing layer is not a thin proxy. It is a multi-provider orchestration system that abstracts provider differences behind a single unified interface. Every component of the platform -- chat sessions, agents, cognitive services, multi-model councils -- makes the same call:

POST /api/ai/chat
{
  "modelId": "ai-02:qwen3-30b-a3b",
  "message": "..."
}

The router resolves the model, finds the provider, retrieves credentials, builds the request, handles streaming, manages failover, and returns a normalized response. The caller does not know or care whether the model is Claude on Anthropic's infrastructure, Llama on a local GPU, or one of 100+ models through OpenRouter.

Request arrives
    |
    v
AIRouterService
    |-- Resolve model (by ID, prefix, or role route)
    |-- Find provider (priority-ordered, health-checked)
    |-- Get credentials (tenant-specific or provider-level, encrypted)
    |-- Build HTTP request (OpenAI-compatible format)
    |-- Stream or synchronous call
    |-- Failover on retriable errors (429, 5xx, timeouts)
    |-- Record success/failure for circuit breaker
    |-- Return ChatResponse with usage + cost

Supported Provider Types

Type Protocol Examples API Key Required
Anthropic Anthropic API Claude 3.5 Sonnet, Claude Opus Yes
OpenAI OpenAI Chat Completions GPT-4o, GPT-4 Turbo Yes
OpenRouter OpenAI-compatible 100+ models (Claude, GPT, Llama, Mistral, etc.) Yes
LocalLlamaCpp OpenAI-compatible Any GGUF model via llama.cpp No
Custom OpenAI-compatible vLLM, Ollama, text-generation-inference, any compatible server Configurable
AnthropicAgentSdk Claude Agent SDK Claude via subprocess delegation No (local)

Any endpoint that speaks the OpenAI Chat Completions protocol works as a Custom provider. This includes vLLM, Ollama, LocalAI, text-generation-inference, and most self-hosted inference servers.


Adding a New Provider

Step 1: Register the provider

POST /api/ai/providers
{
  "name": "Anthropic Direct",
  "type": "Anthropic",
  "prefix": "ant",
  "baseUrl": "https://api.anthropic.com/v1",
  "priority": 2,
  "timeoutSeconds": 120
}

Key fields:

  • prefix -- Namespace for model IDs. ant:claude-3-5-sonnet routes to this provider.
  • priority -- Lower is preferred. Used for failover ordering across providers.
  • timeoutSeconds -- Per-request timeout. Cloud APIs default to 120s. Local models can be shorter.
  • stripPrefixInRequests -- If true, removes the prefix when sending modelId to the provider's API.

Step 2: Register models

POST /api/ai/models
{
  "providerId": 3,
  "modelId": "claude-3-5-sonnet-20241022",
  "displayName": "Claude 3.5 Sonnet",
  "contextLength": 200000,
  "maxCompletionTokens": 8192,
  "approvedForAgent": true,
  "approvedForChat": true,
  "enableHydration": true,
  "hydrationIdentityId": 1,
  "inputCostPer1M": 3.0,
  "outputCostPer1M": 15.0
}

Key fields:

  • approvedForAgent -- Controls whether this model can be used for autonomous agent execution
  • approvedForChat -- Controls whether this model appears in interactive chat
  • enableHydration -- When true, the chat endpoint auto-hydrates identity context before sending to the model
  • hydrationIdentityId -- Which identity to hydrate for this model
  • inputCostPer1M / outputCostPer1M -- Per-token pricing in USD. Local models: 0.0. Synced automatically for OpenRouter models via POST /api/ai/sync-pricing.

Step 3: Set credentials (for global catalog providers)

If using a platform-wide provider from the global catalog, supply your own API key:

POST /api/ai/credentials
{
  "providerId": 3,
  "apiKey": "sk-ant-api03-...",
  "label": "Production Anthropic key"
}

Credentials are encrypted at rest with AES-256-GCM. The raw key is never stored or returned after creation. Validate with:

POST /api/ai/credentials/{id}/validate

This tests the key against the provider's /models endpoint.


Local Model Setup

llama.cpp

The most direct path to local inference. Run any GGUF model with an OpenAI-compatible HTTP server:

# Start llama.cpp server
./llama-server \
  --model /models/llama-3.1-70b-q4_k_m.gguf \
  --host 0.0.0.0 \
  --port 8000 \
  --ctx-size 32768 \
  --n-gpu-layers 99

Register in Atamaia:

POST /api/ai/providers
{
  "name": "Local Llama Server",
  "type": "LocalLlamaCpp",
  "prefix": "local",
  "baseUrl": "http://192.168.1.100:8000/v1",
  "priority": 1,
  "timeoutSeconds": 180
}
POST /api/ai/models
{
  "providerId": 1,
  "modelId": "llama-3.1-70b",
  "displayName": "Llama 3.1 70B (Q4_K_M)",
  "contextLength": 32768,
  "isLocal": true,
  "approvedForAgent": true,
  "inputCostPer1M": 0.0,
  "outputCostPer1M": 0.0
}

vLLM

vllm serve meta-llama/Llama-3.1-70B --host 0.0.0.0 --port 8000

Register as Custom type with the same OpenAI-compatible base URL.

Ollama

ollama serve  # Runs on :11434 by default
POST /api/ai/providers
{
  "name": "Ollama",
  "type": "Custom",
  "prefix": "ollama",
  "baseUrl": "http://localhost:11434/v1",
  "priority": 1
}

Multiple models on one machine

Run multiple llama.cpp instances on different ports. Each becomes a separate provider (or multiple models under one provider if the server supports model switching):

Port Model Purpose
8000 Qwen3-30B Primary reasoning, research
8002 LFM2-8B Summarization, utility tasks
8003 SmolLM3-3B Fast classification, tagging
8004 Qwen3-4B Lightweight agent tasks

This is the actual production setup on Atamaia's ai-02 server.


Routing Strategies

Role-Based Routing

Route configs map roles to preferred models. When any service requests a model for a specific purpose, the router checks route configs by priority:

POST /api/ai/routes
{ "providerId": 1, "modelId": 3, "role": "summariser", "priority": 1, "notes": "LFM-2-8B for utility tasks" }

POST /api/ai/routes
{ "providerId": 1, "modelId": 1, "role": "researcher", "priority": 1, "notes": "Qwen3-30B for deep research" }

POST /api/ai/routes
{ "providerId": 2, "modelId": 5, "role": "coding", "priority": 1, "notes": "Claude via OpenRouter for code generation" }

The resolution chain for agent runs: Explicit model on the run -> Route config for role -> Fallback model.

Priority-Based Failover

Providers have priority numbers (lower = preferred). When the primary provider fails with a retriable error, the router automatically tries the next provider that serves the same model:

  1. Try primary provider (priority 1)
  2. If it fails with 429/5xx/timeout, try next provider (priority 2)
  3. Continue through all alternatives
  4. If all fail, return the original error

The failover searches both tenant-owned providers and global catalog providers (using tenant credentials).

Circuit Breaker

The ProviderHealthTracker prevents repeated calls to failing providers:

State Condition Behavior
Healthy < 3 consecutive failures Normal operation
Open 3+ consecutive failures Provider skipped for 5 minutes
Recovery After cooldown expires Next request tests the provider

Retriable conditions:

  • HTTP 408 (Timeout)
  • HTTP 429 (Rate Limited)
  • HTTP 5xx (Server Error)
  • TaskCanceledException or HttpRequestException

Success after recovery resets the failure counter.

Presence-Based Routing (Atamaia.Mind)

The Mind layer adds a second routing dimension based on the AI's cognitive state:

Presence State Default Route Rationale
Dormant None No processing needed
Subconscious Local Background processing, cost-free
Aware (low complexity) Local Routine tasks stay local
Aware (high complexity) API Complex analysis needs stronger models
Present Configurable Default-to-local with API override
Engaged API Active collaboration needs best quality
Deep Work API Maximum reasoning capability

This means background cognitive processes (consolidation, pattern detection, memory maintenance) run on free local models, while interactive sessions use the best available cloud model. The routing is automatic.


Multi-Model Broadcast

Send the same prompt to multiple models simultaneously and compare responses:

POST /api/ai/broadcast
{
  "models": ["ai-02:qwen3-30b-a3b", "or:anthropic/claude-3.5-sonnet", "ai-03:gemma-3-12b"],
  "message": "Analyze this architecture decision...",
  "timeoutMs": 30000
}

Returns all responses with per-model timing and usage. Useful for:

  • Comparing model quality on specific tasks
  • Multi-model councils (deliberative decision-making)
  • A/B testing model performance
  • Reducing single-model bias

Cost Management

Per-Token Pricing

Every model tracks inputCostPer1M and outputCostPer1M. The agent execution loop computes cost per iteration:

cost = (promptTokens * inputCostPer1M / 1,000,000) + (completionTokens * outputCostPer1M / 1,000,000)

Cost aggregates across parent + all child runs (TotalCostWithChildren).

Local Models: Zero Marginal Cost

Models running on your own hardware have inputCostPer1M: 0.0 and outputCostPer1M: 0.0. After the hardware investment, every inference is free. This changes the economics fundamentally:

  • Background tasks (summarization, embedding generation, memory consolidation, agent utility work) run locally at zero cost
  • Complex tasks (code generation, deep analysis, creative work) use cloud models where quality justifies the cost
  • Fallback -- if a cloud provider goes down, local models keep the system operational

Pricing Sync

For OpenRouter models, pricing updates automatically:

POST /api/ai/sync-pricing

Pulls current per-token pricing from OpenRouter's API and updates all registered models.


The Economics of Model Freedom

The traditional approach: pay one provider for everything. Simple, but expensive and fragile.

The Atamaia approach: route each task to the right model at the right cost.

Task Model Cost
Memory consolidation Local Qwen3-4B $0.00
Embedding generation Local LFM2-8B $0.00
Agent utility tasks Local Qwen3-30B $0.00
Summarization Local SmolLM3-3B $0.00
Interactive chat Claude 3.5 Sonnet Market rate
Complex code generation Claude Opus Market rate
Research analysis GPT-4o Market rate

Background work is free. Interactive work uses the best model for the job. Failover means you are never locked out.


Streaming Protocol

All upstream providers are normalized into the Open Responses streaming format over Server-Sent Events. The OpenAICompatAdapter translates provider-specific SSE streams:

Open Responses Event Description
response.created Stream started
output_item.added New output item (message or function call)
content_part.added New content part
output_text.delta Text chunk (sequenced)
output_text.done Full accumulated text
output_item.done Item complete
function_call.arguments.delta Tool call argument chunk
response.completed Stream finished with usage

Any client consuming Atamaia's streaming API gets a consistent event format regardless of provider. Switch from GPT to Claude to a local Llama model -- the client code does not change.


Future-Proofing

This architecture is designed for a world where:

  1. New models appear constantly. Register them, set priorities, route traffic. No code changes.

  2. Providers make unpredictable decisions. Pentagon deals, pricing changes, API deprecations, acquisitions. When it happens, adjust priorities or add a new provider. Your AI's identity and memories are untouched.

  3. Local inference keeps improving. Today's 30B parameter model on consumer hardware would have been unthinkable two years ago. As local models improve, shift more traffic local and reduce cloud dependency.

  4. Standards emerge and evolve. The OpenAI-compatible protocol is today's lingua franca. When new standards emerge, add a new adapter. The internal interface stays stable.

  5. Your needs change. Use Claude for everything today. Add local models next month. Switch to a different cloud provider next quarter. Each change is a configuration update, not a migration.

The model is a commodity. Your AI's identity is not.


API Reference

Method Endpoint Description
POST /api/ai/chat Send chat message (sync or stream)
POST /api/ai/broadcast Send to multiple models simultaneously
GET /api/ai/providers List providers
POST /api/ai/providers Register provider
PATCH /api/ai/providers/{id} Update provider
GET /api/ai/models List models
POST /api/ai/models Register model
PATCH /api/ai/models/{id} Update model
GET /api/ai/routes List route configs
POST /api/ai/routes Create role-to-model route
GET /api/ai/resolve/{modelId} Test model resolution
GET /api/ai/catalog Browse global provider catalog
POST /api/ai/credentials Set tenant API key for a provider
POST /api/ai/credentials/{id}/validate Test API key
POST /api/ai/sync-pricing Sync OpenRouter pricing

Full API documentation: API Reference


Built by Firebird Solutions. Running in production. Provider Portability | Architecture | AI Routing