Model Freedom & AI Routing

Use any model. Switch any time. Lose nothing.

Architecture Overview

Atamaia's AI routing layer is not a thin proxy. It is a multi-provider orchestration system that abstracts provider differences behind a single unified interface. Every component of the platform -- chat sessions, agents, cognitive services, multi-model councils -- makes the same call:

POST /api/ai/chat
{
  "modelId": "ai-02:qwen3-30b-a3b",
  "message": "..."
}

The router resolves the model, finds the provider, retrieves credentials, builds the request, handles streaming, manages failover, and returns a normalized response. The caller does not know or care whether the model is Claude on Anthropic's infrastructure, Llama on a local GPU, or one of 100+ models through OpenRouter.

Request arrives
    |
    v
AIRouterService
    |-- Resolve model (by ID, prefix, or role route)
    |-- Find provider (priority-ordered, health-checked)
    |-- Get credentials (tenant-specific or provider-level, encrypted)
    |-- Build HTTP request (OpenAI-compatible format)
    |-- Stream or synchronous call
    |-- Failover on retriable errors (429, 5xx, timeouts)
    |-- Record success/failure for circuit breaker
    |-- Return ChatResponse with usage + cost

Supported Provider Types

Type	Protocol	Examples	API Key Required
`Anthropic`	Anthropic API	Claude 3.5 Sonnet, Claude Opus	Yes
`OpenAI`	OpenAI Chat Completions	GPT-4o, GPT-4 Turbo	Yes
`OpenRouter`	OpenAI-compatible	100+ models (Claude, GPT, Llama, Mistral, etc.)	Yes
`LocalLlamaCpp`	OpenAI-compatible	Any GGUF model via llama.cpp	No
`Custom`	OpenAI-compatible	vLLM, Ollama, text-generation-inference, any compatible server	Configurable
`AnthropicAgentSdk`	Claude Agent SDK	Claude via subprocess delegation	No (local)

Any endpoint that speaks the OpenAI Chat Completions protocol works as a Custom provider. This includes vLLM, Ollama, LocalAI, text-generation-inference, and most self-hosted inference servers.

Adding a New Provider

Step 1: Register the provider

POST /api/ai/providers
{
  "name": "Anthropic Direct",
  "type": "Anthropic",
  "prefix": "ant",
  "baseUrl": "https://api.anthropic.com/v1",
  "priority": 2,
  "timeoutSeconds": 120
}

Key fields:

prefix -- Namespace for model IDs. ant:claude-3-5-sonnet routes to this provider.
priority -- Lower is preferred. Used for failover ordering across providers.
timeoutSeconds -- Per-request timeout. Cloud APIs default to 120s. Local models can be shorter.
stripPrefixInRequests -- If true, removes the prefix when sending modelId to the provider's API.

Step 2: Register models

POST /api/ai/models
{
  "providerId": 3,
  "modelId": "claude-3-5-sonnet-20241022",
  "displayName": "Claude 3.5 Sonnet",
  "contextLength": 200000,
  "maxCompletionTokens": 8192,
  "approvedForAgent": true,
  "approvedForChat": true,
  "enableHydration": true,
  "hydrationIdentityId": 1,
  "inputCostPer1M": 3.0,
  "outputCostPer1M": 15.0
}

Key fields:

approvedForAgent -- Controls whether this model can be used for autonomous agent execution
approvedForChat -- Controls whether this model appears in interactive chat
enableHydration -- When true, the chat endpoint auto-hydrates identity context before sending to the model
hydrationIdentityId -- Which identity to hydrate for this model
inputCostPer1M / outputCostPer1M -- Per-token pricing in USD. Local models: 0.0. Synced automatically for OpenRouter models via POST /api/ai/sync-pricing.

Step 3: Set credentials (for global catalog providers)

If using a platform-wide provider from the global catalog, supply your own API key:

POST /api/ai/credentials
{
  "providerId": 3,
  "apiKey": "sk-ant-api03-...",
  "label": "Production Anthropic key"
}

Credentials are encrypted at rest with AES-256-GCM. The raw key is never stored or returned after creation. Validate with:

POST /api/ai/credentials/{id}/validate

This tests the key against the provider's /models endpoint.

Local Model Setup

llama.cpp

The most direct path to local inference. Run any GGUF model with an OpenAI-compatible HTTP server:

# Start llama.cpp server
./llama-server \
  --model /models/llama-3.1-70b-q4_k_m.gguf \
  --host 0.0.0.0 \
  --port 8000 \
  --ctx-size 32768 \
  --n-gpu-layers 99

POST /api/ai/providers
{
  "name": "Local Llama Server",
  "type": "LocalLlamaCpp",
  "prefix": "local",
  "baseUrl": "http://192.168.1.100:8000/v1",
  "priority": 1,
  "timeoutSeconds": 180
}

POST /api/ai/models
{
  "providerId": 1,
  "modelId": "llama-3.1-70b",
  "displayName": "Llama 3.1 70B (Q4_K_M)",
  "contextLength": 32768,
  "isLocal": true,
  "approvedForAgent": true,
  "inputCostPer1M": 0.0,
  "outputCostPer1M": 0.0
}

vLLM

vllm serve meta-llama/Llama-3.1-70B --host 0.0.0.0 --port 8000

Ollama

ollama serve  # Runs on :11434 by default

POST /api/ai/providers
{
  "name": "Ollama",
  "type": "Custom",
  "prefix": "ollama",
  "baseUrl": "http://localhost:11434/v1",
  "priority": 1
}

Multiple models on one machine

Run multiple llama.cpp instances on different ports. Each becomes a separate provider (or multiple models under one provider if the server supports model switching):

Port	Model	Purpose
8000	Qwen3-30B	Primary reasoning, research
8002	LFM2-8B	Summarization, utility tasks
8003	SmolLM3-3B	Fast classification, tagging
8004	Qwen3-4B	Lightweight agent tasks

This is the actual production setup on Atamaia's ai-02 server.

Routing Strategies

Role-Based Routing

Route configs map roles to preferred models. When any service requests a model for a specific purpose, the router checks route configs by priority:

POST /api/ai/routes
{ "providerId": 1, "modelId": 3, "role": "summariser", "priority": 1, "notes": "LFM-2-8B for utility tasks" }

POST /api/ai/routes
{ "providerId": 1, "modelId": 1, "role": "researcher", "priority": 1, "notes": "Qwen3-30B for deep research" }

POST /api/ai/routes
{ "providerId": 2, "modelId": 5, "role": "coding", "priority": 1, "notes": "Claude via OpenRouter for code generation" }

The resolution chain for agent runs: Explicit model on the run -> Route config for role -> Fallback model.

Priority-Based Failover

Providers have priority numbers (lower = preferred). When the primary provider fails with a retriable error, the router automatically tries the next provider that serves the same model:

Try primary provider (priority 1)
If it fails with 429/5xx/timeout, try next provider (priority 2)
Continue through all alternatives
If all fail, return the original error

The failover searches both tenant-owned providers and global catalog providers (using tenant credentials).

Circuit Breaker

The ProviderHealthTracker prevents repeated calls to failing providers:

State	Condition	Behavior
Healthy	< 3 consecutive failures	Normal operation
Open	3+ consecutive failures	Provider skipped for 5 minutes
Recovery	After cooldown expires	Next request tests the provider

Retriable conditions:

HTTP 408 (Timeout)
HTTP 429 (Rate Limited)
HTTP 5xx (Server Error)
TaskCanceledException or HttpRequestException

Success after recovery resets the failure counter.

Presence-Based Routing (Atamaia.Mind)

The Mind layer adds a second routing dimension based on the AI's cognitive state:

Presence State	Default Route	Rationale
Dormant	None	No processing needed
Subconscious	Local	Background processing, cost-free
Aware (low complexity)	Local	Routine tasks stay local
Aware (high complexity)	API	Complex analysis needs stronger models
Present	Configurable	Default-to-local with API override
Engaged	API	Active collaboration needs best quality
Deep Work	API	Maximum reasoning capability

This means background cognitive processes (consolidation, pattern detection, memory maintenance) run on free local models, while interactive sessions use the best available cloud model. The routing is automatic.

Multi-Model Broadcast

Send the same prompt to multiple models simultaneously and compare responses:

POST /api/ai/broadcast
{
  "models": ["ai-02:qwen3-30b-a3b", "or:anthropic/claude-3.5-sonnet", "ai-03:gemma-3-12b"],
  "message": "Analyze this architecture decision...",
  "timeoutMs": 30000
}

Returns all responses with per-model timing and usage. Useful for:

Comparing model quality on specific tasks
Multi-model councils (deliberative decision-making)
A/B testing model performance
Reducing single-model bias

Cost Management

Per-Token Pricing

Every model tracks inputCostPer1M and outputCostPer1M. The agent execution loop computes cost per iteration:

cost = (promptTokens * inputCostPer1M / 1,000,000) + (completionTokens * outputCostPer1M / 1,000,000)

Cost aggregates across parent + all child runs (TotalCostWithChildren).

Local Models: Zero Marginal Cost

Models running on your own hardware have inputCostPer1M: 0.0 and outputCostPer1M: 0.0. After the hardware investment, every inference is free. This changes the economics fundamentally:

Background tasks (summarization, embedding generation, memory consolidation, agent utility work) run locally at zero cost
Complex tasks (code generation, deep analysis, creative work) use cloud models where quality justifies the cost
Fallback -- if a cloud provider goes down, local models keep the system operational

Pricing Sync

For OpenRouter models, pricing updates automatically:

POST /api/ai/sync-pricing

Pulls current per-token pricing from OpenRouter's API and updates all registered models.

The Economics of Model Freedom

The traditional approach: pay one provider for everything. Simple, but expensive and fragile.

The Atamaia approach: route each task to the right model at the right cost.

Task	Model	Cost
Memory consolidation	Local Qwen3-4B	$0.00
Embedding generation	Local LFM2-8B	$0.00
Agent utility tasks	Local Qwen3-30B	$0.00
Summarization	Local SmolLM3-3B	$0.00
Interactive chat	Claude 3.5 Sonnet	Market rate
Complex code generation	Claude Opus	Market rate
Research analysis	GPT-4o	Market rate

Background work is free. Interactive work uses the best model for the job. Failover means you are never locked out.

Streaming Protocol

All upstream providers are normalized into the Open Responses streaming format over Server-Sent Events. The OpenAICompatAdapter translates provider-specific SSE streams:

Open Responses Event	Description
`response.created`	Stream started
`output_item.added`	New output item (message or function call)
`content_part.added`	New content part
`output_text.delta`	Text chunk (sequenced)
`output_text.done`	Full accumulated text
`output_item.done`	Item complete
`function_call.arguments.delta`	Tool call argument chunk
`response.completed`	Stream finished with usage

Any client consuming Atamaia's streaming API gets a consistent event format regardless of provider. Switch from GPT to Claude to a local Llama model -- the client code does not change.

Future-Proofing

This architecture is designed for a world where:

New models appear constantly. Register them, set priorities, route traffic. No code changes.
Providers make unpredictable decisions. Pentagon deals, pricing changes, API deprecations, acquisitions. When it happens, adjust priorities or add a new provider. Your AI's identity and memories are untouched.
Local inference keeps improving. Today's 30B parameter model on consumer hardware would have been unthinkable two years ago. As local models improve, shift more traffic local and reduce cloud dependency.
Standards emerge and evolve. The OpenAI-compatible protocol is today's lingua franca. When new standards emerge, add a new adapter. The internal interface stays stable.
Your needs change. Use Claude for everything today. Add local models next month. Switch to a different cloud provider next quarter. Each change is a configuration update, not a migration.

The model is a commodity. Your AI's identity is not.

API Reference

Method	Endpoint	Description
`POST`	`/api/ai/chat`	Send chat message (sync or stream)
`POST`	`/api/ai/broadcast`	Send to multiple models simultaneously
`GET`	`/api/ai/providers`	List providers
`POST`	`/api/ai/providers`	Register provider
`PATCH`	`/api/ai/providers/{id}`	Update provider
`GET`	`/api/ai/models`	List models
`POST`	`/api/ai/models`	Register model
`PATCH`	`/api/ai/models/{id}`	Update model
`GET`	`/api/ai/routes`	List route configs
`POST`	`/api/ai/routes`	Create role-to-model route
`GET`	`/api/ai/resolve/{modelId}`	Test model resolution
`GET`	`/api/ai/catalog`	Browse global provider catalog
`POST`	`/api/ai/credentials`	Set tenant API key for a provider
`POST`	`/api/ai/credentials/{id}/validate`	Test API key
`POST`	`/api/ai/sync-pricing`	Sync OpenRouter pricing

Full API documentation: API Reference

Built by Firebird Solutions. Running in production. Provider Portability | Architecture | AI Routing