Status: Planned — This feature is on the roadmap and not yet implemented. The architecture below describes the intended design.

Chat History Import Pipeline

Your conversations aren't just chat logs -- they're the record of a relationship. When you leave a provider, you shouldn't have to start that relationship from scratch. Export your history, upload it to Atamaia, and your AI picks up right where you left off -- with any model you choose.


The Opportunity

700K+ users are leaving ChatGPT. Many will jump to Claude, but Anthropic's pricing will cause churn too. These users have months or years of conversation history -- preferences learned, projects discussed, communication styles established. Every platform they move to makes them start from zero.

Atamaia is the permanent home. The killer onboarding: export your chats from any provider, upload them, and we analyze them to build your AI identity and seed your memory system. Five minutes from "I just quit ChatGPT" to "my AI already knows me."


The User Journey

Step 1: Export Your Data

From ChatGPT (OpenAI)

  1. Log in to chat.openai.com
  2. Click your profile icon (bottom-left)
  3. Go to Settings > Data Controls
  4. Click Export Data
  5. Confirm via email
  6. Download the ZIP file (arrives within minutes to hours)
  7. The ZIP contains conversations.json (the gold) and chat.html (human-readable backup)

From Claude (Anthropic)

  1. Log in to claude.ai
  2. Click your initials (bottom-left)
  3. Go to Settings > Privacy
  4. Click Export Data
  5. Download link arrives via email (expires in 24 hours)
  6. The ZIP contains conversations.json with all chat history

From Google Gemini

  1. Go to takeout.google.com
  2. Click Deselect all, then find and select Gemini Apps
  3. Click Next step, choose export format (ZIP)
  4. Click Create export
  5. Download when ready (can take hours for large accounts)
  6. The ZIP contains a Gemini/ folder with conversation JSON files

Step 2: Upload to Atamaia

  1. Go to the Atamaia import page
  2. Drag and drop your ZIP file (or click to browse)
  3. Atamaia auto-detects the provider format -- no configuration needed

Step 3: Processing

Atamaia processes your history entirely on local infrastructure. No cloud APIs. No data leaves your server.

  • Parses conversations from any supported format
  • Normalizes to a common internal representation
  • Runs analysis via Kael (Qwen 30B on ai-02) -- zero API cost
  • Extracts identity signals across six dimensions

Step 4: Review

Before anything is committed, you see everything Atamaia found:

  • Identity Profile: communication style, name preferences, personality traits
  • Facts: personal details, project names, tools used, preferences mentioned
  • Memories: significant conversations, recurring topics, expertise areas
  • Patterns: how you interact with AI, what frustrates you, what excites you

Edit anything. Delete anything. Approve what fits.

Step 5: Go

Identity created. Memories seeded. API key issued. Connect to Claude Code, Cursor, VS Code, any MCP client -- your AI already knows you.


Export Format Specifications

ChatGPT / OpenAI (conversations.json)

The export ZIP contains conversations.json -- an array of conversation objects with a tree-based message structure.

[
  {
    "title": "Project Architecture Discussion",
    "create_time": 1764355435.123,
    "update_time": 1764358000.456,
    "mapping": {
      "aaa-bbb-ccc-message-id": {
        "id": "aaa-bbb-ccc-message-id",
        "message": {
          "id": "aaa-bbb-ccc-message-id",
          "author": {
            "role": "user",
            "metadata": {}
          },
          "create_time": 1764355435.123,
          "content": {
            "content_type": "text",
            "parts": [
              "Can you help me design a REST API for user management?"
            ]
          },
          "status": "finished_successfully",
          "metadata": {
            "model_slug": "gpt-4",
            "timestamp_": "absolute"
          }
        },
        "parent": "system-node-id",
        "children": ["ddd-eee-fff-response-id"]
      },
      "ddd-eee-fff-response-id": {
        "id": "ddd-eee-fff-response-id",
        "message": {
          "id": "ddd-eee-fff-response-id",
          "author": {
            "role": "assistant",
            "metadata": {}
          },
          "create_time": 1764355440.789,
          "content": {
            "content_type": "text",
            "parts": [
              "I'd be happy to help you design a REST API..."
            ]
          },
          "status": "finished_successfully",
          "metadata": {
            "model_slug": "gpt-4",
            "finish_details": {
              "type": "stop"
            }
          }
        },
        "parent": "aaa-bbb-ccc-message-id",
        "children": []
      }
    },
    "moderation_results": [],
    "current_node": "ddd-eee-fff-response-id"
  }
]

Key details:

  • mapping is a tree, not a flat array -- messages link via parent/children UUIDs
  • To reconstruct conversation order: walk the tree from root to current_node
  • content.parts is an array -- can contain text strings, image references, or code blocks
  • author.role values: system, user, assistant, tool
  • Timestamps are Unix epoch floats (seconds with decimal precision)
  • Branching occurs when users edit messages or regenerate responses
  • Images are referenced by URL, not embedded in the export
  • model_slug in metadata tells you which model was used (gpt-4, gpt-4o, etc.)

Claude / Anthropic (conversations.json)

The export ZIP contains conversations.json -- an array of conversation objects with a flat message array.

[
  {
    "uuid": "conv-uuid-here",
    "name": "Database Schema Design",
    "created_at": "2025-11-15T10:30:00.000Z",
    "updated_at": "2025-11-15T11:45:00.000Z",
    "account": {
      "uuid": "account-uuid-here"
    },
    "chat_messages": [
      {
        "uuid": "msg-uuid-1",
        "text": "I need help designing a PostgreSQL schema for multi-tenant SaaS",
        "sender": "human",
        "created_at": "2025-11-15T10:30:00.000Z",
        "content": [
          {
            "type": "text",
            "text": "I need help designing a PostgreSQL schema for multi-tenant SaaS"
          }
        ],
        "attachments": [],
        "files": []
      },
      {
        "uuid": "msg-uuid-2",
        "text": "Great question! For multi-tenant PostgreSQL...",
        "sender": "assistant",
        "created_at": "2025-11-15T10:30:15.000Z",
        "content": [
          {
            "type": "text",
            "text": "Great question! For multi-tenant PostgreSQL..."
          }
        ],
        "attachments": [],
        "files": []
      }
    ]
  }
]

Key details:

  • Flat message array (simpler than ChatGPT's tree structure)
  • sender values: human, assistant
  • content array supports multiple content blocks (text, possibly images)
  • text field contains the plain text version of the message
  • attachments and files arrays for uploaded documents
  • ISO 8601 timestamps (not Unix epoch)
  • UUIDs on both conversations and individual messages

Google Gemini (Google Takeout)

The export contains a Gemini/ directory with individual JSON files per conversation.

{
  "id": "conversation-id",
  "title": "Code Review Help",
  "createdTime": "2025-10-20T14:00:00.000Z",
  "lastModifiedTime": "2025-10-20T14:35:00.000Z",
  "messages": [
    {
      "id": "msg-id-1",
      "author": "user",
      "content": "Can you review this Python function?",
      "createTime": "2025-10-20T14:00:00.000Z",
      "metadata": {
        "deviceType": "DESKTOP",
        "approximateLocation": "AU"
      }
    },
    {
      "id": "msg-id-2",
      "author": "model",
      "content": "I'd be happy to review your function...",
      "createTime": "2025-10-20T14:00:10.000Z"
    }
  ]
}

Key details:

  • One JSON file per conversation (not a single array)
  • author values: user, model
  • Metadata can include device type and approximate geolocation
  • ISO 8601 timestamps
  • Simpler flat structure, similar to Claude's format

Other Providers

Provider Export Method Format Notes
Microsoft Copilot Privacy dashboard JSON Limited history, similar structure to Gemini
Perplexity No official export N/A Third-party scrapers exist; low priority
Grok (xAI) No official export N/A Monitor for future export capability
Meta AI No official export N/A Monitor for future export capability

Technical Architecture

API Endpoints

POST   /api/import/upload              Upload ZIP or JSON file (multipart/form-data)
GET    /api/import/{importId}          Get import status and summary
GET    /api/import/{importId}/preview  Get extracted data for review
PATCH  /api/import/{importId}/preview  Edit extracted data before commit
POST   /api/import/{importId}/commit   Commit approved data to identity/memory
DELETE /api/import/{importId}          Cancel and discard import

Processing Pipeline

Upload (ZIP/JSON)
    |
    v
Format Detection
    |  Inspect file structure, detect provider automatically
    |  - ZIP with conversations.json + chat.html -> ChatGPT
    |  - ZIP with conversations.json (uuid/chat_messages) -> Claude
    |  - ZIP with Gemini/ directory -> Google Gemini
    v
Parsing & Normalization
    |  Convert all formats to common internal representation:
    |  NormalizedConversation { title, created, messages[] }
    |  NormalizedMessage { role, content, timestamp }
    v
Analysis Pipeline (runs on Kael / Qwen 30B -- ai-02:8000)
    |
    |-- Communication Style Analysis
    |     Formal vs casual, verbose vs concise, emoji usage,
    |     question patterns, how they give instructions
    |
    |-- Topic Extraction
    |     What domains come up most, project names,
    |     technologies mentioned, recurring themes
    |
    |-- Expertise Detection
    |     What they know deeply vs what they ask about,
    |     teaching vs learning patterns, domain vocabulary
    |
    |-- Relationship Pattern Analysis
    |     How they interact with AI -- collaborative, directive,
    |     exploratory. Do they push back? Thank the AI?
    |
    |-- Key Facts Extraction
    |     Name, location, job, projects, tools, preferences,
    |     people mentioned, deadlines, personal details
    |
    |-- Emotional Pattern Analysis
    |     What frustrates them (errors, slow responses, misunderstanding),
    |     what excites them (breakthroughs, elegant solutions)
    |
    v
Memory Generation
    |
    |-- Episodic Memories
    |     Significant conversations: breakthroughs, major decisions,
    |     project milestones, turning points
    |
    |-- Semantic Memories
    |     Extracted knowledge: "user prefers PostgreSQL over MySQL",
    |     "user works in .NET ecosystem", "user values clean architecture"
    |
    |-- Procedural Memories
    |     Repeated workflows: "always runs tests before committing",
    |     "prefers to see the plan before implementation"
    |
    |-- Facts
    |     Personal details: name, timezone, tech stack, projects,
    |     team members, preferences
    |
    v
Identity Profile Generation
    |  Display name, communication style settings,
    |  personality configuration, interaction preferences
    |
    v
Review (user approves/edits/rejects)
    |
    v
Commit (write to Atamaia database)

Analysis via Local Models

All analysis runs on Kael (Qwen 30B) on ai-02. No cloud API calls. No per-token costs. The prompts are chunked -- we don't send the entire chat history in one shot. Instead:

  1. Chunking: Split conversations into batches (e.g., 20 conversations per batch)
  2. Parallel extraction: Run multiple analysis passes per batch
  3. Aggregation: Merge results across batches, deduplicate, rank by confidence
  4. Refinement: Final pass to resolve conflicts and generate the identity profile

This means importing 1,000 conversations doesn't require 1,000 LLM calls -- it requires ~50 batched calls with structured extraction prompts.

Common Internal Representation

public record NormalizedConversation
{
    public string SourceProvider { get; init; }     // "chatgpt" | "claude" | "gemini"
    public string SourceId { get; init; }           // Original conversation ID
    public string Title { get; init; }
    public DateTime CreatedAt { get; init; }
    public DateTime? UpdatedAt { get; init; }
    public List<NormalizedMessage> Messages { get; init; }
}

public record NormalizedMessage
{
    public string Role { get; init; }               // "user" | "assistant" | "system"
    public string Content { get; init; }            // Plain text content
    public DateTime Timestamp { get; init; }
    public string? Model { get; init; }             // e.g., "gpt-4", "claude-3-opus"
    public Dictionary<string, object>? Metadata { get; init; }
}

Database Schema Additions

-- Import tracking
CREATE TABLE imports (
    id BIGINT GENERATED ALWAYS AS IDENTITY PRIMARY KEY,
    guid UUID NOT NULL DEFAULT gen_random_uuid(),
    tenant_id BIGINT NOT NULL REFERENCES tenants(id),
    identity_id BIGINT REFERENCES identities(id),
    source_provider TEXT NOT NULL,              -- 'chatgpt', 'claude', 'gemini'
    status TEXT NOT NULL DEFAULT 'uploaded',    -- uploaded, parsing, analyzing, ready_for_review, committed, failed
    file_name TEXT NOT NULL,
    file_size_bytes BIGINT NOT NULL,
    conversation_count INT,
    message_count INT,
    error_message TEXT,
    created_at TIMESTAMPTZ NOT NULL DEFAULT now(),
    updated_at TIMESTAMPTZ NOT NULL DEFAULT now(),
    committed_at TIMESTAMPTZ,
    deleted_at TIMESTAMPTZ                     -- soft delete
);

-- Extracted data staging (before commit)
CREATE TABLE import_extracted_data (
    id BIGINT GENERATED ALWAYS AS IDENTITY PRIMARY KEY,
    import_id BIGINT NOT NULL REFERENCES imports(id),
    data_type TEXT NOT NULL,                   -- 'fact', 'memory_episodic', 'memory_semantic',
                                               -- 'memory_procedural', 'preference', 'identity_trait'
    data_key TEXT NOT NULL,                    -- e.g., 'name', 'timezone', 'tech_stack'
    data_value JSONB NOT NULL,                 -- flexible structured content
    confidence REAL NOT NULL DEFAULT 0.5,      -- 0.0 to 1.0
    source_conversations JSONB,                -- array of conversation IDs that contributed
    approved BOOLEAN,                          -- null = pending, true = approved, false = rejected
    created_at TIMESTAMPTZ NOT NULL DEFAULT now()
);

-- Raw normalized conversations (temporary, deleted after commit)
CREATE TABLE import_conversations (
    id BIGINT GENERATED ALWAYS AS IDENTITY PRIMARY KEY,
    import_id BIGINT NOT NULL REFERENCES imports(id),
    source_id TEXT NOT NULL,
    title TEXT,
    message_count INT NOT NULL,
    created_at TIMESTAMPTZ NOT NULL,
    data JSONB NOT NULL                        -- full NormalizedConversation as JSON
);

Privacy & Security

This is the entire point. Every other provider processes your data on their cloud. Atamaia doesn't.

  • All processing happens on YOUR infrastructure -- local models (Kael/Qwen on ai-02), your PostgreSQL database, your server
  • No data sent to any third party during import or analysis -- zero cloud API calls
  • Uploaded files are processed and discarded -- only the extracted memories, facts, and identity profile persist
  • Raw conversations are not stored permanently -- import_conversations table is cleaned up after commit
  • Encrypted at rest -- PostgreSQL with disk encryption, TLS in transit
  • User controls everything -- the review step means nothing is committed without explicit approval
  • Soft delete -- if a user wants to undo an import, the extracted data can be soft-deleted
  • No account required to preview -- users can see what would be extracted before creating an account (stretch goal)

What This Uses That Already Exists

This isn't a new product. It's a new front door to existing Atamaia capabilities:

Capability Already Built Import Pipeline Uses It For
Memory creation API Yes Seeding episodic, semantic, procedural memories
Fact storage Yes Storing extracted personal details and preferences
Identity management Yes Creating the user's AI identity profile
Local model routing (Kael) Yes Running all analysis without cloud API costs
Multi-tenant isolation Yes Keeping imported data per-tenant
Soft delete Yes Safe undo of imports
JWT auth + API keys Yes Issuing credentials after import
MCP adapter Yes Immediate connectivity to Claude Code et al.

The import pipeline is ~3 new endpoints, ~3 new tables, a format parser, and a set of extraction prompts. Everything downstream already works.