Status: Planned — This feature is on the roadmap and not yet implemented. The architecture below describes the intended design.

Chat Import -- Implementation Plan


Phase 1: ChatGPT Import (Biggest Market Right Now)

Target: 2 weeks

This is where the users are. The QuitGPT movement is real, the export format is well-documented, and ChatGPT users have the deepest conversation histories.

Week 1: Core Pipeline

API Endpoints & Models

// DTOs
public record ImportUploadResponse(Guid ImportId, string Status, int ConversationCount, int MessageCount);
public record ImportStatusResponse(Guid ImportId, string Status, string SourceProvider, int ConversationCount, int MessageCount, DateTime CreatedAt, DateTime? CommittedAt, string? Error);
public record ImportPreviewResponse(Guid ImportId, ImportedIdentityProfile Identity, List<ExtractedDataItem> Facts, List<ExtractedDataItem> Memories, List<ExtractedDataItem> Preferences);
public record ExtractedDataItem(long Id, string DataType, string Key, JsonElement Value, float Confidence, bool? Approved);
public record ImportCommitRequest(List<long> ApprovedItemIds);
public record ImportCommitResponse(Guid ImportId, Guid IdentityId, string ApiKey, int MemoriesCreated, int FactsStored);

public record ImportedIdentityProfile
{
    public string? DisplayName { get; init; }
    public string? CommunicationStyle { get; init; }  // "casual", "formal", "technical"
    public string? Verbosity { get; init; }            // "concise", "moderate", "detailed"
    public List<string> ExpertiseDomains { get; init; }
    public List<string> FrequentTopics { get; init; }
    public string? InteractionStyle { get; init; }     // "collaborative", "directive", "exploratory"
}

Endpoints

POST   /api/import/upload
  - Accept: multipart/form-data
  - Body: file (ZIP or JSON), optional: displayName
  - Auth: optional (can import before account creation)
  - Returns: ImportUploadResponse
  - Kicks off background processing

GET    /api/import/{importId}
  - Returns: ImportStatusResponse
  - Poll this for processing status

GET    /api/import/{importId}/preview
  - Returns: ImportPreviewResponse
  - Only available when status = "ready_for_review"

PATCH  /api/import/{importId}/preview
  - Body: list of { id, approved: bool } decisions
  - Updates approval status on extracted data items

POST   /api/import/{importId}/commit
  - Body: ImportCommitRequest (list of approved item IDs)
  - Creates identity, seeds memories, generates API key
  - Returns: ImportCommitResponse

DELETE /api/import/{importId}
  - Soft-deletes import and all associated data

ChatGPT Parser

public class ChatGptParser : IImportParser
{
    public string Provider => "chatgpt";

    public bool CanParse(ZipArchive archive)
    {
        // Check for conversations.json + chat.html (ChatGPT signature)
        var hasConversations = archive.GetEntry("conversations.json") != null;
        var hasChatHtml = archive.GetEntry("chat.html") != null;
        return hasConversations && hasChatHtml;
    }

    public async Task<List<NormalizedConversation>> ParseAsync(ZipArchive archive)
    {
        // 1. Read conversations.json
        // 2. For each conversation, walk the mapping tree from root to current_node
        // 3. Extract messages in order, resolve branches (follow current_node path)
        // 4. Convert Unix timestamps to DateTime
        // 5. Map author.role: "user" -> "user", "assistant" -> "assistant", "system" -> "system"
        // 6. Extract content from parts[] array (join text parts, note non-text parts)
        // 7. Return NormalizedConversation list
    }
}

Tree walking algorithm (ChatGPT's mapping is a tree, not a list):

  1. Find the root node (node with no parent or parent is null)
  2. Follow children arrays, always picking the child that leads to current_node
  3. This gives the "as experienced" conversation path, ignoring edited/regenerated branches
  4. Optionally preserve branches as metadata (edited messages are interesting signal)

Database Migrations

-- imports, import_extracted_data, import_conversations tables
-- As specified in chat-import.md
-- Add indexes:
CREATE INDEX idx_imports_tenant_id ON imports(tenant_id);
CREATE INDEX idx_imports_status ON imports(status);
CREATE INDEX idx_import_extracted_data_import_id ON import_extracted_data(import_id);
CREATE INDEX idx_import_conversations_import_id ON import_conversations(import_id);

Week 2: Analysis Pipeline & Review UI

Analysis Prompts (Kael / Qwen 30B)

Each prompt processes a batch of ~20 normalized conversations. All prompts return structured JSON.

Prompt 1: Facts & Preferences Extraction

You are analyzing a batch of conversations to extract factual information about the user.
Return a JSON array of facts. Each fact has: key, value, confidence (0-1), evidence (quote from conversation).

Categories to extract:
- Personal: name, location, timezone, job_title, company, team_members
- Technical: programming_languages, frameworks, databases, tools, OS, editor
- Projects: project_names, project_descriptions, deadlines
- Preferences: code_style, communication_preferences, workflow_habits

Conversations:
{batch_json}

Return ONLY valid JSON. No explanation.

Prompt 2: Communication Style Analysis

Analyze the USER messages (not the assistant) in these conversations.
Return JSON with:
{
  "formality": "casual|moderate|formal",
  "verbosity": "concise|moderate|detailed",
  "emoji_usage": "none|rare|moderate|frequent",
  "question_style": "direct|exploratory|socratic",
  "instruction_style": "collaborative|directive|deferential",
  "technical_depth": "beginner|intermediate|advanced|expert",
  "evidence": ["quote1", "quote2"]
}

Conversations:
{batch_json}

Prompt 3: Topic & Expertise Extraction

Analyze these conversations and identify:
1. Recurring topics (what the user talks about most)
2. Expertise areas (what they clearly know well vs what they're learning)
3. Significant conversations (breakthroughs, major decisions, turning points)

Return JSON:
{
  "topics": [{"name": "...", "frequency": N, "examples": ["..."]}],
  "expertise": [{"domain": "...", "level": "learning|competent|expert", "evidence": "..."}],
  "significant_conversations": [{"title": "...", "why": "...", "source_id": "..."}]
}

Conversations:
{batch_json}

Prompt 4: Relationship & Emotional Patterns

Analyze how the user interacts with AI in these conversations.
Return JSON:
{
  "interaction_pattern": "collaborative|directive|exploratory|teaching",
  "pushback_frequency": "never|rare|sometimes|often",
  "gratitude_frequency": "never|rare|sometimes|often",
  "frustration_triggers": ["...", "..."],
  "excitement_triggers": ["...", "..."],
  "correction_style": "gentle|direct|frustrated",
  "evidence": ["quote1", "quote2"]
}

Conversations:
{batch_json}

Aggregation Logic

After all batches are processed:

  1. Fact deduplication: Same key from multiple batches? Keep highest confidence, merge evidence
  2. Style averaging: Communication style scores averaged across batches (weighted by conversation length)
  3. Topic ranking: Sort by total frequency across all batches
  4. Memory selection: Pick top N significant conversations (configurable, default 50)
  5. Conflict resolution: If batch A says "expert in Python" and batch B says "learning Python", the one with more evidence wins

Background Processing Service

public class ImportProcessingService : BackgroundService
{
    // Polls for imports with status = "uploaded"
    // Pipeline:
    // 1. Set status = "parsing"
    // 2. Detect format, parse to normalized conversations
    // 3. Store in import_conversations
    // 4. Set status = "analyzing"
    // 5. Chunk conversations into batches
    // 6. Run each analysis prompt against each batch via Kael (ai-02:8000)
    // 7. Aggregate results
    // 8. Store in import_extracted_data
    // 9. Set status = "ready_for_review"
    // On error: set status = "failed", store error_message
}

Frontend: Review UI

Upload Page (/import)

  • Drag-and-drop zone for ZIP/JSON files
  • Provider auto-detection indicator
  • Upload progress bar
  • Transitions to processing view on upload

Processing View (/import/{id})

  • Status indicator with steps: Uploading > Parsing > Analyzing > Ready
  • Stats: conversation count, message count, estimated time remaining
  • Auto-transitions to review when ready

Review Page (/import/{id}/review)

  • Three columns layout:
    • Identity Profile (left): display name, communication style, expertise domains -- editable
    • Facts & Preferences (center): card per fact, each with approve/reject toggle, confidence badge, source evidence expandable
    • Memories (right): episodic/semantic/procedural tabs, each memory card with approve/reject
  • Bulk actions: approve all, reject all, approve above confidence threshold
  • "Commit" button (bottom) with summary: "Creating identity with 47 facts, 23 memories, 12 preferences"

Done Page (/import/{id}/done)

  • API key (shown once, copy button)
  • Quick-connect instructions for Claude Code, Cursor, VS Code
  • "Your AI is ready" confirmation with identity summary

Phase 1 Estimated Effort

Component Effort
Database migrations + EF entities 2h
Import controller + DTOs 3h
ChatGPT parser (tree walking) 4h
Format auto-detection 1h
Background processing service 3h
Kael integration (prompt execution) 2h
Analysis prompts (4 prompts, tuning) 4h
Aggregation logic 3h
Commit logic (write to memory/facts/identity) 3h
Frontend: upload page 3h
Frontend: processing view 2h
Frontend: review page 6h
Frontend: done page 1h
Integration tests (real PostgreSQL) 4h
Total ~41h / ~1 week focused, 2 weeks realistic

Phase 2: Claude Import

Target: 3-4 days after Phase 1

Claude's format is simpler than ChatGPT's -- flat message arrays instead of trees. Most of the pipeline already exists from Phase 1.

New Work

public class ClaudeParser : IImportParser
{
    public string Provider => "claude";

    public bool CanParse(ZipArchive archive)
    {
        // Claude ZIP has conversations.json but NO chat.html
        // And conversations have uuid + chat_messages fields
        var entry = archive.GetEntry("conversations.json");
        if (entry == null) return false;
        // Peek at first object: look for "chat_messages" field (Claude)
        // vs "mapping" field (ChatGPT)
        return PeekForField(entry, "chat_messages");
    }

    public async Task<List<NormalizedConversation>> ParseAsync(ZipArchive archive)
    {
        // 1. Read conversations.json
        // 2. For each conversation, iterate chat_messages array (already in order)
        // 3. Map sender: "human" -> "user", "assistant" -> "assistant"
        // 4. Parse ISO 8601 timestamps
        // 5. Handle content[] array (join text blocks)
        // 6. Handle attachments/files metadata
    }
}

Effort: ~3-4 days

Component Effort
Claude parser 2h
Format detection update 30min
Test data + integration tests 2h
Prompt tuning (Claude conversations may have different patterns) 2h
Total ~7h

Phase 3: Gemini + Others

Target: 3-4 days after Phase 2

Gemini Parser

public class GeminiParser : IImportParser
{
    public string Provider => "gemini";

    public bool CanParse(ZipArchive archive)
    {
        // Look for Gemini/ directory structure in ZIP
        return archive.Entries.Any(e => e.FullName.StartsWith("Gemini/"));
    }

    public async Task<List<NormalizedConversation>> ParseAsync(ZipArchive archive)
    {
        // 1. Find all JSON files under Gemini/ directory
        // 2. Each file is one conversation
        // 3. Map author: "user" -> "user", "model" -> "assistant"
        // 4. Parse ISO 8601 timestamps
        // 5. Strip location metadata (privacy)
    }
}

Future Providers

Each new provider is just a new IImportParser implementation. The analysis pipeline, review UI, and commit logic are completely reusable. Adding a new provider is a half-day of work once Phase 1 is done.

Effort: ~2-3 days

Component Effort
Gemini parser 2h
Multi-file ZIP handling 1h
Format detection update 30min
Tests 2h
Total ~6h

Architecture Decisions

Why Background Processing (Not Synchronous)

A heavy ChatGPT user might have 2,000+ conversations. Parsing is fast, but running 100+ LLM inference calls through Kael takes time (30s-2min per batch). Background processing with polling keeps the UX responsive.

Why Batch, Not Stream

We need cross-conversation aggregation. A user mentioning "Python" in 50 conversations is different signal than mentioning it once. Batching lets us aggregate before presenting results.

Why Local Models Only

  • Zero marginal cost per import (Kael is already running)
  • No data leaves the infrastructure
  • No rate limits or API quotas
  • Privacy is the selling point -- can't undermine it by sending data to OpenAI/Anthropic for analysis

Why Review Before Commit

  • Trust building: users see exactly what was extracted
  • Error correction: LLM extraction isn't perfect, users can fix mistakes
  • Privacy control: users might not want certain facts stored
  • Legal: explicit consent before data processing

Why Temporary Raw Storage

import_conversations stores the full normalized conversations during processing, then gets cleaned up after commit. We don't keep raw chat history -- only the extracted intelligence. This is a privacy feature, not a limitation.


Risk Mitigation

Risk Mitigation
ChatGPT changes export format Parser is isolated; update one class. Community tracks format changes.
Large exports (10k+ conversations) Chunking + streaming parse. Set upload size limits. Progress indicators.
Kael overload during bulk imports Queue imports, process one at a time per tenant. Rate limit the endpoint.
Low extraction quality Confidence scores + user review. Iterate on prompts. Track approval rates to improve.
Users upload non-chat ZIP files Validate format before processing. Clear error messages.
Duplicate imports Check for existing import from same provider with similar conversation count. Warn user.

Success Metrics

  • Import completion rate: % of uploads that reach "committed" status
  • Approval rate: % of extracted items approved (target: >80%)
  • Time to first connection: minutes from upload to MCP client connection
  • Retention: do imported users stay active vs organic signups?
  • Conversation coverage: what % of conversations yield useful extraction?