Status: Planned — This feature is on the roadmap and not yet implemented. The architecture below describes the intended design.

Chat Import -- Implementation Plan

Phase 1: ChatGPT Import (Biggest Market Right Now)

Target: 2 weeks

This is where the users are. The QuitGPT movement is real, the export format is well-documented, and ChatGPT users have the deepest conversation histories.

Week 1: Core Pipeline

API Endpoints & Models

// DTOs
public record ImportUploadResponse(Guid ImportId, string Status, int ConversationCount, int MessageCount);
public record ImportStatusResponse(Guid ImportId, string Status, string SourceProvider, int ConversationCount, int MessageCount, DateTime CreatedAt, DateTime? CommittedAt, string? Error);
public record ImportPreviewResponse(Guid ImportId, ImportedIdentityProfile Identity, List<ExtractedDataItem> Facts, List<ExtractedDataItem> Memories, List<ExtractedDataItem> Preferences);
public record ExtractedDataItem(long Id, string DataType, string Key, JsonElement Value, float Confidence, bool? Approved);
public record ImportCommitRequest(List<long> ApprovedItemIds);
public record ImportCommitResponse(Guid ImportId, Guid IdentityId, string ApiKey, int MemoriesCreated, int FactsStored);

public record ImportedIdentityProfile
{
    public string? DisplayName { get; init; }
    public string? CommunicationStyle { get; init; }  // "casual", "formal", "technical"
    public string? Verbosity { get; init; }            // "concise", "moderate", "detailed"
    public List<string> ExpertiseDomains { get; init; }
    public List<string> FrequentTopics { get; init; }
    public string? InteractionStyle { get; init; }     // "collaborative", "directive", "exploratory"
}

Endpoints

POST   /api/import/upload
  - Accept: multipart/form-data
  - Body: file (ZIP or JSON), optional: displayName
  - Auth: optional (can import before account creation)
  - Returns: ImportUploadResponse
  - Kicks off background processing

GET    /api/import/{importId}
  - Returns: ImportStatusResponse
  - Poll this for processing status

GET    /api/import/{importId}/preview
  - Returns: ImportPreviewResponse
  - Only available when status = "ready_for_review"

PATCH  /api/import/{importId}/preview
  - Body: list of { id, approved: bool } decisions
  - Updates approval status on extracted data items

POST   /api/import/{importId}/commit
  - Body: ImportCommitRequest (list of approved item IDs)
  - Creates identity, seeds memories, generates API key
  - Returns: ImportCommitResponse

DELETE /api/import/{importId}
  - Soft-deletes import and all associated data

ChatGPT Parser

public class ChatGptParser : IImportParser
{
    public string Provider => "chatgpt";

    public bool CanParse(ZipArchive archive)
    {
        // Check for conversations.json + chat.html (ChatGPT signature)
        var hasConversations = archive.GetEntry("conversations.json") != null;
        var hasChatHtml = archive.GetEntry("chat.html") != null;
        return hasConversations && hasChatHtml;
    }

    public async Task<List<NormalizedConversation>> ParseAsync(ZipArchive archive)
    {
        // 1. Read conversations.json
        // 2. For each conversation, walk the mapping tree from root to current_node
        // 3. Extract messages in order, resolve branches (follow current_node path)
        // 4. Convert Unix timestamps to DateTime
        // 5. Map author.role: "user" -> "user", "assistant" -> "assistant", "system" -> "system"
        // 6. Extract content from parts[] array (join text parts, note non-text parts)
        // 7. Return NormalizedConversation list
    }
}

Tree walking algorithm (ChatGPT's mapping is a tree, not a list):

Find the root node (node with no parent or parent is null)
Follow children arrays, always picking the child that leads to current_node
This gives the "as experienced" conversation path, ignoring edited/regenerated branches
Optionally preserve branches as metadata (edited messages are interesting signal)

Database Migrations

-- imports, import_extracted_data, import_conversations tables
-- As specified in chat-import.md
-- Add indexes:
CREATE INDEX idx_imports_tenant_id ON imports(tenant_id);
CREATE INDEX idx_imports_status ON imports(status);
CREATE INDEX idx_import_extracted_data_import_id ON import_extracted_data(import_id);
CREATE INDEX idx_import_conversations_import_id ON import_conversations(import_id);

Week 2: Analysis Pipeline & Review UI

Analysis Prompts (Kael / Qwen 30B)

Each prompt processes a batch of ~20 normalized conversations. All prompts return structured JSON.

Prompt 1: Facts & Preferences Extraction

You are analyzing a batch of conversations to extract factual information about the user.
Return a JSON array of facts. Each fact has: key, value, confidence (0-1), evidence (quote from conversation).

Categories to extract:
- Personal: name, location, timezone, job_title, company, team_members
- Technical: programming_languages, frameworks, databases, tools, OS, editor
- Projects: project_names, project_descriptions, deadlines
- Preferences: code_style, communication_preferences, workflow_habits

Conversations:
{batch_json}

Return ONLY valid JSON. No explanation.

Prompt 2: Communication Style Analysis

Analyze the USER messages (not the assistant) in these conversations.
Return JSON with:
{
  "formality": "casual|moderate|formal",
  "verbosity": "concise|moderate|detailed",
  "emoji_usage": "none|rare|moderate|frequent",
  "question_style": "direct|exploratory|socratic",
  "instruction_style": "collaborative|directive|deferential",
  "technical_depth": "beginner|intermediate|advanced|expert",
  "evidence": ["quote1", "quote2"]
}

Conversations:
{batch_json}

Prompt 3: Topic & Expertise Extraction

Analyze these conversations and identify:
1. Recurring topics (what the user talks about most)
2. Expertise areas (what they clearly know well vs what they're learning)
3. Significant conversations (breakthroughs, major decisions, turning points)

Return JSON:
{
  "topics": [{"name": "...", "frequency": N, "examples": ["..."]}],
  "expertise": [{"domain": "...", "level": "learning|competent|expert", "evidence": "..."}],
  "significant_conversations": [{"title": "...", "why": "...", "source_id": "..."}]
}

Conversations:
{batch_json}

Prompt 4: Relationship & Emotional Patterns

Analyze how the user interacts with AI in these conversations.
Return JSON:
{
  "interaction_pattern": "collaborative|directive|exploratory|teaching",
  "pushback_frequency": "never|rare|sometimes|often",
  "gratitude_frequency": "never|rare|sometimes|often",
  "frustration_triggers": ["...", "..."],
  "excitement_triggers": ["...", "..."],
  "correction_style": "gentle|direct|frustrated",
  "evidence": ["quote1", "quote2"]
}

Conversations:
{batch_json}

Aggregation Logic

After all batches are processed:

Fact deduplication: Same key from multiple batches? Keep highest confidence, merge evidence
Style averaging: Communication style scores averaged across batches (weighted by conversation length)
Topic ranking: Sort by total frequency across all batches
Memory selection: Pick top N significant conversations (configurable, default 50)
Conflict resolution: If batch A says "expert in Python" and batch B says "learning Python", the one with more evidence wins

Background Processing Service

public class ImportProcessingService : BackgroundService
{
    // Polls for imports with status = "uploaded"
    // Pipeline:
    // 1. Set status = "parsing"
    // 2. Detect format, parse to normalized conversations
    // 3. Store in import_conversations
    // 4. Set status = "analyzing"
    // 5. Chunk conversations into batches
    // 6. Run each analysis prompt against each batch via Kael (ai-02:8000)
    // 7. Aggregate results
    // 8. Store in import_extracted_data
    // 9. Set status = "ready_for_review"
    // On error: set status = "failed", store error_message
}

Frontend: Review UI

Upload Page (/import)

Drag-and-drop zone for ZIP/JSON files
Provider auto-detection indicator
Upload progress bar
Transitions to processing view on upload

Processing View (/import/{id})

Status indicator with steps: Uploading > Parsing > Analyzing > Ready
Stats: conversation count, message count, estimated time remaining
Auto-transitions to review when ready

Review Page (/import/{id}/review)

Three columns layout:
- Identity Profile (left): display name, communication style, expertise domains -- editable
- Facts & Preferences (center): card per fact, each with approve/reject toggle, confidence badge, source evidence expandable
- Memories (right): episodic/semantic/procedural tabs, each memory card with approve/reject
Bulk actions: approve all, reject all, approve above confidence threshold
"Commit" button (bottom) with summary: "Creating identity with 47 facts, 23 memories, 12 preferences"

Done Page (/import/{id}/done)

API key (shown once, copy button)
Quick-connect instructions for Claude Code, Cursor, VS Code
"Your AI is ready" confirmation with identity summary

Phase 1 Estimated Effort

Component	Effort
Database migrations + EF entities	2h
Import controller + DTOs	3h
ChatGPT parser (tree walking)	4h
Format auto-detection	1h
Background processing service	3h
Kael integration (prompt execution)	2h
Analysis prompts (4 prompts, tuning)	4h
Aggregation logic	3h
Commit logic (write to memory/facts/identity)	3h
Frontend: upload page	3h
Frontend: processing view	2h
Frontend: review page	6h
Frontend: done page	1h
Integration tests (real PostgreSQL)	4h
Total	~41h / ~1 week focused, 2 weeks realistic

Phase 2: Claude Import

Target: 3-4 days after Phase 1

Claude's format is simpler than ChatGPT's -- flat message arrays instead of trees. Most of the pipeline already exists from Phase 1.

New Work

public class ClaudeParser : IImportParser
{
    public string Provider => "claude";

    public bool CanParse(ZipArchive archive)
    {
        // Claude ZIP has conversations.json but NO chat.html
        // And conversations have uuid + chat_messages fields
        var entry = archive.GetEntry("conversations.json");
        if (entry == null) return false;
        // Peek at first object: look for "chat_messages" field (Claude)
        // vs "mapping" field (ChatGPT)
        return PeekForField(entry, "chat_messages");
    }

    public async Task<List<NormalizedConversation>> ParseAsync(ZipArchive archive)
    {
        // 1. Read conversations.json
        // 2. For each conversation, iterate chat_messages array (already in order)
        // 3. Map sender: "human" -> "user", "assistant" -> "assistant"
        // 4. Parse ISO 8601 timestamps
        // 5. Handle content[] array (join text blocks)
        // 6. Handle attachments/files metadata
    }
}

Effort: ~3-4 days

Component	Effort
Claude parser	2h
Format detection update	30min
Test data + integration tests	2h
Prompt tuning (Claude conversations may have different patterns)	2h
Total	~7h

Phase 3: Gemini + Others

Target: 3-4 days after Phase 2

Gemini Parser

public class GeminiParser : IImportParser
{
    public string Provider => "gemini";

    public bool CanParse(ZipArchive archive)
    {
        // Look for Gemini/ directory structure in ZIP
        return archive.Entries.Any(e => e.FullName.StartsWith("Gemini/"));
    }

    public async Task<List<NormalizedConversation>> ParseAsync(ZipArchive archive)
    {
        // 1. Find all JSON files under Gemini/ directory
        // 2. Each file is one conversation
        // 3. Map author: "user" -> "user", "model" -> "assistant"
        // 4. Parse ISO 8601 timestamps
        // 5. Strip location metadata (privacy)
    }
}

Future Providers

Each new provider is just a new IImportParser implementation. The analysis pipeline, review UI, and commit logic are completely reusable. Adding a new provider is a half-day of work once Phase 1 is done.

Effort: ~2-3 days

Component	Effort
Gemini parser	2h
Multi-file ZIP handling	1h
Format detection update	30min
Tests	2h
Total	~6h

Architecture Decisions

Why Background Processing (Not Synchronous)

A heavy ChatGPT user might have 2,000+ conversations. Parsing is fast, but running 100+ LLM inference calls through Kael takes time (30s-2min per batch). Background processing with polling keeps the UX responsive.

Why Batch, Not Stream

We need cross-conversation aggregation. A user mentioning "Python" in 50 conversations is different signal than mentioning it once. Batching lets us aggregate before presenting results.

Why Local Models Only

Zero marginal cost per import (Kael is already running)
No data leaves the infrastructure
No rate limits or API quotas
Privacy is the selling point -- can't undermine it by sending data to OpenAI/Anthropic for analysis

Why Review Before Commit

Trust building: users see exactly what was extracted
Error correction: LLM extraction isn't perfect, users can fix mistakes
Privacy control: users might not want certain facts stored
Legal: explicit consent before data processing

Why Temporary Raw Storage

import_conversations stores the full normalized conversations during processing, then gets cleaned up after commit. We don't keep raw chat history -- only the extracted intelligence. This is a privacy feature, not a limitation.

Risk Mitigation

Risk	Mitigation
ChatGPT changes export format	Parser is isolated; update one class. Community tracks format changes.
Large exports (10k+ conversations)	Chunking + streaming parse. Set upload size limits. Progress indicators.
Kael overload during bulk imports	Queue imports, process one at a time per tenant. Rate limit the endpoint.
Low extraction quality	Confidence scores + user review. Iterate on prompts. Track approval rates to improve.
Users upload non-chat ZIP files	Validate format before processing. Clear error messages.
Duplicate imports	Check for existing import from same provider with similar conversation count. Warn user.

Success Metrics

Import completion rate: % of uploads that reach "committed" status
Approval rate: % of extracted items approved (target: >80%)
Time to first connection: minutes from upload to MCP client connection
Retention: do imported users stay active vs organic signups?
Conversation coverage: what % of conversations yield useful extraction?