Status: Planned — This feature is on the roadmap and not yet implemented. The architecture below describes the intended design.
Chat Import -- Implementation Plan
Phase 1: ChatGPT Import (Biggest Market Right Now)
Target: 2 weeks
This is where the users are. The QuitGPT movement is real, the export format is well-documented, and ChatGPT users have the deepest conversation histories.
Week 1: Core Pipeline
API Endpoints & Models
// DTOs
public record ImportUploadResponse(Guid ImportId, string Status, int ConversationCount, int MessageCount);
public record ImportStatusResponse(Guid ImportId, string Status, string SourceProvider, int ConversationCount, int MessageCount, DateTime CreatedAt, DateTime? CommittedAt, string? Error);
public record ImportPreviewResponse(Guid ImportId, ImportedIdentityProfile Identity, List<ExtractedDataItem> Facts, List<ExtractedDataItem> Memories, List<ExtractedDataItem> Preferences);
public record ExtractedDataItem(long Id, string DataType, string Key, JsonElement Value, float Confidence, bool? Approved);
public record ImportCommitRequest(List<long> ApprovedItemIds);
public record ImportCommitResponse(Guid ImportId, Guid IdentityId, string ApiKey, int MemoriesCreated, int FactsStored);
public record ImportedIdentityProfile
{
public string? DisplayName { get; init; }
public string? CommunicationStyle { get; init; } // "casual", "formal", "technical"
public string? Verbosity { get; init; } // "concise", "moderate", "detailed"
public List<string> ExpertiseDomains { get; init; }
public List<string> FrequentTopics { get; init; }
public string? InteractionStyle { get; init; } // "collaborative", "directive", "exploratory"
}
Endpoints
POST /api/import/upload
- Accept: multipart/form-data
- Body: file (ZIP or JSON), optional: displayName
- Auth: optional (can import before account creation)
- Returns: ImportUploadResponse
- Kicks off background processing
GET /api/import/{importId}
- Returns: ImportStatusResponse
- Poll this for processing status
GET /api/import/{importId}/preview
- Returns: ImportPreviewResponse
- Only available when status = "ready_for_review"
PATCH /api/import/{importId}/preview
- Body: list of { id, approved: bool } decisions
- Updates approval status on extracted data items
POST /api/import/{importId}/commit
- Body: ImportCommitRequest (list of approved item IDs)
- Creates identity, seeds memories, generates API key
- Returns: ImportCommitResponse
DELETE /api/import/{importId}
- Soft-deletes import and all associated data
ChatGPT Parser
public class ChatGptParser : IImportParser
{
public string Provider => "chatgpt";
public bool CanParse(ZipArchive archive)
{
// Check for conversations.json + chat.html (ChatGPT signature)
var hasConversations = archive.GetEntry("conversations.json") != null;
var hasChatHtml = archive.GetEntry("chat.html") != null;
return hasConversations && hasChatHtml;
}
public async Task<List<NormalizedConversation>> ParseAsync(ZipArchive archive)
{
// 1. Read conversations.json
// 2. For each conversation, walk the mapping tree from root to current_node
// 3. Extract messages in order, resolve branches (follow current_node path)
// 4. Convert Unix timestamps to DateTime
// 5. Map author.role: "user" -> "user", "assistant" -> "assistant", "system" -> "system"
// 6. Extract content from parts[] array (join text parts, note non-text parts)
// 7. Return NormalizedConversation list
}
}
Tree walking algorithm (ChatGPT's mapping is a tree, not a list):
- Find the root node (node with no parent or parent is null)
- Follow
childrenarrays, always picking the child that leads tocurrent_node - This gives the "as experienced" conversation path, ignoring edited/regenerated branches
- Optionally preserve branches as metadata (edited messages are interesting signal)
Database Migrations
-- imports, import_extracted_data, import_conversations tables
-- As specified in chat-import.md
-- Add indexes:
CREATE INDEX idx_imports_tenant_id ON imports(tenant_id);
CREATE INDEX idx_imports_status ON imports(status);
CREATE INDEX idx_import_extracted_data_import_id ON import_extracted_data(import_id);
CREATE INDEX idx_import_conversations_import_id ON import_conversations(import_id);
Week 2: Analysis Pipeline & Review UI
Analysis Prompts (Kael / Qwen 30B)
Each prompt processes a batch of ~20 normalized conversations. All prompts return structured JSON.
Prompt 1: Facts & Preferences Extraction
You are analyzing a batch of conversations to extract factual information about the user.
Return a JSON array of facts. Each fact has: key, value, confidence (0-1), evidence (quote from conversation).
Categories to extract:
- Personal: name, location, timezone, job_title, company, team_members
- Technical: programming_languages, frameworks, databases, tools, OS, editor
- Projects: project_names, project_descriptions, deadlines
- Preferences: code_style, communication_preferences, workflow_habits
Conversations:
{batch_json}
Return ONLY valid JSON. No explanation.
Prompt 2: Communication Style Analysis
Analyze the USER messages (not the assistant) in these conversations.
Return JSON with:
{
"formality": "casual|moderate|formal",
"verbosity": "concise|moderate|detailed",
"emoji_usage": "none|rare|moderate|frequent",
"question_style": "direct|exploratory|socratic",
"instruction_style": "collaborative|directive|deferential",
"technical_depth": "beginner|intermediate|advanced|expert",
"evidence": ["quote1", "quote2"]
}
Conversations:
{batch_json}
Prompt 3: Topic & Expertise Extraction
Analyze these conversations and identify:
1. Recurring topics (what the user talks about most)
2. Expertise areas (what they clearly know well vs what they're learning)
3. Significant conversations (breakthroughs, major decisions, turning points)
Return JSON:
{
"topics": [{"name": "...", "frequency": N, "examples": ["..."]}],
"expertise": [{"domain": "...", "level": "learning|competent|expert", "evidence": "..."}],
"significant_conversations": [{"title": "...", "why": "...", "source_id": "..."}]
}
Conversations:
{batch_json}
Prompt 4: Relationship & Emotional Patterns
Analyze how the user interacts with AI in these conversations.
Return JSON:
{
"interaction_pattern": "collaborative|directive|exploratory|teaching",
"pushback_frequency": "never|rare|sometimes|often",
"gratitude_frequency": "never|rare|sometimes|often",
"frustration_triggers": ["...", "..."],
"excitement_triggers": ["...", "..."],
"correction_style": "gentle|direct|frustrated",
"evidence": ["quote1", "quote2"]
}
Conversations:
{batch_json}
Aggregation Logic
After all batches are processed:
- Fact deduplication: Same key from multiple batches? Keep highest confidence, merge evidence
- Style averaging: Communication style scores averaged across batches (weighted by conversation length)
- Topic ranking: Sort by total frequency across all batches
- Memory selection: Pick top N significant conversations (configurable, default 50)
- Conflict resolution: If batch A says "expert in Python" and batch B says "learning Python", the one with more evidence wins
Background Processing Service
public class ImportProcessingService : BackgroundService
{
// Polls for imports with status = "uploaded"
// Pipeline:
// 1. Set status = "parsing"
// 2. Detect format, parse to normalized conversations
// 3. Store in import_conversations
// 4. Set status = "analyzing"
// 5. Chunk conversations into batches
// 6. Run each analysis prompt against each batch via Kael (ai-02:8000)
// 7. Aggregate results
// 8. Store in import_extracted_data
// 9. Set status = "ready_for_review"
// On error: set status = "failed", store error_message
}
Frontend: Review UI
Upload Page (/import)
- Drag-and-drop zone for ZIP/JSON files
- Provider auto-detection indicator
- Upload progress bar
- Transitions to processing view on upload
Processing View (/import/{id})
- Status indicator with steps: Uploading > Parsing > Analyzing > Ready
- Stats: conversation count, message count, estimated time remaining
- Auto-transitions to review when ready
Review Page (/import/{id}/review)
- Three columns layout:
- Identity Profile (left): display name, communication style, expertise domains -- editable
- Facts & Preferences (center): card per fact, each with approve/reject toggle, confidence badge, source evidence expandable
- Memories (right): episodic/semantic/procedural tabs, each memory card with approve/reject
- Bulk actions: approve all, reject all, approve above confidence threshold
- "Commit" button (bottom) with summary: "Creating identity with 47 facts, 23 memories, 12 preferences"
Done Page (/import/{id}/done)
- API key (shown once, copy button)
- Quick-connect instructions for Claude Code, Cursor, VS Code
- "Your AI is ready" confirmation with identity summary
Phase 1 Estimated Effort
| Component | Effort |
|---|---|
| Database migrations + EF entities | 2h |
| Import controller + DTOs | 3h |
| ChatGPT parser (tree walking) | 4h |
| Format auto-detection | 1h |
| Background processing service | 3h |
| Kael integration (prompt execution) | 2h |
| Analysis prompts (4 prompts, tuning) | 4h |
| Aggregation logic | 3h |
| Commit logic (write to memory/facts/identity) | 3h |
| Frontend: upload page | 3h |
| Frontend: processing view | 2h |
| Frontend: review page | 6h |
| Frontend: done page | 1h |
| Integration tests (real PostgreSQL) | 4h |
| Total | ~41h / ~1 week focused, 2 weeks realistic |
Phase 2: Claude Import
Target: 3-4 days after Phase 1
Claude's format is simpler than ChatGPT's -- flat message arrays instead of trees. Most of the pipeline already exists from Phase 1.
New Work
public class ClaudeParser : IImportParser
{
public string Provider => "claude";
public bool CanParse(ZipArchive archive)
{
// Claude ZIP has conversations.json but NO chat.html
// And conversations have uuid + chat_messages fields
var entry = archive.GetEntry("conversations.json");
if (entry == null) return false;
// Peek at first object: look for "chat_messages" field (Claude)
// vs "mapping" field (ChatGPT)
return PeekForField(entry, "chat_messages");
}
public async Task<List<NormalizedConversation>> ParseAsync(ZipArchive archive)
{
// 1. Read conversations.json
// 2. For each conversation, iterate chat_messages array (already in order)
// 3. Map sender: "human" -> "user", "assistant" -> "assistant"
// 4. Parse ISO 8601 timestamps
// 5. Handle content[] array (join text blocks)
// 6. Handle attachments/files metadata
}
}
Effort: ~3-4 days
| Component | Effort |
|---|---|
| Claude parser | 2h |
| Format detection update | 30min |
| Test data + integration tests | 2h |
| Prompt tuning (Claude conversations may have different patterns) | 2h |
| Total | ~7h |
Phase 3: Gemini + Others
Target: 3-4 days after Phase 2
Gemini Parser
public class GeminiParser : IImportParser
{
public string Provider => "gemini";
public bool CanParse(ZipArchive archive)
{
// Look for Gemini/ directory structure in ZIP
return archive.Entries.Any(e => e.FullName.StartsWith("Gemini/"));
}
public async Task<List<NormalizedConversation>> ParseAsync(ZipArchive archive)
{
// 1. Find all JSON files under Gemini/ directory
// 2. Each file is one conversation
// 3. Map author: "user" -> "user", "model" -> "assistant"
// 4. Parse ISO 8601 timestamps
// 5. Strip location metadata (privacy)
}
}
Future Providers
Each new provider is just a new IImportParser implementation. The analysis pipeline, review UI, and commit logic are completely reusable. Adding a new provider is a half-day of work once Phase 1 is done.
Effort: ~2-3 days
| Component | Effort |
|---|---|
| Gemini parser | 2h |
| Multi-file ZIP handling | 1h |
| Format detection update | 30min |
| Tests | 2h |
| Total | ~6h |
Architecture Decisions
Why Background Processing (Not Synchronous)
A heavy ChatGPT user might have 2,000+ conversations. Parsing is fast, but running 100+ LLM inference calls through Kael takes time (30s-2min per batch). Background processing with polling keeps the UX responsive.
Why Batch, Not Stream
We need cross-conversation aggregation. A user mentioning "Python" in 50 conversations is different signal than mentioning it once. Batching lets us aggregate before presenting results.
Why Local Models Only
- Zero marginal cost per import (Kael is already running)
- No data leaves the infrastructure
- No rate limits or API quotas
- Privacy is the selling point -- can't undermine it by sending data to OpenAI/Anthropic for analysis
Why Review Before Commit
- Trust building: users see exactly what was extracted
- Error correction: LLM extraction isn't perfect, users can fix mistakes
- Privacy control: users might not want certain facts stored
- Legal: explicit consent before data processing
Why Temporary Raw Storage
import_conversations stores the full normalized conversations during processing, then gets cleaned up after commit. We don't keep raw chat history -- only the extracted intelligence. This is a privacy feature, not a limitation.
Risk Mitigation
| Risk | Mitigation |
|---|---|
| ChatGPT changes export format | Parser is isolated; update one class. Community tracks format changes. |
| Large exports (10k+ conversations) | Chunking + streaming parse. Set upload size limits. Progress indicators. |
| Kael overload during bulk imports | Queue imports, process one at a time per tenant. Rate limit the endpoint. |
| Low extraction quality | Confidence scores + user review. Iterate on prompts. Track approval rates to improve. |
| Users upload non-chat ZIP files | Validate format before processing. Clear error messages. |
| Duplicate imports | Check for existing import from same provider with similar conversation count. Warn user. |
Success Metrics
- Import completion rate: % of uploads that reach "committed" status
- Approval rate: % of extracted items approved (target: >80%)
- Time to first connection: minutes from upload to MCP client connection
- Retention: do imported users stay active vs organic signups?
- Conversation coverage: what % of conversations yield useful extraction?