data-ingest
Data Ingest — Universal Text Source Handler
You are ingesting arbitrary text data into an Obsidian wiki. The source could be anything — conversation exports, log files, transcripts, data dumps. Your job is to figure out the format, extract knowledge, and distill it into wiki pages.
Before You Start
- Read
.envto getOBSIDIAN_VAULT_PATH - Read
.manifest.jsonat the vault root — check if this source has been ingested before - Read
index.mdat the vault root to know what already exists
If the source path is already in .manifest.json and the file hasn't been modified since ingested_at, tell the user it's already been ingested. Ask if they want to re-ingest anyway.
Step 1: Identify the Source Format
Read the file(s) the user points you at. Common formats you'll encounter:
| Format | How to identify | How to read |
|---|---|---|
| JSON / JSONL | .json / .jsonl extension, starts with { or [ |
Parse with Read tool, look for message/content fields |
| Markdown | .md extension |
Read directly |
| Plain text | .txt extension or no extension |
Read directly |
| CSV / TSV | .csv / .tsv, comma or tab separated |
Parse rows, identify columns |
| HTML | .html, starts with < |
Extract text content, ignore markup |
| Chat export | Varies — look for turn-taking patterns (user/assistant, human/ai, timestamps) | Extract the dialogue turns |
| Images | .png / .jpg / .jpeg / .webp / .gif |
Requires a vision-capable model. Use the Read tool — it renders images into your context. Screenshots, whiteboards, diagrams all qualify. Models without vision support should skip and report which files were skipped. |
Common Chat Export Formats
ChatGPT export (conversations.json):
[{"title": "...", "mapping": {"node-id": {"message": {"role": "user", "content": {"parts": ["text"]}}}}}]
Slack export (directory of JSON files per channel):
[{"user": "U123", "text": "message", "ts": "1234567890.123456"}]
Generic chat log (timestamped text):
[2024-03-15 10:30] User: message here
[2024-03-15 10:31] Bot: response here
Don't try to handle every format upfront — read the actual data, figure out the structure, and adapt.
Images and visual sources
When the user dumps a folder of screenshots, whiteboard photos, or diagram exports, treat each image as a source:
- Use the Read tool on the image path — it will render the image into context.
- Transcribe any visible text verbatim (this is the only extracted content from an image).
- Describe structure: for diagrams, list nodes/edges; for screenshots, name the app and what's on screen.
- Extract the concepts the image conveys — what's it about? Most of this is
^[inferred]. - Flag anything you can't read, can't identify, or are guessing at with
^[ambiguous].
Image-derived pages will skew heavily inferred — that's expected and the provenance markers will reflect it. Set source_type: "image" in the manifest entry. Skip files with EXIF-only changes (re-saved with no visual diff) — compare via the standard delta logic.
For folders of mixed images (e.g. a screenshot timeline of a debugging session), cluster by visible topic rather than per-file. Twenty screenshots of the same UI bug should produce one wiki page, not twenty.
Step 2: Extract Knowledge
Regardless of format, extract the same things:
- Topics discussed — what subjects come up?
- Decisions made — what was concluded or decided?
- Facts learned — what concrete information is stated?
- Procedures described — how-to knowledge, workflows, steps
- Entities mentioned — people, tools, projects, organizations
- Connections — how do topics relate to each other and to existing wiki content?
For conversation data specifically:
Focus on the substance, not the dialogue. A 50-message debugging session might yield one skills page about the fix. A long brainstorming chat might yield three concept pages.
Skip:
- Greetings, pleasantries, meta-conversation ("can you help me with...")
- Repetitive back-and-forth that doesn't add new information
- Raw code dumps (unless they illustrate a reusable pattern)
Step 3: Cluster and Deduplicate
Before creating pages:
- Group extracted knowledge by topic (not by source file or conversation)
- Check existing wiki pages — does this knowledge belong on an existing page?
- Merge overlapping information from multiple sources
- Note contradictions between sources
Step 4: Distill into Wiki Pages
Follow the wiki-ingest skill's process for creating/updating pages:
- Use correct category directories (
concepts/,entities/,skills/, etc.) - Add YAML frontmatter with title, category, tags, sources
- Use
[[wikilinks]]to connect to existing pages - Attribute claims to their source
- Write a
summary:frontmatter field on every new page (1–2 sentences, ≤200 characters) answering "what is this page about?" — this is what downstream skills read to avoid opening the page body. - Apply provenance markers per the convention in
llm-wiki. Conversation, log, and chat data tend to be high-inference — you're often reading between the turns to extract a coherent claim. Be liberal with^[inferred]for synthesized patterns and with^[ambiguous]when speakers contradict each other or you're unsure who's right. Write aprovenance:frontmatter block on each new/updated page.
Step 5: Update Manifest and Special Files
.manifest.json — Add an entry for each source file processed:
{
"ingested_at": "TIMESTAMP",
"size_bytes": FILE_SIZE,
"modified_at": FILE_MTIME,
"source_type": "data", // or "image" for png/jpg/webp/gif sources
"project": "project-name-or-null",
"pages_created": ["list/of/pages.md"],
"pages_updated": ["list/of/pages.md"]
}
index.md and log.md:
- [TIMESTAMP] DATA_INGEST source="path/to/data" format=FORMAT pages_updated=X pages_created=Y
Tips
- When in doubt about format, just read it. The Read tool will show you what you're dealing with.
- Large files: Read in chunks using offset/limit. Don't try to load a 10MB JSON in one go.
- Multiple files: Process them in order, building up wiki pages incrementally.
- Binary files: Skip them, except images — those are first-class sources via the Read tool's vision support.
- Encoding issues: If you see garbled text, mention it to the user and move on.