persona-dataset

Installation

SKILL.md

persona-dataset

Persistent, incremental, searchable persona knowledge base — the data layer between raw sources and persona training.

Architecture: MemPalace (storage + search) + Knowledge Graph (relationships + timeline) + Karpathy LLM Wiki (knowledge accumulation)

Dependency chain: data sources → persona-dataset → anyone-skill / persona-model-trainer

When to use this skill

Trigger phrases:

"create a dataset for this persona"
"add data to the dataset"
"import my Obsidian vault"
"import my Twitter archive"
"build a knowledge base for X"
"export training data"
"search the persona dataset"

Not suitable when:

User wants a quick one-shot distillation without persistent storage (use anyone-skill alone)
User only has < 50 messages of data (too little to warrant a dataset)

Phase 1: Init

Create a new persona dataset:

python scripts/init_dataset.py --slug {slug} --name "Display Name"

This creates ~/.openpersona/datasets/{slug}/ with:

~/.openpersona/datasets/{slug}/
  dataset.json                 # metadata: slug, name, created_at, stats
  .mempalace/                  # MemPalace local data (per-dataset isolation via palace_path)
    palace/                    # MemPalace internal store (ChromaDB + KG)
  sources/                     # immutable source file backups
    .source-index.json         # per-file metadata: hash, import time, line count, PII flags
  wiki/                        # Karpathy wiki (LLM-maintained derived artifact)
    _schema.md                 # wiki maintenance rules
    identity.md
    voice.md
    values.md
    thinking.md
    relationships.md           # generated from Knowledge Graph
    timeline.md                # generated from Knowledge Graph
    _contradictions.md
    _changelog.md
    _evidence.md

MemPalace palace structure:

One Wing per persona (named by slug)
Halls mapped to 5 persona dimensions:
- hall_facts — Identity (background, career, education)
- hall_events — Memory (key events, turning points)
- hall_preferences — Personality (values, preferences, boundaries)
- hall_discoveries — Procedure (mental models, decision heuristics)
- hall_voice — Interaction (vocabulary, rhythm, humor, emotional temperature)

Gate: Confirm slug and display name with the user before proceeding.

Phase 2: Ingest

Import data sources into the dataset. Can be called multiple times for incremental ingestion.

python scripts/ingest.py --slug {slug} --source <path> [--adapter <name>] [--since <date>]

Adapter auto-detection (by file type / directory structure):

Source	Adapter	Detection
Obsidian vault	`obsidian`	Directory containing `.obsidian/` or `*.md` files
WhatsApp `.txt` export	`chat_export`	Matches WhatsApp timestamp pattern
Telegram `result.json`	`chat_export`	JSON with `chats` key
Signal export	`chat_export`	JSON with Signal message format
iMessage `.db`	`chat_export`	SQLite with `message` + `handle` tables
X (Twitter) archive	`social`	Directory containing `data/tweets.js`
Instagram archive	`social`	Directory containing `content/posts_1.json`
Slack / Discord JSON	delegates to `mempalace mine --mode convos`	JSON workspace export
`.txt` / `.csv` / `.pdf`	`plaintext`	File extension
`.jsonl` / `.json`	`jsonl`	File extension + `{role, content}` format
GBrain MCP	`gbrain`	User specifies `--adapter gbrain --entity "Name"`

Ingest pipeline (per source):

Parse — adapter converts source to unified [{role, content, timestamp, source_file, source_type}]
PII scan — flag SSN, credit card, email, password patterns
Hash dedup — SHA-256 content hash, skip already-ingested entries
Write sources/ — save parsed data as JSONL backup (immutable, one file per source)
Store in MemPalace — verbatim text into ChromaDB via palace wing/hall structure
Extract KG triples — detect entities and relationships, write to Knowledge Graph with temporal validity
Report — print source name, message count, assistant turns, PII flags, new KG entities

After each source is ingested, report:

✅ whatsapp-2024.txt → 1,247 messages (892 assistant turns)
   PII: none detected
   KG: +3 entities, +7 relationships
   → sources/whatsapp-2024.jsonl

Phase 3: Wiki Build (agent task — not a script)

After ingesting new data, the agent reads MemPalace content and Knowledge Graph relationships, then builds or updates the wiki pages following the Karpathy LLM Wiki pattern.

This phase is driven by agent intelligence (SKILL.md instructions), not by automated scripts. The LLM decides which pages to update, how to phrase entries, and how to tag evidence.

Ingest operation (after each Phase 2 run)

Read new data from MemPalace (search the wing for recently added entries)
For each relevant wiki page, check if the new data adds, contradicts, or refines existing content
Update 5-15 wiki pages with new information, using evidence tags:
- [L1:source] — direct quote, traceable
- [L2] — reported/paraphrased, verifiable
- [L3:inferred] — reasonably inferred from multiple signals
- [L4:inspired] — impression-based
Add backlinks between related pages using [[page]] wikilink syntax
Record contradictions in _contradictions.md with both sides cited
Append entry to _changelog.md
Update counts in _evidence.md

Query operation

When the user asks a question about the persona:

Search MemPalace semantically for relevant memories
Navigate wiki pages for structured knowledge
Synthesize an answer
If the query reveals new insights, write them back to the appropriate wiki page

Lint operation

Run periodically or before export:

python scripts/lint_wiki.py --slug {slug}

Checks:

Broken [[links]] (referenced page doesn't exist)
Empty pages (created but never populated)
Contradictions without resolution notes
Evidence coverage (pages with < 2 evidence tags)
Source coverage (MemPalace entries not reflected in any wiki page)

Wiki page structure (see `references/wiki-schema.md` for full spec)

Each page follows this template:

# {Page Title}

> One-sentence summary of this page's scope.

## Content

{Structured content with [L?:source] evidence tags and [[backlinks]]}

## Sources

- {source_file}: {what was extracted} [L?]

## See also

- [[related_page]]

Knowledge Graph–driven pages

relationships.md and timeline.md are generated from the Knowledge Graph, not written freehand:

from mempalace.knowledge_graph import KnowledgeGraph
kg = KnowledgeGraph(palace_path)
kg.timeline(slug)           # → chronological event list for timeline.md
kg.query_entity(slug)       # → current relationships for relationships.md

After generating, the agent may annotate with evidence tags and additional context.

Phase 4: Export

Generate a training/ directory compatible with persona-model-trainer:

python scripts/export_training.py --slug {slug} --output training/

Output:

training/
  raw/                      # copied from sources/ (authentic voice, unmodified)
  conversations.jsonl       # generated from wiki pages (structured Q-A pairs)
  profile.md                # summarized from wiki identity/voice/values
  metadata.json             # slug, source count, turn count, export timestamp

How each file is built:

training/raw/ — direct copy of sources/*.jsonl and sources/*.txt files
training/conversations.jsonl — the agent reads wiki pages and generates distilled user/assistant turn pairs representing the persona's voice, knowledge, and values
training/profile.md — 300-500 word character sheet derived from identity.md, voice.md, values.md
training/metadata.json — aggregated stats from dataset.json + source index

This output is directly consumable by persona-model-trainer's prepare_data.py — no changes needed downstream.

Phase 5: Search

Query the dataset using MemPalace's semantic search and Knowledge Graph:

# Semantic search across all stored memories
mempalace search "how does this person handle conflict" --wing {slug}

# Knowledge Graph entity query
python -c "
from mempalace.knowledge_graph import KnowledgeGraph
kg = KnowledgeGraph('~/.openpersona/datasets/{slug}/.mempalace/palace')
print(kg.query_entity('{slug}'))
"

# Wake-up summary (~170 tokens)
mempalace wake-up --wing {slug}

The agent can also search programmatically during wiki build or distillation:

from mempalace.searcher import search_memories
results = search_memories("vocabulary patterns", palace_path="~/.openpersona/datasets/{slug}/.mempalace/palace")

Phase 6: Maintain

Ongoing dataset management:

Add new source: run Phase 2 (Ingest) again with new files → triggers wiki update
Remove source: delete from sources/ + re-index → run wiki lint to flag orphaned content
Wiki lint: python scripts/lint_wiki.py --slug {slug} — health check
Dataset stats: python scripts/init_dataset.py --slug {slug} --stats — show current stats
List datasets: ls ~/.openpersona/datasets/ — all available datasets

Tools

Tool	Purpose
`Bash`	Run init, ingest, export, lint scripts; MemPalace CLI commands
`Read`	Load source files, wiki pages, dataset.json
`Write`	Update wiki pages, write training exports
`WebSearch`	Fetch public figure data for ingestion

Scripts

Script	Purpose
`scripts/init_dataset.py`	Initialize dataset directory + MemPalace wing + KG
`scripts/ingest.py`	Unified ingestion: adapter dispatch + PII scan + dedup + MemPalace + KG
`scripts/export_training.py`	Export sources/ + wiki → training/ directory
`scripts/lint_wiki.py`	Wiki health check: broken links, contradictions, coverage gaps

Adapters

Adapter	Sources	Format
`obsidian`	Obsidian vault (.md + YAML frontmatter)	Markdown notes
`chat_export`	WhatsApp / Telegram / Signal / iMessage	.txt / JSON / SQLite
`social`	X (Twitter) / Instagram archive	JSON archive
`plaintext`	.txt / .csv / .pdf	Generic files
`gbrain`	GBrain MCP (optional)	MCP tool calls
`jsonl`	Generic JSONL / JSON	{role, content} format

References

references/wiki-schema.md — Karpathy wiki structure specification and maintenance rules
references/source-formats.md — supported data source formats and adapter details

Related skills

More from acnlabs/openpersona

Installs

Repository

acnlabs/openpersona

GitHub Stars

First Seen

Apr 13, 2026

Security Audits

Gen Agent Trust HubWarn

SocketPass

SnykFail

persona-dataset

persona-dataset

When to use this skill

Phase 1: Init

Phase 2: Ingest

Phase 3: Wiki Build (agent task — not a script)

Ingest operation (after each Phase 2 run)

Query operation

Lint operation

Wiki page structure (see `references/wiki-schema.md` for full spec)

Knowledge Graph–driven pages

Phase 4: Export

Phase 5: Search

Phase 6: Maintain

Tools

Scripts

Adapters

References

More from acnlabs/openpersona

open-persona

anyone-skill

persona-knowledge

persona-model-trainer

secondme-skill

create-anyone

persona-dataset

persona-dataset

When to use this skill

Phase 1: Init

Phase 2: Ingest

Phase 3: Wiki Build (agent task — not a script)

Ingest operation (after each Phase 2 run)

Query operation

Lint operation

Wiki page structure (see references/wiki-schema.md for full spec)

Knowledge Graph–driven pages

Phase 4: Export

Phase 5: Search

Phase 6: Maintain

Tools

Scripts

Adapters

References

More from acnlabs/openpersona

open-persona

anyone-skill

persona-knowledge

persona-model-trainer

secondme-skill

create-anyone

Wiki page structure (see `references/wiki-schema.md` for full spec)