ai-training-data-generation
AI Training Data Generation
Overview
A comprehensive skill for automatically generating high-quality training datasets from documents, text corpora, and structured content. Optimized for low-resource languages, dictionary content, and domain-specific knowledge extraction.
Key Sources: Dictionary databases, Bible EPUBs (NWT), JW brochures, parallel text corpora
Capabilities
- Multi-strategy Generation: Dictionary pairs, contextual definitions, completion tasks, classification examples
- Quality Filtering: Confidence scoring, duplicate removal, and content validation
- Format Flexibility: Support for multiple AI training formats (JSONL, HuggingFace, Ollama, OpenAI)
- Language Awareness: Multi-language support with special handling for accented characters
- Scalable Processing: Generate thousands of examples from large documents
- Balance Management: Ensure dataset diversity and prevent category imbalance
- EPUB Processing: Extract parallel verses from Bible EPUBs for translation training
- Sentence Alignment: Align parallel sentences from bilingual documents
Core Strategies
1. Dictionary Pair Extraction
Extract word-definition pairs from structured and semi-structured text.
Detection Patterns:
- Separator-based:
word – definition,term: meaning - Linguistic indicators:
means,is defined as,refers to - Structural cues: Indentation, formatting, list structures
- Context analysis: Surrounding text for validation
2. Bible EPUB Parallel Corpus Extraction
Extract aligned verse pairs from Chuukese and English NWT Bible EPUBs.
from ebooklib import epub
from src.utils.nwt_epub_parser import NWTEpubParser
class BibleTrainingDataExtractor:
"""
Extract parallel training data from NWT Bible EPUBs.
Aligns verses between Chuukese (nwt_TE.epub) and English (nwt_E.epub).
"""
def __init__(self, chuukese_epub_path: str, english_epub_path: str):
self.chk_parser = NWTEpubParser(chuukese_epub_path)
self.en_parser = NWTEpubParser(english_epub_path)
def extract_parallel_verses(
self,
books: list = None,
min_verse_length: int = 10
) -> list:
"""
Extract aligned verse pairs for training.
Args:
books: List of book names to extract (None = all)
min_verse_length: Minimum characters per verse
Returns:
List of {'chuukese': str, 'english': str, 'reference': str}
"""
parallel_pairs = []
# Get list of books
if books is None:
books = self.chk_parser.get_book_list()
for book in books:
# Get chapters
chapters = self.chk_parser.get_chapters(book)
for chapter in chapters:
# Get verses from both languages
chk_verses = self.chk_parser.get_verses(book, chapter)
en_verses = self.en_parser.get_verses(book, chapter)
# Align by verse number
for verse_num, chk_text in chk_verses.items():
if verse_num in en_verses:
en_text = en_verses[verse_num]
# Filter by length
if len(chk_text) >= min_verse_length and len(en_text) >= min_verse_length:
parallel_pairs.append({
'chuukese': chk_text.strip(),
'english': en_text.strip(),
'reference': f"{book} {chapter}:{verse_num}",
'source': 'bible',
'confidence': 0.95 # Bible translations are high-quality
})
return parallel_pairs
def export_for_helsinki_training(
self,
output_path: str,
direction: str = "chk_to_en"
) -> dict:
"""
Export parallel data in format suitable for Helsinki-NLP fine-tuning.
Args:
output_path: Path for output files
direction: 'chk_to_en' or 'en_to_chk'
Returns:
Statistics about exported data
"""
pairs = self.extract_parallel_verses()
# Split into train/val/test (80/10/10)
from sklearn.model_selection import train_test_split
train_pairs, test_pairs = train_test_split(pairs, test_size=0.2, random_state=42)
val_pairs, test_pairs = train_test_split(test_pairs, test_size=0.5, random_state=42)
# Export in TSV format for transformers
for split_name, split_data in [
('train', train_pairs),
('val', val_pairs),
('test', test_pairs)
]:
with open(f"{output_path}/{split_name}.tsv", 'w', encoding='utf-8') as f:
for pair in split_data:
if direction == "chk_to_en":
f.write(f"{pair['chuukese']}\t{pair['english']}\n")
else:
f.write(f"{pair['english']}\t{pair['chuukese']}\n")
return {
'total_pairs': len(pairs),
'train_size': len(train_pairs),
'val_size': len(val_pairs),
'test_size': len(test_pairs)
}
3. Brochure Sentence Extraction
Extract parallel sentences from JW brochures.
class BrochureSentenceExtractor:
"""Extract training pairs from brochure sentences."""
def __init__(self, sentences_json_path: str):
with open(sentences_json_path, 'r', encoding='utf-8') as f:
self.sentences = json.load(f)
def extract_training_pairs(self) -> list:
"""Extract and format brochure sentences for training."""
pairs = []
for item in self.sentences:
if 'chuukese' in item and 'english' in item:
pairs.append({
'chuukese': item['chuukese'],
'english': item['english'],
'source': 'brochure',
'confidence': item.get('confidence', 0.8)
})
return pairs
4. Implementation Pattern
from .ai_training_generator import AITrainingDataGenerator
# Initialize generator
generator = AITrainingDataGenerator(min_confidence=0.7)
# Generate comprehensive training data
training_data = generator.generate_comprehensive_training_data(
parsed_document,
target_count=10000
)
# Export in multiple formats
files = generator.export_training_data(
training_data,
output_dir="training_output",
format_type="ollama"
)
Output Format Examples
JSONL Format (Standard)
{"input": "What does 'ááfengen' mean?", "output": "very good, excellent", "type": "dictionary_pair", "confidence": 0.95}
Ollama Format
{"prompt": "Translate this Chuukese word: ngang", "response": "fish", "system": "You are a Chuukese-English translator."}
HuggingFace Format
{"text": "### Instruction:\nWhat does 'chomong' mean in Chuukese?\n\n### Response:\nto help, assist"}
OpenAI Fine-tuning Format
{"messages": [{"role": "user", "content": "Define: kúún"}, {"role": "assistant", "content": "to go, to leave"}]}
Quality Assurance
- Content validity: Does the example make linguistic sense?
- Pattern matching: Does it follow expected language patterns?
- Context appropriateness: Is the context relevant and helpful?
- Uniqueness: Avoid repetitive or duplicate content
Best Practices
- Multiple validation passes: Automated and manual quality checks
- Confidence thresholds: Adjust based on use case requirements
- Human review sampling: Periodic manual validation of generated examples
- Balance management: Ensure even distribution across categories
Dependencies
re: Regular expression pattern matchingjson: Data serialization and exporthashlib: Duplicate detection and content hashingcollections: Data structure utilities and counting
More from findinfinitelabs/chuuk
large-document-processing
Process large documents (200+ pages) with structure preservation, intelligent parsing, and memory-efficient handling. Also covers intelligent text chunking for AI training and RAG systems. Use when working with complex formatted documents, multi-level hierarchies, or when splitting large content for AI pipelines.
28python-venv-management
Automatically manage Python virtual environments (.venv) in terminal commands. Always activate .venv before running Python/pip commands. Supports macOS, Linux, and Windows with shell-aware activation. Use when executing Python scripts, installing packages, or running development servers. Critical for consistent environment management.
14bible-epub-processing
Parse and extract structured content from Bible EPUBs (NWT) for parallel text alignment between Chuukese and English. Use when working with Bible data, verse extraction, parallel corpus building, or generating training data from Scripture.
14security-environment-standards
Security and environment configuration standards for web applications, including environment variable management, secure coding practices, and production deployment security. Use when setting up environments, configuring security, or deploying applications.
13intelligent-text-chunking
Split large texts into meaningful, AI-optimized chunks while preserving semantic coherence and document structure. Covered by the large-document-processing skill — see that skill for full details.
13document-ocr-processing
Process scanned documents and images containing Chuukese text using OCR with specialized post-processing for accent characters and traditional formatting. Use when working with scanned books, documents, or images that contain Chuukese text that needs to be digitized.
12