helsinki-nlp-model-training
Helsinki-NLP Model Training
Overview
Train and fine-tune Helsinki-NLP OPUS-MT models for Chuukese-English translation. These models are specifically designed for translation tasks and perform better than general LLMs for low-resource languages like Chuukese.
Base Models:
Helsinki-NLP/opus-mt-mul-en(Multilingual → English)Helsinki-NLP/opus-mt-en-mul(English → Multilingual)
Capabilities
- Fine-tuning: Adapt base models to Chuukese-specific data
- Bidirectional Training: Support for both Chuukese→English and English→Chuukese
- Training Data Preparation: Format dictionary and parallel corpus data
- Model Evaluation: BLEU, chrF scoring on test sets
- Local Deployment: Run models locally without API calls
- GPU/CPU Support: Automatic device detection and optimization
Model Architecture
Helsinki-NLP OPUS-MT Architecture:
┌─────────────────────────────────────────────────┐
│ MarianMT (Marian Neural Machine Translation) │
├─────────────────────────────────────────────────┤
│ Encoder: 6 Transformer layers │
│ Decoder: 6 Transformer layers │
│ Attention heads: 8 │
│ Hidden size: 512 │
│ Vocabulary: SentencePiece tokenizer │
└─────────────────────────────────────────────────┘
Directory Structure
models/
├── helsinki-chuukese_chuukese_to_english/
│ ├── finetuned/
│ │ ├── config.json
│ │ ├── pytorch_model.bin
│ │ ├── tokenizer_config.json
│ │ ├── source.spm
│ │ └── target.spm
│ └── training_logs/
├── helsinki-chuukese_english_to_chuukese/
│ ├── finetuned/
│ └── training_logs/
└── test-helsinki_chuukese_to_english/
Training Data Preparation
1. Data Format
# Training data structure
training_pairs = [
{"source": "chomong", "target": "to help, assist"},
{"source": "Kopwe pwan chomong", "target": "We will help"},
{"source": "ngang", "target": "fish"},
]
# Export to TSV format
def export_training_data(pairs, output_path, direction="chk_to_en"):
"""Export training pairs to TSV format for transformers."""
with open(output_path, 'w', encoding='utf-8') as f:
for pair in pairs:
if direction == "chk_to_en":
f.write(f"{pair['source']}\t{pair['target']}\n")
else:
f.write(f"{pair['target']}\t{pair['source']}\n")
2. Data Sources
# Combine multiple data sources
from src.database.dictionary_db import DictionaryDB
def prepare_training_data():
"""Prepare comprehensive training dataset."""
db = DictionaryDB()
training_pairs = []
# 1. Dictionary entries
entries = db.dictionary.find({'is_base_word': True})
for entry in entries:
training_pairs.append({
'source': entry['chuukese_word'],
'target': entry['english_definition'],
'confidence': 0.9
})
# 2. Phrases
phrases = db.phrases.find({'confidence_score': {'$gte': 0.7}})
for phrase in phrases:
training_pairs.append({
'source': phrase['chuukese_phrase'],
'target': phrase['english_translation'],
'confidence': phrase.get('confidence_score', 0.8)
})
# 3. Bible parallel verses (highest quality)
from src.utils.nwt_epub_parser import NWTEpubParser
# ... extract parallel verses
return training_pairs
3. Data Splitting
from sklearn.model_selection import train_test_split
def split_training_data(pairs, test_size=0.1, val_size=0.1):
"""Split data into train/val/test sets."""
train, test = train_test_split(pairs, test_size=test_size, random_state=42)
train, val = train_test_split(train, test_size=val_size/(1-test_size), random_state=42)
return {
'train': train,
'val': val,
'test': test
}
Training Implementation
HelsinkiChuukeseTranslator Class
import torch
from transformers import (
AutoTokenizer,
AutoModelForSeq2SeqLM,
Trainer,
TrainingArguments,
DataCollatorForSeq2Seq
)
from datasets import Dataset
class HelsinkiChuukeseTranslator:
"""Helsinki-NLP based Chuukese translator with fine-tuning support."""
def __init__(self,
base_model: str = "Helsinki-NLP/opus-mt-mul-en",
reverse_model: str = "Helsinki-NLP/opus-mt-en-mul"):
self.base_model = base_model
self.reverse_model = reverse_model
self.device = "cuda" if torch.cuda.is_available() else "cpu"
# Local fine-tuned model paths
self.local_chk_to_en = "models/helsinki-chuukese_chuukese_to_english/finetuned"
self.local_en_to_chk = "models/helsinki-chuukese_english_to_chuukese/finetuned"
self.tokenizer = None
self.model = None
def setup_models(self, direction: str = "chk_to_en") -> bool:
"""Load models, preferring fine-tuned versions."""
try:
if direction == "chk_to_en":
model_path = self.local_chk_to_en if os.path.exists(self.local_chk_to_en) else self.base_model
else:
model_path = self.local_en_to_chk if os.path.exists(self.local_en_to_chk) else self.reverse_model
self.tokenizer = AutoTokenizer.from_pretrained(model_path)
self.model = AutoModelForSeq2SeqLM.from_pretrained(model_path).to(self.device)
return True
except Exception as e:
print(f"Error loading model: {e}")
return False
def translate(self, text: str, direction: str = "chk_to_en") -> str:
"""Translate text using loaded model."""
if self.model is None:
self.setup_models(direction)
inputs = self.tokenizer(text, return_tensors="pt", padding=True).to(self.device)
outputs = self.model.generate(
**inputs,
max_length=256,
num_beams=5,
early_stopping=True
)
return self.tokenizer.decode(outputs[0], skip_special_tokens=True)
Fine-Tuning Script
def fine_tune_model(
training_data: list,
direction: str = "chk_to_en",
epochs: int = 10,
batch_size: int = 16,
learning_rate: float = 2e-5,
output_dir: str = None
):
"""
Fine-tune Helsinki-NLP model on Chuukese data.
Args:
training_data: List of {'source': str, 'target': str} pairs
direction: 'chk_to_en' or 'en_to_chk'
epochs: Number of training epochs
batch_size: Training batch size
learning_rate: Learning rate
output_dir: Directory to save fine-tuned model
"""
# Select base model
if direction == "chk_to_en":
base_model = "Helsinki-NLP/opus-mt-mul-en"
output_dir = output_dir or "models/helsinki-chuukese_chuukese_to_english/finetuned"
else:
base_model = "Helsinki-NLP/opus-mt-en-mul"
output_dir = output_dir or "models/helsinki-chuukese_english_to_chuukese/finetuned"
# Load tokenizer and model
tokenizer = AutoTokenizer.from_pretrained(base_model)
model = AutoModelForSeq2SeqLM.from_pretrained(base_model)
# Prepare dataset
def preprocess_function(examples):
inputs = examples['source']
targets = examples['target']
model_inputs = tokenizer(inputs, max_length=128, truncation=True, padding='max_length')
with tokenizer.as_target_tokenizer():
labels = tokenizer(targets, max_length=128, truncation=True, padding='max_length')
model_inputs['labels'] = labels['input_ids']
return model_inputs
# Create HuggingFace Dataset
dataset = Dataset.from_dict({
'source': [p['source'] for p in training_data],
'target': [p['target'] for p in training_data]
})
tokenized_dataset = dataset.map(preprocess_function, batched=True)
# Split into train/val
split = tokenized_dataset.train_test_split(test_size=0.1)
# Training arguments
training_args = TrainingArguments(
output_dir=output_dir,
num_train_epochs=epochs,
per_device_train_batch_size=batch_size,
per_device_eval_batch_size=batch_size,
learning_rate=learning_rate,
weight_decay=0.01,
evaluation_strategy="epoch",
save_strategy="epoch",
load_best_model_at_end=True,
logging_dir=f"{output_dir}/logs",
logging_steps=100,
fp16=torch.cuda.is_available(), # Use mixed precision on GPU
gradient_accumulation_steps=4,
warmup_ratio=0.1,
)
# Data collator
data_collator = DataCollatorForSeq2Seq(tokenizer, model=model)
# Create trainer
trainer = Trainer(
model=model,
args=training_args,
train_dataset=split['train'],
eval_dataset=split['test'],
data_collator=data_collator,
)
# Train
trainer.train()
# Save model and tokenizer
trainer.save_model(output_dir)
tokenizer.save_pretrained(output_dir)
print(f"✅ Model saved to {output_dir}")
return trainer
Model Evaluation
import sacrebleu
def evaluate_model(translator, test_pairs: list, direction: str = "chk_to_en"):
"""Evaluate translation model on test set."""
predictions = []
references = []
for pair in test_pairs:
source = pair['source']
target = pair['target']
prediction = translator.translate(source, direction)
predictions.append(prediction)
references.append([target]) # sacrebleu expects list of references
# Calculate BLEU
bleu = sacrebleu.corpus_bleu(predictions, references)
# Calculate chrF (preferred for low-resource languages)
chrf = sacrebleu.corpus_chrf(predictions, references)
return {
'bleu': bleu.score,
'chrf': chrf.score,
'num_samples': len(test_pairs),
'sample_predictions': list(zip(
[p['source'] for p in test_pairs[:5]],
[p['target'] for p in test_pairs[:5]],
predictions[:5]
))
}
GPU vs CPU Training
def get_device_config():
"""Get optimal device configuration for training."""
if torch.cuda.is_available():
device = "cuda"
gpu_name = torch.cuda.get_device_name(0)
gpu_memory = torch.cuda.get_device_properties(0).total_memory / 1e9
# Adjust batch size based on GPU memory
if gpu_memory >= 16:
batch_size = 32
elif gpu_memory >= 8:
batch_size = 16
else:
batch_size = 8
return {
'device': device,
'gpu_name': gpu_name,
'gpu_memory_gb': gpu_memory,
'batch_size': batch_size,
'fp16': True,
'gradient_accumulation': 2
}
else:
return {
'device': 'cpu',
'batch_size': 4,
'fp16': False,
'gradient_accumulation': 8
}
Usage Examples
Basic Training
# Prepare data
training_data = prepare_training_data()
splits = split_training_data(training_data)
# Train Chuukese → English model
fine_tune_model(
training_data=splits['train'],
direction="chk_to_en",
epochs=10
)
# Train English → Chuukese model
fine_tune_model(
training_data=splits['train'],
direction="en_to_chk",
epochs=10
)
Model Comparison
# Compare base vs fine-tuned
translator = HelsinkiChuukeseTranslator()
# Test with base model
translator.setup_models(use_finetuned=False)
base_results = evaluate_model(translator, splits['test'])
# Test with fine-tuned model
translator.setup_models(use_finetuned=True)
finetuned_results = evaluate_model(translator, splits['test'])
print(f"Base BLEU: {base_results['bleu']:.2f}")
print(f"Fine-tuned BLEU: {finetuned_results['bleu']:.2f}")
print(f"Improvement: {finetuned_results['bleu'] - base_results['bleu']:.2f}")
Best Practices
Training
- Use high-quality data: Bible verses and validated dictionary entries
- Balance dataset: Mix words, phrases, and sentences
- Early stopping: Monitor validation loss to prevent overfitting
- Mixed precision: Use FP16 on GPU for faster training
- Gradient accumulation: Use for larger effective batch sizes on limited memory
Evaluation
- Use chrF for low-resource languages: More stable than BLEU
- Human evaluation: Periodic review by native speakers
- Back-translation: Verify round-trip consistency
- Cultural validation: Ensure cultural concepts are preserved
Deployment
- Quantization: Use 8-bit quantization for faster inference
- Caching: Cache tokenizer and model loading
- Batch processing: Process multiple translations together
- Fallback: Keep base model as fallback if fine-tuned fails
Dependencies
torch>=2.0.0: PyTorchtransformers>=4.57.0: Hugging Face Transformersdatasets>=2.0.0: Hugging Face Datasetssentencepiece>=0.1.99: Tokenizationsacrebleu>=2.0.0: Evaluation metricsaccelerate>=0.20.0: Training acceleration
More from findinfinitelabs/chuuk
large-document-processing
Process large documents (200+ pages) with structure preservation, intelligent parsing, and memory-efficient handling. Also covers intelligent text chunking for AI training and RAG systems. Use when working with complex formatted documents, multi-level hierarchies, or when splitting large content for AI pipelines.
28python-venv-management
Automatically manage Python virtual environments (.venv) in terminal commands. Always activate .venv before running Python/pip commands. Supports macOS, Linux, and Windows with shell-aware activation. Use when executing Python scripts, installing packages, or running development servers. Critical for consistent environment management.
14bible-epub-processing
Parse and extract structured content from Bible EPUBs (NWT) for parallel text alignment between Chuukese and English. Use when working with Bible data, verse extraction, parallel corpus building, or generating training data from Scripture.
14security-environment-standards
Security and environment configuration standards for web applications, including environment variable management, secure coding practices, and production deployment security. Use when setting up environments, configuring security, or deploying applications.
13intelligent-text-chunking
Split large texts into meaningful, AI-optimized chunks while preserving semantic coherence and document structure. Covered by the large-document-processing skill — see that skill for full details.
13document-ocr-processing
Process scanned documents and images containing Chuukese text using OCR with specialized post-processing for accent characters and traditional formatting. Use when working with scanned books, documents, or images that contain Chuukese text that needs to be digitized.
12