train-fasttext
Train FastText
Overview
This skill provides guidance for training FastText text classification models, particularly when facing dual constraints like model size limits and accuracy requirements. It covers systematic experimentation strategies, hyperparameter tuning approaches, and common pitfalls to avoid.
Constraint Prioritization Strategy
When facing competing constraints (e.g., model size < X MB AND accuracy >= Y%), establish a clear strategy:
- Identify which constraint is harder to satisfy - Accuracy is typically harder to recover after compression
- First achieve the accuracy target with an unconstrained model
- Then apply size reduction techniques (quantization, dimension reduction, pruning)
- Track the accuracy-size trade-off at each compression step
Systematic Experimentation Approach
Phase 1: Quick Exploratory Runs
Before committing to long training times, run quick experiments to understand parameter sensitivity:
# Quick baseline (1-2 minutes)
model = fasttext.train_supervised(
input=train_file,
dim=50,
epoch=5,
lr=0.5
)
Record results systematically:
- Accuracy on validation set
- Model file size
- Training time
Phase 2: Parameter Sensitivity Analysis
Test one parameter at a time while holding others constant:
| Parameter | Low | Medium | High | Impact |
|---|---|---|---|---|
| dim | 50 | 100 | 200 | Size, accuracy |
| epoch | 5 | 15 | 25 | Training time, accuracy |
| lr | 0.1 | 0.5 | 1.0 | Convergence speed |
| wordNgrams | 1 | 2 | 3 | Accuracy, size |
Phase 3: Targeted Optimization
Based on Phase 2 findings, combine the best parameters and fine-tune.
Key FastText Parameters
Accuracy-Focused Parameters
- dim: Word vector dimensions (higher = more expressive, larger model)
- epoch: Training iterations (more epochs can improve accuracy, diminishing returns)
- wordNgrams: N-gram features (2 or 3 often improves accuracy significantly)
- lr: Learning rate (higher can speed convergence but may overshoot)
- loss: Loss function (
softmaxfor few classes,ovafor many classes,nsfor very large label spaces)
Size-Focused Parameters
- dim: Lower dimensions = smaller model
- bucket: Hash bucket size for n-grams (lower = smaller model, may hurt accuracy)
- minCount: Minimum word frequency (higher = smaller vocabulary)
- minn/maxn: Character n-gram range (0,0 disables, reduces size)
Model Compression Techniques
Quantization
FastText quantization can dramatically reduce model size (often 4-10x reduction):
model.quantize(input=train_file, retrain=True)
model.save_model("model.ftz")
Important trade-off: Quantization typically reduces accuracy by 1-5%. Plan for this when targeting accuracy thresholds.
When to Apply Quantization
- If non-quantized model is close to size limit (e.g., 155MB vs 150MB limit), try parameter tuning first
- If non-quantized model is far above limit, quantization is necessary
- Always measure accuracy before and after quantization
Built-in Optimization Features
Autotune (Recommended)
FastText's autotune automatically searches for optimal hyperparameters:
model = fasttext.train_supervised(
input=train_file,
autotuneValidationFile=valid_file,
autotuneDuration=600, # seconds
autotuneModelSize="150M" # target size constraint
)
This is often more effective than manual parameter tuning.
Verification Strategies
1. Create a Validation Set
Reserve 10-20% of training data for validation. Do not rely solely on test set evaluation:
# Split data
shuf train.txt > shuffled.txt
head -n 80000 shuffled.txt > train_split.txt
tail -n 20000 shuffled.txt > valid_split.txt
2. Verify Model File Integrity
Before evaluation, verify the model file is valid:
import os
import fasttext
# Check file exists and has reasonable size
model_path = "/app/model.bin"
if os.path.exists(model_path):
size_mb = os.path.getsize(model_path) / (1024 * 1024)
print(f"Model size: {size_mb:.2f} MB")
# Try loading to verify integrity
model = fasttext.load_model(model_path)
print(f"Labels: {len(model.labels)}")
3. Monitor Training Progress
For long-running training, implement progress monitoring:
import time
start_time = time.time()
model = fasttext.train_supervised(input=train_file, epoch=25, verbose=2)
elapsed = time.time() - start_time
print(f"Training completed in {elapsed:.1f} seconds")
Common Pitfalls to Avoid
1. Random Parameter Changes
Problem: Changing multiple parameters simultaneously without tracking impact.
Solution: Change one parameter at a time and record results in a structured log.
2. Premature Quantization
Problem: Always applying quantization regardless of whether it's needed.
Solution: Check if non-quantized model meets size constraint first. Minor parameter adjustments may achieve size goals with less accuracy loss than quantization.
3. Inadequate Time Estimation
Problem: Setting training timeouts too short for the chosen parameters.
Solution: Estimate training time based on:
- Dataset size (lines × epoch count)
- Previous run times with similar parameters
- Add 50% buffer for safety
4. No Checkpoint Strategy
Problem: Losing good intermediate results when training is interrupted.
Solution: Save intermediate models and track their performance:
for epoch in [5, 10, 15, 20, 25]:
model = fasttext.train_supervised(input=train_file, epoch=epoch)
acc = evaluate(model, valid_file)
model.save_model(f"model_epoch{epoch}.bin")
print(f"Epoch {epoch}: accuracy={acc}")
5. Overwriting Best Models
Problem: New training runs overwrite previous better models.
Solution: Use timestamped or versioned model names:
import datetime
timestamp = datetime.datetime.now().strftime("%Y%m%d_%H%M%S")
model.save_model(f"model_{timestamp}.bin")
6. Ignoring Text Preprocessing
Problem: Training on raw text without preprocessing.
Solution: Consider preprocessing steps:
- Lowercasing
- Removing special characters
- Normalizing whitespace
- Optional: removing stop words
Decision Flowchart
START
│
▼
Run quick baseline (dim=50, epoch=5)
│
▼
Does baseline meet accuracy target?
│
├─ YES → Check size constraint
│ ├─ Meets size → DONE
│ └─ Exceeds size → Apply quantization or reduce dim
│
└─ NO → Increase model capacity
│
▼
Try: higher dim, more epochs, wordNgrams=2
│
▼
Does improved model meet accuracy?
├─ YES → Check size, apply compression if needed
└─ NO → Try autotune with validation file
Environment Setup Best Practice
Avoid repeating environment setup in every command. Set up once at the start:
# Set up environment variables in shell profile or script
export PATH="$HOME/.local/bin:$PATH"
cd /app
# Or create a wrapper script
Summary Checklist
Before starting training:
- Create validation split from training data
- Plan systematic parameter exploration
- Estimate training time for parameters
- Set up model versioning/checkpointing
During training:
- Track all experiments (parameters, accuracy, size, time)
- Change one parameter at a time
- Save promising intermediate models
After training:
- Verify model file integrity
- Test on validation set
- Apply compression only if needed
- Verify final model meets all constraints