Train FastText

Overview

This skill provides guidance for training FastText text classification models, particularly when facing dual constraints like model size limits and accuracy requirements. It covers systematic experimentation strategies, hyperparameter tuning approaches, and common pitfalls to avoid.

Constraint Prioritization Strategy

When facing competing constraints (e.g., model size < X MB AND accuracy >= Y%), establish a clear strategy:

Identify which constraint is harder to satisfy - Accuracy is typically harder to recover after compression
First achieve the accuracy target with an unconstrained model
Then apply size reduction techniques (quantization, dimension reduction, pruning)
Track the accuracy-size trade-off at each compression step

Systematic Experimentation Approach

Phase 1: Quick Exploratory Runs

Before committing to long training times, run quick experiments to understand parameter sensitivity:

# Quick baseline (1-2 minutes)
model = fasttext.train_supervised(
    input=train_file,
    dim=50,
    epoch=5,
    lr=0.5
)

Record results systematically:

Accuracy on validation set
Model file size
Training time

Phase 2: Parameter Sensitivity Analysis

Test one parameter at a time while holding others constant:

Parameter	Low	Medium	High	Impact
dim	50	100	200	Size, accuracy
epoch	5	15	25	Training time, accuracy
lr	0.1	0.5	1.0	Convergence speed
wordNgrams	1	2	3	Accuracy, size

Phase 3: Targeted Optimization

Based on Phase 2 findings, combine the best parameters and fine-tune.

Key FastText Parameters

Accuracy-Focused Parameters

dim: Word vector dimensions (higher = more expressive, larger model)
epoch: Training iterations (more epochs can improve accuracy, diminishing returns)
wordNgrams: N-gram features (2 or 3 often improves accuracy significantly)
lr: Learning rate (higher can speed convergence but may overshoot)
loss: Loss function (softmax for few classes, ova for many classes, ns for very large label spaces)

Size-Focused Parameters

dim: Lower dimensions = smaller model
bucket: Hash bucket size for n-grams (lower = smaller model, may hurt accuracy)
minCount: Minimum word frequency (higher = smaller vocabulary)
minn/maxn: Character n-gram range (0,0 disables, reduces size)

Model Compression Techniques

Quantization

FastText quantization can dramatically reduce model size (often 4-10x reduction):

model.quantize(input=train_file, retrain=True)
model.save_model("model.ftz")

Important trade-off: Quantization typically reduces accuracy by 1-5%. Plan for this when targeting accuracy thresholds.

When to Apply Quantization

If non-quantized model is close to size limit (e.g., 155MB vs 150MB limit), try parameter tuning first
If non-quantized model is far above limit, quantization is necessary
Always measure accuracy before and after quantization

Built-in Optimization Features

Autotune (Recommended)

FastText's autotune automatically searches for optimal hyperparameters:

model = fasttext.train_supervised(
    input=train_file,
    autotuneValidationFile=valid_file,
    autotuneDuration=600,  # seconds
    autotuneModelSize="150M"  # target size constraint
)

This is often more effective than manual parameter tuning.

Verification Strategies

1. Create a Validation Set

Reserve 10-20% of training data for validation. Do not rely solely on test set evaluation:

# Split data
shuf train.txt > shuffled.txt
head -n 80000 shuffled.txt > train_split.txt
tail -n 20000 shuffled.txt > valid_split.txt

2. Verify Model File Integrity

Before evaluation, verify the model file is valid:

import os
import fasttext

# Check file exists and has reasonable size
model_path = "/app/model.bin"
if os.path.exists(model_path):
    size_mb = os.path.getsize(model_path) / (1024 * 1024)
    print(f"Model size: {size_mb:.2f} MB")

    # Try loading to verify integrity
    model = fasttext.load_model(model_path)
    print(f"Labels: {len(model.labels)}")

3. Monitor Training Progress

For long-running training, implement progress monitoring:

import time

start_time = time.time()
model = fasttext.train_supervised(input=train_file, epoch=25, verbose=2)
elapsed = time.time() - start_time
print(f"Training completed in {elapsed:.1f} seconds")

Common Pitfalls to Avoid

1. Random Parameter Changes

Problem: Changing multiple parameters simultaneously without tracking impact.

Solution: Change one parameter at a time and record results in a structured log.

2. Premature Quantization

Problem: Always applying quantization regardless of whether it's needed.

Solution: Check if non-quantized model meets size constraint first. Minor parameter adjustments may achieve size goals with less accuracy loss than quantization.

3. Inadequate Time Estimation

Problem: Setting training timeouts too short for the chosen parameters.

Solution: Estimate training time based on:

Dataset size (lines × epoch count)
Previous run times with similar parameters
Add 50% buffer for safety

4. No Checkpoint Strategy

Problem: Losing good intermediate results when training is interrupted.

Solution: Save intermediate models and track their performance:

for epoch in [5, 10, 15, 20, 25]:
    model = fasttext.train_supervised(input=train_file, epoch=epoch)
    acc = evaluate(model, valid_file)
    model.save_model(f"model_epoch{epoch}.bin")
    print(f"Epoch {epoch}: accuracy={acc}")

5. Overwriting Best Models

Problem: New training runs overwrite previous better models.

Solution: Use timestamped or versioned model names:

import datetime
timestamp = datetime.datetime.now().strftime("%Y%m%d_%H%M%S")
model.save_model(f"model_{timestamp}.bin")

6. Ignoring Text Preprocessing

Problem: Training on raw text without preprocessing.

Solution: Consider preprocessing steps:

Lowercasing
Removing special characters
Normalizing whitespace
Optional: removing stop words

Decision Flowchart

START
  │
  ▼
Run quick baseline (dim=50, epoch=5)
  │
  ▼
Does baseline meet accuracy target?
  │
  ├─ YES → Check size constraint
  │         ├─ Meets size → DONE
  │         └─ Exceeds size → Apply quantization or reduce dim
  │
  └─ NO → Increase model capacity
           │
           ▼
         Try: higher dim, more epochs, wordNgrams=2
           │
           ▼
         Does improved model meet accuracy?
           ├─ YES → Check size, apply compression if needed
           └─ NO → Try autotune with validation file

Environment Setup Best Practice

Avoid repeating environment setup in every command. Set up once at the start:

# Set up environment variables in shell profile or script
export PATH="$HOME/.local/bin:$PATH"
cd /app

# Or create a wrapper script

Summary Checklist

Before starting training:

Create validation split from training data
Plan systematic parameter exploration
Estimate training time for parameters
Set up model versioning/checkpointing

During training:

Track all experiments (parameters, accuracy, size, time)
Change one parameter at a time
Save promising intermediate models

After training:

Verify model file integrity
Test on validation set
Apply compression only if needed
Verify final model meets all constraints

train-fasttext