NYC
skills/letta-ai/skills/train-fasttext

train-fasttext

SKILL.md

Train FastText

Overview

This skill provides guidance for training FastText text classification models, particularly when facing dual constraints like model size limits and accuracy requirements. It covers systematic experimentation strategies, hyperparameter tuning approaches, and common pitfalls to avoid.

Constraint Prioritization Strategy

When facing competing constraints (e.g., model size < X MB AND accuracy >= Y%), establish a clear strategy:

  1. Identify which constraint is harder to satisfy - Accuracy is typically harder to recover after compression
  2. First achieve the accuracy target with an unconstrained model
  3. Then apply size reduction techniques (quantization, dimension reduction, pruning)
  4. Track the accuracy-size trade-off at each compression step

Systematic Experimentation Approach

Phase 1: Quick Exploratory Runs

Before committing to long training times, run quick experiments to understand parameter sensitivity:

# Quick baseline (1-2 minutes)
model = fasttext.train_supervised(
    input=train_file,
    dim=50,
    epoch=5,
    lr=0.5
)

Record results systematically:

  • Accuracy on validation set
  • Model file size
  • Training time

Phase 2: Parameter Sensitivity Analysis

Test one parameter at a time while holding others constant:

Parameter Low Medium High Impact
dim 50 100 200 Size, accuracy
epoch 5 15 25 Training time, accuracy
lr 0.1 0.5 1.0 Convergence speed
wordNgrams 1 2 3 Accuracy, size

Phase 3: Targeted Optimization

Based on Phase 2 findings, combine the best parameters and fine-tune.

Key FastText Parameters

Accuracy-Focused Parameters

  • dim: Word vector dimensions (higher = more expressive, larger model)
  • epoch: Training iterations (more epochs can improve accuracy, diminishing returns)
  • wordNgrams: N-gram features (2 or 3 often improves accuracy significantly)
  • lr: Learning rate (higher can speed convergence but may overshoot)
  • loss: Loss function (softmax for few classes, ova for many classes, ns for very large label spaces)

Size-Focused Parameters

  • dim: Lower dimensions = smaller model
  • bucket: Hash bucket size for n-grams (lower = smaller model, may hurt accuracy)
  • minCount: Minimum word frequency (higher = smaller vocabulary)
  • minn/maxn: Character n-gram range (0,0 disables, reduces size)

Model Compression Techniques

Quantization

FastText quantization can dramatically reduce model size (often 4-10x reduction):

model.quantize(input=train_file, retrain=True)
model.save_model("model.ftz")

Important trade-off: Quantization typically reduces accuracy by 1-5%. Plan for this when targeting accuracy thresholds.

When to Apply Quantization

  • If non-quantized model is close to size limit (e.g., 155MB vs 150MB limit), try parameter tuning first
  • If non-quantized model is far above limit, quantization is necessary
  • Always measure accuracy before and after quantization

Built-in Optimization Features

Autotune (Recommended)

FastText's autotune automatically searches for optimal hyperparameters:

model = fasttext.train_supervised(
    input=train_file,
    autotuneValidationFile=valid_file,
    autotuneDuration=600,  # seconds
    autotuneModelSize="150M"  # target size constraint
)

This is often more effective than manual parameter tuning.

Verification Strategies

1. Create a Validation Set

Reserve 10-20% of training data for validation. Do not rely solely on test set evaluation:

# Split data
shuf train.txt > shuffled.txt
head -n 80000 shuffled.txt > train_split.txt
tail -n 20000 shuffled.txt > valid_split.txt

2. Verify Model File Integrity

Before evaluation, verify the model file is valid:

import os
import fasttext

# Check file exists and has reasonable size
model_path = "/app/model.bin"
if os.path.exists(model_path):
    size_mb = os.path.getsize(model_path) / (1024 * 1024)
    print(f"Model size: {size_mb:.2f} MB")

    # Try loading to verify integrity
    model = fasttext.load_model(model_path)
    print(f"Labels: {len(model.labels)}")

3. Monitor Training Progress

For long-running training, implement progress monitoring:

import time

start_time = time.time()
model = fasttext.train_supervised(input=train_file, epoch=25, verbose=2)
elapsed = time.time() - start_time
print(f"Training completed in {elapsed:.1f} seconds")

Common Pitfalls to Avoid

1. Random Parameter Changes

Problem: Changing multiple parameters simultaneously without tracking impact.

Solution: Change one parameter at a time and record results in a structured log.

2. Premature Quantization

Problem: Always applying quantization regardless of whether it's needed.

Solution: Check if non-quantized model meets size constraint first. Minor parameter adjustments may achieve size goals with less accuracy loss than quantization.

3. Inadequate Time Estimation

Problem: Setting training timeouts too short for the chosen parameters.

Solution: Estimate training time based on:

  • Dataset size (lines × epoch count)
  • Previous run times with similar parameters
  • Add 50% buffer for safety

4. No Checkpoint Strategy

Problem: Losing good intermediate results when training is interrupted.

Solution: Save intermediate models and track their performance:

for epoch in [5, 10, 15, 20, 25]:
    model = fasttext.train_supervised(input=train_file, epoch=epoch)
    acc = evaluate(model, valid_file)
    model.save_model(f"model_epoch{epoch}.bin")
    print(f"Epoch {epoch}: accuracy={acc}")

5. Overwriting Best Models

Problem: New training runs overwrite previous better models.

Solution: Use timestamped or versioned model names:

import datetime
timestamp = datetime.datetime.now().strftime("%Y%m%d_%H%M%S")
model.save_model(f"model_{timestamp}.bin")

6. Ignoring Text Preprocessing

Problem: Training on raw text without preprocessing.

Solution: Consider preprocessing steps:

  • Lowercasing
  • Removing special characters
  • Normalizing whitespace
  • Optional: removing stop words

Decision Flowchart

START
Run quick baseline (dim=50, epoch=5)
Does baseline meet accuracy target?
  ├─ YES → Check size constraint
  │         ├─ Meets size → DONE
  │         └─ Exceeds size → Apply quantization or reduce dim
  └─ NO → Increase model capacity
         Try: higher dim, more epochs, wordNgrams=2
         Does improved model meet accuracy?
           ├─ YES → Check size, apply compression if needed
           └─ NO → Try autotune with validation file

Environment Setup Best Practice

Avoid repeating environment setup in every command. Set up once at the start:

# Set up environment variables in shell profile or script
export PATH="$HOME/.local/bin:$PATH"
cd /app

# Or create a wrapper script

Summary Checklist

Before starting training:

  • Create validation split from training data
  • Plan systematic parameter exploration
  • Estimate training time for parameters
  • Set up model versioning/checkpointing

During training:

  • Track all experiments (parameters, accuracy, size, time)
  • Change one parameter at a time
  • Save promising intermediate models

After training:

  • Verify model file integrity
  • Test on validation set
  • Apply compression only if needed
  • Verify final model meets all constraints
Weekly Installs
19
Repository
letta-ai/skills
First Seen
Jan 24, 2026
Installed on
claude-code14
gemini-cli13
codex13
opencode13
antigravity12
github-copilot10