mlx-fine-tuning
MLX Fine-Tuning
Comprehensive skill for fine-tuning Large Language Models using MLX framework on Apple Silicon (M1/M2/M3/M4).
Purpose
Enable efficient LLM fine-tuning on Apple Silicon using MLX's unified memory architecture and Metal GPU acceleration. Focus on LoRA (Low-Rank Adaptation) for parameter-efficient fine-tuning without requiring expensive GPU hardware.
When to Use This Skill
Invoke this skill when:
- Setting up MLX fine-tuning on Apple Silicon
- Converting models from HuggingFace to MLX format
- Configuring LoRA adapters for fine-tuning
- Optimizing hyperparameters for specific datasets
- Troubleshooting memory or performance issues
- Benchmarking fine-tuned models
- Managing and exporting adapters
Platform Requirements
Critical: MLX only works on Apple Silicon (M1/M2/M3/M4) running macOS natively.
- Architecture must be
arm64(verify withuname -m) - Cannot run in Docker or virtual machines
- Requires macOS 11.0 or later
Workflow
1. Environment Validation
First, validate the environment using the provided script:
python scripts/validate_environment.py
This checks:
- Apple Silicon architecture
- MLX installation
- Metal GPU availability
- Memory capacity
2. Model Preparation
Convert HuggingFace models to MLX format or use pre-converted models:
# Option A: Use pre-converted model from MLX Community
--model mlx-community/Qwen2.5-3B-Instruct-4bit
# Option B: Convert from HuggingFace
uv run mlx_lm.convert \
--hf-path Qwen/Qwen2.5-3B-Instruct \
--mlx-path models/Qwen2.5-3B-Instruct-mlx \
--quantize # Optional: for 4-bit quantization
3. Data Preparation
Prepare training data in MLX chat format (JSONL):
{"messages": [
{"role": "system", "content": "You are a helpful assistant."},
{"role": "user", "content": "What is machine learning?"},
{"role": "assistant", "content": "Machine learning is..."}
]}
Save as train.jsonl and valid.jsonl in your data directory.
4. Hyperparameter Selection
Load references/hyperparameter_guidelines.md for detailed guidance based on dataset size.
Quick reference:
- Small datasets (<1k samples): LR 2e-5, 200-500 iterations, 4-8 layers
- Medium datasets (1k-10k): LR 1e-5, 500-1000 iterations, 8-16 layers
- Large datasets (10k+): LR 5e-6, 1000-2000 iterations, 16-32 layers
5. Training Execution
Run fine-tuning with mlx_lm:
uv run python -m mlx_lm lora \
--model models/Qwen2.5-3B-Instruct-mlx \
--train \
--data ./data \
--batch-size 1 \
--iters 500 \
--learning-rate 1e-5 \
--num-layers 8 \
--adapter-path adapters/my-lora \
--save-every 100 \
--grad-checkpoint # Enable for memory efficiency
6. Memory Optimization
If encountering memory issues, load references/memory_optimization.md for techniques:
- Gradient checkpointing (
--grad-checkpoint) - Batch size reduction
- Model quantization (4-bit)
- Layer count reduction
7. Evaluation and Testing
Test the fine-tuned model:
# Interactive generation
uv run python -m mlx_lm lora \
--model models/Qwen2.5-3B-Instruct-mlx \
--adapter-path adapters/my-lora \
--prompt "Your test prompt"
# Benchmark comparison
uv run python scripts/hyperparameter_optimizer.py \
--model models/Qwen2.5-3B-Instruct-mlx \
--adapter adapters/my-lora \
--test-samples 20
8. Adapter Management
Export and merge adapters:
# Export to safetensors format
uv run mlx_lm.fuse \
--model models/base-model \
--adapter-path adapters/my-lora \
--save-path models/fused-model
# Convert back to HuggingFace format
uv run mlx_lm.convert \
--mlx-path models/fused-model \
--hf-path models/hf-export
Troubleshooting
For common issues, load references/common_issues.md. Quick solutions:
- Memory errors: Reduce batch size to 1, enable gradient checkpointing
- Slow training: Verify Metal GPU usage with
mx.default_device() - Training instability: Lower learning rate, increase warmup ratio
- Format errors: Validate data format matches MLX chat structure
Advanced Techniques
Multi-Phase Training with LR Decay
# Progressive learning rate schedule
--lr-schedule "400:2e-5,400:1e-5,200:2e-6"
A/B Testing Configurations
Use scripts/hyperparameter_optimizer.py to test multiple configurations:
python scripts/hyperparameter_optimizer.py \
--config-file configs/experiment.yaml \
--parallel-runs 4
Monitoring Training
Track key metrics:
- Training/validation loss curves
- Memory usage (
mx.metal.get_active_memory()) - Token throughput
- Gradient norms
Best Practices
- Start Small: Test with few iterations before full training
- Checkpoint Frequently: Use
--save-everyto avoid losing progress - Monitor Memory: Track unified memory usage throughout training
- Validate Often: Check validation loss to detect overfitting early
- Compare Performance: Always benchmark against base model
- Version Control: Use descriptive names for adapter directories
- Document Experiments: Log all hyperparameters and results
Resource References
- Load
references/hyperparameter_guidelines.mdfor detailed parameter selection - Load
references/memory_optimization.mdfor memory management techniques - Load
references/common_issues.mdfor troubleshooting guide - Load
references/mlx_api_reference.mdfor MLX-specific functions
Output Artifacts
Training produces:
adapters/*/adapters.safetensors- LoRA weightsadapters/*/adapter_config.json- Configuration- Training logs with loss curves
- Checkpoint saves at specified intervals
Integration with Pipelines
For programmatic training, integrate with pipeline systems:
from training.mlx import TrainConfig, TrainPipeline
config = TrainConfig(
model="mlx-community/Qwen2.5-3B-Instruct-4bit",
train_data="data/train.jsonl",
iters=500,
learning_rate=1e-5,
num_layers=8,
grad_checkpoint=True
)
pipeline = TrainPipeline(config)
result = pipeline.execute()