dflash-mlx-speculative-decoding
Installation
SKILL.md
dflash-mlx Speculative Decoding
Skill by ara.so — Daily 2026 Skills collection.
DFlash implements lossless speculative decoding for MLX on Apple Silicon. A small draft model (~1B params) generates 16 tokens in parallel using block diffusion; the target model verifies all 16 in a single forward pass. Tokens are only emitted after target verification — output is lossless (every token is the target model's greedy argmax).
Typical speedups: 1.7x–4.1x over baseline mlx_lm depending on model size and context length. Acceptance rates hover around 87–90% for Qwen3.5 models.
Installation
pip install dflash-mlx
# or isolated install
pipx install dflash-mlx
Requires Python 3.10+, MLX 0.31.1+, Apple Silicon Mac.