onnx-webgpu-converter
ONNX WebGPU Model Converter
Convert any HuggingFace model to ONNX and run it in the browser with Transformers.js + WebGPU.
Workflow Overview
- Check if ONNX version already exists on HuggingFace
- Set up Python environment with optimum
- Export model to ONNX with optimum-cli
- Quantize for target deployment (WebGPU vs WASM)
- Upload to HuggingFace Hub (optional)
- Use in Transformers.js with WebGPU
Step 1: Check for Existing ONNX Models
Before converting, check if the model already has an ONNX version:
- Search
onnx-community/<model-name>on HuggingFace Hub - Check the model repo for an
onnx/folder - Browse https://huggingface.co/models?library=transformers.js (1200+ pre-converted)
If found, skip to Step 6.
Step 2: Environment Setup
# Create venv (recommended)
python -m venv onnx-env && source onnx-env/bin/activate
# Install optimum with ONNX support
pip install "optimum[onnx]" onnxruntime
# For GPU-accelerated export (optional)
pip install onnxruntime-gpu
Verify installation:
optimum-cli export onnx --help
Step 3: Export to ONNX
Basic Export (auto-detect task)
optimum-cli export onnx --model <model_id_or_path> ./output_dir/
With Explicit Task
optimum-cli export onnx \
--model <model_id> \
--task <task> \
./output_dir/
Common tasks: text-generation, text-classification, feature-extraction, image-classification, automatic-speech-recognition, object-detection, image-segmentation, question-answering, token-classification, zero-shot-classification
For decoder models, append -with-past for KV cache reuse (default behavior):
text-generation-with-past, text2text-generation-with-past, automatic-speech-recognition-with-past
Full CLI Reference
| Flag | Description |
|---|---|
-m MODEL, --model MODEL |
HuggingFace model ID or local path (required) |
--task TASK |
Export task (auto-detected if on Hub) |
--opset OPSET |
ONNX opset version (default: auto) |
--device DEVICE |
Export device, cpu (default) or cuda |
--optimize {O1,O2,O3,O4} |
ONNX Runtime optimization level |
--monolith |
Force single ONNX file (vs split encoder/decoder) |
--no-post-process |
Skip post-processing (e.g., decoder merging) |
--trust-remote-code |
Allow custom model code from Hub |
--pad_token_id ID |
Override pad token (needed for some models) |
--cache_dir DIR |
Cache directory for downloaded models |
--batch_size N |
Batch size for dummy inputs |
--sequence_length N |
Sequence length for dummy inputs |
--framework {pt} |
Source framework |
--atol ATOL |
Absolute tolerance for validation |
Optimization Levels
| Level | Description |
|---|---|
| O1 | Basic general optimizations |
| O2 | Basic + extended + transformer fusions |
| O3 | O2 + GELU approximation |
| O4 | O3 + mixed precision fp16 (GPU only, requires --device cuda) |
Step 4: Quantize for Web Deployment
Quantization Types for Transformers.js
| dtype | Precision | Best For | Size Reduction |
|---|---|---|---|
fp32 |
Full 32-bit | Maximum accuracy | None (baseline) |
fp16 |
Half 16-bit | WebGPU default quality | ~50% |
q8 / int8 |
8-bit | WASM default, good balance | ~75% |
q4 / bnb4 |
4-bit | Maximum compression | ~87% |
q4f16 |
4-bit weights, fp16 compute | WebGPU + small size | ~87% |
Using optimum-cli quantization
# Dynamic quantization (post-export)
optimum-cli onnxruntime quantize \
--onnx_model ./output_dir/ \
--avx512 \
-o ./quantized_dir/
Using Python API for finer control
from optimum.onnxruntime import ORTQuantizer, ORTModelForSequenceClassification
from optimum.onnxruntime.configuration import AutoQuantizationConfig
model = ORTModelForSequenceClassification.from_pretrained("./output_dir/")
quantizer = ORTQuantizer.from_pretrained(model)
config = AutoQuantizationConfig.avx512_vnni(is_static=False, per_channel=False)
quantizer.quantize(save_dir="./quantized_dir/", quantization_config=config)
Producing Multiple dtype Variants for Transformers.js
To provide fp32, fp16, q8, and q4 variants (like onnx-community models), organize output as:
model_onnx/
├── onnx/
│ ├── model.onnx # fp32
│ ├── model_fp16.onnx # fp16
│ ├── model_quantized.onnx # q8
│ └── model_q4.onnx # q4
├── config.json
├── tokenizer.json
└── tokenizer_config.json
Step 5: Upload to HuggingFace Hub (Optional)
# Login
huggingface-cli login
# Upload
huggingface-cli upload <your-username>/<model-name>-onnx ./output_dir/
# Add transformers.js tag to model card for discoverability
Step 6: Use in Transformers.js with WebGPU
Install
npm install @huggingface/transformers
Basic Pipeline with WebGPU
import { pipeline } from "@huggingface/transformers";
const pipe = await pipeline("task-name", "model-id-or-path", {
device: "webgpu", // GPU acceleration
dtype: "q4", // Quantization level
});
const result = await pipe("input text");
Per-Module dtypes (encoder-decoder models)
Some models (Whisper, Florence-2) need different quantization per component:
const model = await Florence2ForConditionalGeneration.from_pretrained(
"onnx-community/Florence-2-base-ft",
{
dtype: {
embed_tokens: "fp16",
vision_encoder: "fp16",
encoder_model: "q4",
decoder_model_merged: "q4",
},
device: "webgpu",
},
);
For detailed Transformers.js WebGPU usage patterns: See references/webgpu-usage.md
Troubleshooting
For conversion errors and common issues: See references/conversion-guide.md
Quick Fixes
- "Task not found": Use
--taskflag explicitly. For decoder models trytext-generation-with-past - "trust_remote_code": Add
--trust-remote-codeflag for custom model architectures - Out of memory: Use
--device cpuand smaller--batch_size - Validation fails: Try
--no-post-processor increase--atol - Model not supported: Check supported architectures — 120+ architectures supported
- WebGPU fallback to WASM: Ensure browser supports WebGPU (Chrome 113+, Edge 113+)
Supported Task → Pipeline Mapping
| Task | Transformers.js Pipeline | Example Model |
|---|---|---|
| text-classification | sentiment-analysis |
distilbert-base-uncased-finetuned-sst-2 |
| text-generation | text-generation |
Qwen2.5-0.5B-Instruct |
| feature-extraction | feature-extraction |
mxbai-embed-xsmall-v1 |
| automatic-speech-recognition | automatic-speech-recognition |
whisper-tiny.en |
| image-classification | image-classification |
mobilenetv4_conv_small |
| object-detection | object-detection |
detr-resnet-50 |
| image-segmentation | image-segmentation |
segformer-b0 |
| zero-shot-image-classification | zero-shot-image-classification |
clip-vit-base-patch32 |
| depth-estimation | depth-estimation |
depth-anything-small |
| translation | translation |
nllb-200-distilled-600M |
| summarization | summarization |
bart-large-cnn |
More from jakerains/agentskills
elevenlabs
Complete ElevenLabs AI audio platform: text-to-speech (TTS), speech-to-text (STT/Scribe), voice cloning, voice design, sound effects, music generation, dubbing, voice changer, voice isolator, and conversational voice agents. Use when working with audio generation, voice synthesis, transcription, audio processing, or building voice-enabled applications. Triggers: generate speech, clone voice, transcribe audio, create sound effects, compose music, dub video, change voice, isolate vocals, build voice agent, ElevenLabs API/SDK/CLI/MCP.
9skill-seekers
Convert documentation websites, GitHub repositories, and PDFs into Claude AI skills. Use when creating Claude skills from docs, scraping documentation, packaging websites into skills, or converting repos/PDFs to Claude knowledge.
7vercel-workflow
Build durable workflows with Vercel Workflow DevKit using "use workflow" and "use step" directives. Use for long-running tasks, background jobs, AI agents, webhooks, scheduled tasks, retries, and workflow orchestration. Supports Next.js, Vite, Astro, Express, Fastify, Hono, Nitro, Nuxt, SvelteKit.
7apple-foundation-models
Build Apple Intelligence features with Foundation Models and Image Playground on iOS 26+, iPadOS 26+, macOS 26+, Mac Catalyst 26+, and visionOS 26+. Use when implementing SystemLanguageModel, LanguageModelSession, guided generation with @Generable/@Guide, tool calling, streaming responses, prompt design, safety and guardrail handling, model availability checks, content tagging, context-window limits, local on-device inference, routing to larger-model paths, adapters, and ImagePlayground/ImageCreator APIs. Covers model capabilities and limitations, structured output, error handling, and SwiftUI integration patterns.
7docxmakebetter
Comprehensive document creation, editing, and analysis with support for tracked changes, comments, formatting preservation, and text extraction. When Claude needs to work with professional documents (.docx files) for: (1) Creating new documents, (2) Modifying or editing content, (3) Working with tracked changes, (4) Adding comments, or any other document tasks
5