voice-note-to-midi
π΅ Voice Note to MIDI
Transform your voice memos, humming, and melodic recordings into clean, quantized MIDI files ready for your DAW.
What It Does
This skill provides a complete audio-to-MIDI conversion pipeline that:
- Stem Separation - Uses HPSS (Harmonic-Percussive Source Separation) to isolate melodic content from drums, noise, and background sounds
- ML-Powered Pitch Detection - Leverages Spotify's Basic Pitch model for accurate fundamental frequency extraction
- Key Detection - Automatically detects the musical key of your recording using Krumhansl-Kessler key profiles
- Intelligent Quantization - Snaps notes to a configurable timing grid with optional key-aware pitch correction
- Post-Processing - Applies octave pruning, overlap-based harmonic removal, and legato note merging for clean output
Pipeline Architecture
Audio Input (WAV/M4A/MP3)
β
βββββββββββββββββββββββββββββββββββββββ
β Step 1: Stem Separation (HPSS) β
β - Isolate harmonic content β
β - Remove drums/percussion β
β - Noise gating β
βββββββββββββββββββββββββββββββββββββββ
β
βββββββββββββββββββββββββββββββββββββββ
β Step 2: Pitch Detection β
β - Basic Pitch ML model (Spotify) β
β - Polyphonic note detection β
β - Onset/offset estimation β
βββββββββββββββββββββββββββββββββββββββ
β
βββββββββββββββββββββββββββββββββββββββ
β Step 3: Analysis β
β - Pitch class distribution β
β - Key detection β
β - Dominant note identification β
βββββββββββββββββββββββββββββββββββββββ
β
βββββββββββββββββββββββββββββββββββββββ
β Step 4: Quantization & Cleanup β
β - Timing grid snap β
β - Key-aware pitch correction β
β - Octave pruning (harmonic removal) β
β - Overlap-based pruning β
β - Note merging (legato) β
β - Velocity normalization β
βββββββββββββββββββββββββββββββββββββββ
β
MIDI Output (Standard MIDI File)
Setup
Prerequisites
- Python 3.11+ (Python 3.14+ recommended)
- FFmpeg (for audio format support)
- pip
Installation
Quick Install (Recommended):
cd /path/to/voice-note-to-midi
./setup.sh
This automated script will:
- Check Python 3.11+ is installed
- Create the
~/melody-pipelinedirectory - Set up the virtual environment
- Install all dependencies (basic-pitch, librosa, music21, etc.)
- Download and configure the hum2midi script
- Add melody-pipeline to your PATH
Manual Install:
If you prefer manual setup:
mkdir -p ~/melody-pipeline
cd ~/melody-pipeline
python3 -m venv venv-bp
source venv-bp/bin/activate
pip install basic-pitch librosa soundfile mido music21
chmod +x ~/melody-pipeline/hum2midi
- Add to your PATH (optional):
echo 'export PATH="$HOME/melody-pipeline:$PATH"' >> ~/.bashrc
source ~/.bashrc
Verify Installation
cd ~/melody-pipeline
./hum2midi --help
Usage
Basic Usage
Convert a voice memo to MIDI:
./hum2midi my_humming.wav
This creates my_humming.mid with 16th-note quantization.
Specify Output File
./hum2midi input.wav output.mid
Command-Line Options
| Option | Description | Default |
|---|---|---|
--grid <value> |
Quantization grid: 1/4, 1/8, 1/16, 1/32 |
1/16 |
--min-note <ms> |
Minimum note duration in milliseconds | 50 |
--no-quantize |
Skip quantization (output raw Basic Pitch MIDI) | disabled |
--key-aware |
Enable key-aware pitch correction | disabled |
--no-analysis |
Skip pitch analysis and key detection | disabled |
Usage Examples
Quantize to eighth notes
./hum2midi melody.wav --grid 1/8
Key-aware quantization (recommended for tonal music)
./hum2midi song.wav --key-aware
Require longer minimum notes
./hum2midi humming.wav --min-note 100
Skip analysis for faster processing
./hum2midi quick.wav --no-analysis
Combine options
./hum2midi recording.wav output.mid --grid 1/8 --key-aware --min-note 80
Processing MIDI Input
You can also process existing MIDI files through the quantization pipeline:
./hum2midi input.mid output.mid --grid 1/16 --key-aware
This skips the audio processing steps and goes directly to analysis and quantization.
Sample Output
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
hum2midi - Melody-to-MIDI Pipeline (Basic Pitch Edition)
[Key-Aware Mode Enabled]
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
Input: my_humming.wav
Output: my_humming.mid
β Step 1: Stem Separation (HPSS)
Isolating melodic content...
Loaded: 5.23s @ 44100Hz
β Melody stem extracted β 5.23s
β Step 2: Audio-to-MIDI Conversion (Basic Pitch)
Running Spotify's Basic Pitch ML model on melody stem...
β Raw MIDI generated (Basic Pitch)
β Step 3: Pitch Analysis & Key Detection
Notes detected: 42 total, 7 unique
Note range: C3 - G4
Pitch classes: C3, E3, G3, A3, C4, D4, G4
Dominant note: G3 (23.8% of notes)
Detected key: G major
β Step 4: Quantization & Cleanup
Octave pruning: removed 3 harmonic notes above 67 (median+12)
Overlap pruning: removed 2 harmonic notes at overlapping positions
Note merging: merged 5 staccato chunks into legato notes (gap<=60 ticks)
Grid: 240 ticks (1/16)
Notes: 38 notes
Key: G major
Key-aware: 2 notes corrected to scale
Tempo: 120 BPM
β Quantized MIDI saved
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β Done! Output: my_humming.mid
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
π ANALYSIS SUMMARY
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
Detected Notes: C3, E3, G3, A3, C4, D4, G4
Detected Key: G major
Quantization: Key-aware mode (notes snapped to scale)
MIDI Info: 38 notes, 7 unique pitches, 120 BPM
Pitches: C3, E3, G3, A3, C4, D4, G4
Notes & Limitations
Audio Quality Matters
- Clear, loud melody produces the best results
- Background noise can cause false note detection
- Reverb and effects may confuse pitch detection
- Close-mic'd vocals work significantly better than room recordings
Musical Considerations
- Monophonic sources work best (single melody line)
- Polyphonic audio (chords, multiple instruments) will produce messy results
- Vibrato and pitch bends may be quantized to stepped pitches
- Rapid note passages may be missed or merged
Technical Limitations
- Tempo is fixed at 120 BPM in output (time positions are preserved, but tempo may need adjustment in your DAW)
- Note velocities are normalized but may need manual adjustment
- Very short notes (<50ms) may be filtered out by default
- Extreme pitch ranges may cause octave detection issues
Post-Processing Recommendations
After generating MIDI, you may want to:
- Import into your DAW and adjust tempo to match your original recording
- Quantize further if stricter timing is needed
- Adjust note velocities for dynamics
- Apply swing/groove templates if the rigid grid sounds too mechanical
- Edit individual notes that were misdetected (common with fast runs)
Supported Audio Formats
Input formats supported via FFmpeg:
- WAV, AIFF, FLAC (uncompressed, best quality)
- MP3, M4A, AAC (compressed, acceptable)
- OGG, OPUS (open source formats)
- Most other formats FFmpeg supports
Troubleshooting
No notes detected
- Check that input file isn't silent or corrupted
- Try increasing
--min-notethreshold - Verify audio has clear melodic content (not just noise)
Too many notes / messy output
- Enable octave pruning and overlap pruning (on by default)
- Use
--key-awareto constrain to musical scale - Check for background noise in source audio
Wrong key detected
- Key detection works best with at least 8-10 measures of music
- Chromatic passages may confuse the detector
- Manually review and adjust in your DAW if needed
Notes in wrong octave
- Basic Pitch sometimes detects harmonics instead of fundamentals
- The pipeline includes pruning, but some may slip through
- Use your DAW's transpose function for simple octave shifts
References
- Basic Pitch - Spotify's polyphonic pitch detection model
- librosa HPSS - Harmonic-Percussive Source Separation
- Krumhansl-Kessler Key Profiles - Key detection algorithm
License
This skill integrates Basic Pitch by Spotify, which is licensed under Apache 2.0. The pipeline script and documentation are provided under MIT license.
More from thinkfleetai/thinkfleet-engine
local-whisper
Local speech-to-text using OpenAI Whisper. Runs fully offline after model download. High quality transcription with multiple model sizes.
148flyio-cli-public
Use the Fly.io flyctl CLI for deploying and operating apps on Fly.io: deploys (local or remote builder), viewing status/logs, SSH/console, secrets/config, scaling, machines, volumes, and Fly Postgres (create/attach/manage databases). Use when asked to deploy to Fly.io, debug fly deploy/build/runtime failures, set up GitHub Actions deploys/previews, or safely manage Fly apps and Postgres.
24kagi-search
Web search using Kagi Search API. Use when you need to search the web for current information, facts, or references. Requires KAGI_API_KEY in the environment.
22feishu-bridge
Connect a Feishu (Lark) bot to ThinkFleet via WebSocket long-connection. No public server, domain, or ngrok required. Use when setting up Feishu/Lark as a messaging channel, troubleshooting the Feishu bridge, or managing the bridge service (start/stop/logs). Covers bot creation on Feishu Open Platform, credential setup, bridge startup, macOS launchd auto-restart, and group chat behavior tuning.
13bambu-local
Control Bambu Lab 3D printers locally via MQTT (no cloud). Supports A1, A1 Mini, P1P, P1S, X1C.
10voice-transcribe
Transcribe audio files using OpenAI's gpt-4o-mini-transcribe model with vocabulary hints and text replacements. Requires uv (https://docs.astral.sh/uv/).
10