acestep
ACE-Step 1.5 Music Generation
Open-source music generation via tools/music_gen.py.
Cloud providers:
- acemusic (default) — Official ACE-Step cloud API with XL Turbo (4B) model + 5Hz LM thinking mode. Free API key from acemusic.ai/api-key. No GPU required.
- modal — Self-hosted ACE-Step 2B Turbo on Modal. Requires
MODAL_MUSIC_GEN_ENDPOINT_URL. - runpod — Self-hosted ACE-Step 2B Turbo on RunPod. Requires
RUNPOD_ACESTEP_ENDPOINT_ID.
Setup
# acemusic (recommended — free, best quality, no GPU)
echo "ACEMUSIC_API_KEY=your_key" >> .env
# Get key at https://acemusic.ai/api-key
# Self-hosted (optional fallback)
python tools/music_gen.py --setup # RunPod
modal deploy docker/modal-music-gen/app.py # Modal
Quick Reference
# Basic generation (uses acemusic XL Turbo by default)
python tools/music_gen.py --prompt "Upbeat tech corporate" --duration 60 --output bg.mp3
# Generate 4 variations, pick the best
python tools/music_gen.py --prompt "Calm ambient piano" --duration 30 --variations 4 --output ambient.mp3
# Fast mode (disable thinking)
python tools/music_gen.py --no-thinking --prompt "Quick draft" --duration 30 --output draft.mp3
# With musical control
python tools/music_gen.py --prompt "Calm ambient piano" --duration 30 --bpm 72 --key "D Major" --output ambient.mp3
# Scene presets (video production)
python tools/music_gen.py --preset corporate-bg --duration 60 --output bg.mp3
python tools/music_gen.py --preset tension --duration 20 --output problem.mp3
python tools/music_gen.py --preset cta --brand digital-samba --duration 15 --output cta.mp3
# Vocals with lyrics
python tools/music_gen.py --prompt "Indie pop jingle" --lyrics "[verse]\nBuild it better\nShip it faster" --duration 30 --output jingle.mp3
# Cover / style transfer
python tools/music_gen.py --cover --reference theme.mp3 --prompt "Jazz piano version" --duration 60 --output jazz_cover.mp3
# Repaint a weak section
python tools/music_gen.py --repaint --input track.mp3 --repaint-start 15 --repaint-end 25 --prompt "Guitar solo" --output fixed.mp3
# Continue from existing audio
python tools/music_gen.py --continuation --input track.mp3 --prompt "Continue with jazz piano" --output extended.mp3
# Stem extraction
python tools/music_gen.py --extract vocals --input mixed.mp3 --output vocals.mp3
# Fall back to self-hosted
python tools/music_gen.py --cloud modal --prompt "Background music" --duration 60 --output bg.mp3
Fixing "Samey" Output
If generated music sounds repetitive or lacks variety, try these in order:
- Use acemusic cloud (default) — the XL Turbo 4B model is significantly more capable than the 2B model on Modal/RunPod
- Keep thinking mode on (default for acemusic) — the 5Hz LM enriches sparse prompts into detailed musical descriptions
- Generate variations —
--variations 4generates 4 takes, pick the best - Use stochastic inference —
--infer-method sdeadds randomness (same seed gives different results) - Vary BPM and key across scenes — don't use the same preset for every scene
- Write sparser prompts — "Upbeat indie rock" gives the model more creative freedom than a hyper-detailed description
- Vary seeds — omit
--seedto let each generation be unique
Creating a Song (Step by Step)
1. Instrumental background track (simplest)
python tools/music_gen.py --prompt "Upbeat indie rock, driving drums, jangly guitar" --duration 60 --bpm 120 --key "G Major" --output track.mp3
2. Song with vocals and lyrics
Write lyrics in a temp file or pass inline. Use structure tags to control song sections.
# Write lyrics to a file first (recommended for longer songs)
cat > /tmp/lyrics.txt << 'LYRICS'
[Verse 1]
Walking through the morning light
Coffee in my hand feels right
Another day to build and dream
Nothing's ever what it seems
[Chorus - anthemic]
WE KEEP MOVING FORWARD
Through the noise and doubt
We keep moving forward
That's what it's about
[Verse 2]
Screens are glowing late at night
Shipping code until it's right
The deadline's close but so are we
Almost there, just wait and see
[Chorus - bigger]
WE KEEP MOVING FORWARD
Through the noise and doubt
We keep moving forward
That's what it's about
[Outro - fade]
(Moving forward...)
LYRICS
# Generate the song
python tools/music_gen.py \
--prompt "Upbeat indie rock anthem, male vocal, driving drums, electric guitar, studio polish" \
--lyrics "$(cat /tmp/lyrics.txt)" \
--duration 60 \
--bpm 128 \
--key "G Major" \
--output my_song.mp3
3. Repaint a weak section
If the chorus sounds weak, regenerate just that section:
python tools/music_gen.py --repaint --input my_song.mp3 --repaint-start 20 --repaint-end 35 --prompt "Powerful anthemic chorus, big drums" --output fixed.mp3
4. Continue/extend a track
python tools/music_gen.py --continuation --input my_song.mp3 --prompt "Continue with gentle acoustic outro" --output extended.mp3
Key tips for good results
- Caption = overall style (genre, instruments, mood, production quality)
- Lyrics = temporal structure (verse/chorus flow, vocal delivery)
- UPPERCASE in lyrics = high vocal intensity
- Parentheses = background vocals: "We rise (together)"
- Keep 6-10 syllables per line for natural rhythm
- Don't describe the melody in the caption — describe the sound and feeling
- Use
--seedto lock randomness when iterating on prompt/lyrics
Controlling vocal gender
The model doesn't reliably follow "female vocal" or "male vocal" on its own. Use both of these together:
- In the prompt: Be explicit — "solo female singer, alto voice" or "female vocalist only, breathy intimate voice". Adding an artist reference helps (e.g., "Brandi Carlile style").
- In the lyrics: Add
[female vocal]tags before each section:
[female vocal]
[Verse 1]
Walking through the morning light...
[female vocal]
[Chorus - anthemic]
WE KEEP MOVING FORWARD...
Just saying "female vocal" in the prompt alone is often ignored. The combination of prompt + lyrics tags is what works.
Duets and vocal trading
For duets with male/female vocals trading verses, use both the prompt and per-section lyrics tags:
- Prompt: "duet, male and female vocals trading verses, warm harmonies on chorus"
- Lyrics: Tag each section with who sings it:
[Verse 1 - male vocal, storytelling]
First verse lyrics here...
[Chorus - male and female duet, harmonies]
Chorus lyrics here...
[Verse 2 - female vocal, wry]
Second verse lyrics here...
[Bridge - male vocal, spoken]
Spoken bridge...
[Bridge - female vocal, sung]
Sung response...
This reliably produces vocal trading between sections and harmonies on shared parts.
Scene Presets
| Preset | BPM | Key | Use Case |
|---|---|---|---|
corporate-bg |
110 | C Major | Professional background, presentations |
upbeat-tech |
128 | G Major | Product launches, tech demos |
ambient |
72 | D Major | Overview slides, reflective content |
dramatic |
90 | D Minor | Reveals, announcements |
tension |
85 | A Minor | Problem statements, challenges |
hopeful |
120 | C Major | Solution reveals, resolutions |
cta |
135 | E Major | Call to action, closing energy |
lofi |
85 | F Major | Screen recordings, coding demos |
Task Types
text2music (default)
Generate music from text prompt + optional lyrics.
cover
Style transfer from reference audio. Control blend with --cover-strength (0.0-1.0):
- 0.2 — Loose style inspiration (more creative freedom)
- 0.5 — Balanced style transfer
- 0.7 — Close to original structure (default)
- 1.0 — Maximum fidelity to source
extract
Stem separation — isolate individual tracks from mixed audio.
Tracks: vocals, drums, bass, guitar, piano, keyboard, strings, brass, woodwinds, other
repainting (acemusic only)
Regenerate a specific time segment within existing audio while preserving the rest.
python tools/music_gen.py --repaint --input track.mp3 --repaint-start 15 --repaint-end 25 --prompt "Guitar solo" --output fixed.mp3
continuation (acemusic only)
Extend existing audio by continuing from where it ends.
python tools/music_gen.py --continuation --input track.mp3 --prompt "Continue with jazz piano" --output extended.mp3
Prompt Engineering
Caption Writing — Layer Dimensions
Write captions by layering multiple descriptive dimensions rather than single-word descriptions.
Dimensions to include:
- Genre/Style: pop, rock, jazz, electronic, lo-fi, synthwave, orchestral
- Emotion/Mood: melancholic, euphoric, dreamy, nostalgic, intimate, tense
- Instruments: acoustic guitar, synth pads, 808 drums, strings, brass, piano
- Timbre: warm, crisp, airy, punchy, lush, polished, raw
- Era: "80s synth-pop", "modern indie", "classical romantic"
- Production: lo-fi, studio-polished, live recording, cinematic
- Vocal: breathy, powerful, falsetto, raspy, spoken word (or "instrumental")
Good: "Slow melancholic piano ballad with intimate female vocal, warm strings building to powerful chorus, studio-polished production" Bad: "Sad song"
Key Principles
- Specificity over vagueness — describe instruments, mood, production style
- Avoid contradictions — don't request "classical strings" and "hardcore metal" simultaneously
- Repetition reinforces priority — repeat important elements for emphasis
- Sparse captions = more creative freedom — detailed captions constrain the model
- Use metadata params for BPM/key — don't write "120 BPM" in the caption, use
--bpm 120
Lyrics Formatting
Structure tags (use in lyrics, not caption):
[Intro]
[Verse]
[Chorus]
[Bridge]
[Outro]
[Instrumental]
[Guitar Solo]
[Build]
[Drop]
[Breakdown]
Vocal control (prefix lines or sections):
[raspy vocal]
[whispered]
[falsetto]
[powerful belting]
[harmonies]
[ad-lib]
Energy indicators:
- UPPERCASE = high intensity ("WE RISE ABOVE")
- Parentheses = background vocals ("We rise (together)")
- Keep 6-10 syllables per line within sections for natural rhythm
Video Production Integration
Music for Scene Types
| Scene | Preset | Duration | Notes |
|---|---|---|---|
| Title | dramatic or ambient |
3-5s | Short, mood-setting |
| Problem | tension |
10-15s | Dark, unsettling |
| Solution | hopeful |
10-15s | Relief, optimism |
| Demo | lofi or corporate-bg |
30-120s | Non-distracting, matches demo length |
| Stats | upbeat-tech |
8-12s | Building credibility |
| CTA | cta |
5-10s | Maximum energy, punchy |
| Credits | ambient |
5-10s | Gentle fade-out |
Timing Workflow
- Plan scene durations first (from voiceover script)
- Generate music to match:
--duration <scene_seconds> - Music duration is precise (within 0.1s of requested)
- For background music spanning multiple scenes: generate one long track
Combining with Voiceover
Background music should be mixed at 10-20% volume in Remotion:
<Audio src={staticFile('voiceover.mp3')} volume={1} />
<Audio src={staticFile('bg-music.mp3')} volume={0.15} />
For music under narration: use instrumental presets (corporate-bg, ambient, lofi).
For music-forward scenes (title, CTA): can use higher volume or vocal tracks.
Brand Consistency
Use --brand <name> to load hints from brands/<name>/brand.json.
Use --cover --reference brand_theme.mp3 to create variations of a brand's sonic identity.
For consistent sound across a project: fix the seed (--seed 42) and vary only duration/prompt.
Advanced Parameters
| Flag | Default | Description |
|---|---|---|
--thinking |
on (acemusic) | 5Hz LM enriches prompts and generates audio codes |
--no-thinking |
- | Faster generation, skip LM reasoning |
--variations N |
1 | Generate N variations (1-8, acemusic only) |
--guidance-scale |
7.0 | Prompt adherence (1.0-15.0) |
--infer-method |
ode | ode (deterministic) or sde (stochastic, more variety) |
--seed |
random | Lock randomness for reproducibility |
Technical Details
- acemusic cloud: XL Turbo 4B DiT + 4B LM, best quality, ~5-15s per generation
- Modal/RunPod: Standard Turbo 2B DiT, no LM, ~2-3s per generation
- Output: 48kHz MP3/WAV/FLAC
- Duration range: 10-600 seconds
- BPM range: 30-300
When NOT to use ACE-Step
- Voice cloning — use Qwen3-TTS or ElevenLabs instead
- Sound effects — use ElevenLabs SFX (
tools/sfx.py) - Speech/narration — use voiceover tools, not music gen
- Stem extraction from video — extract audio first with FFmpeg, then use
--extract