ACE-Step 1.5 Music Generation

Open-source music generation via tools/music_gen.py.

Cloud providers:

acemusic (default) — Official ACE-Step cloud API with XL Turbo (4B) model + 5Hz LM thinking mode. Free API key from acemusic.ai/api-key. No GPU required.
modal — Self-hosted ACE-Step 2B Turbo on Modal. Requires MODAL_MUSIC_GEN_ENDPOINT_URL.
runpod — Self-hosted ACE-Step 2B Turbo on RunPod. Requires RUNPOD_ACESTEP_ENDPOINT_ID.

Setup

# acemusic (recommended — free, best quality, no GPU)
echo "ACEMUSIC_API_KEY=your_key" >> .env
# Get key at https://acemusic.ai/api-key

# Self-hosted (optional fallback)
python tools/music_gen.py --setup             # RunPod
modal deploy docker/modal-music-gen/app.py    # Modal

Quick Reference

# Basic generation (uses acemusic XL Turbo by default)
python tools/music_gen.py --prompt "Upbeat tech corporate" --duration 60 --output bg.mp3

# Generate 4 variations, pick the best
python tools/music_gen.py --prompt "Calm ambient piano" --duration 30 --variations 4 --output ambient.mp3

# Fast mode (disable thinking)
python tools/music_gen.py --no-thinking --prompt "Quick draft" --duration 30 --output draft.mp3

# With musical control
python tools/music_gen.py --prompt "Calm ambient piano" --duration 30 --bpm 72 --key "D Major" --output ambient.mp3

# Scene presets (video production)
python tools/music_gen.py --preset corporate-bg --duration 60 --output bg.mp3
python tools/music_gen.py --preset tension --duration 20 --output problem.mp3
python tools/music_gen.py --preset cta --brand digital-samba --duration 15 --output cta.mp3

# Vocals with lyrics
python tools/music_gen.py --prompt "Indie pop jingle" --lyrics "[verse]\nBuild it better\nShip it faster" --duration 30 --output jingle.mp3

# Cover / style transfer
python tools/music_gen.py --cover --reference theme.mp3 --prompt "Jazz piano version" --duration 60 --output jazz_cover.mp3

# Repaint a weak section
python tools/music_gen.py --repaint --input track.mp3 --repaint-start 15 --repaint-end 25 --prompt "Guitar solo" --output fixed.mp3

# Continue from existing audio
python tools/music_gen.py --continuation --input track.mp3 --prompt "Continue with jazz piano" --output extended.mp3

# Stem extraction
python tools/music_gen.py --extract vocals --input mixed.mp3 --output vocals.mp3

# Fall back to self-hosted
python tools/music_gen.py --cloud modal --prompt "Background music" --duration 60 --output bg.mp3

Fixing "Samey" Output

If generated music sounds repetitive or lacks variety, try these in order:

Use acemusic cloud (default) — the XL Turbo 4B model is significantly more capable than the 2B model on Modal/RunPod
Keep thinking mode on (default for acemusic) — the 5Hz LM enriches sparse prompts into detailed musical descriptions
Generate variations — --variations 4 generates 4 takes, pick the best
Use stochastic inference — --infer-method sde adds randomness (same seed gives different results)
Vary BPM and key across scenes — don't use the same preset for every scene
Write sparser prompts — "Upbeat indie rock" gives the model more creative freedom than a hyper-detailed description
Vary seeds — omit --seed to let each generation be unique

Creating a Song (Step by Step)

1. Instrumental background track (simplest)

python tools/music_gen.py --prompt "Upbeat indie rock, driving drums, jangly guitar" --duration 60 --bpm 120 --key "G Major" --output track.mp3

2. Song with vocals and lyrics

Write lyrics in a temp file or pass inline. Use structure tags to control song sections.

# Write lyrics to a file first (recommended for longer songs)
cat > /tmp/lyrics.txt << 'LYRICS'
[Verse 1]
Walking through the morning light
Coffee in my hand feels right
Another day to build and dream
Nothing's ever what it seems

[Chorus - anthemic]
WE KEEP MOVING FORWARD
Through the noise and doubt
We keep moving forward
That's what it's about

[Verse 2]
Screens are glowing late at night
Shipping code until it's right
The deadline's close but so are we
Almost there, just wait and see

[Chorus - bigger]
WE KEEP MOVING FORWARD
Through the noise and doubt
We keep moving forward
That's what it's about

[Outro - fade]
(Moving forward...)
LYRICS

# Generate the song
python tools/music_gen.py \
  --prompt "Upbeat indie rock anthem, male vocal, driving drums, electric guitar, studio polish" \
  --lyrics "$(cat /tmp/lyrics.txt)" \
  --duration 60 \
  --bpm 128 \
  --key "G Major" \
  --output my_song.mp3

3. Repaint a weak section

If the chorus sounds weak, regenerate just that section:

python tools/music_gen.py --repaint --input my_song.mp3 --repaint-start 20 --repaint-end 35 --prompt "Powerful anthemic chorus, big drums" --output fixed.mp3

4. Continue/extend a track

python tools/music_gen.py --continuation --input my_song.mp3 --prompt "Continue with gentle acoustic outro" --output extended.mp3

Key tips for good results

Caption = overall style (genre, instruments, mood, production quality)
Lyrics = temporal structure (verse/chorus flow, vocal delivery)
UPPERCASE in lyrics = high vocal intensity
Parentheses = background vocals: "We rise (together)"
Keep 6-10 syllables per line for natural rhythm
Don't describe the melody in the caption — describe the sound and feeling
Use --seed to lock randomness when iterating on prompt/lyrics

Controlling vocal gender

The model doesn't reliably follow "female vocal" or "male vocal" on its own. Use both of these together:

In the prompt: Be explicit — "solo female singer, alto voice" or "female vocalist only, breathy intimate voice". Adding an artist reference helps (e.g., "Brandi Carlile style").
In the lyrics: Add [female vocal] tags before each section:

[female vocal]
[Verse 1]
Walking through the morning light...

[female vocal]
[Chorus - anthemic]
WE KEEP MOVING FORWARD...

Just saying "female vocal" in the prompt alone is often ignored. The combination of prompt + lyrics tags is what works.

Duets and vocal trading

For duets with male/female vocals trading verses, use both the prompt and per-section lyrics tags:

Prompt: "duet, male and female vocals trading verses, warm harmonies on chorus"
Lyrics: Tag each section with who sings it:

[Verse 1 - male vocal, storytelling]
First verse lyrics here...

[Chorus - male and female duet, harmonies]
Chorus lyrics here...

[Verse 2 - female vocal, wry]
Second verse lyrics here...

[Bridge - male vocal, spoken]
Spoken bridge...

[Bridge - female vocal, sung]
Sung response...

This reliably produces vocal trading between sections and harmonies on shared parts.

Scene Presets

Preset	BPM	Key	Use Case
`corporate-bg`	110	C Major	Professional background, presentations
`upbeat-tech`	128	G Major	Product launches, tech demos
`ambient`	72	D Major	Overview slides, reflective content
`dramatic`	90	D Minor	Reveals, announcements
`tension`	85	A Minor	Problem statements, challenges
`hopeful`	120	C Major	Solution reveals, resolutions
`cta`	135	E Major	Call to action, closing energy
`lofi`	85	F Major	Screen recordings, coding demos

Task Types

text2music (default)

Generate music from text prompt + optional lyrics.

cover

Style transfer from reference audio. Control blend with --cover-strength (0.0-1.0):

0.2 — Loose style inspiration (more creative freedom)
0.5 — Balanced style transfer
0.7 — Close to original structure (default)
1.0 — Maximum fidelity to source

extract

Stem separation — isolate individual tracks from mixed audio. Tracks: vocals, drums, bass, guitar, piano, keyboard, strings, brass, woodwinds, other

repainting (acemusic only)

Regenerate a specific time segment within existing audio while preserving the rest.

python tools/music_gen.py --repaint --input track.mp3 --repaint-start 15 --repaint-end 25 --prompt "Guitar solo" --output fixed.mp3

continuation (acemusic only)

Extend existing audio by continuing from where it ends.

python tools/music_gen.py --continuation --input track.mp3 --prompt "Continue with jazz piano" --output extended.mp3

Prompt Engineering

Caption Writing — Layer Dimensions

Write captions by layering multiple descriptive dimensions rather than single-word descriptions.

Dimensions to include:

Genre/Style: pop, rock, jazz, electronic, lo-fi, synthwave, orchestral
Emotion/Mood: melancholic, euphoric, dreamy, nostalgic, intimate, tense
Instruments: acoustic guitar, synth pads, 808 drums, strings, brass, piano
Timbre: warm, crisp, airy, punchy, lush, polished, raw
Era: "80s synth-pop", "modern indie", "classical romantic"
Production: lo-fi, studio-polished, live recording, cinematic
Vocal: breathy, powerful, falsetto, raspy, spoken word (or "instrumental")

Good: "Slow melancholic piano ballad with intimate female vocal, warm strings building to powerful chorus, studio-polished production" Bad: "Sad song"

Key Principles

Specificity over vagueness — describe instruments, mood, production style
Avoid contradictions — don't request "classical strings" and "hardcore metal" simultaneously
Repetition reinforces priority — repeat important elements for emphasis
Sparse captions = more creative freedom — detailed captions constrain the model
Use metadata params for BPM/key — don't write "120 BPM" in the caption, use --bpm 120

Lyrics Formatting

Structure tags (use in lyrics, not caption):

[Intro]
[Verse]
[Chorus]
[Bridge]
[Outro]
[Instrumental]
[Guitar Solo]
[Build]
[Drop]
[Breakdown]

Vocal control (prefix lines or sections):

[raspy vocal]
[whispered]
[falsetto]
[powerful belting]
[harmonies]
[ad-lib]

Energy indicators:

UPPERCASE = high intensity ("WE RISE ABOVE")
Parentheses = background vocals ("We rise (together)")
Keep 6-10 syllables per line within sections for natural rhythm

Video Production Integration

Music for Scene Types

Scene	Preset	Duration	Notes
Title	`dramatic` or `ambient`	3-5s	Short, mood-setting
Problem	`tension`	10-15s	Dark, unsettling
Solution	`hopeful`	10-15s	Relief, optimism
Demo	`lofi` or `corporate-bg`	30-120s	Non-distracting, matches demo length
Stats	`upbeat-tech`	8-12s	Building credibility
CTA	`cta`	5-10s	Maximum energy, punchy
Credits	`ambient`	5-10s	Gentle fade-out

Timing Workflow

Plan scene durations first (from voiceover script)
Generate music to match: --duration <scene_seconds>
Music duration is precise (within 0.1s of requested)
For background music spanning multiple scenes: generate one long track

Combining with Voiceover

Background music should be mixed at 10-20% volume in Remotion:

<Audio src={staticFile('voiceover.mp3')} volume={1} />
<Audio src={staticFile('bg-music.mp3')} volume={0.15} />

For music under narration: use instrumental presets (corporate-bg, ambient, lofi). For music-forward scenes (title, CTA): can use higher volume or vocal tracks.

Brand Consistency

Use --brand <name> to load hints from brands/<name>/brand.json. Use --cover --reference brand_theme.mp3 to create variations of a brand's sonic identity. For consistent sound across a project: fix the seed (--seed 42) and vary only duration/prompt.

Advanced Parameters

Flag	Default	Description
`--thinking`	on (acemusic)	5Hz LM enriches prompts and generates audio codes
`--no-thinking`	-	Faster generation, skip LM reasoning
`--variations N`	1	Generate N variations (1-8, acemusic only)
`--guidance-scale`	7.0	Prompt adherence (1.0-15.0)
`--infer-method`	ode	`ode` (deterministic) or `sde` (stochastic, more variety)
`--seed`	random	Lock randomness for reproducibility

Technical Details

acemusic cloud: XL Turbo 4B DiT + 4B LM, best quality, ~5-15s per generation
Modal/RunPod: Standard Turbo 2B DiT, no LM, ~2-3s per generation
Output: 48kHz MP3/WAV/FLAC
Duration range: 10-600 seconds
BPM range: 30-300

When NOT to use ACE-Step

Voice cloning — use Qwen3-TTS or ElevenLabs instead
Sound effects — use ElevenLabs SFX (tools/sfx.py)
Speech/narration — use voiceover tools, not music gen
Stem extraction from video — extract audio first with FFmpeg, then use --extract

acestep