acestep-songwriting
ACE-Step Songwriting Guide
Professional music creation knowledge for writing captions, lyrics, and choosing music parameters for ACE-Step.
Output Format
After using this guide, produce two things for the acestep skill:
- Caption (
-c): Style/genre/instruments/emotion description - Lyrics (
-l): Complete structured lyrics with tags - Parameters:
--duration,--bpm,--key,--time-signature,--language
Caption: The Most Important Input
Caption is the most important factor affecting generated music.
Supports multiple formats: simple style words, comma-separated tags, complex natural language descriptions.
Common Dimensions
| Dimension | Examples |
|---|---|
| Style/Genre | pop, rock, jazz, electronic, hip-hop, R&B, folk, classical, lo-fi, synthwave |
| Emotion/Atmosphere | melancholic, uplifting, energetic, dreamy, dark, nostalgic, euphoric, intimate |
| Instruments | acoustic guitar, piano, synth pads, 808 drums, strings, brass, electric bass |
| Timbre Texture | warm, bright, crisp, muddy, airy, punchy, lush, raw, polished |
| Era Reference | 80s synth-pop, 90s grunge, 2010s EDM, vintage soul, modern trap |
| Production Style | lo-fi, high-fidelity, live recording, studio-polished, bedroom pop |
| Vocal Characteristics | female vocal, male vocal, breathy, powerful, falsetto, raspy, choir |
| Speed/Rhythm | slow tempo, mid-tempo, fast-paced, groovy, driving, laid-back |
| Structure Hints | building intro, catchy chorus, dramatic bridge, fade-out ending |
Caption Writing Principles
- Specific beats vague — "sad piano ballad with female breathy vocal" > "a sad song"
- Combine multiple dimensions — style+emotion+instruments+timbre anchors direction precisely
- Use references well — "in the style of 80s synthwave" conveys complex aesthetic quickly
- Texture words are useful — warm, crisp, airy, punchy influence mixing and timbre
- Don't pursue perfection — Caption is a starting point, iterate based on results
- Granularity determines freedom — Less detail = more model creativity; more detail = more control
- Avoid conflicting words — "classical strings" + "hardcore metal" degrades output
- Fix: Repetition reinforcement — Repeat the elements you want more
- Fix: Conflict to evolution — "Start with soft strings, middle becomes metal rock, end turns to hip-hop"
- Don't put BPM/key/tempo in Caption — Use dedicated parameters instead
Lyrics: The Temporal Script
Lyrics controls how music unfolds over time. It carries:
- Lyric text itself
- Structure tags ([Verse], [Chorus], [Bridge]...)
- Vocal style hints ([raspy vocal], [whispered]...)
- Instrumental sections ([guitar solo], [drum break]...)
- Energy changes ([building energy], [explosive drop]...)
Structure Tags
| Category | Tag | Description |
|---|---|---|
| Basic Structure | [Intro] |
Opening, establish atmosphere |
[Verse] / [Verse 1] |
Verse, narrative progression | |
[Pre-Chorus] |
Pre-chorus, build energy | |
[Chorus] |
Chorus, emotional climax | |
[Bridge] |
Bridge, transition or elevation | |
[Outro] |
Ending, conclusion | |
| Dynamic Sections | [Build] |
Energy gradually rising |
[Drop] |
Electronic music energy release | |
[Breakdown] |
Reduced instrumentation, space | |
| Instrumental | [Instrumental] |
Pure instrumental, no vocals |
[Guitar Solo] |
Guitar solo | |
[Piano Interlude] |
Piano interlude | |
| Special | [Fade Out] |
Fade out ending |
[Silence] |
Silence |
Combining Tags
Use - for finer control, but keep it concise:
✅ [Chorus - anthemic]
❌ [Chorus - anthemic - stacked harmonies - high energy - powerful - epic]
Put complex style descriptions in Caption, not in tags.
Caption-Lyrics Consistency
Models are not good at resolving conflicts. Checklist:
- Instruments in Caption ↔ Instrumental section tags in Lyrics
- Emotion in Caption ↔ Energy tags in Lyrics
- Vocal description in Caption ↔ Vocal control tags in Lyrics
Vocal Control Tags
| Tag | Effect |
|---|---|
[raspy vocal] |
Raspy, textured vocals |
[whispered] |
Whispered |
[falsetto] |
Falsetto |
[powerful belting] |
Powerful, high-pitched singing |
[spoken word] |
Rap/recitation |
[harmonies] |
Layered harmonies |
[call and response] |
Call and response |
[ad-lib] |
Improvised embellishments |
Energy and Emotion Tags
| Tag | Effect |
|---|---|
[high energy] |
High energy, passionate |
[low energy] |
Low energy, restrained |
[building energy] |
Increasing energy |
[explosive] |
Explosive energy |
[melancholic] |
Melancholic |
[euphoric] |
Euphoric |
[dreamy] |
Dreamy |
[aggressive] |
Aggressive |
Lyric Writing Tips
- 6-10 syllables per line — Model aligns syllables to beats; keep similar counts for lines in same position (±1-2)
- Uppercase = stronger intensity —
WE ARE THE CHAMPIONS!(shouting) vswalking through the streets(normal) - Parentheses = background vocals —
We rise together (together) - Extend vowels —
Feeeling so aliiive(use cautiously, effects unstable) - Clear section separation — Blank lines between sections
Avoiding "AI-flavored" Lyrics
| Red Flag | Description |
|---|---|
| Adjective stacking | "neon skies, electric hearts, endless dreams" — vague imagery filler |
| Rhyme chaos | Inconsistent patterns or forced rhymes breaking meaning |
| Blurred boundaries | Lyric content crosses structure tags |
| No breathing room | Lines too long to sing in one breath |
| Mixed metaphors | Water → fire → flying — listeners can't anchor |
Metaphor discipline: One core metaphor per song, explore its multiple aspects.
Music Metadata
Most of the time, let LM auto-infer. Only set manually when you have clear requirements.
| Parameter | Range | Description |
|---|---|---|
bpm |
30–300 | Slow 60–80, mid 90–120, fast 130–180 |
keyscale |
Key | e.g. C Major, Am. Common keys (C, G, D, Am, Em) most stable |
timesignature |
Time sig | 4/4 (most common), 3/4 (waltz), 6/8 (swing) |
vocal_language |
Language | Usually auto-detected from lyrics |
duration |
Seconds | See duration calculation below |
When to Set Manually
| Scenario | Set |
|---|---|
| Daily generation | Let LM auto-infer |
| Clear tempo requirement | bpm |
| Specific style (waltz) | timesignature=3/4 |
| Match other material | bpm + duration |
| Specific key color | keyscale |
Duration Calculation
Estimation Method
- Intro/Outro: 5-10 seconds each
- Instrumental sections: 5-15 seconds each
- Typical structures:
- 2 verses + 2 choruses: 120-150s minimum
- 2 verses + 2 choruses + bridge: 180-240s minimum
- Full song with intro/outro: 210-270s (3.5-4.5 min)
BPM and Duration Relationship
- Slower BPM (60-80): Need MORE duration for same lyrics
- Medium BPM (100-130): Standard duration
- Faster BPM (150-180): Can fit more lyrics, but still need breathing room
Rule of thumb: When in doubt, estimate longer. A song too short feels rushed.
Note: Lyrics tags (piano, powerful, whispered) are consistent with Caption (piano ballad, building to powerful chorus, intimate).