Ace Step: Text-to-Music with Vocals, Lyrics & Style Tag Control on Models and API

acestep-ai/ace-step/text-to-audio

Generate songs up to 4 minutes from style tags and optional lyrics, with original vocals and high acoustic fidelity, accessible on RunComfy models and HTTP API.

Idle

The rate is $0.0002 per second.

Introduction To Ace Step

ACE Studio's Ace Step transforms text style tags and optional lyrics into complete songs up to 4 minutes long at $0.0002 per second, delivering coherent vocals, instrumentation, and high acoustic fidelity. Trading manual scoring sessions, vocalist bookings, and multi-track production for tag-driven, prompt-controlled Ace Step generation, the model accelerates music ideation for media teams, game studios, content creators, and advertising producers. For developers, Ace Step on RunComfy can be used both in the browser and via an HTTP API, so you don't need to host or scale the model yourself.
Ideal for: Music Demo Prototyping | Cinematic and Game Scoring | Short-Form Ad Music

ACE Studio / Ace Step#

Ace Step is a text-to-music generation model that turns comma-separated style tags and optional lyrics into full songs with vocals, instrumentation, and synchronized lyrics. The model is built for fast iteration, supporting durations from a few seconds up to 4 minutes (240 seconds).

Output format: Audio only / duration 5–240 seconds / stereo / provider-defined sample rate.

Highlights#

Tag-driven text-to-music: Compose tracks by listing genres, moods, and instruments instead of writing detailed paragraphs.
Vocal and lyric generation: Ace Step can write its own lyrics or sing user-provided lyrics with synchronized vocal lines.
High acoustic fidelity: Maintains dynamic balance, spatial quality, and instrument clarity across the full 4-minute range.
Flexible duration: Adjustable from short loops up to 240-second tracks for full-length cues.
Reproducibility: Set a fixed seed to recreate exact results when iterating on tags or lyrics.
API and browser parity: The same parameters work in the RunComfy model UI and via the HTTP API.

Parameters#

Parameter	Required	Type	Default	Range / Options	Description
tags*	Yes (*)	string	—	Free text	Comma-separated list of genre, mood, and instrument tags.
lyrics	No	string	—	Free text or [inst] / [instrumental]	Vocal content; leave blank for AI-generated lyrics, use [inst] for instrumental.
duration	No	integer	60	5 – 240	Audio length in seconds.
seed	No	integer	-1	-1 – 2147483647	Random seed for reproducibility; -1 randomizes.

Pricing#

Ace Step on RunComfy uses time-based billing for generated audio.

Billing unit	Rate
Per second of generated audio	$0.0002

Estimated cost examples

Duration	Approx. cost
30 s	~$0.006
60 s (default)	~$0.012
120 s	~$0.024
240 s (4 min)	~$0.048

How to Use#

1) Open the Ace Step model in RunComfy and reveal the generation panel.

2) Enter style tags such as "lofi, hiphop, chill, mellow piano" to define genre, mood, and instrumentation.

3) Optionally add lyrics; keep verse and chorus sections clearly separated, or use [inst] for an instrumental.

4) Set duration in seconds (5–240); start short to test direction before committing to a full 4-minute render.

5) Lock the seed when you want to compare the impact of tag or lyric changes, or leave it at -1 for variety.

6) Run the generation, preview the result, and download the audio file from your job history.

7) For API use, send the same fields to the Ace Step endpoint on RunComfy; no self-hosting is required.

8) Save promising seeds and tag combinations as presets to keep your sonic direction consistent across a project.

Prompt & Reference Tips#

Use multiple complementary tags (e.g., "trap, dark, 808, brass") to lock genre, instrumentation, and energy in one pass.
Combine contrasting tags carefully (e.g., "chill, trap") to find unique blends without confusing the arrangement.
Provide structured lyrics with clear [Verse] and [Chorus] markers and consistent syllable counts for cleaner vocal phrasing.
Start with 30–45s drafts to validate the direction before requesting longer 120–240s renders.
Fix the seed when iterating; only change tags or lyrics so you can attribute differences to your edits.
Avoid contradictory cues such as "low-fi yet ultra-hi-fi" — pick one dominant aesthetic per track in Ace Step.
For instrumentals, use [inst] or [instrumental] in the lyrics field instead of leaving it blank.
If durations sound rushed or padded, adjust the duration field rather than over-constraining the lyrics.

How Ace Step compares to other models#

Based on publicly available information, Ace Step focuses on tag-driven control and 4-minute song lengths, while some Suno-family models lean on free-form prompts with shorter default tracks.
Compared to instrumental-only systems like MusicGen, Ace Step natively generates vocals and lyrics alongside instrumentation in a single pass.
Ideal Use Case: Choose Ace Step when you need full songs with vocals, tag-level genre control, and reproducible seeds for iteration.

Related Models

hunyuan/video-to-video

Transform one video into another style with Tencent Hunyuan Video.

kling-video-o1/standard/image-to-video

Create 1080p cinematic clips from stills with physics-true motion and consistent subjects.

ace-step/audio-outpaint

Extend an audio track at the start, end, or both with matching style

kling-video-o1/image-to-video/reference

Generate cinematic shots guided by reference images with unified control and realistic motion.

veo-3/text-to-video

Generate premium-quality videos from text prompts with Google Veo 3.

scail

Delivers consistent face animation from a single image using motion-driven synthesis for design and game visualization.

Frequently Asked Questions

What is Ace Step and what does it do in a text-to-sound workflow?

Ace Step is a text-to-music model from acestep-ai that turns style tags and prompts into full audio tracks with melody, rhythm, and vocals. In a text-to-sound workflow on RunComfy, you describe the genre, mood, and structure, and Ace Step generates a coherent musical piece with synchronized lyrics. It is designed for creators who want fast, prompt-driven music generation without manual composition.

What kinds of generation tasks is Ace Step best suited for?

Ace Step is best suited for text-to-sound tasks such as generating background music, short song demos, ambient loops, ad jingles, and reference tracks for video or game scenes. It handles style tag control well, so you can steer genre, tempo, and energy with a few descriptors. Vocal and lyric generation also makes it useful for songwriting drafts and creative prototyping.

How does Ace Step compare to other text-to-sound music models for quality and control?

Compared to many general audio models, Ace Step focuses on fine-grained acoustic fidelity, with attention to dynamic balance, spatial quality, and instrument clarity. The style-tag interface gives technical artists and designers more direct control over genre and energy than free-form-only prompts. Reproducibility through a seed parameter also helps developers iterate consistently on a chosen direction.

Which teams and use cases benefit most from Ace Step in production?

Designers, technical artists, video creators, and product teams can use Ace Step text-to-sound generation for trailers, social content, prototype game audio, e-commerce videos, and ad creatives. Developers can wrap it into pipelines that need on-demand soundtracks tied to scene metadata or campaign briefs. Because the model supports both vocals and instrumentals, it covers a wide range of audio needs from a single interface.

What input and output limits should I know before using Ace Step?

Ace Step supports flexible duration, adjustable from a few seconds up to about 4 minutes (240 seconds) per generation. Other constraints such as prompt length, supported audio formats, and tag combinations depend on the current provider configuration, so check the RunComfy parameter panel for exact limits before building around them. Limits may vary by mode or provider settings, and the panel always reflects the live values for the text-to-sound endpoint.

How do I move from testing Ace Step in the Playground to using it in production via the RunComfy API?

You can prototype Ace Step in the RunComfy AI Playground Web UI by adjusting style tags, prompts, duration, and seed until the text-to-sound output matches your target. Once the configuration is stable, call the same Ace Step model through the RunComfy API with identical parameters to automate generation from your backend or content pipeline. This keeps creative iteration in the browser and production runs in code, without changing the underlying model behavior.

How is pricing handled when generating audio with Ace Step on RunComfy?

Ace Step generations consume usd / credits from your RunComfy balance, and based on available provider information the model is billed per second at $0.0002. New users typically get a free trial usd amount to experiment, after which usage follows the Generation rules shown on the model page. For the most current rates and any mode-specific differences, refer to the Generation section of the Ace Step page on RunComfy.

Can I use Ace Step text-to-sound outputs commercially?

RunComfy provides access to the Ace Step model and the workflow to generate audio, but commercial usage rights for the generated music depend on the license from the original model author and provider (acestep-ai). Before releasing tracks in commercial products, ads, films, or games, review the official Ace Step license and any provider terms to confirm allowed use cases. If anything is unclear, you can reach out to hi@runcomfy.com for guidance on platform-side questions.

RunComfy

RunComfy is the premier ComfyUI platform, offering ComfyUI online environment and services, along with ComfyUI workflows featuring stunning visuals. RunComfy also provides AI Models, enabling artists to harness the latest AI tools to create incredible art.

Ace Step: Text-to-Music with Vocals, Lyrics & Style Tag Control on Models and API | RunComfy

Generate songs up to 4 minutes from style tags and optional lyrics, with original vocals and high acoustic fidelity, accessible on RunComfy models and HTTP API.

Introduction To Ace Step

ACE Studio / Ace Step#

Highlights#

Parameters#

Pricing#

How to Use#

Prompt & Reference Tips#

How Ace Step compares to other models#

Related Models

Frequently Asked Questions

What is Ace Step and what does it do in a text-to-sound workflow?

What kinds of generation tasks is Ace Step best suited for?

How does Ace Step compare to other text-to-sound music models for quality and control?

Which teams and use cases benefit most from Ace Step in production?

What input and output limits should I know before using Ace Step?

How do I move from testing Ace Step in the Playground to using it in production via the RunComfy API?

How is pricing handled when generating audio with Ace Step on RunComfy?

Can I use Ace Step text-to-sound outputs commercially?

Ace Step: Text-to-Music with Vocals, Lyrics & Style Tag Control on Models and API | RunComfy

Generate songs up to 4 minutes from style tags and optional lyrics, with original vocals and high acoustic fidelity, accessible on RunComfy models and HTTP API.

Introduction To Ace Step

ACE Studio / Ace Step#

Highlights#

Parameters#

Pricing#

How to Use#

Prompt & Reference Tips#

How Ace Step compares to other models#

Related Models

Frequently Asked Questions

What is Ace Step and what does it do in a text-to-sound workflow?

What kinds of generation tasks is Ace Step best suited for?

How does Ace Step compare to other text-to-sound music models for quality and control?

Which teams and use cases benefit most from Ace Step in production?

What input and output limits should I know before using Ace Step?

How do I move from testing Ace Step in the Playground to using it in production via the RunComfy API?

How is pricing handled when generating audio with Ace Step on RunComfy?

Can I use Ace Step text-to-sound outputs commercially?