Spaces:

WeReCooking
/

ACE-Step-CPU

Running

App Files Files Community

ACE-Step-CPU / README.md

Nekochu

update README with final state, full pipeline inference, LM generation step

a5741b1 22 days ago

preview code

raw

history blame contribute delete

4.68 kB

	---
	title: ACE-Step 1.5 XL Music Generation (CPU)
	emoji: 🎵
	colorFrom: indigo
	colorTo: yellow
	sdk: docker
	pinned: false
	license: mit
	tags:
	- music-generation
	- ace-step
	- gguf
	- lora
	- training
	- cpu
	- mcp-server
	short_description: ACE-Step 1.5 XL - CPU music generation + LoRA training
	models:
	- ACE-Step/Ace-Step1.5
	startup_duration_timeout: 2h
	---

	# ACE-Step 1.5 XL Music Generation (CPU)

	GGUF inference + LoRA training on free CPU Spaces. Powered by [acestep.cpp](https://github.com/ServeurpersoCom/acestep.cpp).

	## Features

	- Music Generation -- text/lyrics to stereo 48kHz MP3 via GGUF quantized models
	- LoRA Training -- fine-tune on your own audio (~11s/epoch CPU, ~1.4s/epoch GPU)
	- Auto-Captioning -- librosa BPM/key/signature + LM understand mode (caption + lyrics extraction)
	- Multiple LM Sizes -- 0.6B / 1.7B / 4B language models (on-demand download)
	- Cancel + Download -- cancel training mid-epoch, download trained LoRA adapter

	## Music Generation

	1. Enter a music description
	2. Enter lyrics or check Instrumental
	3. Adjust BPM, duration, steps, seed
	4. Select LoRA adapter if trained
	5. Click Generate Music

	Timing: ~270s for 10s audio with 1.7B LM, 8 steps on CPU.

	## LoRA Training

	1. Upload audio files (any length, auto-tiled at 30s chunks by VAE)
	2. Set LoRA name, epochs, learning rate, rank
	3. Click Train -- ace-server stops during training, restarts after
	4. Use Cancel to stop early (saves checkpoint)
	5. Download the trained adapter file
	6. Trained adapter appears in the LoRA dropdown

	Timing: ~170s preprocessing + ~11s/epoch on CPU. GPU: ~1.4s/epoch.

	Limits: 30 min total audio across all files. Files exceeding the cap are truncated with a warning. 50 files max. 8h training timeout.

	Settings (per Side-Step author recommendations):
	- LR: 3e-4
	- Rank: 32, Alpha: 64
	- Epochs: 200-500 for 3-10 files
	- Optimizer: Adafactor (minimal memory)
	- Variant: standard turbo (not XL -- XL swaps on 18GB)

	## Captioning Pipeline

	Training audio is auto-captioned before preprocessing:

	\| Method \| What it extracts \| Speed \|
	\|--------\|-----------------\|-------\|
	\| librosa \| BPM, key, time signature \| ~3s/file \|
	\| LM understand (GPU) \| Rich caption + lyrics + metadata \| ~52s/file \|
	\| ace-server /understand (Space) \| Same as LM, via GGUF \| ~30s/file \|
	\| .txt/.json sidecar \| User-provided caption (if present) \| instant \|

	On Space: uses ace-server /understand before training. Locally: uses PyTorch LM understand.

	## Models

	\| Component \| GGUF \| Size \| Purpose \|
	\|-----------\|------\|------\|---------\|
	\| DiT XL turbo \| acestep-v15-xl-turbo-Q4_K_M \| 2.8 GB \| Music generation (no LoRA) \|
	\| DiT standard turbo \| acestep-v15-turbo-Q4_K_M \| 1.1 GB \| Music generation (with LoRA) \|
	\| LM 1.7B \| acestep-5Hz-lm-1.7B-Q8_0 \| 1.7 GB \| Caption understanding \|
	\| Text Encoder \| Qwen3-Embedding-0.6B-Q8_0 \| 0.75 GB \| Text encoding \|
	\| VAE \| vae-BF16 \| 0.32 GB \| Audio encode/decode \|

	## API

	### Generate Music

	```python
	from gradio_client import Client

	client = Client("WeReCooking/ACE-Step-CPU")
	result = client.predict(
	caption="upbeat electronic dance music",
	lyrics="[Instrumental]",
	instrumental=True, bpm=120, duration=10, seed=-1, steps=8,
	lora_select="None (no LoRA)",
	lm_model_select="acestep-5Hz-lm-1.7B-Q8_0.gguf",
	api_name="/generate"
	)
	```

	### Train LoRA

	```python
	from gradio_client import Client, handle_file

	client = Client("WeReCooking/ACE-Step-CPU")
	result = client.predict(
	audio_files=[handle_file("song.mp3")],
	lora_name="my-style", epochs=200, lr=0.0003, rank=32,
	api_name="/train_lora"
	)
	```

	### MCP (Model Context Protocol)

	```json
	{
	"mcpServers": {
	"ace-step": {"url": "https://werecooking-ace-step-cpu.hf.space/gradio_api/mcp/"}
	}
	}
	```

	## CLI

	```bash
	python app.py "upbeat electronic dance music" --duration 10 --steps 8
	python app.py "jazz piano" --adapter my-style --seed 42
	```

	## Architecture

	- Inference: GGUF via [acestep.cpp](https://github.com/ServeurpersoCom/acestep.cpp)
	- Training: PyTorch, ported from [Side-Step](https://github.com/koda-dernet/Side-Step) (commit ecd13bd)
	- Captioning: librosa + LM understand (PyTorch or ace-server /understand)
	- Training stops ace-server to free RAM, restarts after with new adapters
	- Inference blocked during training with clear message

	## Credits

	- [ACE-Step 1.5](https://github.com/ace-step/ACE-Step-1.5)
	- [acestep.cpp](https://github.com/ServeurpersoCom/acestep.cpp)
	- [Side-Step](https://github.com/koda-dernet/Side-Step)
	- [Serveurperso/ACE-Step-1.5-GGUF](https://huggingface.co/Serveurperso/ACE-Step-1.5-GGUF)