ACE-Step-CPU / README.md
Nekochu's picture
update README with final state, full pipeline inference, LM generation step
a5741b1
---
title: ACE-Step 1.5 XL Music Generation (CPU)
emoji: 🎵
colorFrom: indigo
colorTo: yellow
sdk: docker
pinned: false
license: mit
tags:
- music-generation
- ace-step
- gguf
- lora
- training
- cpu
- mcp-server
short_description: ACE-Step 1.5 XL - CPU music generation + LoRA training
models:
- ACE-Step/Ace-Step1.5
startup_duration_timeout: 2h
---
# ACE-Step 1.5 XL Music Generation (CPU)
**GGUF inference + LoRA training** on free CPU Spaces. Powered by [acestep.cpp](https://github.com/ServeurpersoCom/acestep.cpp).
## Features
- **Music Generation** -- text/lyrics to stereo 48kHz MP3 via GGUF quantized models
- **LoRA Training** -- fine-tune on your own audio (~11s/epoch CPU, ~1.4s/epoch GPU)
- **Auto-Captioning** -- librosa BPM/key/signature + LM understand mode (caption + lyrics extraction)
- **Multiple LM Sizes** -- 0.6B / 1.7B / 4B language models (on-demand download)
- **Cancel + Download** -- cancel training mid-epoch, download trained LoRA adapter
## Music Generation
1. Enter a music description
2. Enter lyrics or check **Instrumental**
3. Adjust BPM, duration, steps, seed
4. Select LoRA adapter if trained
5. Click **Generate Music**
**Timing:** ~270s for 10s audio with 1.7B LM, 8 steps on CPU.
## LoRA Training
1. Upload audio files (any length, auto-tiled at 30s chunks by VAE)
2. Set LoRA name, epochs, learning rate, rank
3. Click **Train** -- ace-server stops during training, restarts after
4. Use **Cancel** to stop early (saves checkpoint)
5. **Download** the trained adapter file
6. Trained adapter appears in the LoRA dropdown
**Timing:** ~170s preprocessing + ~11s/epoch on CPU. GPU: ~1.4s/epoch.
**Limits:** 30 min total audio across all files. Files exceeding the cap are truncated with a warning. 50 files max. 8h training timeout.
**Settings (per Side-Step author recommendations):**
- LR: 3e-4
- Rank: 32, Alpha: 64
- Epochs: 200-500 for 3-10 files
- Optimizer: Adafactor (minimal memory)
- Variant: standard turbo (not XL -- XL swaps on 18GB)
## Captioning Pipeline
Training audio is auto-captioned before preprocessing:
| Method | What it extracts | Speed |
|--------|-----------------|-------|
| **librosa** | BPM, key, time signature | ~3s/file |
| **LM understand** (GPU) | Rich caption + lyrics + metadata | ~52s/file |
| **ace-server /understand** (Space) | Same as LM, via GGUF | ~30s/file |
| **.txt/.json sidecar** | User-provided caption (if present) | instant |
On Space: uses ace-server /understand before training. Locally: uses PyTorch LM understand.
## Models
| Component | GGUF | Size | Purpose |
|-----------|------|------|---------|
| DiT XL turbo | acestep-v15-xl-turbo-Q4_K_M | 2.8 GB | Music generation (no LoRA) |
| DiT standard turbo | acestep-v15-turbo-Q4_K_M | 1.1 GB | Music generation (with LoRA) |
| LM 1.7B | acestep-5Hz-lm-1.7B-Q8_0 | 1.7 GB | Caption understanding |
| Text Encoder | Qwen3-Embedding-0.6B-Q8_0 | 0.75 GB | Text encoding |
| VAE | vae-BF16 | 0.32 GB | Audio encode/decode |
## API
### Generate Music
```python
from gradio_client import Client
client = Client("WeReCooking/ACE-Step-CPU")
result = client.predict(
caption="upbeat electronic dance music",
lyrics="[Instrumental]",
instrumental=True, bpm=120, duration=10, seed=-1, steps=8,
lora_select="None (no LoRA)",
lm_model_select="acestep-5Hz-lm-1.7B-Q8_0.gguf",
api_name="/generate"
)
```
### Train LoRA
```python
from gradio_client import Client, handle_file
client = Client("WeReCooking/ACE-Step-CPU")
result = client.predict(
audio_files=[handle_file("song.mp3")],
lora_name="my-style", epochs=200, lr=0.0003, rank=32,
api_name="/train_lora"
)
```
### MCP (Model Context Protocol)
```json
{
"mcpServers": {
"ace-step": {"url": "https://werecooking-ace-step-cpu.hf.space/gradio_api/mcp/"}
}
}
```
## CLI
```bash
python app.py "upbeat electronic dance music" --duration 10 --steps 8
python app.py "jazz piano" --adapter my-style --seed 42
```
## Architecture
- **Inference:** GGUF via [acestep.cpp](https://github.com/ServeurpersoCom/acestep.cpp)
- **Training:** PyTorch, ported from [Side-Step](https://github.com/koda-dernet/Side-Step) (commit ecd13bd)
- **Captioning:** librosa + LM understand (PyTorch or ace-server /understand)
- Training stops ace-server to free RAM, restarts after with new adapters
- Inference blocked during training with clear message
## Credits
- [ACE-Step 1.5](https://github.com/ace-step/ACE-Step-1.5)
- [acestep.cpp](https://github.com/ServeurpersoCom/acestep.cpp)
- [Side-Step](https://github.com/koda-dernet/Side-Step)
- [Serveurperso/ACE-Step-1.5-GGUF](https://huggingface.co/Serveurperso/ACE-Step-1.5-GGUF)