WavAlign: Enhancing Intelligence and Expressiveness in Spoken Dialogue Models via Adaptive Hybrid Post-Training
Abstract
Spoken dialogue models face challenges in expressiveness despite end-to-end approaches, but a modality-aware adaptive post-training method using constrained preference updates and explicit anchoring improves both semantic quality and speech expressiveness.
End-to-end spoken dialogue models have garnered significant attention because they offer a higher potential ceiling in expressiveness and perceptual ability than cascaded systems. However, the intelligence and expressiveness of current open-source spoken dialogue models often remain below expectations. Motivated by the success of online reinforcement learning(RL) in other domains, one might attempt to directly apply preference optimization to spoken dialogue models, yet this transfer is non-trivial. We analyze these obstacles from the perspectives of reward modeling and rollout sampling, focusing on how sparse preference supervision interacts with dense speech generation under shared-parameter updates. Based on the analysis, we propose a modality-aware adaptive post-training recipe that makes RL practical for spoken dialogue: it constrains preference updates to the semantic channel and improves acoustic behavior via explicit anchoring, while dynamically regulating their mixture from rollout statistics to avoid unreliable preference gradients. We evaluate the method across multiple spoken dialogue benchmarks and representative architectures, and observe consistent improvements in semantic quality and speech expressiveness.
Community
An effective post-train framework for SDM
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- Bridging What the Model Thinks and How It Speaks: Self-Aware Speech Language Models for Expressive Speech Generation (2026)
- Enhancing Conversational TTS with Cascaded Prompting and ICL-Based Online Reinforcement Learning (2026)
- TiCo: Time-Controllable Training for Spoken Dialogue Models (2026)
- X-OPD: Cross-Modal On-Policy Distillation for Capability Alignment in Speech LLMs (2026)
- Modeling and Benchmarking Spoken Dialogue Rewards with Modality and Colloquialness (2026)
- Listen, Pause, and Reason: Toward Perception-Grounded Hybrid Reasoning for Audio Understanding (2026)
- Character Beyond Speech: Leveraging Role-Playing Evaluation in Audio Large Language Models via Reinforcement Learning (2026)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: @librarian-bot recommend
Get this paper in your agent:
hf papers read 2604.14932 Don't have the latest CLI?
curl -LsSf https://hf.co/cli/install.sh | bash Models citing this paper 0
No model linking this paper
Datasets citing this paper 0
No dataset linking this paper
Spaces citing this paper 0
No Space linking this paper