MIT-SLS
/

USAD-Base

Feature Extraction

automatic-speech-recognition

audio-classification

Model card Files Files and versions

USAD-Base / README.md

vectominist's picture

Update license

e763a34 verified 11 months ago

|

history blame contribute delete

3.02 kB

	---
	license: cc-by-nc-sa-4.0
	pipeline_tag: feature-extraction
	tags:
	- automatic-speech-recognition
	- audio-classification
	- audio
	- speech
	- music
	library_name: transformers
	datasets:
	- openslr/librispeech_asr
	- facebook/multilingual_librispeech
	- mozilla-foundation/common_voice_17_0
	- speechcolab/gigaspeech
	- facebook/voxpopuli
	- agkphysics/AudioSet
	language:
	- en
	---
	# USAD: Universal Speech and Audio Representation via Distillation

	Universal Speech and Audio Distillation (USAD) is a unified speech, sound, and music encoder distilled from domain-specific teachers.
	Trained on 126k hours of mixed data, USAD delivers competitive performance across diverse benchmarks (SUPERB, HEAR, and AudioSet) with a single model.

	[👀 Read Full Paper](https://arxiv.org/abs/2506.18843)

	---

	## 🗂️ Models

	USAD models are all transformer encoders operating at 50Hz frame rate. The teacher models are WavLM Base+ and ATST Frame.

	\| Model \| Parameters \| Dim \| Layer \| Checkpoint \|
	\| ---------- \| ---------- \| ---- \| ----- \| ------------------------------------------------- \|
	\| USAD Small \| 24M \| 384 \| 12 \| [link](https://huggingface.co/MIT-SLS/USAD-Small) \|
	\| USAD Base \| 94M \| 768 \| 12 \| [link](https://huggingface.co/MIT-SLS/USAD-Base) \|
	\| USAD Large \| 330M \| 1024 \| 24 \| [link](https://huggingface.co/MIT-SLS/USAD-Large) \|

	---


	## 🚀 How To Use

	Installation
	```
	pip install -U transformers
	```

	Load Model and Extract Features
	```python
	import torch
	from transformers import AutoModel

	# Load pre-trained model
	model = AutoModel.from_pretrained("MIT-SLS/USAD-Base", trust_remote_code=True).cuda().eval()

	# Load audio and resample to 16kHz
	wav = model.load_audio("path/to/audio").unsqueeze(0) # (batch_size, wav_len)
	# wav is a float tensor on the same device as the model
	# You can also load waveforms directly with torchaudio.load

	# Extract features
	with torch.no_grad():
	results = model(wav)

	# result["x"]: model final output (batch_size, seq_len)
	# result["mel"]: mel fbank (batch_size, seq_len * 2, mel_dim)
	# result["hidden_states"]: list of (batch_size, seq_len, encoder_dim)
	# result["ffn"]: list of (batch_size, seq_len, encoder_dim)
	```

	See [usad_model.py](https://huggingface.co/MIT-SLS/USAD-Base/blob/main/usad_model.py) for more details about the model.

	---

	## 📖 Citation

	```bibtex
	@article{chang2025usad,
	title={{USAD}: Universal Speech and Audio Representation via Distillation},
	author={Chang, Heng-Jui and Bhati, Saurabhchand and Glass, James and Liu, Alexander H.},
	journal={arXiv preprint arXiv:2506.18843},
	year={2025}
	}
	```

	---

	## 🙏 Acknowledgement

	Our implementation is based on the awesome [facebookresearch/fairseq](https://github.com/facebookresearch/fairseq), [cwx-worst-one/EAT](https://github.com/cwx-worst-one/EAT), and [sooftware/conformer](https://github.com/sooftware/conformer) repositories.