Feature Extraction
Transformers
Safetensors
English
usad
automatic-speech-recognition
audio-classification
audio
speech
music
custom_code
Instructions to use MIT-SLS/USAD-Base with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use MIT-SLS/USAD-Base with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("feature-extraction", model="MIT-SLS/USAD-Base", trust_remote_code=True)# Load model directly from transformers import AutoModel model = AutoModel.from_pretrained("MIT-SLS/USAD-Base", trust_remote_code=True, dtype="auto") - Notebooks
- Google Colab
- Kaggle
| license: cc-by-nc-sa-4.0 | |
| pipeline_tag: feature-extraction | |
| tags: | |
| - automatic-speech-recognition | |
| - audio-classification | |
| - audio | |
| - speech | |
| - music | |
| library_name: transformers | |
| datasets: | |
| - openslr/librispeech_asr | |
| - facebook/multilingual_librispeech | |
| - mozilla-foundation/common_voice_17_0 | |
| - speechcolab/gigaspeech | |
| - facebook/voxpopuli | |
| - agkphysics/AudioSet | |
| language: | |
| - en | |
| # USAD: Universal Speech and Audio Representation via Distillation | |
| **Universal Speech and Audio Distillation (USAD)** is a unified **speech**, **sound**, and **music** encoder distilled from domain-specific teachers. | |
| Trained on 126k hours of mixed data, USAD delivers competitive performance across diverse benchmarks (SUPERB, HEAR, and AudioSet) with a single model. | |
| [π **Read Full Paper**](https://arxiv.org/abs/2506.18843) | |
| --- | |
| ## ποΈ Models | |
| USAD models are all transformer encoders operating at **50Hz frame rate**. The teacher models are **WavLM Base+** and **ATST Frame**. | |
| | Model | Parameters | Dim | Layer | Checkpoint | | |
| | ---------- | ---------- | ---- | ----- | ------------------------------------------------- | | |
| | USAD Small | 24M | 384 | 12 | [link](https://huggingface.co/MIT-SLS/USAD-Small) | | |
| | USAD Base | 94M | 768 | 12 | [link](https://huggingface.co/MIT-SLS/USAD-Base) | | |
| | USAD Large | 330M | 1024 | 24 | [link](https://huggingface.co/MIT-SLS/USAD-Large) | | |
| --- | |
| ## π How To Use | |
| **Installation** | |
| ``` | |
| pip install -U transformers | |
| ``` | |
| **Load Model and Extract Features** | |
| ```python | |
| import torch | |
| from transformers import AutoModel | |
| # Load pre-trained model | |
| model = AutoModel.from_pretrained("MIT-SLS/USAD-Base", trust_remote_code=True).cuda().eval() | |
| # Load audio and resample to 16kHz | |
| wav = model.load_audio("path/to/audio").unsqueeze(0) # (batch_size, wav_len) | |
| # wav is a float tensor on the same device as the model | |
| # You can also load waveforms directly with torchaudio.load | |
| # Extract features | |
| with torch.no_grad(): | |
| results = model(wav) | |
| # result["x"]: model final output (batch_size, seq_len) | |
| # result["mel"]: mel fbank (batch_size, seq_len * 2, mel_dim) | |
| # result["hidden_states"]: list of (batch_size, seq_len, encoder_dim) | |
| # result["ffn"]: list of (batch_size, seq_len, encoder_dim) | |
| ``` | |
| See [usad_model.py](https://huggingface.co/MIT-SLS/USAD-Base/blob/main/usad_model.py) for more details about the model. | |
| --- | |
| ## π Citation | |
| ```bibtex | |
| @article{chang2025usad, | |
| title={{USAD}: Universal Speech and Audio Representation via Distillation}, | |
| author={Chang, Heng-Jui and Bhati, Saurabhchand and Glass, James and Liu, Alexander H.}, | |
| journal={arXiv preprint arXiv:2506.18843}, | |
| year={2025} | |
| } | |
| ``` | |
| --- | |
| ## π Acknowledgement | |
| Our implementation is based on the awesome [facebookresearch/fairseq](https://github.com/facebookresearch/fairseq), [cwx-worst-one/EAT](https://github.com/cwx-worst-one/EAT), and [sooftware/conformer](https://github.com/sooftware/conformer) repositories. | |