| ## [VHASR: A Multimodal Speech Recognition System With Vision Hotwords](https://arxiv.org/abs/2410.00822) |
|
|
| This repository provides the VHASR trained on Flickr8k, ADE20k, COCO, and OpenImages. |
|
|
| Our paper is available at https://arxiv.org/abs/2410.00822. |
|
|
| Our code is available at https://github.com/193746/VHASR/tree/main. |
|
|
| For specific details about training and testing, please refer to https://github.com/193746/VHASR/tree/main. |
|
|
| ### Infer |
|
|
| If you are interested in our work, you can use large-scale data to train your own model and perform inference using the following command. Note that you should place the config file of clip in '{model_file}/clip_config' like the four pretrained models we provide. |
|
|
| ```sh |
| cd VHASR |
| CUDA_VISIBLE_DEVICES=1 python src/infer.py \ |
| --model_name "{path_to_model_folder}" \ |
| --speech_path "{path_to_speech}" \ |
| --image_path "{path_to_image}" \ |
| --merge_method 3 |
| ``` |
|
|
| ### Citation |
| ```sh |
| @misc{hu2024vhasrmultimodalspeechrecognition, |
| title={VHASR: A Multimodal Speech Recognition System With Vision Hotwords}, |
| author={Jiliang Hu and Zuchao Li and Ping Wang and Haojun Ai and Lefei Zhang and Hai Zhao}, |
| year={2024}, |
| eprint={2410.00822}, |
| archivePrefix={arXiv}, |
| primaryClass={cs.SD}, |
| url={https://arxiv.org/abs/2410.00822}, |
| } |
| ``` |
|
|
| ### License: cc-by-nc-4.0 |