ncsu-dk-lab
/

AutoDisProxyT-RTE

Text Classification

text-embeddings-inference

Model card Files Files and versions

AutoDisProxyT-RTE / Readme.md

Jinawei's picture

Create Readme.md

345ee39 about 3 years ago

|

history blame contribute delete

2.65 kB

	---
	language: en
	thumbnail: https://huggingface.co/front/thumbnails/microsoft.png
	tags:
	- text-classification
	license: mit
	---

	# AutoDisProxyT-RTE for Distilling Massive Neural Networks

	AutoDisProxyT is a distilled task-agnostic transformer model that leverages task transfer for learning a small universal model that can be applied to arbitrary tasks and languages as outlined in the paper [Few-shot Task-agnostic Neural Architecture Search for
	Distilling Large Language Models](https://proceedings.neurips.cc/paper_files/paper/2022/file/b7c12689a89e98a61bcaa65285a41b7c-Paper-Conference.pdf).

	This AutoDisProxyT checkpoint with 7 layers, 160 hidden size, 10 attention heads corresponds to 6.88 million parameters and 0.27G FLOPs.

	The following table shows the results on GLUE dev set.

	\| Models \| #Params (M) \| #FLOPs (G) \| MNLI \| QNLI \| QQP \| RTE \| SST-2 \| MRPC \| CoLA \| Avg \|
	\|----------------\|--------\|---------\|------\|------\|------\|------\|------\|------\|--------\|-------\|
	\| BERT \| 109 \| 11.2 \| 84.5 \| 91.7 \| 91.3 \| 68.6 \| 93.2 \| 87.3 \| 53.5 \| 82.2 \|
	\| BERT<sub>SMALL</sub> \| 66 \| 5.66 \| 81.8 \| 89.8 \| 90.6 \| 67.9 \| 91.2 \| 84.9 \| 53.5 \| 80.0 \|
	\| TruncatedBERT \| 66 \| 5.66 \| 81.2 \| 87.9 \| 90.4 \| 65.5 \| 90.8 \| 82.7 \| 41.4 \| 77.1 \|
	\| DistilBERT \| 66 \| 5.66 \| 82.2 \| 89.2 \| 88.5 \| 59.9 \| 91.3 \| 87.5 \| 51.3 \| 78.6 \|
	\| TinyBERT \| 66 \| 5.66 \| 83.5 \| 90.5 \| 90.6 \| 72.2 \| 91.6 \| 88.4 \| 42.8 \| 79.9 \|
	\| MiniLM \| 66 \| 5.66 \| 84.0 \| 91.0 \| 91.0 \| 71.5 \| 92.0 \| 88.4 \| 49.2 \| 81.0 \|
	\| AutoTinyBERT-KD-S1 \| 30.0 \| 1.69 \| 82.3 \| 89.7 \| 89.9 \| 71.1 \| 91.4 \| 88.5 \| 47.3 \| 80.0 \|
	\| DynaBERT \| 37.7 \| 1.81 \| 82.3 \| 88.5 \| 90.4 \| 63.2 \| 92.0 \| 81.4 \| 76.4 \| 43.7 \|
	\| NAS-BERT<sub>10</sub>\| 10.0 \| 2.30 \| 76.4 \| 86.3 \| 88.5 \| 66.6 \| 88.6 \| 79.1 \| 34.0 \| 74.2 \|
	\| AutoTinyBERT-KD-S4 \| 66 \| 5.66 \| 76.0 \| 85.5 \| 86.9 \| 64.9 \| 86.8 \| 81.4 \| 20.4 \| 71.7 \|
	\| NAS-BERT<sub>5</sub> \| 66 \| 5.66 \| 74.4 \| 84.9 \| 85.8 \| 66.6 \| 87.3 \| 79.6 \| 19.8 \| 71.2 \|
	\| AutoDisProxyT \| 6.88 \| 0.27 \| 79.0 \| 86.4 \| 89.1 \| 64.3 \| 85.9 \| 78.5 \| 24.8 \| 72.6 \|

	Tested with `torch 1.6.0`

	If you use this checkpoint in your work, please cite:

	``` latex
	@article{xu2022autodistil,
	title={AutoDistil: Few-shot Task-agnostic Neural Architecture Search for Distilling Large Language Models},
	author={Xu, Dongkuan and Mukherjee, Subhabrata and Liu, Xiaodong and Dey, Debadeepta and Wang, Wenhui and Zhang, Xiang and Awadallah, Ahmed Hassan and Gao, Jianfeng},
	journal={arXiv preprint arXiv:2201.12507},
	year={2022}
	}
	```