| --- |
| language: eo |
| license: mit |
| --- |
| |
| # EsperBERTo: A RoBERTa-like model for Esperanto |
|
|
| This is a RoBERTa-like model trained from scratch on the Esperanto language. |
|
|
| ## Model description |
|
|
| The model has 6 layers, 768 hidden size, 12 attention heads, and a total of 84 million parameters. It's based on the RoBERTa architecture. The tokenizer is a byte-level Byte-Pair Encoding (BPE) tokenizer trained from scratch on the same Esperanto corpus. |
|
|
| - **Model:** RoBERTa-like |
| - **Layers:** 6 |
| - **Hidden size:** 768 |
| - **Heads:** 12 |
| - **Parameters:** 84M |
| - **Tokenizer:** Byte-level BPE |
| - **Vocabulary size:** 52,000 |
|
|
| ## Training data |
|
|
| The model was trained on the Esperanto portion of the OSCAR corpus (`oscar.eo.txt`), which is approximately 3GB in size. |
|
|
| ## Training procedure |
|
|
| The model was trained for one epoch on the OSCAR corpus using the `Trainer` API from the `transformers` library. The training was performed on a single GPU. |
|
|
| ### Hyperparameters |
| - `output_dir`: "./EsperBERTo" |
| - `overwrite_output_dir`: `True` |
| - `num_train_epochs`: 1 |
| - `per_gpu_train_batch_size`: 64 |
| - `save_steps`: 10_000 |
| - `save_total_limit`: 2 |
| - `prediction_loss_only`: `True` |
| |
| The final training loss was `6.1178`. |
| |
| ## Evaluation results |
| |
| The model was not evaluated on a downstream task in the notebook. However, its capabilities can be tested using the `fill-mask` pipeline. |
| |
| Example 1: |
| ```python |
| from transformers import pipeline |
| |
| fill_mask = pipeline( |
| "fill-mask", |
| model="./EsperBERTo", |
| tokenizer="./EsperBERTo" |
| ) |
| |
| fill_mask("La suno <mask>.") |
| ``` |
| Output: |
| ``` |
| [{'score': 0.013023526407778263, 'token': 316, 'token_str': ' estas', 'sequence': 'La suno estas.'}, |
| {'score': 0.008523152209818363, 'token': 607, 'token_str': ' min', 'sequence': 'La suno min.'}, |
| {'score': 0.007405377924442291, 'token': 2575, 'token_str': ' okuloj', 'sequence': 'La suno okuloj.'}, |
| {'score': 0.007219308987259865, 'token': 1635, 'token_str': ' tago', 'sequence': 'La suno tago.'}, |
| {'score': 0.006888304837048054, 'token': 394, 'token_str': ' estis', 'sequence': 'La suno estis.'}] |
| ``` |
| |
| Example 2: |
| ```python |
| fill_mask("Jen la komenco de bela <mask>.") |
| ``` |
| Output: |
| ``` |
| [{'score': 0.016247423365712166, 'token': 1635, 'token_str': ' tago', 'sequence': 'Jen la komenco de bela tago.'}, |
| {'score': 0.009718689136207104, 'token': 1021, 'token_str': ' tempo', 'sequence': 'Jen la komenco de bela tempo.'}, |
| {'score': 0.007543196901679039, 'token': 2257, 'token_str': ' kongreso', 'sequence': 'Jen la komenco de bela kongreso.'}, |
| {'score': 0.0071307034231722355, 'token': 1161, 'token_str': ' vivo', 'sequence': 'Jen la komenco de bela vivo.'}, |
| {'score': 0.006644904613494873, 'token': 758, 'token_str': ' jaroj', 'sequence': 'Jen la komenco de bela jaroj.'}] |
| ``` |
| |
| ## Intended uses & limitations |
| |
| This model is intended to be a general-purpose language model for Esperanto. It can be used for masked language modeling and can be fine-tuned for various downstream tasks such as: |
| - Text Classification |
| - Token Classification (Part-of-Speech Tagging, Named Entity Recognition) |
| - Question Answering |
| |
| Since the model was trained on a relatively small dataset, its performance may be limited. For better results on specific tasks, fine-tuning on a relevant dataset is recommended. |