Model Details
LEMMIS (lt. lietuviškas ekspertų mišinio modelis) is a first-of-a-kind natively trained Mixture-of-Experts (MoE) transformer language model designed and pretrained exclusively on Lithuanian text. Unlike existing Lithuanian LLMs (Lt-Llama, LT-MLKM) which adapt English-trained base models, LEMMIS was built from the ground up - architecture, tokenizer, and training data are all Lithuanian-first.
- Developed by: Aividas Šilingas
- Language(s): Lithuanian
- License: BSD 3-Clause License
Intended Use Case
LEMMIS is designed for research purposes in Lithuanian. Both the base and instruct models are proof-of-concept prototypes demonstrating that a Lithuanian-native LLM architecture can achieve coherent language generation. They are not intended for production use or deployment in critical applications.
Model Architecture
| Parameter | Value |
|---|---|
| Total parameters | 1.24B |
| Active parameters | ~411M |
| Hidden dim | 1024 |
| Layers | 16 |
| Attn heads | 16 |
| Experts (total / active) | 8 / 2 |
| Expert FFN dim | 2816 |
| Sequence length | 2048 |
| Vocab size | 65,546 |
Tokenizer
LEMMIS uses a custom BPE tokenizer trained with SentencePiece on a subset of the corpus. Benchmarked against 50,000 Lithuanian sentences:
| Tokenizer | Fertility (tokens/word) |
|---|---|
| LEMMIS | 1.49 |
| LT-MLKM-modernBERT | 2.31 |
| GPT-4 (cl100k_base) | 3.55 |
LEMMIS achieves near-English-level tokenization efficiency (1.3–1.5 is standard for English) on a morphologically complex language. This means over 2x more Lithuanian text fits in the same context window compared to GPT-4's tokenizer.
Training
LEMMIS was trained on a large dataset of curated Lithuanian text corpus. The pretraining was run over a single epoch of the full dataset, using mixed precision and AdamW optimizer.
| Source | Description |
|---|---|
| CulturaX-LT + GlotCC-LT | Web-crawled Lithuanian text |
| TAR legal texts | Lithuanian laws, decrees, and government decisions since 1992 |
| Delfi news | ~719k articles, 10-year scrape |
| Lithuanian books | ~2,500 digitized archived public domain books |
| Wikipedia (LT) | Full Lithuanian Wikipedia |
| Supermama | Lithuanian forum posts |
| Other | Reddit (r/lithuania, r/lietuva), 15min, LRT, Lrytas |
Instruct tuning
The instruct variant uses LoRA (rank 32, alpha 64) on attention projections (q/k/v/o), trained on nearly 50,000 examples from Alpaca-LT, translated Dolly 2.0 subset and synthetic lithuanian examples.
Usage
LEMMIS is a custom PyTorch model and is not compatible with HuggingFace transformers out of the box. To run inference:
import torch
import sentencepiece as spm
from model.transformer import Lemmis, ModelConfig
# Load tokenizer
sp = spm.SentencePieceProcessor()
sp.Load("tokenizer/lemmis_tokenizer.model")
# Load model
checkpoint = torch.load("lemmis_1.24B_bf16.pt", map_location="cpu", weights_only=False)
config = checkpoint["config"]
model = Lemmis(config)
model.load_state_dict(checkpoint["model_state_dict"], strict=True)
model = model.to(torch.bfloat16).to("cuda")
model.eval()
# Generate
input_text = "Kartą gyveno senelis ir senelė"
input_ids = torch.tensor([sp.Encode(input_text)], dtype=torch.long, device="cuda")
output_ids = model.generate(input_ids, max_new_tokens=100, temperature=0.9, top_p=0.95, repetition_penalty=1.35)
print(sp.Decode(output_ids[0].tolist()))
For the instruct model, wrap prompts with this format:
### Instruction:
{your question}
### Response:
Evaluation
The model was evaluated on a translated MMLU subset (abstract algebra, anatomy, astronomy, business ethics). Results were not meaningful, the model lacks the scale for multi-step reasoning and structured multiple-choice answering. This is consistent with sub-7B models on MMLU generally. The instruct model seems to be generating grammatically correct Lithuanian with proper case and inflections and verb conjugations. All generated words are usually real Lithuanian words. The model follows LoRA instruction format and generated up to ~60 tokens before quality starts degrading.
Acknowledgments
This project was built by Aividas, student @ KTMC (Klaipėdos technologijų mokymo centras). Training compute provided by Vast.ai. Funded out of my pocket.
Model tree for apparentlyjames/LEMMIS-1.24B
Unable to build the model tree, the base model loops to the model itself. Learn more.