Lithuanian

Model Details

LEMMIS (lt. lietuviškas ekspertų mišinio modelis) is a first-of-a-kind natively trained Mixture-of-Experts (MoE) transformer language model designed and pretrained exclusively on Lithuanian text. Unlike existing Lithuanian LLMs (Lt-Llama, LT-MLKM) which adapt English-trained base models, LEMMIS was built from the ground up - architecture, tokenizer, and training data are all Lithuanian-first.

  • Developed by: Aividas Šilingas
  • Language(s): Lithuanian
  • License: BSD 3-Clause License

Intended Use Case

LEMMIS is designed for research purposes in Lithuanian. Both the base and instruct models are proof-of-concept prototypes demonstrating that a Lithuanian-native LLM architecture can achieve coherent language generation. They are not intended for production use or deployment in critical applications.

Model Architecture

Parameter Value
Total parameters 1.24B
Active parameters ~411M
Hidden dim 1024
Layers 16
Attn heads 16
Experts (total / active) 8 / 2
Expert FFN dim 2816
Sequence length 2048
Vocab size 65,546

Tokenizer

LEMMIS uses a custom BPE tokenizer trained with SentencePiece on a subset of the corpus. Benchmarked against 50,000 Lithuanian sentences:

Tokenizer Fertility (tokens/word)
LEMMIS 1.49
LT-MLKM-modernBERT 2.31
GPT-4 (cl100k_base) 3.55

LEMMIS achieves near-English-level tokenization efficiency (1.3–1.5 is standard for English) on a morphologically complex language. This means over 2x more Lithuanian text fits in the same context window compared to GPT-4's tokenizer.

Training

LEMMIS was trained on a large dataset of curated Lithuanian text corpus. The pretraining was run over a single epoch of the full dataset, using mixed precision and AdamW optimizer.

Source Description
CulturaX-LT + GlotCC-LT Web-crawled Lithuanian text
TAR legal texts Lithuanian laws, decrees, and government decisions since 1992
Delfi news ~719k articles, 10-year scrape
Lithuanian books ~2,500 digitized archived public domain books
Wikipedia (LT) Full Lithuanian Wikipedia
Supermama Lithuanian forum posts
Other Reddit (r/lithuania, r/lietuva), 15min, LRT, Lrytas

Instruct tuning

The instruct variant uses LoRA (rank 32, alpha 64) on attention projections (q/k/v/o), trained on nearly 50,000 examples from Alpaca-LT, translated Dolly 2.0 subset and synthetic lithuanian examples.

Usage

LEMMIS is a custom PyTorch model and is not compatible with HuggingFace transformers out of the box. To run inference:

import torch
import sentencepiece as spm
from model.transformer import Lemmis, ModelConfig

# Load tokenizer
sp = spm.SentencePieceProcessor()
sp.Load("tokenizer/lemmis_tokenizer.model")

# Load model
checkpoint = torch.load("lemmis_1.24B_bf16.pt", map_location="cpu", weights_only=False)
config = checkpoint["config"]
model = Lemmis(config)
model.load_state_dict(checkpoint["model_state_dict"], strict=True)
model = model.to(torch.bfloat16).to("cuda")
model.eval()

# Generate
input_text = "Kartą gyveno senelis ir senelė"
input_ids = torch.tensor([sp.Encode(input_text)], dtype=torch.long, device="cuda")
output_ids = model.generate(input_ids, max_new_tokens=100, temperature=0.9, top_p=0.95, repetition_penalty=1.35)
print(sp.Decode(output_ids[0].tolist()))

For the instruct model, wrap prompts with this format:

### Instruction:
{your question}

### Response:

Evaluation

The model was evaluated on a translated MMLU subset (abstract algebra, anatomy, astronomy, business ethics). Results were not meaningful, the model lacks the scale for multi-step reasoning and structured multiple-choice answering. This is consistent with sub-7B models on MMLU generally. The instruct model seems to be generating grammatically correct Lithuanian with proper case and inflections and verb conjugations. All generated words are usually real Lithuanian words. The model follows LoRA instruction format and generated up to ~60 tokens before quality starts degrading.

Acknowledgments

This project was built by Aividas, student @ KTMC (Klaipėdos technologijų mokymo centras). Training compute provided by Vast.ai. Funded out of my pocket.

Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for apparentlyjames/LEMMIS-1.24B

Unable to build the model tree, the base model loops to the model itself. Learn more.

Datasets used to train apparentlyjames/LEMMIS-1.24B