Overview

This repository contains a Grapheme-Aware Tokenizer (GAT) specifically trained for Kannada, designed to handle the unique orthographic and phonological structure of the language. Unlike traditional subword tokenizers such as BPE, SentencePiece, or WordPiece, this tokenizer operates at the grapheme level, improving representation fidelity and reducing subword fragmentation.

Available Vocabulary Sizes

This repository includes three tokenizer variants:

Vocabulary	File
8k	`GAT_Kannada_8k.json`
16k	`GAT_Kannada_16k.json`
32k	`GAT_Kannada_32k.json` (recommended)

Why Grapheme-Aware Preprocessing?

Kannada is an Abugida script where a single grapheme (akshara) may be composed of:

multiple consonants
a halant (virama)
vowel diacritics (matra)

For example:

ಕ್ರಿ

is one grapheme, but composed of multiple Unicode codepoints.

ಕ್ + ರ್ + ಿ → 3–4 fragments

Problem with BPE / SentencePiece / WordPiece

These tokenizers operate at the byte or character level:

This results in:

stable semantic units
better compression
more efficient tokenization

GAT Solution

GAT applies a custom grapheme parser that merges the components into one atomic unit:

GAT uses a rule-based finite-state parser that correctly handles:

consonants
vowels
halants
vowel signs
anusvara & visarga

After grapheme segmentation, Byte Pair Encoding (BPE) is applied to learn higher-level merges.

Training Data

Tokenizer training uses a composite 4.5M-sentence Kannada corpus:

Samanantar Dataset (AI4Bharat)
Kannada-Instruct Dataset (Cognitive Lab)

This provides broad coverage of conversational, literary, and instruction-following Kannada.

Tokenizer Metrics

These metrics evaluate tokenizer quality independent of any downstream NLP model.

Compression Ratio (CR)

Higher = better (larger text compressed into fewer bytes)

Fertility Score (FS)

Lower = better (#tokens produced per grapheme/character)

Results for CR and FS

GAT consistently showed better compression ratio and fertility score across vocab sizes .

CR : 3.5 -> 3.9 -> 4.8

FS : 2.1 -> 1.8 -> 1.6

💻 Usage Example

Load the 32k tokenizer

from transformers import PreTrainedTokenizerFast

tokenizer = PreTrainedTokenizerFast.from_pretrained(
    "varuni/GAT-K",
    tokenizer_file="GAT_Kannada_32k.json"
)

text = "ನಿಮ್ಮ ಹೆಸರು ಏನು?"
print(tokenizer.encode(text))

Related work :

M. Velayuthan and K. Sarveswaran, “Egalitarian Language Representation in Language Models: It All Begins with Tokenizers,” COLING 2025.
arXiv:2409.11501 [cs.CL]. DOI: https://doi.org/10.48550/arXiv.2409.11501
Unicode Normalization and Grapheme Parsing of Indic Languages, 2023. [Online]. Available: https://arxiv.org/abs/2306.01743
M. K. H. and A. Giri, “Orthographic Syllable Pair Encoding for Language Modelling Tasks in Indic Languages,” in 2023 IEEE MIT Undergraduate Research Technology Conference (URTC), Cambridge, MA, USA, pp. 1–6, 2023.
DOI: https://doi.org/10.1109/URTC60662.2023.10534970
R. Sennrich, B. Haddow, and A. Birch, “Neural machine translation of rare words with subword units,” in Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (ACL), 2016. [Online]. Available: https://arxiv.org/abs/1508.07909

Downloads last month: -; Downloads are not tracked for this model. How to track

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Papers for varuni/GAT-K

Egalitarian Language Representation in Language Models: It All Begins with Tokenizers

Paper • 2409.11501 • Published Sep 17, 2024

Unicode Normalization and Grapheme Parsing of Indic Languages

Paper • 2306.01743 • Published May 27, 2024

Neural Machine Translation of Rare Words with Subword Units

Paper • 1508.07909 • Published Aug 31, 2015 • 4