Short Communication: Lightweight Tokenization for Indic Languages

Ivy Nguyen; Tomas Ku?era

Short Communication: Lightweight Tokenization for Indic Languages

short-communication

Received: Oct 10, 2025

Published: Oct 30, 2025

Authors: Ivy Nguyen ✉ Tomas Ku?era

Abstract

We present a rule-augmented unigram tokenizer yielding 7�10% faster training and improved BLEU on Marathi and Telugu corpora compared to SentencePiece baselines.

⬇ Download

Cite this article

Nguyen, I. & Ku?era, T. (2025). Short Communication: Lightweight Tokenization for Indic Languages. Research Explorations in Global Knowledge & Technology (REGKT), 3 (7). Retrieved from https://regkt.com/article.php?id=143&slug=short-communication-lightweight-tokenization-for-indic-languages

Short Communication: Lightweight Tokenization for Indic Languages

Abstract

Cite this article

Premium Membership Required