Short Communication: Lightweight Tokenization for Indic Languages

short-communication
Received: Oct 10, 2025
Published: Oct 30, 2025
Authors: Ivy Nguyen ✉ Tomas Ku?era

Abstract

We present a rule-augmented unigram tokenizer yielding 7�10% faster training and improved BLEU on Marathi and Telugu corpora compared to SentencePiece baselines.

⬇ Download

Cite this article

Nguyen, I. & Ku?era, T. (2025). Short Communication: Lightweight Tokenization for Indic Languages. Research Explorations in Global Knowledge & Technology (REGKT), 3 (7). Retrieved from https://regkt.com/article.php?id=143&slug=short-communication-lightweight-tokenization-for-indic-languages

Premium Membership Required

You need a premium account to view or download this article.

Become Premium