Short Communication: Lightweight Tokenization for Indic Languages
Abstract
We present a rule-augmented unigram tokenizer yielding 7�10% faster training and improved BLEU on Marathi and Telugu corpora compared to SentencePiece baselines.
Cite this article
Nguyen, I. & Ku?era, T. (2025). Short Communication: Lightweight Tokenization for Indic Languages. Research Explorations in Global Knowledge & Technology (REGKT), 3 (7). Retrieved from https://regkt.com/article.php?id=143&slug=short-communication-lightweight-tokenization-for-indic-languages