A Transformer-Based Framework forDomain-Sensitive Amharic to English MachineTranslation with Character-Aware SubwordEncoding

Authors

  • Ragini Rai Noida International University Author

DOI:

https://doi.org/10.70454/JRIST.2025.10202

Keywords:

Neural Machine Translation, Transformer, Amharic, Religious Texts, Subword Encoding, Character-Level Embedding, BLEU Score

Abstract

This paper proposes a domain-adapted neural machine translation (NMT) system for Amharic-to-English translation, focusing on the issues of low-resource translation in a richly morphologically inflected language. We focus on the religious domain with the Tanzil corpus, a structured collection 
of Quranic verses which are translated into Amharic and English for coherence and semantic correspondence. To address the shortcomings of the traditional word-level tokenization of 
Amharic, we implement character-level subword tokenization using the SentencePiece model, which is better suited for rare and compound words. Our model harnesses a Transformer based 
encoder-decoder model together with multi-head attention and feedforward layers induction over the parallel corpus of poems in English and Amharic.The model achieved 59.03 BLEU score 
on the test set, greatly exceeding the classical RNN+Attention baselines which have been shown to have poor performance in low-resource settings. The strong score illustrates that an effectively tuned baseline Transformer model, in combination with domain-specific corpora and sophisticated subword methods, can perform well in translation tasks for under-resourced languages. 
The research provides a foundational, reproducible, and scalable framework that is linguistically-informed for Amharic-English translation, and it can be extended in the future, with additional 
extensions to other Semitic and morphologically rich languages. Our results highlight the value of domain adaptation and subword-aware architectures in advancing NMT for low-resource 
language communities.

References

[1] D. Bahdanau, K. Cho, and Y. Bengio, “Neural machine translation by jointly learning to align and translate,” in ICLR, 2015. [Online]. Available: https://arxiv.org/abs/1409.0473

[2] T. Mikolov et al., “Efficient estimation of word representations in vector space,” arXiv preprint arXiv:1301.3781, 2013. [Online]. Available: https://arxiv.org/abs/1301.3781

[3] A. Vaswani et al., “Attention is all you need,” in NeurIPS, 2017. [Online]. Available: https://papers.nips.cc/paper files/paper/2017/ file/3f5ee243547dee91fbd053c1c4a845aa-Paper.pdf

[4] G. Gezmu et al., “Subword-based transformer nmt models for amharic- english,” arXiv preprint arXiv:2203.00567, 2022. [Online]. Available: https://arxiv.org/abs/2203.00567

[5] M. Artetxe and H. Schwenk, “Massively multilingual sentence embeddings for zero-shot cross-lingual transfer and beyond,” TACL, vol. 7, pp. 597–610, 2019. [Online]. Available: https://aclanthology.org/ Q19-1042.pdf

[6] G. Gezmu and A. Nu¨rnberger, “Morphoseg: Morphologically-aware segmentation for amharic nmt,” in LREC, 2022. [Online]. Available: https://aclanthology.org/2022.lrec-1.124.pdf

[7] B. Belay and Y. Assabie, “Improving amharic–english machine transla- tion using homophone normalization,” arXiv preprint arXiv:2302.10934, 2023. [Online]. Available: https://arxiv.org/abs/2302.10934

[8] T. Kudo and J. Richardson, “Sentencepiece: A simple and language independent subword tokenizer and detokenizer for neural text processing,” in EMNLP, 2018. [Online]. Available: https://aclanthology. org/D18-2012.pdf

[9] A. Conneau et al., “Unsupervised cross-lingual representation learning at scale,” in ACL, 2020. [Online]. Available: https://aclanthology.org/ 2020.acl-main.747.pdf

[10] S. Baniata et al., “Arabic dialects to modern standard arabic nmt with transformer and subword units,” in ACL, 2023. [Online]. Available: https://aclanthology.org/2023.acl-long.25.pdf

[11] “Tanzil: Quranic translation corpus,” https://opus.nlpl.eu/Tanzil.php.

[12] S. Hochreiter and J. Schmidhuber, “Long short-term memory,” Neural Computation, vol. 9, no. 8, pp. 1735–1780, 1997. [Online]. Available: https://www.bioinf.jku.at/publications/older/2604.pdf

[13] Y. Zhang et al., “Character-level representations improve translation for morphologically rich languages,” in ACL, 2020. [Online]. Available: https://aclanthology.org/2020.acl-main.430.pdf

[14] Y. Liu et al., “Multilingual denoising pre-training for neural machine translation,” TACL, vol. 8, pp. 726–742, 2020. [Online]. Available: https://aclanthology.org/2020.tacl-1.47.pdf

[15] H. Zhou et al., “Understanding dropout in deep learning,” NeurIPS, 2022. [Online]. Available: https://arxiv.org/abs/2205.13055

[16] D. Varisˇ and O. Bojar, “Ewc for low-resource neural machine translation,” arXiv preprint arXiv:2301.00251, 2023. [Online]. Available: https://arxiv.org/abs/2301.00251

[17] J. Gao et al., “Cross-lingual consistency regularization for multilingual neural machine translation,” in EMNLP, 2023. [Online]. Available: https://aclanthology.org/2023.emnlp-main.115.pdf

[18] M. Wu et al., “Unidrop: Structured dropout with uniform sampling,” ICLR, 2023. [Online]. Available: https://openreview.net/forum?id= xKzcY0kAQ9G

[19] T. Kocmi and O. Bojar, “An exploration of sentence embeddings for neural machine translation,” arXiv preprint arXiv:2001.03156, 2020. [Online]. Available: https://arxiv.org/abs/2001.03156

[20] A. Fan et al., “Layerdrop: Structured dropout for transformer models,” arXiv preprint arXiv:1909.11556, 2019. [Online]. Available: https://arxiv.org/abs/1909.11556

[21] R. K. Singh, V. Kochher, H. Mehta, S. Gupta, P. Kumar and L. Verma, "Optimizing Security in High-Speed Networking Environments: An Integrated Framework Using AES, MPLS, and IDS for Enhanced Data Protection and Performance," 2025 International Conference on Electronics, AI and Computing (EAIC), Jalandhar, India, 2025, pp. 1-6, doi: 10.1109/EAIC66483.2025.11101617.

[22] T. Wolf et al., “Transformers: State-of-the-art natural language processing,” in Proceedings of the EMNLP 2020, 2020, pp. 38–45. [Online]. Available: https://aclanthology.org/2020.emnlp-demos.6.pdf

Downloads

Published

2025-12-30

Issue

Section

Articles

How to Cite

A Transformer-Based Framework forDomain-Sensitive Amharic to English MachineTranslation with Character-Aware SubwordEncoding. (2025). Journal of Recent Innovation in Science and Technology , 1(2), 13-24. https://doi.org/10.70454/JRIST.2025.10202