Dr Surangika Ranathunga and Dr Nisansa de Silva from the Department of Computer Science and Engineering, University of Moratuwa recently became recipients of Google’s Inclusion Research award, 2022. Dr. Nisansa de Silva is affiliated with LIRNEasia as a Research Fellow.
Google’s Inclusion Research Program is targeted towards academic research in computing and technology that globally addresses the needs of historically marginalized groups. It was launched in 2020 and research proposals under topics such as Accessibility, Algorithmic fairness, and Digital safety are supported. The proposals are evaluated by a rubric that takes into account the qualifications and the prior domain experience of the faculty, the broader impact & research merit of the proposed work, and the overall quality of the proposal.
While Machine Translation systems such as GoogleTranslate support Sinhala and Tamil, their performance is not at an acceptable level across text from different domains such as news, government, religious, and scientific. Dr Surangika Ranathunga and Dr Nisansa de Silva propose to build a multi-domain Neural Machine Translation (NMT) System for Sinhala, Tamil, and English with the hopes that the said system will allow the Sri Lankan population to refer to information written in all three languages. Especially, it is expected that the proposed system will be beneficial to the marginalized Tamil-speaking minority to refer to Sinhala content. The first phase of the project is to quality estimate the existing parallel corpora for these language pairs and denoise few of them having reasonable accuracy. The second phase of the project will be targeted at building new parallel corpora for some more domains. It is expected to have parallel corpora for at least 7 different domains, with each corpus having at least 25000 parallel sentences. In the final phase of the project, knowledge distillation methods will be employed to build a multi-domain NMT model on top of large language models (LLMs), which would be robust to domain differences. The newly created datasets and the trained NMT models will be publicly released.