Researchers at the University of Surrey have developed the world’s first dataset to enrich the quality of translation from English to Malayalam. With previously limited data to ascertain the accuracy of translation tools, progress with understanding the language has been slow despite more than 38 million people speaking it in India.
To drive progress in improving machine translation for this low-resource language, the Surrey-led study – published in the ACL Anthology – tackles the problem head-on. Backed by funding from the European Association for Machine Translation (EAMT), the research focuses on two areas where low-resource languages suffer the greatest: Quality Estimation, which predicts translation quality without a reference text, and Automatic Post-Editing (APE), which corrects machine-generated errors.
To do this, the team curated 8,000 English-to-Malayalam translation segments drawn from finance, legal and news content – domains where precision is non-negotiable. Each segment was assessed by professional annotators at industry partner TechLiebe, who provided three independent quality scores alongside a corrected, post-edited version of the machine translation.
Dr Diptesh Kanojia, Senior Lecturer at the Surrey Institute for People-Centred AI and project co-lead, said:
“Low-resource languages like Malayalam are often left behind simply because we don’t have the datasets needed to improve machine translation. Our work provides a strong foundation for both assessing and correcting translations – supporting Malayalam speakers while also opening the door to similar resources for many other underserved languages.”
The research also introduces an innovation with potentially wide-ranging implications: Weak Error Remarks. This additional layer of annotation allows human reviewers to quickly flag and describe specific translation errors – such as missing words, mistranslations or added phrases – without the burden of lengthy reports.
Early results show that when these brief human notes are paired with large language models, systems become significantly better at identifying why a translation failed, not just that it failed – already outperforming existing evaluation methods.
“Malayalam is one of India’s classical languages, spoken by millions, yet it remains severely under-resourced for reference-free machine translation evaluation. By introducing Weak Error Remarks, we offer a lightweight and interpretable form of human-annotated supervision that captures translation errors beyond scalar scores. This added context enables learning signals that help large language models reason more effectively about translation quality, demonstrating how simple, human-centric annotations can significantly strengthen MT evaluation in low-resource settings.”
The team has now completed most of the annotations, with a public release of the dataset scheduled for April 2026. Beyond Malayalam, the methodology offers a template for improving machine translation across dozens of underserved languages – including many spoken across India, Africa and Creole-speaking regions – where high-quality data remains scarce but demand is growing fast.

