Misinformation and Language Resources — LIRNEasia

In 2018, when a race mob let loose in Digana, Sri Lanka, social network sites were named as one of the enablers of virulent hatred and misinformation. Likewise in Myanmar, a country LIRNEasia has spent years conducting research in. 

Building on prior work, we first began to investigate automated bad actors around social media following these incidents, examining hundreds of thousands of tweets to understand their impact. Simultaneously, funded by IDRC, we began to investigate the policy impact and technological feasibility of hate speech moderation. 

As the failure of social media to moderate content became apparent, we delved deeper into the underlying technicalities of why languages in the Global South seem so poorly handled by machine learning – be it hate speech detection or translation. Our research – since cited by Wired, Foreign Policy, MIT Technology Review, the Palgrave Handbook of the Public Servant, and others – examined the state of computational linguistics, from the ruminations of Descartes to modern-day tools, frameworks, research initiatives. Contrasting English, Sinhala and Tamil, we pointed out that the majority of languages in the world do not have the fundamental computational resources to be properly analyzed, especially not in their colloquial forms. We proposed that social network sites, as repositories of colloquial text, open up their data for the creation of corpora, tokenizers, machine learning models, and the like. 

In 2020, with access to Facebook data, we published a corpus of 28 million words from a decade of Sri Lankan Facebook, including separate Sinhala-specific corpora and stopwords – the first of its kind. This corpus is now being used as the basis for sentiment detection modelling in Sinhala. 

Our ongoing research examines the use of AI for misinformation, including the state of the art and the design and testing of 400+ machine learning models to examine algorithmic efficacy, data requirements, and hardware and liveware costs. Funded by the Asia Foundation, the project also builds new misinformation datasets and models for Sinhala and Bengali. As part of a new, ongoing strand of research, we’re also taking a deeper, more qualitative look at both the challenges faced by regional fact checkers and journalists, as well the practicalities of technology adoption therein. LIRNEasia is also undertaking a scoping study funded by IDRC to understand the nature of information disorder, measures to counteract it and the gaps in action and research in Asia PAcific. The study output will consist of a map of actors and frameworks; an evaluation of the current approaches and tools used by stakeholder groups to counter information disorder and an overview of the research landscape.  


  • A Corpus and Machine Learning Models for Fake News Classification in Sinhala

    We present a dataset consisting of 3576 documents in Sinhala, drawn from Sri Lankan news websites and factchecking operations, annotated as CREDIBLE, FALSE, PARTIAL or UN- CERTAIN. The dataset has markers for the content of the document, the classification, the web domain from which each document was retrieved, and the date on which the document was published. We also present the results of misinformation classification models built for the Sinhala language, as well as comparisons to English benchmarks, and suggest that for smaller media ecosystems it may make more practical sense to model uncertainty instead of truth vs falsehood binaries.

  • The Control of Hate Speech on Social Media: Lessons from Sri Lanka

    As hate speech on social media becomes an ever-increasing problem, policymakers may look to more authoritarian measures for policing content. Several countries have already, at some stage, banned networks such as Facebook and Twitter (Liebelson, 2017).

  • Sinhala Language Corpora and Stopwords from a Decade of Sri Lankan Facebook

    This paper presents two colloquial Sinhala language corpora from the language efforts of the Data, Analysis and Policy team of LIRNEasia, as well as a list of algorithmically derived stopwords. The larger of the two corpora spans 2010 to 2020 and contains 28,825,820 to 29,549,672 words of multilingual text posted by 533 Sri Lankan Facebook pages, including politics, media, celebrities, and other categories; the smaller corpus amounts to 5,402,76 words of only Sinhala text extracted from the larger.

  • Artificial Intelligence for Factchecking: Observations on the State and Practicality of the Art

    We summarize the state of progress in artificial intelligence as used for classifying misinforma- tion, or ’fake news’. Making a case for AI in an assistive capacity for factchecking, we briefly examine the history of the field, divide current work into ’classical machine learning’ and ’deep learning’, and for both, examine the work that has led to certain algorithms becoming the de facto standards for this type of text classification task.

  • How Much Bullshit Do We Need? Benchmarking Classical Machine Learning for Fake News Classification

    In a practical experiment, we benchmark five common text classification algorithms – Naive Bayes, Logistic Regression, Support Vector Machines, Random Forests, and eXtreme Gradient Boosting – on multiple misinformation datasets, accounting for both data-rich and data-poor environments.

  • Natural Language Processing for Government: Problems and Potential

    A whitepaper distilling LIRNEasia’s current thoughts on the possibilities and issues with the computation extraction of syntactic and semantic language from digital text.

More Documents →


More Events →


Blogs and Updates