In 2018, when a race mob let loose in Digana, Sri Lanka, social network sites were named as one of the enablers of virulent hatred and misinformation. Likewise in Myanmar, a country LIRNEasia has spent years conducting research in.
Building on prior work, we first began to investigate automated bad actors around social media following these incidents, examining hundreds of thousands of tweets to understand their impact. Simultaneously, funded by IDRC, we began to investigate the policy impact and technological feasibility of hate speech moderation.
As the failure of social media to moderate content became apparent, we delved deeper into the underlying technicalities of why languages in the Global South seem so poorly handled by machine learning – be it hate speech detection or translation. Our research – since cited by Wired, Foreign Policy, MIT Technology Review, the Palgrave Handbook of the Public Servant, and others – examined the state of computational linguistics, from the ruminations of Descartes to modern-day tools, frameworks, research initiatives. Contrasting English, Sinhala and Tamil, we pointed out that the majority of languages in the world do not have the fundamental computational resources to be properly analyzed, especially not in their colloquial forms. We proposed that social network sites, as repositories of colloquial text, open up their data for the creation of corpora, tokenizers, machine learning models, and the like.
In 2020, with access to Facebook data, we published a corpus of 28 million words from a decade of Sri Lankan Facebook, including separate Sinhala-specific corpora and stopwords – the first of its kind. This corpus is now being used as the basis for sentiment detection modelling in Sinhala.
Our ongoing research examines the use of AI for misinformation, including the state of the art and the design and testing of 400+ machine learning models to examine algorithmic efficacy, data requirements, and hardware and liveware costs. Funded by the Asia Foundation, the project also builds new misinformation datasets and models for Sinhala and Bengali. As part of a new, ongoing strand of research, we’re also taking a deeper, more qualitative look at both the challenges faced by regional fact checkers and journalists, as well the practicalities of technology adoption therein. LIRNEasia is also undertaking a scoping study funded by IDRC to understand the nature of information disorder, measures to counteract it and the gaps in action and research in Asia PAcific. The study output will consist of a map of actors and frameworks; an evaluation of the current approaches and tools used by stakeholder groups to counter information disorder and an overview of the research landscape.
An Expert Round Table discussion on “Tackling online misinformation while protecting freedom of expression” held on the 11th of October 2021, as the second of a series of discussions under the theme of “Frontiers of Digital Economy”
LIRNEasia joined a webinar on Information Disorder organized by University of Cape Town on 6 May 2022. This event was based on the collaborative Global South report on Information Disorder where LIRNEasia authored the chapter on Asian region.
A white paper exploring the use of AI in classifying misinformation.
Over the past decade, both internet penetration and digital media user base have increased substantially.
We present a dataset consisting of 3468 documents in Bengali, drawn from Bangladeshi news websites and factchecking operations, annotated as CREDIBLE, FALSE, PARTIAL or UN-CERTAIN. The dataset has markers for the content of the document, the classification, the web domain from which each document was retrieved, and the date on which the document was published. We also present the results of misinformation classification models built for the Bengali language, as well as comparisons to prior work in English and Sinhala.
We present a dataset consisting of 3576 documents in Sinhala, drawn from Sri Lankan news websites and factchecking operations, annotated as CREDIBLE, FALSE, PARTIAL or UN- CERTAIN. The dataset has markers for the content of the document, the classification, the web domain from which each document was retrieved, and the date on which the document was published. We also present the results of misinformation classification models built for the Sinhala language, as well as comparisons to English benchmarks, and suggest that for smaller media ecosystems it may make more practical sense to model uncertainty instead of truth vs falsehood binaries.
As hate speech on social media becomes an ever-increasing problem, policymakers may look to more authoritarian measures for policing content. Several countries have already, at some stage, banned networks such as Facebook and Twitter (Liebelson, 2017).
This paper presents two colloquial Sinhala language corpora from the language efforts of the Data, Analysis and Policy team of LIRNEasia, as well as a list of algorithmically derived stopwords. The larger of the two corpora spans 2010 to 2020 and contains 28,825,820 to 29,549,672 words of multilingual text posted by 533 Sri Lankan Facebook pages, including politics, media, celebrities, and other categories; the smaller corpus amounts to 5,402,76 words of only Sinhala text extracted from the larger.