Sinhala Language Corpora and Stopwords from a Decade of Sri Lankan Facebook

LIRNEasia > Themes > Data, Algorithms and Policy

Sinhala Language Corpora and Stopwords from a Decade of Sri Lankan Facebook

Posted on July 13, 2021 / 0 Comments

This paper presents two colloquial Sinhala language corpora from the language efforts of the Data, Analysis and Policy team of LIRNEasia, as well as a list of algorithmically derived stopwords. The larger of the two corpora spans 2010 to 2020 and contains 28,825,820 to 29,549,672 words of multilingual text posted by 533 Sri Lankan Facebook pages, including politics, media, celebrities, and other categories; the smaller corpus amounts to 5,402,76 words of only Sinhala text extracted from the larger.

Download PDF Email

Comments are closed.

+94 (0)11 267 1160

+94 (0)11 267 5212

info [at] lirneasia [dot] net

a regional ICT policy and regulation think tank active across the Asia Pacific

Sinhala Language Corpora and Stopwords from a Decade of Sri Lankan Facebook

Share this:

Can Copyright Law still serve the public interest in the age of AI?

Beyond the Hype: Responsible AI and Data Protection in South and Southeast Asia

Artificial Intelligence (AI) governance is a concern Sri Lanka must address now

Links

Themes

Social

Contact