Sinhala Language Corpora and Stopwords from a Decade of Sri Lankan Facebook


Posted by on July 13, 2020  /  0 Comments

Wijeratne, Y., de Silva, N. (2020).  Sinhala Language Corpora and Stopwords from a Decade of Sri Lankan Facebook. LIRNEasia. Last updated: July 13, 2020.


This paper presents two colloquial Sinhala language corpora extracted from Facebook, as well as a list of algorithmically derived stopwords.

Corpus-Alpha

The larger of the two corpora spans trilingual text posted by 533 Sri Lankan Facebook pages, including politics, media, celebrities, and other categories, from 2010 to 2020. It contains 28,825,820 to 29,549,672 words of text, mostly in Sinhala, English and Tamil (the three main languages used in Sri Lanka). It contains URLs, punctuation and other noise, making it more suitable for discourse analysis and the study of codemixing in colloquial Sinhala.

Corpus-Sinhala-Redux

The smaller corpus amounts to 5,402,76 words of only Sinhala text extracted from Corpus-Alpha. It has been cleaned of URLs, punctuation and noise.

Both corpora have markers for their date of creation, page of origin, and content type.

License

These datasets are released under the principles of Open Access. As such, this work is licensed under a Creative Commons 4.0 CC BY licence: you may distribute, remix, adapt, and build upon this work, even commercially, as long as you credit the authors for the original creation. You may not apply legal terms or technological measures that legally restrict others from doing anything the license permits. Please see the full license at https://creativecommons.org/licenses/by/4.0/legalcode

Download data

Leave a Reply

Your email address will not be published. Required fields are marked *

*

*

*