The news reports suggest that TRAI has already received nearly 1 million submissions to its recent “Consultation Paper on Regulatory Framework for Over-the-top (OTT) services” that has sparked a heated debate on net neutrality. In addition to drafting a response ourselves, we also turned our attention to the problem of analyzing such a large volume of responses. Significant amount of time and effort would be required to read and interpret, as well to even formulate a basic general outline of what the public and other stakeholders are trying to say. To put it mildly, TRAI is going to have its work cut out if they are to give each response due justice.
Current and former researchers from our big data team, Kaushalya Madhawa, Danaja Maldeniya, and Nisansa de Silva brainstormed a technology augmented approach to the problem of analyzing the responses. Their preliminary thoughts are below:
The solution to TRAI’s dilemma lies in the use of Natural Language Processing (NLP), a field that incorporates computer science, artificial intelligence and computational linguistics. NLP is concerned with providing computers the ability understand and communicate in human languages. Whilst being around for decades, it is only recently that NLP has achieved the critical mass in advancement that would allow it to make a noticeable difference in user experience with computing devices and the Internet. While not yet in a position to completely replace good old human skill in understanding and interpreting text, NLP when combined with the vastly superior processing capabilities of modern computers can at the very least significantly simplify the process of making sense of the responses TRAI has received.
Word/ Phrase Clouds
One of the most basic NLP analyses involves generating a word or tag cloud (these have become somewhat commonplace on the web, particularly on blogs). A word cloud is a visualization where words in the body of a text a represented such that their font size corresponds to the frequency of occurrence. This provides a simple visual method to identify the most significant words and concepts employed or addressed in text. Appropriately adapted to the context of the responses, TRAI can use word clouds to great effect in identifying the significance of different aspects and entities in the responses to each of the questions as well as on the whole.
A cursory examination of the 20 questions shows that, many of the questions are in fact a set of interdependent sub-questions. Thus it is worth constructing a simple logical hierarchy of questions. This would allow analyzing the responses in terms of sections, which may address different sub questions, as well as the whole
Any effective use of a word cloud requires the removal of designated “stop words” not relevant to the particular analysis (e.g. ‘the’). In analyzing the body of responses to a particular question there’s a high likelihood that words/phrases that appear in the question appear frequently. Whether or not the purpose of identifying the key concerns of those who responded to the question is better served by either considering these words as “stop words” can be left to human judgment. For example Question 7 asks how OTT players should ensure security, safety and privacy of the consumer, where it is important to understand the relative weight placed on the words in the question (security, safety, privacy) by the respondents. On the other hand Question 3 asks how the growth of the Internet and the OTT players is affecting the revenue of the Telecom operators. The words/phrases internet, OTT, revenue, telecom operators will inevitably be frequent in responses to this question, though the relative frequencies of these will arguably provide little insight into what the respondents are saying.
The individuals or groups that have responded to the questions may have specific interests or motives in taking part in the debate. They may use words/phrases relevant to such interests or motivations consistently across questions though not necessarily frequently enough to be noticed in the word clouds for each question. A word cloud that visualizes the aggregate of all responses to all questions where words/phrases in the questions have been excluded may serve to highlight underlying patterns of interest of the respondents.
Some questions ask for the opinion of the user on different matters. As an example Question 1 asks whether is it too early to establish a regulatory framework for OTT services. Response to such a question can vary from simple yes/no answers to long and cogent paragraphs representing a positive or a negative polarity towards the question. Even though these answers contain different arguments, they can be simply divided as “yes” or “no” answers. If we consider the answers to Question 1 there could be thousands of users requesting a regulatory framework and those who oppose such a framework. We can utilize opinion mining techniques that are used by ecommerce websites to extract the opinions expressed by their users as product reviews. By grouping these answers based on the polarity to “yes and “no” answers, we can identify the motives behind supporting such a framework. We hope generating word clouds per each group of answers per each question would highlight the underlying motives of the respondents.
Identifying constructive input
Quite likely not all responses to the questions will be equally constructive. One may be written by someone with a comprehensive understanding of the context. Another may just be by an ordinary user of OTT services who is simply concerned about his/her continued ability to use those services. Whilst knowing the opinions of both are important, identifying constructive and nuanced responses will be important for something as complex as what is being discussed in the consultation paper.
While NLP has not advanced sufficiently to tackle an abstract concept like constructiveness directly, quality of writing (this is easily quantifiable) is often used as a proxy. A compound measure of quality may take into account a range of aspects in the responses including simple measures such as average word length, rate of grammar errors and the overall cohesion of writing at a paragraph level. Cohesion itself may potentially be estimated by considering the semantic similarity of consecutive sentences in a paragraph as well as that the introductory sentence with every other sentence in a paragraph.
This measure of quality of writing can be used as a filtering mechanism to focus on different types of responses. So when attempting to extract broad patterns in opinion, one can use the entire set of response, but we can then use a measure of the quality of writing as a filtering mechanism, to identify the responses that may be worth greater scrutiny for nuanced inputs for navigating a complex subject.
It is likely that the majority of the responses that TRAI has gotten so far came through organized online movements such as http://www.savetheinternet.in/, which provide a set of stock answers that people can use as is, or modify as per their needs. It may be of interest to identify the contribution of these movements to the overall picture that may be formed when analyzing the responses as a whole. Simple text matching and probability measures based on the semantic similarity of responses can be used to identify responses that are either identical or derived from a common source. This could potentially provide a clear understanding of groups of respondents with whom further discourse can be carried out on specific points.
If applied, the techniques discussed above can provide significant summary statistics to get a broad understanding of the submitted responses. However the actual interpretation especially of constructive input is mostly up to the human using those techniques. Recent advances in NLP provide the ability to carry out partial analysis of meaning in text. Methods such as semantic role labeling allow extracting bits of meaning or semantic frames from the whole a sentence. By considering the most significant semantic frames in the response of a single respondent to a question as opposed to how often that expression or concept is expressed in the set of all responses it may be possible to identify which elements characterizes that particular response best. Extending this to all responses and the use of a word cloud (which now becomes a semantic cloud) will potentially provide more detailed eagle’s eye view of the most significant sentiments of the respondents in relation to different questions.
The techniques we mention here are meant to simplify and facilitate the work that TRAI has ahead in making sense of the responses. However at the end of the day, human intuition and intervention is still very much needed for effective interpretation.