A socioeconomic index, also known as a deprivation or poverty index, is a single numerical figure derived from multiple indicators, that gauges the socioeconomic status of a predefined area. It allows for direct comparisons of socioeconomic status between regions and is tremendously useful in identifying patterns and correlations between socioeconomic status and other attributes.
However, it is not easy to construct as there are many indicators to choose from – income, expenditure, education, occupation, durable assets, etc. and it is difficult to objectively justify their relative importance.
Several governments and organisations have developed socioeconomic indices for their respective regions that have been widely accepted as official:
- The National Statistics Socio-Economic Classification (NS-SEC) for the United Kingdom
- The European Deprivation Index (EDI) for Europe
- The Socio-Economic Indexes for Areas (SEIFA) for Australia
- The New Zealand Deprivation Index (NZDep) for New Zealand
- The Global Multidimensional Poverty Index (MPI)
Unfortunately, Sri Lanka does not yet have such an index. This article presents our attempt at creating a socioeconomic index for Sri Lanka using principal component analysis (PCA) on the 2011 national census datasets. Justification of our choice of using PCA is beyond the scope of this article, however it is routine in the creation of socioeconomic indices (Vyas and Kumaranayake, 2006).
Principal Component Analysis
This section contains a brief introduction to using PCA to develop a socioeconomic index. We strongly recommend reading Vyas and Kumaranayake (2006) for a thorough description, justification and a worked example.
PCA is a statistical technique used to reduce a set of possibly correlated variables to a smaller set of uncorrelated (i.e. orthogonal) components, where each component is a linear, weighted combination of the initial variables. The first principal component accounts for as much of the variance of the dataset as possible, and each succeeding component accounts for as much of the variance of the dataset orthogonal to the preceding components.
We make the assumption that the first principal component resulting from the application of PCA on a dataset of socioeconomic indicators is the socioeconomic index. However, the reliability of this index is contingent on the careful selection of variables to include in the PCA.
Curating the Dataset
The 2011 national census datasets are only available as a summary of counts at the Grama Niladhari Division (GND) level. The original categorical variables surveyed at the household level have been converted to binary variables and aggregated for each GND. This would not necessarily be a problem if each GND was homogeneous with respect to socioeconomic status, but reality is not as organized as we would like it to be. Consequently, this obscures certain correlations between variables – consider a GND where half the houses have granite flooring and tile roofing, and the remaining half have cement flooring and asbestos roofing. In actual fact, granite flooring should be correlated with tile roofing, however in this case granite flooring would be equally correlated with tile roofing, cement flooring and asbestos roofing.
The variables we considered were from the 2011 national census (after two weeks of painstaking data cleaning). There were a total of 109 variables to curate. Our objective was to discard any variables that were not indicative of socioeconomic status or redundant.
As such, we discarded the following variables:
- All variables in the housing category on the basis of redundancy – the official concepts and definitions revealed that they were merely a classifical based on floor, roof and wall materials.
- All variables in the waste disposal, age and gender categories on the basis that they were not indicative of socioeconomic status.
- All “other” variables from all categories on the basis that they were not indicative of socioeconomic status – “other” is inherently ambiguous and could potentially cover a wide range of socioeconomic levels.
Thus we were left with 61 variables in the household dataset and 9 variables in the population dataset. We then combined these remaining variables into a single dataset before proceeding.
Ideally, we would have run the PCA on a household level dataset of binary variables. For given household in such a dataset, only a single variable within each category would have a value of 1, with the remaining variables having a value of 0. In order to emulate this, we normalized the variables within each category so that they represented the proportion of households in a GND possessing a certain attribute of a category. We then standardized each variable and ran PCA on the dataset.
We multiplied the weights of the resulting first principal component with the standardized dataset and summed each row to produce a score for each GND. This score was to serve as the socioeconomic index.
The resulting choropleth map exhibits an expected socioeconomic distribution, in spite of the suboptimal dataset. We anticipate even better results with a household level dataset.
The relevant datasets, code and results can be found at this GitHub repository.