TY - JOUR
T1 - School Dropout Prediction and Feature Importance Exploration in Malawi Using Household Panel Data
T2 - Machine Learning Approach
AU - Çolak, Hazal
AU - Güven, Çiçek
AU - Nápoles, Gonzalo
PY - 2022/12/13
Y1 - 2022/12/13
N2 - Designing early warning systems through machine learning (ML) models to identify students at risk of dropout can improve targeting mechanisms and lead to efficient social policy interventions in education. School dropout is a culmination of various factors that drive children to leave school, and timely policy responses are most needed to address these underlying factors and improve school retention of children over time. However, applying ML approaches to school dropout prediction is an important challenge, especially in low-income countries, where data collection and management systems are relatively more prone to financial and technical constraints. For this reason, this study suggests using already collected household panel data to predict the probability of school dropout and explore feature importance for primary school children in Malawi through ML models and feature importance exploration. A rich set of variables is obtained in this study from the household data and used to build Random Forest (RF), Least Absolute Shrinkage and Selection Operator (LASSO), Ridge and Multilayer Neural Network (MNN) models. The study further explores how performance metrics differ when we embed the training samples' weights representing frequency in sampling design into the cost function of these ML models to discuss the implications of using household data in computational social science. LASSO and MNN models trained with sample weights become more prominent due to their higher recall rates of 80.6% and 78.8%. Compared to the baseline model trained with sample weights, the recall rate gained is roughly 56 percentage points using LASSO and 54 percentage points using MNN. Also, comparing LASSO and MNN trained with and without sample weights reveals that training models with sample weights increase the recall rate roughly by 11 percentage points for LASSO and 12 percentage points for MNN. Lastly, the paper provides a comprehensive and unified approach to better interpret the models by using a game-theoretic approach – SHapley Additive exPlanations (SHAP) – to quantify feature importance. As a result, socio-economic characteristics of children, such as working in household farming and father's education level, are among the most important features contributing to the probability of school dropout in ML models. This study argues that the weighted sample structure of household data and its wide range of variables can enrich the literature and the SHAP method for feature importance and yield valuable results to harness data science for society.
AB - Designing early warning systems through machine learning (ML) models to identify students at risk of dropout can improve targeting mechanisms and lead to efficient social policy interventions in education. School dropout is a culmination of various factors that drive children to leave school, and timely policy responses are most needed to address these underlying factors and improve school retention of children over time. However, applying ML approaches to school dropout prediction is an important challenge, especially in low-income countries, where data collection and management systems are relatively more prone to financial and technical constraints. For this reason, this study suggests using already collected household panel data to predict the probability of school dropout and explore feature importance for primary school children in Malawi through ML models and feature importance exploration. A rich set of variables is obtained in this study from the household data and used to build Random Forest (RF), Least Absolute Shrinkage and Selection Operator (LASSO), Ridge and Multilayer Neural Network (MNN) models. The study further explores how performance metrics differ when we embed the training samples' weights representing frequency in sampling design into the cost function of these ML models to discuss the implications of using household data in computational social science. LASSO and MNN models trained with sample weights become more prominent due to their higher recall rates of 80.6% and 78.8%. Compared to the baseline model trained with sample weights, the recall rate gained is roughly 56 percentage points using LASSO and 54 percentage points using MNN. Also, comparing LASSO and MNN trained with and without sample weights reveals that training models with sample weights increase the recall rate roughly by 11 percentage points for LASSO and 12 percentage points for MNN. Lastly, the paper provides a comprehensive and unified approach to better interpret the models by using a game-theoretic approach – SHapley Additive exPlanations (SHAP) – to quantify feature importance. As a result, socio-economic characteristics of children, such as working in household farming and father's education level, are among the most important features contributing to the probability of school dropout in ML models. This study argues that the weighted sample structure of household data and its wide range of variables can enrich the literature and the SHAP method for feature importance and yield valuable results to harness data science for society.
KW - Machine learning
KW - Feature importance
KW - School dropout prediction
KW - Sample weight
KW - Educational data mining
U2 - 10.1007/s42001-022-00195-3
DO - 10.1007/s42001-022-00195-3
M3 - Article
SN - 2432-2717
JO - Journal of Computational Social Science
JF - Journal of Computational Social Science
ER -