Abstract
In this paper, we investigate whether Software Development Effort Estimations (SDEEs) predictions can be improved using commonly used machine learning algorithms such as Linear Regression, Decision Tree Regression, Random Forest Regression, XGBoost Regression, CatBoost Regression, and LightGBM Regression.
To prevent the data leakage and enhance the TAWOS agile open-source software project dataset using Tabular Variational Autoencoder (TVAE) and Truncation Normal Data distribution we also apply additional scaling.
Hyperparameter optimization with Optuna was conducted on 21 model-data combinations based on 5-fold crossvalidated adjusted R², mean squared prediction error (MSPE), and Pearson’s correlation coefficient.
The Random Forest Regressor trained on TVAE-augmented data achieved the best results, with an adjusted R² of 0.59, a Pearson’s correlation of 0.81, and an MSPE of 140011, indicating strong predictive accuracy. The CatBoost
Regressor on regular data ranked second, with an adjusted R² of 0.39, a Pearson’s correlation of 0.74, and an MSPE of 200011. The Decision Tree Regressor, despite a high training correlation, performed the worst, with an
adjusted R² of 0.35, a Pearson’s correlation of 0.76, and an MSPE of 234500, indicating weaker performance. Ultimately, we aimed to reduce the gap between expected and actual software development efforts, thereby minimizing associated risks. The results of this study can significantly enhance software development project planning and management.
To prevent the data leakage and enhance the TAWOS agile open-source software project dataset using Tabular Variational Autoencoder (TVAE) and Truncation Normal Data distribution we also apply additional scaling.
Hyperparameter optimization with Optuna was conducted on 21 model-data combinations based on 5-fold crossvalidated adjusted R², mean squared prediction error (MSPE), and Pearson’s correlation coefficient.
The Random Forest Regressor trained on TVAE-augmented data achieved the best results, with an adjusted R² of 0.59, a Pearson’s correlation of 0.81, and an MSPE of 140011, indicating strong predictive accuracy. The CatBoost
Regressor on regular data ranked second, with an adjusted R² of 0.39, a Pearson’s correlation of 0.74, and an MSPE of 200011. The Decision Tree Regressor, despite a high training correlation, performed the worst, with an
adjusted R² of 0.35, a Pearson’s correlation of 0.76, and an MSPE of 234500, indicating weaker performance. Ultimately, we aimed to reduce the gap between expected and actual software development efforts, thereby minimizing associated risks. The results of this study can significantly enhance software development project planning and management.
Original language | English |
---|---|
Title of host publication | SQAMIA2024: 11th Workshop on Software Quality Analysis, Monitoring, Improvement, and Applications |
Publisher | ceur-ws.org |
Number of pages | 12 |
Publication status | Accepted/In press - 2024 |
Keywords
- software estimation
- regression models
- synthetic data generation
- hyperparameter optimization