Research Article
BibTex RIS Cite

Zaman Serisi Tabanlı Makine Öğrenmesi Modelleri ile GitHub Projelerindeki Programlama Dili Popülerliğinin Tahmini (2011–2021)

Year 2025, Volume: 11 Issue: 2 , 351 - 365 , 31.12.2025
https://doi.org/10.34186/klujes.1790613
https://izlik.org/JA48CS98AJ

Abstract

Bu çalışmada GitHub platformunda 2011–2021 dönemine ait farklı programlama dillerinin depo (repository), çekme isteği (PR) ve sorun (issue) verileri kullanılarak, dillerin popülerliği zaman serisi tabanlı makine öğrenmesi yöntemleriyle tahmin edilmiştir. Üç farklı kaynaktan bütünleştirilen veri kümesi, dil–yıl–çeyrek düzeyinde PR, issue ve depo sayılarını içermekte; farklı kaynaklardan elde edilen metrikler tek bir zaman çizelgesinde birleştirilerek her dil için çeyreklik gözlemler üzerinden modelleme yapılmasına olanak vermektedir. Öznitelik mühendisliği sonrasında lojistik regresyon, karar ağaçları, rastgele orman, destek vektör makineleri ve gradyan artırma yöntemleri uygulanmıştır. Bulgular, Lojistik Regresyonun (AUC=0,996), Rastgele Ormanın (AUC=0,994) ve SVM’nin (AUC=0,988) güçlü ayırt edicilik sağladığını; Karar Ağaçları ve Gradyan Artırmanın ise yüksek doğruluk değerlerine rağmen ROC-AUC açısından daha zayıf kaldığını göstermektedir. Bu kapsamda, doğruluk ile ROC-AUC’nin birlikte raporlanması yöntemler arasındaki ayrım gücünü daha görünür kılmaktadır. Ayrıca analizler, Python ve JavaScript gibi dillerin uzun vadeli yükselişini doğrulamış, karar ağaçları ve gradyan artırma nadir dönemlerde öne çıkan dilleri yakalamada daha dengeli sonuçlar sunmuştur.

Ethical Statement

Yazarlar arasında çıkar çatışması bulunmamaktadır.

Supporting Institution

Herhangi bir kurum tarafından desteklenmemiştir.

Project Number

Çalışma, herhangi bir proje tarafından desteklenmemiştir.

References

  • Biau, G., & Scornet, E. (2016). A random forest guided tour. Test, 25(2), 197–227. https://doi.org/10.1007/s11749-016-0481-7
  • Bissyandé, T. F., Lo, D., Jiang, L., Réveillère, L., Klein, J., & Le Traon, Y. (2013). Got issues? Who cares about it? A large scale investigation of issue trackers from GitHub. 2013 IEEE 24th International Symposium on Software Reliability Engineering (ISSRE), 188–197. IEEE. https://doi.org/10.1109/ISSRE.2013.6698917
  • Borges, H., Hora, A., & Valente, M. T. (2016). Predicting the popularity of GitHub repositories. Proceedings of the 12th International Conference on Predictive Models and Data Analytics in Software Engineering (PROMISE 2016), 1–10. ACM. https://doi.org/10.1145/2972958.2972966
  • Breiman, L., Friedman, J. H., Olshen, R. A., & Stone, C. J. (1984). Classification and Regression Trees. Wadsworth International Group.
  • Breiman, L. (2001). Random forests. Machine Learning, 45(1), 5–32. https://doi.org/10.1023/A:1010933404324
  • Chen, T., & Guestrin, C. (2016). XGBoost: A scalable tree boosting system. Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, 785–794. https://doi.org/10.1145/2939672.2939785
  • Cortes, C., & Vapnik, V. (1995). Support-vector networks. Machine Learning, 20(3), 273–297. https://doi.org/10.1007/BF00994018
  • Cox, D. R. (1958). The regression analysis of binary sequences. Journal of the Royal Statistical Society: Series B (Methodological), 20(2), 215–242.
  • Friedman, J. H. (2001). Greedy function approximation: A gradient boosting machine. Annals of Statistics, 29(5), 1189–1232
  • GitHub Staff. (2024, October 29). Octoverse: AI leads Python to top language as the number of global developers surges. GitHub Blog. https://github.blog/news-insights/octoverse/octoverse-2024/
  • Hosmer, D. W., Lemeshow, S., & Sturdivant, R. X. (2013). Applied Logistic Regression (3rd ed.). Wiley.
  • Ke, G., Meng, Q., Finley, T., Wang, T., Chen, W., Ma, W., ... & Liu, T. Y. (2017). LightGBM: A highly efficient gradient boosting decision tree. Advances in Neural Information Processing Systems (NeurIPS), 30, 3146–3154.
  • Menard, S. (2002). Applied Logistic Regression Analysis (2nd ed.). Sage.
  • Quinlan, J. R. (1986). Induction of decision trees. Machine Learning, 1(1), 81–106. https://doi.org/10.1007/BF00116251
  • Rahman, M. M., & Roy, C. K. (2014). An insight into the pull requests of GitHub. Proceedings of the 11th Working Conference on Mining Software Repositories (MSR 2014), 364–367. ACM. https://doi.org/10.1145/2597073.2597076
  • Ray, B., Posnett, D., Devanbu, P., & Filkov, V. (2017). A large-scale study of programming languages and code quality in GitHub. Communications of the ACM, 60(10), 91–100. https://doi.org/10.1145/3126905
  • Shannon, C. E. (1948). A mathematical theory of communication. Bell System Technical Journal, 27(3), 379–423. https://doi.org/10.1002/j.1538-7305.1948.tb01338.x
  • Wen, I. (2021). GitHub Programming Languages Data (2011–2021) [Dataset]. Kaggle. https://www.kaggle.com/datasets/isaacwen/github-programming-languages-data
  • Wessel, M., Vargovich, J., Gerosa, M. A., & Treude, C. (2023). GitHub Actions: The impact on the pull request process. Empirical Software Engineering, 28(131). https://doi.org/10.1007/s10664-023-10369-w

Time Series–Based Machine Learning Models for Forecasting Programming Language Popularity on the GitHub Projects (2011–2021)

Year 2025, Volume: 11 Issue: 2 , 351 - 365 , 31.12.2025
https://doi.org/10.34186/klujes.1790613
https://izlik.org/JA48CS98AJ

Abstract

In this study, popularity trends of programming languages were predicted using time-series–based machine learning methods on GitHub data covering 2011–2021. The integrated dataset, compiled from three different sources, contains counts of repositories, pull requests (PRs), and issues at the language–year–quarter level; by consolidating metrics from multiple sources into a single timeline, it enables quarter-based modeling for each language. Following feature engineering, logistic regression, decision trees, random forests, support vector machines (SVM), and gradient boosting were applied. The findings indicate that Logistic Regression (AUC = 0.996), Random Forest (AUC = 0.994), and SVM (AUC = 0.988) provide strong discriminative performance, whereas Decision Trees and Gradient Boosting remain weaker in terms of ROC-AUC despite achieving high accuracy. In this context, reporting accuracy together with ROC-AUC makes differences in discriminative power across methods more apparent. Moreover, the analyses confirm the long-term rise of languages such as Python and JavaScript; decision trees and gradient boosting yield more balanced results in capturing languages that become prominent during rare periods.

Ethical Statement

There is no conflict of interest among the authors.

Supporting Institution

The study was not supported by any project.

Project Number

Çalışma, herhangi bir proje tarafından desteklenmemiştir.

References

  • Biau, G., & Scornet, E. (2016). A random forest guided tour. Test, 25(2), 197–227. https://doi.org/10.1007/s11749-016-0481-7
  • Bissyandé, T. F., Lo, D., Jiang, L., Réveillère, L., Klein, J., & Le Traon, Y. (2013). Got issues? Who cares about it? A large scale investigation of issue trackers from GitHub. 2013 IEEE 24th International Symposium on Software Reliability Engineering (ISSRE), 188–197. IEEE. https://doi.org/10.1109/ISSRE.2013.6698917
  • Borges, H., Hora, A., & Valente, M. T. (2016). Predicting the popularity of GitHub repositories. Proceedings of the 12th International Conference on Predictive Models and Data Analytics in Software Engineering (PROMISE 2016), 1–10. ACM. https://doi.org/10.1145/2972958.2972966
  • Breiman, L., Friedman, J. H., Olshen, R. A., & Stone, C. J. (1984). Classification and Regression Trees. Wadsworth International Group.
  • Breiman, L. (2001). Random forests. Machine Learning, 45(1), 5–32. https://doi.org/10.1023/A:1010933404324
  • Chen, T., & Guestrin, C. (2016). XGBoost: A scalable tree boosting system. Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, 785–794. https://doi.org/10.1145/2939672.2939785
  • Cortes, C., & Vapnik, V. (1995). Support-vector networks. Machine Learning, 20(3), 273–297. https://doi.org/10.1007/BF00994018
  • Cox, D. R. (1958). The regression analysis of binary sequences. Journal of the Royal Statistical Society: Series B (Methodological), 20(2), 215–242.
  • Friedman, J. H. (2001). Greedy function approximation: A gradient boosting machine. Annals of Statistics, 29(5), 1189–1232
  • GitHub Staff. (2024, October 29). Octoverse: AI leads Python to top language as the number of global developers surges. GitHub Blog. https://github.blog/news-insights/octoverse/octoverse-2024/
  • Hosmer, D. W., Lemeshow, S., & Sturdivant, R. X. (2013). Applied Logistic Regression (3rd ed.). Wiley.
  • Ke, G., Meng, Q., Finley, T., Wang, T., Chen, W., Ma, W., ... & Liu, T. Y. (2017). LightGBM: A highly efficient gradient boosting decision tree. Advances in Neural Information Processing Systems (NeurIPS), 30, 3146–3154.
  • Menard, S. (2002). Applied Logistic Regression Analysis (2nd ed.). Sage.
  • Quinlan, J. R. (1986). Induction of decision trees. Machine Learning, 1(1), 81–106. https://doi.org/10.1007/BF00116251
  • Rahman, M. M., & Roy, C. K. (2014). An insight into the pull requests of GitHub. Proceedings of the 11th Working Conference on Mining Software Repositories (MSR 2014), 364–367. ACM. https://doi.org/10.1145/2597073.2597076
  • Ray, B., Posnett, D., Devanbu, P., & Filkov, V. (2017). A large-scale study of programming languages and code quality in GitHub. Communications of the ACM, 60(10), 91–100. https://doi.org/10.1145/3126905
  • Shannon, C. E. (1948). A mathematical theory of communication. Bell System Technical Journal, 27(3), 379–423. https://doi.org/10.1002/j.1538-7305.1948.tb01338.x
  • Wen, I. (2021). GitHub Programming Languages Data (2011–2021) [Dataset]. Kaggle. https://www.kaggle.com/datasets/isaacwen/github-programming-languages-data
  • Wessel, M., Vargovich, J., Gerosa, M. A., & Treude, C. (2023). GitHub Actions: The impact on the pull request process. Empirical Software Engineering, 28(131). https://doi.org/10.1007/s10664-023-10369-w
There are 19 citations in total.

Details

Primary Language Turkish
Subjects Computer Software, Programming Languages, Reinforcement Learning, Software Engineering (Other)
Journal Section Research Article
Authors

Bora Uğurlu 0000-0001-6769-9563

Bahadir Karasulu 0000-0001-8524-874X

Project Number Çalışma, herhangi bir proje tarafından desteklenmemiştir.
Submission Date September 24, 2025
Acceptance Date November 27, 2025
Publication Date December 31, 2025
DOI https://doi.org/10.34186/klujes.1790613
IZ https://izlik.org/JA48CS98AJ
Published in Issue Year 2025 Volume: 11 Issue: 2

Cite

APA Uğurlu, B., & Karasulu, B. (2025). Zaman Serisi Tabanlı Makine Öğrenmesi Modelleri ile GitHub Projelerindeki Programlama Dili Popülerliğinin Tahmini (2011–2021). Kirklareli University Journal of Engineering and Science, 11(2), 351-365. https://doi.org/10.34186/klujes.1790613