Araştırma Makalesi
BibTex RIS Kaynak Göster

Zaman Serisi Tabanlı Makine Öğrenmesi Modelleri ile GitHub Projelerindeki Programlama Dili Popülerliğinin Tahmini (2011–2021)

Yıl 2025, Cilt: 11 Sayı: 2, 351 - 365, 31.12.2025
https://doi.org/10.34186/klujes.1790613

Öz

Bu çalışmada GitHub platformunda 2011–2021 dönemine ait farklı programlama dillerinin depo (repository), çekme isteği (PR) ve sorun (issue) verileri kullanılarak, dillerin popülerliği zaman serisi tabanlı makine öğrenmesi yöntemleriyle tahmin edilmiştir. Üç farklı kaynaktan bütünleştirilen veri kümesi, dil–yıl–çeyrek düzeyinde PR, issue ve depo sayılarını içermekte; farklı kaynaklardan elde edilen metrikler tek bir zaman çizelgesinde birleştirilerek her dil için çeyreklik gözlemler üzerinden modelleme yapılmasına olanak vermektedir. Öznitelik mühendisliği sonrasında lojistik regresyon, karar ağaçları, rastgele orman, destek vektör makineleri ve gradyan artırma yöntemleri uygulanmıştır. Bulgular, Lojistik Regresyonun (AUC=0,996), Rastgele Ormanın (AUC=0,994) ve SVM’nin (AUC=0,988) güçlü ayırt edicilik sağladığını; Karar Ağaçları ve Gradyan Artırmanın ise yüksek doğruluk değerlerine rağmen ROC-AUC açısından daha zayıf kaldığını göstermektedir. Bu kapsamda, doğruluk ile ROC-AUC’nin birlikte raporlanması yöntemler arasındaki ayrım gücünü daha görünür kılmaktadır. Ayrıca analizler, Python ve JavaScript gibi dillerin uzun vadeli yükselişini doğrulamış, karar ağaçları ve gradyan artırma nadir dönemlerde öne çıkan dilleri yakalamada daha dengeli sonuçlar sunmuştur.

Etik Beyan

Yazarlar arasında çıkar çatışması bulunmamaktadır.

Destekleyen Kurum

Herhangi bir kurum tarafından desteklenmemiştir.

Proje Numarası

Çalışma, herhangi bir proje tarafından desteklenmemiştir.

Kaynakça

  • Biau, G., & Scornet, E. (2016). A random forest guided tour. Test, 25(2), 197–227. https://doi.org/10.1007/s11749-016-0481-7
  • Bissyandé, T. F., Lo, D., Jiang, L., Réveillère, L., Klein, J., & Le Traon, Y. (2013). Got issues? Who cares about it? A large scale investigation of issue trackers from GitHub. 2013 IEEE 24th International Symposium on Software Reliability Engineering (ISSRE), 188–197. IEEE. https://doi.org/10.1109/ISSRE.2013.6698917
  • Borges, H., Hora, A., & Valente, M. T. (2016). Predicting the popularity of GitHub repositories. Proceedings of the 12th International Conference on Predictive Models and Data Analytics in Software Engineering (PROMISE 2016), 1–10. ACM. https://doi.org/10.1145/2972958.2972966
  • Breiman, L., Friedman, J. H., Olshen, R. A., & Stone, C. J. (1984). Classification and Regression Trees. Wadsworth International Group.
  • Breiman, L. (2001). Random forests. Machine Learning, 45(1), 5–32. https://doi.org/10.1023/A:1010933404324
  • Chen, T., & Guestrin, C. (2016). XGBoost: A scalable tree boosting system. Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, 785–794. https://doi.org/10.1145/2939672.2939785
  • Cortes, C., & Vapnik, V. (1995). Support-vector networks. Machine Learning, 20(3), 273–297. https://doi.org/10.1007/BF00994018
  • Cox, D. R. (1958). The regression analysis of binary sequences. Journal of the Royal Statistical Society: Series B (Methodological), 20(2), 215–242.
  • Friedman, J. H. (2001). Greedy function approximation: A gradient boosting machine. Annals of Statistics, 29(5), 1189–1232
  • GitHub Staff. (2024, October 29). Octoverse: AI leads Python to top language as the number of global developers surges. GitHub Blog. https://github.blog/news-insights/octoverse/octoverse-2024/
  • Hosmer, D. W., Lemeshow, S., & Sturdivant, R. X. (2013). Applied Logistic Regression (3rd ed.). Wiley.
  • Ke, G., Meng, Q., Finley, T., Wang, T., Chen, W., Ma, W., ... & Liu, T. Y. (2017). LightGBM: A highly efficient gradient boosting decision tree. Advances in Neural Information Processing Systems (NeurIPS), 30, 3146–3154.
  • Menard, S. (2002). Applied Logistic Regression Analysis (2nd ed.). Sage.
  • Quinlan, J. R. (1986). Induction of decision trees. Machine Learning, 1(1), 81–106. https://doi.org/10.1007/BF00116251
  • Rahman, M. M., & Roy, C. K. (2014). An insight into the pull requests of GitHub. Proceedings of the 11th Working Conference on Mining Software Repositories (MSR 2014), 364–367. ACM. https://doi.org/10.1145/2597073.2597076
  • Ray, B., Posnett, D., Devanbu, P., & Filkov, V. (2017). A large-scale study of programming languages and code quality in GitHub. Communications of the ACM, 60(10), 91–100. https://doi.org/10.1145/3126905
  • Shannon, C. E. (1948). A mathematical theory of communication. Bell System Technical Journal, 27(3), 379–423. https://doi.org/10.1002/j.1538-7305.1948.tb01338.x
  • Wen, I. (2021). GitHub Programming Languages Data (2011–2021) [Dataset]. Kaggle. https://www.kaggle.com/datasets/isaacwen/github-programming-languages-data
  • Wessel, M., Vargovich, J., Gerosa, M. A., & Treude, C. (2023). GitHub Actions: The impact on the pull request process. Empirical Software Engineering, 28(131). https://doi.org/10.1007/s10664-023-10369-w

Time Series–Based Machine Learning Models for Forecasting Programming Language Popularity on the GitHub Projects (2011–2021)

Yıl 2025, Cilt: 11 Sayı: 2, 351 - 365, 31.12.2025
https://doi.org/10.34186/klujes.1790613

Öz

In this study, popularity trends of programming languages were predicted using time-series–based machine learning methods on GitHub data covering 2011–2021. The integrated dataset, compiled from three different sources, contains counts of repositories, pull requests (PRs), and issues at the language–year–quarter level; by consolidating metrics from multiple sources into a single timeline, it enables quarter-based modeling for each language. Following feature engineering, logistic regression, decision trees, random forests, support vector machines (SVM), and gradient boosting were applied. The findings indicate that Logistic Regression (AUC = 0.996), Random Forest (AUC = 0.994), and SVM (AUC = 0.988) provide strong discriminative performance, whereas Decision Trees and Gradient Boosting remain weaker in terms of ROC-AUC despite achieving high accuracy. In this context, reporting accuracy together with ROC-AUC makes differences in discriminative power across methods more apparent. Moreover, the analyses confirm the long-term rise of languages such as Python and JavaScript; decision trees and gradient boosting yield more balanced results in capturing languages that become prominent during rare periods.

Etik Beyan

There is no conflict of interest among the authors.

Destekleyen Kurum

The study was not supported by any project.

Proje Numarası

Çalışma, herhangi bir proje tarafından desteklenmemiştir.

Kaynakça

  • Biau, G., & Scornet, E. (2016). A random forest guided tour. Test, 25(2), 197–227. https://doi.org/10.1007/s11749-016-0481-7
  • Bissyandé, T. F., Lo, D., Jiang, L., Réveillère, L., Klein, J., & Le Traon, Y. (2013). Got issues? Who cares about it? A large scale investigation of issue trackers from GitHub. 2013 IEEE 24th International Symposium on Software Reliability Engineering (ISSRE), 188–197. IEEE. https://doi.org/10.1109/ISSRE.2013.6698917
  • Borges, H., Hora, A., & Valente, M. T. (2016). Predicting the popularity of GitHub repositories. Proceedings of the 12th International Conference on Predictive Models and Data Analytics in Software Engineering (PROMISE 2016), 1–10. ACM. https://doi.org/10.1145/2972958.2972966
  • Breiman, L., Friedman, J. H., Olshen, R. A., & Stone, C. J. (1984). Classification and Regression Trees. Wadsworth International Group.
  • Breiman, L. (2001). Random forests. Machine Learning, 45(1), 5–32. https://doi.org/10.1023/A:1010933404324
  • Chen, T., & Guestrin, C. (2016). XGBoost: A scalable tree boosting system. Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, 785–794. https://doi.org/10.1145/2939672.2939785
  • Cortes, C., & Vapnik, V. (1995). Support-vector networks. Machine Learning, 20(3), 273–297. https://doi.org/10.1007/BF00994018
  • Cox, D. R. (1958). The regression analysis of binary sequences. Journal of the Royal Statistical Society: Series B (Methodological), 20(2), 215–242.
  • Friedman, J. H. (2001). Greedy function approximation: A gradient boosting machine. Annals of Statistics, 29(5), 1189–1232
  • GitHub Staff. (2024, October 29). Octoverse: AI leads Python to top language as the number of global developers surges. GitHub Blog. https://github.blog/news-insights/octoverse/octoverse-2024/
  • Hosmer, D. W., Lemeshow, S., & Sturdivant, R. X. (2013). Applied Logistic Regression (3rd ed.). Wiley.
  • Ke, G., Meng, Q., Finley, T., Wang, T., Chen, W., Ma, W., ... & Liu, T. Y. (2017). LightGBM: A highly efficient gradient boosting decision tree. Advances in Neural Information Processing Systems (NeurIPS), 30, 3146–3154.
  • Menard, S. (2002). Applied Logistic Regression Analysis (2nd ed.). Sage.
  • Quinlan, J. R. (1986). Induction of decision trees. Machine Learning, 1(1), 81–106. https://doi.org/10.1007/BF00116251
  • Rahman, M. M., & Roy, C. K. (2014). An insight into the pull requests of GitHub. Proceedings of the 11th Working Conference on Mining Software Repositories (MSR 2014), 364–367. ACM. https://doi.org/10.1145/2597073.2597076
  • Ray, B., Posnett, D., Devanbu, P., & Filkov, V. (2017). A large-scale study of programming languages and code quality in GitHub. Communications of the ACM, 60(10), 91–100. https://doi.org/10.1145/3126905
  • Shannon, C. E. (1948). A mathematical theory of communication. Bell System Technical Journal, 27(3), 379–423. https://doi.org/10.1002/j.1538-7305.1948.tb01338.x
  • Wen, I. (2021). GitHub Programming Languages Data (2011–2021) [Dataset]. Kaggle. https://www.kaggle.com/datasets/isaacwen/github-programming-languages-data
  • Wessel, M., Vargovich, J., Gerosa, M. A., & Treude, C. (2023). GitHub Actions: The impact on the pull request process. Empirical Software Engineering, 28(131). https://doi.org/10.1007/s10664-023-10369-w
Toplam 19 adet kaynakça vardır.

Ayrıntılar

Birincil Dil Türkçe
Konular Bilgisayar Yazılımı, Programlama Dilleri, Pekiştirmeli Öğrenme, Yazılım Mühendisliği (Diğer)
Bölüm Araştırma Makalesi
Yazarlar

Bora Uğurlu 0000-0001-6769-9563

Bahadir Karasulu 0000-0001-8524-874X

Proje Numarası Çalışma, herhangi bir proje tarafından desteklenmemiştir.
Gönderilme Tarihi 24 Eylül 2025
Kabul Tarihi 27 Kasım 2025
Yayımlanma Tarihi 31 Aralık 2025
Yayımlandığı Sayı Yıl 2025 Cilt: 11 Sayı: 2

Kaynak Göster

APA Uğurlu, B., & Karasulu, B. (2025). Zaman Serisi Tabanlı Makine Öğrenmesi Modelleri ile GitHub Projelerindeki Programlama Dili Popülerliğinin Tahmini (2011–2021). Kirklareli University Journal of Engineering and Science, 11(2), 351-365. https://doi.org/10.34186/klujes.1790613