Research Article
BibTex RIS Cite

Yazılım hata tahmininde kullanılan metriklerin karar ağaçlarındaki bilgi kazançlarının incelenmesi ve iyileştirilmesi

Year 2018, Volume: 24 Issue: 5, 906 - 914, 12.10.2018

Abstract

Yazılım
kalitesinin somut bir şekilde ölçülebilmesi için kullanılan sayısal yazılım
metrikleri içinde bilinen ve yaygın şekilde kullanılanlar arasında McCabe ve
Halstead yöntem-seviye metrikleri bulunmaktadır. Yazılım hata tahmini, geliştirilecek
olan yazılımda bulunan alt modüllerin hangisi veya hangilerinin daha çok hataya
meyilli olabileceğini konusunda öngörüde bulunabilmektedir. Böylece işgücü ve
zaman konusundaki kayıpların önüne geçilebilmektedir. Yazılım hata tahmini için
kullanılan veri kümelerinde, hata var sınıflı kayıt sayısı, hata yok sınıflı
kayıt sayısına göre daha az sayıda olabildiğinden bu veri kümeleri genellikle
dengeli olmayan bir sınıf dağılımına sahip olmakta ve makine öğrenme
yöntemlerinin sonuçlarını olumsuz etkilemektedir. Bilgi kazancı, karar ağaçları
ve karar ağacı temeline dayanan kural sınıflayıcı, nitelik seçimi gibi
algoritma ve yöntemlerde kullanılmaktadır. Bu çalışmada, yazılım hata tahmini
için önemli bilgiler sunan yazılım metrikleri incelenmiş, NASA’nın PROMISE
yazılım veri deposundan CM1, JM1, KC1 ve PC1 veri kümeleri sentetik veri
artırım Smote algoritması ile daha dengeli hale getirilerek bilgi kazancı
yönünden iyileştirilmiştir. Sonuçta karar ağaçlarında sınıflama başarı
performansı daha yüksek yazılım hata tahmini veri kümeleri ve bilgi kazanç
oranı yükseltilmiş yazılım metrik değerleri elde edilmiştir.

References

  • Gupta D, Vinay K, Mittal GH. “Comparative study of soft computing techniques for software quality model”. International Journal of Software Engineering Research & Practices, 1(1), 33-37, 2011.
  • Hall T, Beecham S, Bowes D, Gray D, Counsell S. “A systematic literature review on fault prediction performance in software engineering”. IEEE Transactions on Software Engineering, 38(6), 1276-1304, 2012.
  • Catal C, Diri B. “A systematic review of software fault prediction studies”. Expert Systems with Applications, 36(4), 7346-7354, 2009.
  • Pal B, Hasan A, Aktar M, Shahdat N. “Cluster ensemble and probabilistic neural network modeling of class ımbalance learning in software defect prediction”. Artificial Intelligence and Applications, In Press.
  • Shirabad S, Menzies TJ. School of Information Technology and Engineering, University of Ottawa. “The PROMISE repository of software engineering databases”. http://promise.site.uottawa.ca/SERepository (01.10.2017).
  • Koru A, Liu H. “Building effective defect-prediction models in practice”. IEEE Software, 22(6), 23-29, 2005.
  • Menzies T, Dekhtyar A, Distefano J, Greenwald J. “Problems with precision: A response to comments on data mining static code attributes to learn defect predictors”. IEEE Transactions on Software Engineering, 33(9), 637-640, 2007.
  • Sahana DC. Software Defect Prediction Based on Classication Rule Mining. MSc Thesis, National Institute of Technology Rourkela, Rourkela, India, 2013.
  • Menzies T, Greenwald J, Frank A. “Data mining static code attributes to learn defect predictors”. IEEE Transactions on Software Engineering, 33(1), 2-13, 2007.
  • Lessmann S, Baesens B, Mues C, Pietsch S. “Benchmarking classification models for software defect prediction: A proposed framework and novel findings”. IEEE Transactions on Software Engineering, 34(4), 485-496, 2008.
  • Mertik M, Lenic M, Stiglic G, Kokol P. “Estimating software quality with advanced data mining techniques”. International Conference on Software Engineering Advances, Tahiti, 29 October-3 November 2006.
  • Pelayo L, Dick S. “Applying novel resampling strategies to software defect prediction”. Fuzzy Information Processing Society NAFIPS ’07, San Diego, USA, 24-27 June, 2007.
  • Magal K, Jacob SG. “Improved random forest algorithm for software defect prediction through data mining techniques”. International Journal of Computer Applications, 117(23), 18-22, 2015.
  • Chawla NV, Bowyer KW, Hall LO, Kegelmeyer WP. “SMOTE: Synthetic minority over-sampling technique”. Journal of Artificial Intelligence Research, 16, 321-357, 2002.
  • Quinlan JR. “Induction of decision trees”. Machine Learning, 1(1), 81-106, 1986.
  • Quinlan JR. C4.5: Programs for Machine Learning. San Francisco, USA, Morgan Kaufmann Publishers Inc., 1993.
  • Harris E. “Information gain versus gain Ratio: a study of split method biases”. 7th International Symposium on Artificial Intelligence and Mathematics, Florida, USA, 2-4 January 2002.
  • Frank E, Witten IH. “Generating accurate rule sets without global optimization”. 15th International Conference on Machine Learning, Wisconsin, USA, 24-27 July 1998.
  • Tan KC, Tay A, Lee TH, Heng CM. “Mining multiple comprehensible classification rules using genetic programming”. Proceedings of the Congress Evolutionary Computation, Hawaii, USA, 12-17 May 2002.
  • Li K, Zhang W, Lu Q, Fang X. “An improved SMOTE imbalanced data classification method based on support degree”. International Conference on Identification, Information and Knowledge in the Internet of Things, Beijing, China, 17-18 October 2014.
  • Jiang K, Lu J, Xia K. “A novel algorithm for imbalance data classification based on genetic algorithm improved SMOTE”. Arabian Journal for Science and Engineering, 41(8), 3255-3266, 2016.
  • Hu Y, Guo D, Fan Z, Dong C, Huang Q, Xie S, Liu, G, Tan J, Li B, Xie Q. “An Improved algorithm for ımbalanced data and small sample size classification”. Journal of Data Analysis and Information Processing, 3, 27-33, 2015.
  • Hall M, Frank E, Holmes G, Pfahringer B, Reutemann P, Witten IH. “The WEKA Data Mining Software: An Update”. SIGKDD Explorations, 11(1), 10-18, 2009.
  • Tan PN, Steinbach M, Kumar V. Introduction to Data Mining, 1st ed. Boston, USA, Addison-Wesley Longman Publishing Co. Inc. 2005.
  • Watson AH, Mccabe TJ. Structured Testing: A Testing Methodology Using the Cyclomatic Complexity Metric. Washington, USA, National Institute of Standards and Technology Special Publication 500-235, 1996.
  • Tomar D, Agarwal S. “Prediction of Defective Software Modules Using Class Imbalance Learning”. Applied Computational Intelligence and Soft Computing, Article ID 7658207, 12 pages, 2016.
  • Stone M. “Cross-validatory choice and assessment of statistical predictions”. Journal of the Royal Statistical. Society, 36(2), 111–147, 1974.
  • Paramshetti P, Phalke DA. “Survey on software defect prediction using machine learning techniques”. International Journal of Science and Research, 3, 1394-1397, 2014.
  • Hall T, Beecham S, Bowes D, Gray D, Counsell S. “A systematic literature review on fault prediction performance in software engineering”. IEEE Transactions on Software Engineering. 38(6), 1276-304, 2012.
  • Wang S, Yao X. “Using Class Imbalance Learning for Software Defect Prediction”. IEEE Transactions on Reliability, 62(2), 434-43, 2013.
  • Aleem S, Capretz LF, Ahmed F. “Benchmarking machine learning techniques for software defect detection”. International Journal of Software Engineering & Applications, 6(3), 11-23, 2015.
  • Prasad M, Florence L, Arya, A. “A Study on Software Metrics Based Software Defect Prediction using Data Mining and Machine Learning Techniques”. International Journal of Database Theory and Application, 8(3), 179-190, 2015.
  • Menzies T, Krishna, R, Pryor, D. North Carolina State University, Department of Computer Science. “The Promise Repository of Empirical Software Engineering Data”. http://openscience.us/repo, (01.10.2017).
  • Frank E, Witten IH. “Generating accurate rule sets without global optimization”. 15th International Conference on Machine Learning, San Francisco, USA, 24-27 July 1998.
  • Martin B. Instance-Based learning: Nearest Neighbor With Generalization. MSc Thesis, University of Waikato, Hamilton, New Zealand, 1995.
  • Roy S. Nearest Neighbor with Generalization. MSc Thesis, University of Canterbury, Christchurch, New Zealand, 2002.
  • Cendrowska J. “Prism - an Algorithm for Inducing Modular Rules”. International Journal of Man-Machine Studies, 27(4), 349-70, 1987.
  • Japkowicz N, Stephen S. “The class imbalance problem: A systematic study”. Intelligent Data Analysis, 6(5), 429-449, 2002.
  • Batista G, Prati R, Monard M, “A Study of the Behavior of several methods for balancing machine learning training data”. ACM SIGKDD Explorations Special issue on learning from imbalanced datasets, 6(1), 20-29, 2004.
  • He H, Bai Y, Garcia EA, Li S. “ADASYN: Adaptive synthetic sampling approach for imbalanced learning”. 2008 IEEE International Joint Conference on Neural Networks, Hong Kong, China, 1-8 June 2008.
  • Holte RC. “Very simple classification rules perform well on most commonly used datasets”. Machine Learning, 11, 63-90, 1993.
  • John GH, Langley P. “Estimating Continuous Distributions in Bayesian Classifiers”. 11th Conference on Uncertainty in Artificial Intelligence, Montreal, Canada, 18-20 August 1995.
  • Breiman L, Friedman J, Stone CJ, Olshen RA. Classification and Regression Trees, England, Taylor & Francis, 1984.
  • Wang S, Yao X. “Multiclass imbalance problems: analysis and potential solutions”. IEEE Transactions on Man, Systems and Cybernetics Part B, 42(4), 1119-1130, 2012.
  • Breiman L. “Random Forests”. Machine Learning, 45(1), 5-32, 2001.
  • Specht DF. “Probabilistic neural networks”, Neural Networks, 3(1), 109-118, 1990.
  • Catal C, Diri B. “Investigating the effect of dataset size, metrics sets and feature selection techniques on software fault prediction problem”. Information Sciences, 179(8), 1040-1058, 2009.
  • Catal, C, Diri, B. “Software defect prediction using artificial ımmune recognition system”. The IASTED Int’l Conference on Software Eng, Innsbruck, Austria, 13-15 February 2007.
  • Cohen WW. “Fast effective rule ınduction”. 12th International Conference on Machine Learning, California, USA, 09-12 July 1995.
  • Han J, Kamber M, Pei J. Data Mining: Concepts and Techniques, 3rd ed. Massachusetts, USA, Morgan Kaufmann, 2011.

Analyzing and improving information gain of metrics used in software defect prediction in decision trees

Year 2018, Volume: 24 Issue: 5, 906 - 914, 12.10.2018

Abstract

McCabe and Halstead
method-level metrics are among the well-known and widely used quantitative
software metrics are used to measure software quality in a concrete way.
Software defect prediction can guess which or which of the sub-modules in the
software to be developed may be more prone to defect. Thus, loss of labor and
time can be avoided. The datasets which are used for software defect prediction,
usually have an unbalanced class distribution, since the number of records with
defective class can be fewer than the number of records with not defective
class and this situation adversely affect the results of the machine learning
methods. Information gain is employed in decision trees and decision tree based
rule classifier and attribute selection methods. In this study, software
metrics that provide important information for software defect prediction have
been investigated and CM1, JM1, KC1 and PC1 datasets of NASA's PROMISE software
repository have been balanced with the synthetic data over-sampling Smote
algorithm and improved in terms of information gain. As a result, the software
defect prediction datasets with higher classification success performance and
the software metrics ​​with increased information gain ratio are obtained in
the decision trees.

References

  • Gupta D, Vinay K, Mittal GH. “Comparative study of soft computing techniques for software quality model”. International Journal of Software Engineering Research & Practices, 1(1), 33-37, 2011.
  • Hall T, Beecham S, Bowes D, Gray D, Counsell S. “A systematic literature review on fault prediction performance in software engineering”. IEEE Transactions on Software Engineering, 38(6), 1276-1304, 2012.
  • Catal C, Diri B. “A systematic review of software fault prediction studies”. Expert Systems with Applications, 36(4), 7346-7354, 2009.
  • Pal B, Hasan A, Aktar M, Shahdat N. “Cluster ensemble and probabilistic neural network modeling of class ımbalance learning in software defect prediction”. Artificial Intelligence and Applications, In Press.
  • Shirabad S, Menzies TJ. School of Information Technology and Engineering, University of Ottawa. “The PROMISE repository of software engineering databases”. http://promise.site.uottawa.ca/SERepository (01.10.2017).
  • Koru A, Liu H. “Building effective defect-prediction models in practice”. IEEE Software, 22(6), 23-29, 2005.
  • Menzies T, Dekhtyar A, Distefano J, Greenwald J. “Problems with precision: A response to comments on data mining static code attributes to learn defect predictors”. IEEE Transactions on Software Engineering, 33(9), 637-640, 2007.
  • Sahana DC. Software Defect Prediction Based on Classication Rule Mining. MSc Thesis, National Institute of Technology Rourkela, Rourkela, India, 2013.
  • Menzies T, Greenwald J, Frank A. “Data mining static code attributes to learn defect predictors”. IEEE Transactions on Software Engineering, 33(1), 2-13, 2007.
  • Lessmann S, Baesens B, Mues C, Pietsch S. “Benchmarking classification models for software defect prediction: A proposed framework and novel findings”. IEEE Transactions on Software Engineering, 34(4), 485-496, 2008.
  • Mertik M, Lenic M, Stiglic G, Kokol P. “Estimating software quality with advanced data mining techniques”. International Conference on Software Engineering Advances, Tahiti, 29 October-3 November 2006.
  • Pelayo L, Dick S. “Applying novel resampling strategies to software defect prediction”. Fuzzy Information Processing Society NAFIPS ’07, San Diego, USA, 24-27 June, 2007.
  • Magal K, Jacob SG. “Improved random forest algorithm for software defect prediction through data mining techniques”. International Journal of Computer Applications, 117(23), 18-22, 2015.
  • Chawla NV, Bowyer KW, Hall LO, Kegelmeyer WP. “SMOTE: Synthetic minority over-sampling technique”. Journal of Artificial Intelligence Research, 16, 321-357, 2002.
  • Quinlan JR. “Induction of decision trees”. Machine Learning, 1(1), 81-106, 1986.
  • Quinlan JR. C4.5: Programs for Machine Learning. San Francisco, USA, Morgan Kaufmann Publishers Inc., 1993.
  • Harris E. “Information gain versus gain Ratio: a study of split method biases”. 7th International Symposium on Artificial Intelligence and Mathematics, Florida, USA, 2-4 January 2002.
  • Frank E, Witten IH. “Generating accurate rule sets without global optimization”. 15th International Conference on Machine Learning, Wisconsin, USA, 24-27 July 1998.
  • Tan KC, Tay A, Lee TH, Heng CM. “Mining multiple comprehensible classification rules using genetic programming”. Proceedings of the Congress Evolutionary Computation, Hawaii, USA, 12-17 May 2002.
  • Li K, Zhang W, Lu Q, Fang X. “An improved SMOTE imbalanced data classification method based on support degree”. International Conference on Identification, Information and Knowledge in the Internet of Things, Beijing, China, 17-18 October 2014.
  • Jiang K, Lu J, Xia K. “A novel algorithm for imbalance data classification based on genetic algorithm improved SMOTE”. Arabian Journal for Science and Engineering, 41(8), 3255-3266, 2016.
  • Hu Y, Guo D, Fan Z, Dong C, Huang Q, Xie S, Liu, G, Tan J, Li B, Xie Q. “An Improved algorithm for ımbalanced data and small sample size classification”. Journal of Data Analysis and Information Processing, 3, 27-33, 2015.
  • Hall M, Frank E, Holmes G, Pfahringer B, Reutemann P, Witten IH. “The WEKA Data Mining Software: An Update”. SIGKDD Explorations, 11(1), 10-18, 2009.
  • Tan PN, Steinbach M, Kumar V. Introduction to Data Mining, 1st ed. Boston, USA, Addison-Wesley Longman Publishing Co. Inc. 2005.
  • Watson AH, Mccabe TJ. Structured Testing: A Testing Methodology Using the Cyclomatic Complexity Metric. Washington, USA, National Institute of Standards and Technology Special Publication 500-235, 1996.
  • Tomar D, Agarwal S. “Prediction of Defective Software Modules Using Class Imbalance Learning”. Applied Computational Intelligence and Soft Computing, Article ID 7658207, 12 pages, 2016.
  • Stone M. “Cross-validatory choice and assessment of statistical predictions”. Journal of the Royal Statistical. Society, 36(2), 111–147, 1974.
  • Paramshetti P, Phalke DA. “Survey on software defect prediction using machine learning techniques”. International Journal of Science and Research, 3, 1394-1397, 2014.
  • Hall T, Beecham S, Bowes D, Gray D, Counsell S. “A systematic literature review on fault prediction performance in software engineering”. IEEE Transactions on Software Engineering. 38(6), 1276-304, 2012.
  • Wang S, Yao X. “Using Class Imbalance Learning for Software Defect Prediction”. IEEE Transactions on Reliability, 62(2), 434-43, 2013.
  • Aleem S, Capretz LF, Ahmed F. “Benchmarking machine learning techniques for software defect detection”. International Journal of Software Engineering & Applications, 6(3), 11-23, 2015.
  • Prasad M, Florence L, Arya, A. “A Study on Software Metrics Based Software Defect Prediction using Data Mining and Machine Learning Techniques”. International Journal of Database Theory and Application, 8(3), 179-190, 2015.
  • Menzies T, Krishna, R, Pryor, D. North Carolina State University, Department of Computer Science. “The Promise Repository of Empirical Software Engineering Data”. http://openscience.us/repo, (01.10.2017).
  • Frank E, Witten IH. “Generating accurate rule sets without global optimization”. 15th International Conference on Machine Learning, San Francisco, USA, 24-27 July 1998.
  • Martin B. Instance-Based learning: Nearest Neighbor With Generalization. MSc Thesis, University of Waikato, Hamilton, New Zealand, 1995.
  • Roy S. Nearest Neighbor with Generalization. MSc Thesis, University of Canterbury, Christchurch, New Zealand, 2002.
  • Cendrowska J. “Prism - an Algorithm for Inducing Modular Rules”. International Journal of Man-Machine Studies, 27(4), 349-70, 1987.
  • Japkowicz N, Stephen S. “The class imbalance problem: A systematic study”. Intelligent Data Analysis, 6(5), 429-449, 2002.
  • Batista G, Prati R, Monard M, “A Study of the Behavior of several methods for balancing machine learning training data”. ACM SIGKDD Explorations Special issue on learning from imbalanced datasets, 6(1), 20-29, 2004.
  • He H, Bai Y, Garcia EA, Li S. “ADASYN: Adaptive synthetic sampling approach for imbalanced learning”. 2008 IEEE International Joint Conference on Neural Networks, Hong Kong, China, 1-8 June 2008.
  • Holte RC. “Very simple classification rules perform well on most commonly used datasets”. Machine Learning, 11, 63-90, 1993.
  • John GH, Langley P. “Estimating Continuous Distributions in Bayesian Classifiers”. 11th Conference on Uncertainty in Artificial Intelligence, Montreal, Canada, 18-20 August 1995.
  • Breiman L, Friedman J, Stone CJ, Olshen RA. Classification and Regression Trees, England, Taylor & Francis, 1984.
  • Wang S, Yao X. “Multiclass imbalance problems: analysis and potential solutions”. IEEE Transactions on Man, Systems and Cybernetics Part B, 42(4), 1119-1130, 2012.
  • Breiman L. “Random Forests”. Machine Learning, 45(1), 5-32, 2001.
  • Specht DF. “Probabilistic neural networks”, Neural Networks, 3(1), 109-118, 1990.
  • Catal C, Diri B. “Investigating the effect of dataset size, metrics sets and feature selection techniques on software fault prediction problem”. Information Sciences, 179(8), 1040-1058, 2009.
  • Catal, C, Diri, B. “Software defect prediction using artificial ımmune recognition system”. The IASTED Int’l Conference on Software Eng, Innsbruck, Austria, 13-15 February 2007.
  • Cohen WW. “Fast effective rule ınduction”. 12th International Conference on Machine Learning, California, USA, 09-12 July 1995.
  • Han J, Kamber M, Pei J. Data Mining: Concepts and Techniques, 3rd ed. Massachusetts, USA, Morgan Kaufmann, 2011.
There are 50 citations in total.

Details

Primary Language Turkish
Subjects Engineering
Journal Section Research Article
Authors

İbrahim Berkan Aydilek 0000-0001-8037-8625

Publication Date October 12, 2018
Published in Issue Year 2018 Volume: 24 Issue: 5

Cite

APA Aydilek, İ. B. (2018). Yazılım hata tahmininde kullanılan metriklerin karar ağaçlarındaki bilgi kazançlarının incelenmesi ve iyileştirilmesi. Pamukkale Üniversitesi Mühendislik Bilimleri Dergisi, 24(5), 906-914.
AMA Aydilek İB. Yazılım hata tahmininde kullanılan metriklerin karar ağaçlarındaki bilgi kazançlarının incelenmesi ve iyileştirilmesi. Pamukkale Üniversitesi Mühendislik Bilimleri Dergisi. October 2018;24(5):906-914.
Chicago Aydilek, İbrahim Berkan. “Yazılım Hata Tahmininde kullanılan Metriklerin Karar ağaçlarındaki Bilgi kazançlarının Incelenmesi Ve iyileştirilmesi”. Pamukkale Üniversitesi Mühendislik Bilimleri Dergisi 24, no. 5 (October 2018): 906-14.
EndNote Aydilek İB (October 1, 2018) Yazılım hata tahmininde kullanılan metriklerin karar ağaçlarındaki bilgi kazançlarının incelenmesi ve iyileştirilmesi. Pamukkale Üniversitesi Mühendislik Bilimleri Dergisi 24 5 906–914.
IEEE İ. B. Aydilek, “Yazılım hata tahmininde kullanılan metriklerin karar ağaçlarındaki bilgi kazançlarının incelenmesi ve iyileştirilmesi”, Pamukkale Üniversitesi Mühendislik Bilimleri Dergisi, vol. 24, no. 5, pp. 906–914, 2018.
ISNAD Aydilek, İbrahim Berkan. “Yazılım Hata Tahmininde kullanılan Metriklerin Karar ağaçlarındaki Bilgi kazançlarının Incelenmesi Ve iyileştirilmesi”. Pamukkale Üniversitesi Mühendislik Bilimleri Dergisi 24/5 (October 2018), 906-914.
JAMA Aydilek İB. Yazılım hata tahmininde kullanılan metriklerin karar ağaçlarındaki bilgi kazançlarının incelenmesi ve iyileştirilmesi. Pamukkale Üniversitesi Mühendislik Bilimleri Dergisi. 2018;24:906–914.
MLA Aydilek, İbrahim Berkan. “Yazılım Hata Tahmininde kullanılan Metriklerin Karar ağaçlarındaki Bilgi kazançlarının Incelenmesi Ve iyileştirilmesi”. Pamukkale Üniversitesi Mühendislik Bilimleri Dergisi, vol. 24, no. 5, 2018, pp. 906-14.
Vancouver Aydilek İB. Yazılım hata tahmininde kullanılan metriklerin karar ağaçlarındaki bilgi kazançlarının incelenmesi ve iyileştirilmesi. Pamukkale Üniversitesi Mühendislik Bilimleri Dergisi. 2018;24(5):906-14.





Creative Commons Lisansı
Bu dergi Creative Commons Al 4.0 Uluslararası Lisansı ile lisanslanmıştır.