Research Article
BibTex RIS Cite

Outlier Detection in Multiple Regression Models Using Genetic Algorithms and Bayesian Information Criteria

Year 2008, Volume: 6 Issue: 1, 38 - 51, 15.07.2008

Abstract

Statistical models, particularly regression models, are most useful devices for extracting and understanding the essential features of datasets. However, most of the databases in real-world include a particular amount of abnormal values, generally termed as outliers. An accurate identification of outliers plays a significant role in statistical analysis especially regression models. Nevertheless, many classical statistical models are blindly applied to data sets containing outliers, the results can be misleading at best. The appearance of outliers can exert negative influences on the fit of the multiple regression models. The aim of this study is to define outlier detection method using Genetic Algorithms (GA) with Bayesian Information Criterion (BIC) and to illustrate the algorithm with real and simulation data. We use a fitness function which is based on BIC in this algorithm. The criteria’s value indicates a better model to fit data, the presence of one or more outliers will negatively impact the regression model and result in larger BIC values.

References

  • Abe, N., Zadronzy, B., and Langford, J., 2006. Outlier detection by active learning. ACM. Proceedings of the 12th ACM SIGKDD International conference on Knowledge Discovery and Data Mining, 767-772, New York, USA.
  • Acuna, E., and Rodriguez, C., 2005. On detection of outliers and their effect in supervised classification, http://academic.uprm.edu/~eacuna/vene31.pdf, 30 April 2008.
  • Amidan, B., Ferryman, and T., Cooley S., 2005. Data outlier detection using the Chebyshew theorem. IEEE Aerospace Conference Proceedings, IEEE, Piscataway NJ USA, 3814-3819.
  • Atkinson, A.C., 1986. Influential observations, high leverage points, and outliers in linear regression. Statistical Science, 1, 397-402.
  • Barnett, V., and Lewis, T., 1994. Outliers in statistical data. John Wiley and Sons, USA.
  • Ben-Gal I., 2005. Outlier detection.,131-146. In: Maimon O. and Rokach L., Data mining and knowledge discovery handbook. Springer, USA.
  • Bozdogan, H., 2004. Statistical data mining and knowledge discovery. Chapman and Hall/CRC, USA.
  • Breitenbach, M., and Grudic, G.Z., 2005. Clustering through ranking on manifolds. Proceedings of the 22nd International Conference on Machine Learning, 73-80, New York, USA.
  • Davies L., and Gather U., 1993. The identification of multiple outliers. Journal of the American Statistical Association, 88, (423), 797-801.
  • Fox, J., 1997. Applied regression analysis, linear models and related methods. Sage Publication, USA.
  • Goldberg, D.E., 1989. Genetic algorithms in search, optimization, and machine learning. Addison-Wesley, USA.
  • Hadi, A., 1986. Influential observations, high leverage points, and outliers in linear regression. Journal of the American Statistical Association, Statistical Science, 1 (3), 379-393
  • Hoaglin, D., and Tukey, J., 1983. Understanding robust and exploratory data analysis. John Wiley and Sons, Canada
  • Hoeting, J., Raftery, A.E., and Madigan, D., 1996. A method for simultaneous variable selection and outlier identification in linear regression. Computational Statistics and Data Analysis, 22, 251-270.
  • Ishibuchi, H., Nakashima, T., and Nii, M., 2001. Genetic algorithm based instance and feature selection. In: Liu, H., and Motoda, H., Instance selection and construction for data mining, Kluwer Academic.
  • Jann, A., 2000. Multiple change point detection with a genetic algorithm. Soft Computing, 4, 68-75.
  • Kullback, S., 1996. Information theory and statistics. Dover Publications, USA.
  • MacQueen, J.B., 1967. Some methods for classification and analysis of multivariate observations. IProceedings of 5th Berkeley Symposium on Mathematical Statistics and Probability, 1, 281-297.
  • Rothlauf, F., 2006. Representations for genetic and evolutionary algorithms. Springer, Netherlands.
  • Scott, D.W., 2005. Outlier detection and clustering by partial mixture modeling. Phsica-Verlag. In COMPSTAT 2004 Symposium, 453-465, Heidelberg.
  • Tolvi, J., 2004. Genetic algorithms for outlier detection and variable selection in linear regression models. Soft Computing, Springer, 527-533.

Çoklu Regresyon Modellerinde Genetik Algoritma ve Bayes Bilgi Kriteri Kullanarak Sapan Değerlerin Belirlenmesi

Year 2008, Volume: 6 Issue: 1, 38 - 51, 15.07.2008

Abstract

İstatistiksel modeller; özellikle regresyon modelleri, veri setlerinin önemli özelliklerinin anlaşılması ve ortaya çıkarılmasında en çok kullanılan araçlardandır. Bununla birlikte, gerçek hayatta birçok veri seti genellikle sapan değer olarak adlandırılan belirli miktardaki anormal değerler içerebilmektedir. Sapan değerlerin doğru bir şekilde tespit edilmesi, istatistiksel çözümlemelerde özellikle de regresyon modellerinde önemli bir rol oynar. Buna rağmen, birçok klasik istatistiksel modeller sapan değer içeren veri setlerine de uygulanmakta, nihayetinde de sonuçlar yanıltıcı olmaktadır. Sapan değerler, uygun olan çoklu regresyon modelinin belirlenmesini de güçleştirir.

Bu çalışmanın amacı, Genetik Algoritma (GA) ve Bayes Bilgi Kriteri (BIC) kullanarak sapan değer belirleme yöntemini tanımlamak ve algoritmayı gerçek ve benzetim verisi ile göstermektir. Genetik algoritmada BIC tabanlı uygunluk fonksiyonu kullanılmıştır. BIC değeri, veri için en uygun modeli göstermekte olup, bir veya daha çok sapan değerin varlığında regresyon modeli bu gözlemlerden olumsuz yönde etkilenecek ve daha büyük BIC değerli sonuçlar verecektir.

References

  • Abe, N., Zadronzy, B., and Langford, J., 2006. Outlier detection by active learning. ACM. Proceedings of the 12th ACM SIGKDD International conference on Knowledge Discovery and Data Mining, 767-772, New York, USA.
  • Acuna, E., and Rodriguez, C., 2005. On detection of outliers and their effect in supervised classification, http://academic.uprm.edu/~eacuna/vene31.pdf, 30 April 2008.
  • Amidan, B., Ferryman, and T., Cooley S., 2005. Data outlier detection using the Chebyshew theorem. IEEE Aerospace Conference Proceedings, IEEE, Piscataway NJ USA, 3814-3819.
  • Atkinson, A.C., 1986. Influential observations, high leverage points, and outliers in linear regression. Statistical Science, 1, 397-402.
  • Barnett, V., and Lewis, T., 1994. Outliers in statistical data. John Wiley and Sons, USA.
  • Ben-Gal I., 2005. Outlier detection.,131-146. In: Maimon O. and Rokach L., Data mining and knowledge discovery handbook. Springer, USA.
  • Bozdogan, H., 2004. Statistical data mining and knowledge discovery. Chapman and Hall/CRC, USA.
  • Breitenbach, M., and Grudic, G.Z., 2005. Clustering through ranking on manifolds. Proceedings of the 22nd International Conference on Machine Learning, 73-80, New York, USA.
  • Davies L., and Gather U., 1993. The identification of multiple outliers. Journal of the American Statistical Association, 88, (423), 797-801.
  • Fox, J., 1997. Applied regression analysis, linear models and related methods. Sage Publication, USA.
  • Goldberg, D.E., 1989. Genetic algorithms in search, optimization, and machine learning. Addison-Wesley, USA.
  • Hadi, A., 1986. Influential observations, high leverage points, and outliers in linear regression. Journal of the American Statistical Association, Statistical Science, 1 (3), 379-393
  • Hoaglin, D., and Tukey, J., 1983. Understanding robust and exploratory data analysis. John Wiley and Sons, Canada
  • Hoeting, J., Raftery, A.E., and Madigan, D., 1996. A method for simultaneous variable selection and outlier identification in linear regression. Computational Statistics and Data Analysis, 22, 251-270.
  • Ishibuchi, H., Nakashima, T., and Nii, M., 2001. Genetic algorithm based instance and feature selection. In: Liu, H., and Motoda, H., Instance selection and construction for data mining, Kluwer Academic.
  • Jann, A., 2000. Multiple change point detection with a genetic algorithm. Soft Computing, 4, 68-75.
  • Kullback, S., 1996. Information theory and statistics. Dover Publications, USA.
  • MacQueen, J.B., 1967. Some methods for classification and analysis of multivariate observations. IProceedings of 5th Berkeley Symposium on Mathematical Statistics and Probability, 1, 281-297.
  • Rothlauf, F., 2006. Representations for genetic and evolutionary algorithms. Springer, Netherlands.
  • Scott, D.W., 2005. Outlier detection and clustering by partial mixture modeling. Phsica-Verlag. In COMPSTAT 2004 Symposium, 453-465, Heidelberg.
  • Tolvi, J., 2004. Genetic algorithms for outlier detection and variable selection in linear regression models. Soft Computing, Springer, 527-533.
There are 21 citations in total.

Details

Primary Language English
Subjects Statistics
Journal Section Research Articles
Authors

Özlem Gürünlü Alma This is me

Serdar Kurt This is me

Aybars Uğur This is me

Publication Date July 15, 2008
Published in Issue Year 2008 Volume: 6 Issue: 1

Cite

APA Gürünlü Alma, Ö., Kurt, S., & Uğur, A. (2008). Outlier Detection in Multiple Regression Models Using Genetic Algorithms and Bayesian Information Criteria. İstatistik Araştırma Dergisi, 6(1), 38-51.