A Robust Initial Basic Subset Selection Method for Outlier Detection Algorithms in Linear Regression

Mehmet Hakan Satman

doi:10.52693/jsas.1512794

Research Article

Doğrusal Regresyonda Aykırı Değer Tespit Algoritmaları için Dayanıklı Bir Temel Küme Seçim Yöntemi

Year 2024, Issue: 10, 76 - 85, 31.12.2024

Mehmet Hakan Satman

https://doi.org/10.52693/jsas.1512794

Abstract

Bu çalışmanın temel motivasyonu, makul bir kirlilik seviyesine kadar doğrusal regresyonda aykırı değerlerin teşhisi ve tespiti için etkili bir algoritma geliştirmektir. Algoritma başlangıçta doğrusal cebir düzeyinde şapka matrisinin sağlam bir versiyonunu elde eder. İlk aşamada elde edilen temel alt küme, hızlı LTS (Least Trimmed Squares) regresyon algoritmasında tanımlandığı gibi konsantrasyon adımlarıyla iyileştirilir. Yöntem, temel alt küme seçim durumu olarak başka bir algoritmaya takılabilir. Algoritma hem X hem de Y yönündeki aykırı değerlere karşı %25 oranında etkilidir. Algoritmanın karmaşıklığı gözlem ve parametre sayısı ile doğrusal olarak artmaktadır. Algoritma iteratif hesaplamalar gerektirmediği için oldukça hızlıdır. Algoritmanın belirli bir kirlilik seviyesine karşı başarısı simülasyonlarla gösterilmiştir.

Keywords

doğrusal regresyon, aykırı değerler, dayanıklı regresyon

References

[1] X. Gao and Y. Feng, “Penalized weighted least absolute deviation regression,” Statistics and its interface, vol. 11, no. 1, pp. 79–89, 2018.
[2] P. J. Rousseeuw and K. Van Driessen, “Computing LTS regression for large data sets,” Data mining and knowledge discovery, vol. 12, pp. 29–45, 2006.
[3] D. C. Hoaglin and R. E. Welsch, “The hat matrix in regression and anova,” The American Statistician, vol. 32, no. 1, pp. 17–22, 1978.
[4] J. W. Tukey et al., Exploratory data analysis. Reading, MA, 1977, vol. 2.
[5] A. S. Hadi and J. S. Simonoff, “Procedures for the identification of multiple outliers in linear models,” Journal of the American statistical association, vol. 88, no. 424, pp. 1264–1272, 1993
[6] N. Billor, A. S. Hadi, and P. F. Velleman, “Bacon: blocked adaptive computationally efficient outlier nominators,” Computational statistics & data analysis, vol. 34, no. 3, pp. 279–298, 2000.
[7] N. Billor, S. Chatterjee, and A. S. Hadi, “A re-weighted least squares method for robust regression estimation,” American journal of mathematical and management sciences, vol. 26, no. 3-4, pp. 229–252, 2006.
[8] D. A. Belsley, E. Kuh, and R. E. Welsch, Regression diagnostics: Identifying influential data and sources of collinearity. John Wiley & Sons, 2005.
[9] S. Barratt, G. Angeris, and S. Boyd, “Minimizing a sum of clipped convex functions,” Optimization Letters, vol. 14, pp. 2443–2459, 2020.
[10] S. Chatterjee and M. Mächler, “Robust regression: A weighted least squares approach”, Communications in Statistics-Theory and Methods, vol. 26, no. 6, pp. 1381–1394, 1997.
[11] M. H. Satman, “A new algorithm for detecting outliers in linear regression,” International Journal of statistics and Probability, vol. 2, no. 3, p. 101, 2013.
[12] L. Huo, T.-H. Kim, and Y. Kim, “Robust estimation of covariance and its application to portfolio optimization,” Finance Research Letters, vol. 9, no. 3, pp. 121–134, 2012.
[13] P. J. Rousseeuw, “Least median of squares regression,” Journal of the American statistical association, vol. 79, no. 388, pp. 871–880, 1984.
[14] D. M. Hawkins and D. Olive, “Applications and algorithms for least trimmed sum of absolute deviations regression,” Computational Statistics & Data Analysis, vol. 32, no. 2, pp. 119–134, 1999.
[15] D. M. Hawkins, D. Bradu, and G. V. Kass, “Location of several outliers in multiple-regression data using elemental sets,” Technometrics, vol. 26, no. 3, pp. 197–208, 1984.
[16] D. De Menezes, D. M. Prata, A. R. Secchi, and J. C. Pinto, “A review on robust m-estimators for regression analysis,” Computers & Chemical Engineering, vol. 147, p. 107254, 2021.
[17] P. Rousseeuw and V. Yohai, “Robust regression by means of s-estimators,” in Robust and Non-linear Time Series Analysis: Proceedings of a Workshop Organized by the Sonderforschungs-bereich 123 “Stochastische Mathematische Modelle”, Heidelberg 1983. Springer, pp. 256–272, 1984.
[18] M.H. Satman. "A genetic algorithm based modification on the LTS algorithm for large data sets." Communications in Statistics-Simulation and Computation 41.5, pp. 644-652, 2012.
[19] M.H. Satman, S. Adiga, G. Angeris, and E. Akadal. "LinRegOutliers: A Julia package for detecting outliers in linear regression." Journal of Open Source Software 6, no. 57: 2892, 2021.
[20] J. Bezanson, S. Karpinski, V.B. Shah, V. B., and A. Edelman. Julia: A fast dynamic language for technical computing. arXiv preprint arXiv:1209.5145, 2012

A Robust Initial Basic Subset Selection Method for Outlier Detection Algorithms in Linear Regression

Year 2024, Issue: 10, 76 - 85, 31.12.2024

Mehmet Hakan Satman

https://doi.org/10.52693/jsas.1512794

Abstract

The main motivation of this study is to develop an efficient algorithm for diagnosing and detecting outliers in linear regression up to a reasonable level of contamination. The algorithm initially obtains a robust version of the hat matrix at the linear algebra level. The basic subset obtained in the first stage is improved through concentration steps as defined in the fast-LTS (Least Trimmed Squares) regression algorithm. The method can be plugged into another algorithm as a basic subset selection state. The algorithm is effective against outliers in both X and Y directions by a rate of 25%. The complexity of the algorithm increases linearly with the number of observations and parameters. The algorithm is quite fast as it does not require iterative calculations. The success of the algorithm against a specific contamination level is demonstrated through simulations.

Keywords

Linear regression, outlier detection, robust regression

References

[1] X. Gao and Y. Feng, “Penalized weighted least absolute deviation regression,” Statistics and its interface, vol. 11, no. 1, pp. 79–89, 2018.
[2] P. J. Rousseeuw and K. Van Driessen, “Computing LTS regression for large data sets,” Data mining and knowledge discovery, vol. 12, pp. 29–45, 2006.
[3] D. C. Hoaglin and R. E. Welsch, “The hat matrix in regression and anova,” The American Statistician, vol. 32, no. 1, pp. 17–22, 1978.
[4] J. W. Tukey et al., Exploratory data analysis. Reading, MA, 1977, vol. 2.
[5] A. S. Hadi and J. S. Simonoff, “Procedures for the identification of multiple outliers in linear models,” Journal of the American statistical association, vol. 88, no. 424, pp. 1264–1272, 1993
[6] N. Billor, A. S. Hadi, and P. F. Velleman, “Bacon: blocked adaptive computationally efficient outlier nominators,” Computational statistics & data analysis, vol. 34, no. 3, pp. 279–298, 2000.
[7] N. Billor, S. Chatterjee, and A. S. Hadi, “A re-weighted least squares method for robust regression estimation,” American journal of mathematical and management sciences, vol. 26, no. 3-4, pp. 229–252, 2006.
[8] D. A. Belsley, E. Kuh, and R. E. Welsch, Regression diagnostics: Identifying influential data and sources of collinearity. John Wiley & Sons, 2005.
[9] S. Barratt, G. Angeris, and S. Boyd, “Minimizing a sum of clipped convex functions,” Optimization Letters, vol. 14, pp. 2443–2459, 2020.
[10] S. Chatterjee and M. Mächler, “Robust regression: A weighted least squares approach”, Communications in Statistics-Theory and Methods, vol. 26, no. 6, pp. 1381–1394, 1997.
[11] M. H. Satman, “A new algorithm for detecting outliers in linear regression,” International Journal of statistics and Probability, vol. 2, no. 3, p. 101, 2013.
[12] L. Huo, T.-H. Kim, and Y. Kim, “Robust estimation of covariance and its application to portfolio optimization,” Finance Research Letters, vol. 9, no. 3, pp. 121–134, 2012.
[13] P. J. Rousseeuw, “Least median of squares regression,” Journal of the American statistical association, vol. 79, no. 388, pp. 871–880, 1984.
[14] D. M. Hawkins and D. Olive, “Applications and algorithms for least trimmed sum of absolute deviations regression,” Computational Statistics & Data Analysis, vol. 32, no. 2, pp. 119–134, 1999.
[15] D. M. Hawkins, D. Bradu, and G. V. Kass, “Location of several outliers in multiple-regression data using elemental sets,” Technometrics, vol. 26, no. 3, pp. 197–208, 1984.
[16] D. De Menezes, D. M. Prata, A. R. Secchi, and J. C. Pinto, “A review on robust m-estimators for regression analysis,” Computers & Chemical Engineering, vol. 147, p. 107254, 2021.
[17] P. Rousseeuw and V. Yohai, “Robust regression by means of s-estimators,” in Robust and Non-linear Time Series Analysis: Proceedings of a Workshop Organized by the Sonderforschungs-bereich 123 “Stochastische Mathematische Modelle”, Heidelberg 1983. Springer, pp. 256–272, 1984.
[18] M.H. Satman. "A genetic algorithm based modification on the LTS algorithm for large data sets." Communications in Statistics-Simulation and Computation 41.5, pp. 644-652, 2012.
[19] M.H. Satman, S. Adiga, G. Angeris, and E. Akadal. "LinRegOutliers: A Julia package for detecting outliers in linear regression." Journal of Open Source Software 6, no. 57: 2892, 2021.
[20] J. Bezanson, S. Karpinski, V.B. Shah, V. B., and A. Edelman. Julia: A fast dynamic language for technical computing. arXiv preprint arXiv:1209.5145, 2012

There are 20 citations in total.

Details

Primary Language	English
Subjects	Econometric and Statistical Methods
Journal Section	Research Articles
Authors	Mehmet Hakan Satman 0000-0002-9402-1982
Early Pub Date	December 24, 2024
Publication Date	December 31, 2024
Submission Date	July 8, 2024
Acceptance Date	December 16, 2024
Published in Issue	Year 2024 Issue: 10

Cite

IEEE	M. H. Satman, “A Robust Initial Basic Subset Selection Method for Outlier Detection Algorithms in Linear Regression”, JSAS, no. 10, pp. 76–85, December 2024, doi: 10.52693/jsas.1512794.

Download Cover Image

Article Files

Full Text