An Empirical Study on the Performance of the Distance Metrics

Fatih Aydın

doi:10.35414/akufemubid.1325843

Research Article

An Empirical Study on the Performance of the Distance Metrics

Year 2023, Volume: 23 Issue: 6, 1445 - 1457, 28.12.2023

Fatih Aydın

https://doi.org/10.35414/akufemubid.1325843

Abstract

Metrics are used to measure the distance, similarity, or dissimilarity between two points in a metric space. Metric learning algorithms perform the finding task of data points that are closest or furthest to a query point in m-dimensional metric space. Some metrics take into account the assumption that the whole dimensions are of equal importance, and vice versa. However, this assumption does not incorporate a number of real-world problems that classification algorithms tackle. In this research, the existing information gain, the information gain ratio, and some well-known conventional metrics have been compared by each other. The 1-Nearest Neighbor algorithm taking these metrics as its meta-parameter has been applied to forty-nine benchmark datasets. Only the accuracy rate criterion has been employed in order to quantify the performance of the metrics. The experimental results show that each metric is successful on datasets corresponding to its own domain. In other words, each metric is favorable on datasets overlapping its own assumption. In addition, there also exists incompleteness in classification tasks for metrics just like there is for learning algorithms.

Keywords

Machine learning , Metric learning , Information gain , No free lunch theorems , K-nearest neighbors

References

Aha, D.W., 1998. Feature Weighting for Lazy Learning Algorithms. In: H. Liu and H. Motoda, eds. Feature Extraction, Construction and Selection. Springer, Boston, MA, 13–32.
Aydın, F., 2022. A class-driven approach to dimension embedding. Expert Systems with Applications, 195, 116650.
Bellet, A., Habrard, A., and Sebban, M., 2013. A Survey on Metric Learning for Feature Vectors and Structured Data.
Bellet, A., Habrard, A., and Sebban, M., 2015. Nonlinear and Local Metric Learning. In: Metric Learning. Springer, Cham., 33–42.
Beyer, K.S., Goldstein, J., Ramakrishnan, R., and Shaft, U., 1999. When Is ‘“Nearest Neighbor”’ Meaningful? In: ICDT ’99 Proceedings of the 7th International Conference on Database Theory. London, UK: Springer-Verlag, 217–235.
Brown, T. and Koplowitz, J., 1979. The weighted nearest neighbor rule for class dependent sample sizes (Corresp.). IEEE Transactions on Information Theory, 25 (5), 617–619.
Cover, T. and Hart, P., 1967. Nearest neighbor pattern classification. IEEE Transactions on Information Theory, 13 (1), 21–27.
Duneja, A. and Puyalnithi, T., 2017. Enhancing Classification Accuracy of K-Nearest Neighbours Algorithm Using Gain Ratio. International Research Journal of Engineering and Technology, 4 (9), 1385–1388.
Fix, E. and Hodges, J.L., 1951. Discriminatory Analysis - Nonparametric Discrimination: Consistency Properties. Texas.
Fukunaga, K. and Hostetler, L., 1973. Optimization of k nearest neighbor density estimates. IEEE Transactions on Information Theory, 19 (3), 320–326.
Grabczewski, K. and Jankowski, N., 2005. Feature selection with decision tree criterion. In: N. Nedjah, L. de M. Mourelle, M. Vellasco, A. Abraham, and M. Köppen, eds. Fifth International Conference on Hybrid Intelligent Systems (HIS’05). Rio de Janeiro, Brazil: IEEE, 212–217.
Grabczewski, K. and Jankowski, N., 2006. Mining for Complex Models Comprising Feature Selection and Classification. In: I. Guyon, S. Gunn, M. Nikravesh, and L.A. Zadeh, eds. Feature Extraction. Springer, Berlin, Heidelberg, 471–488.
Gu, X., Angelov, P.P., Kangin, D., and Principe, J.C., 2017. A new type of distance metric and its use for clustering. Evolving Systems, 8 (3), 167–177.
Guyon, I. and Elisseeff, A.A., 2003. An Introduction to Variable and Feature Selection. Journal of Machine Learning Research, 3 (2003), 1157–1182.
Hall, M.A., 1999. Correlation-based feature subset selection for machine learning (Ph.D. thesis). The University of Waikato, Waikato, New Zealand.
Han, J. and Kamber, M., 2006. Data Mining: Concepts and Techniques. 2nd ed. San Francisco: Morgan Kaufmann.
Hancock, J.M., 2004. Jaccard Distance (Jaccard Index, Jaccard Similarity Coefficient). In: Dictionary of Bioinformatics and Computational Biology. Chichester, UK: John Wiley & Sons, Ltd.
Hechenbichler, K. and Schliep, K., 2004. Weighted k-Nearest-Neighbor Techniques and Ordinal Classification. Collaborative Research Center 386, 399.
Jankowski, N. and Usowicz, K., 2011. Analysis of Feature Weighting Methods Based on Feature Ranking Methods for Classification. In: B.L. Lu, L. Zhang, and J. Kwok, eds. Neural Information Processing. Springer, Berlin, Heidelberg, 238–247.
Jia, H., Cheung, Y., and Liu, J., 2016. A New Distance Metric for Unsupervised Learning of Categorical Data. IEEE Transactions on Neural Networks and Learning Systems, 27 (5), 1065–1079.
Jiang, S., Xu, Y., Song, H., Wu, Q., Ng, M.K., Min, H., and Qiu, S., 2018. Multi-instance transfer metric learning by weighted distribution and consistent maximum likelihood estimation. Neurocomputing, 321, 49–60.
Korenius, T., Laurikkala, J., and Juhola, M., 2007. On principal component analysis, cosine and Euclidean measures in information retrieval. Information Sciences, 177 (22), 4893–4905.
De Maesschalck, R., Jouan-Rimbaud, D., and Massart, D.L., 2000. The Mahalanobis distance. Chemometrics and Intelligent Laboratory Systems, 50 (1), 1–18.
Manning, C.D. and Raghavan, P., 2009. An Introduction to Information Retrieval. In: Online. 1.
Monjardet, B., 1998. On the comparison of the Spearman and Kendall metrics between linear orders. Discrete Mathematics, 192 (1–3), 281–292.
Munkres, J.R., 2017. Topology. 2nd ed. Pearson, 119-121.
Nilsson, N.J., 1965. Learning Machines: Foundations of trainable pattern-classifying systems. New York: McGraw-Hill.
Norouzi, M., Fleet, D.J., and Salakhutdinov, R.R., 2012. Hamming Distance Metric Learning. In: Pereira F., C.J.C. Burges, L. Bottou, and K.Q. Weinberger, eds. Advances in Neural Information Processing Systems 25 (NIPS 2012). Curran Associates, 1061–1069.
Parmar, J., Chouhan, S.S., Raychoudhury, V., and Rathore, S.S., 2021. Open-world Machine Learning: Applications, Challenges, and Opportunities. ACM Computing Surveys, 55 (10), 1–37.
Peng, Y., Hu, L., Ying, S., and Shen, C., 2018. Global Nonlinear Metric Learning by Gluing Local Linear Metrics. In: Proceedings of the 2018 SIAM International Conference on Data Mining. Philadelphia, PA: Society for Industrial and Applied Mathematics, 423–431.
Rudin, W., 1976. Principles of Mathematical Analysis. 3rd ed. McGraw-Hill Education, 30-32.
Short, R. and Fukunaga, K., 1981. The optimal distance measure for nearest neighbor classification. IEEE Transactions on Information Theory, 27 (5), 622–627.
Spearman, C., 1904. The Proof and Measurement of Association between Two Things. The American Journal of Psychology, 15 (1), 72.
Székely, G.J., Rizzo, M.L., and Bakirov, N.K., 2007. Measuring and testing dependence by correlation of distances. The Annals of Statistics, 35 (6), 2769–2794.
Taneja, S., Gupta, C., Goyal, K., and Gureja, D., 2014. An Enhanced K-Nearest Neighbor Algorithm Using Information Gain and Clustering. In: 2014 Fourth International Conference on Advanced Computing & Communication Technologies. Rohtak, India: IEEE, 325–329.
Utkin, L. V. and Ryabinin, M.A., 2019. Discriminative Metric Learning with Deep Forest. International Journal on Artificial Intelligence Tools, 28 (02), 1950007.
Vivencio, D.P., R. Hruschka, E., do Carmo Nicoletti, M., dos Santos, E.B., and Galvao, S.D.C.O., 2007. Feature-weighted k-Nearest Neighbor Classifier. In: 2007 IEEE Symposium on Foundations of Computational Intelligence. Honolulu, HI, USA: IEEE, 481–486.
Wolpert, D.H., 1996. The Lack of A Priori Distinctions Between Learning Algorithms. Neural Computation, 8 (7), 1341–1390.
Wolpert, D.H. and Macready, W.G., 1997. No free lunch theorems for optimization. IEEE Transactions on Evolutionary Computation, 1 (1), 67–82.
Wolpert, D.H. and Macready, W.G., 2005. Coevolutionary Free Lunches. IEEE Transactions on Evolutionary Computation, 9 (6), 721–735.
Zabihzadeh, D., Monsefi, R., and Yazdi, H.S., 2019. Sparse Bayesian approach for metric learning in latent space. Knowledge-Based Systems, 178, 11–24.
Zhang, W., Yan, Z., Xiao, G., Zhang, H., and Zuo, W., 2019. Learning Distance Metric for Support Vector Machine: A Multiple Kernel Learning Approach. Neural Processing Letters, 50, 2899–2923.
Zhou, Z.-H. and Feng, J., 2019. Deep forest. National Science Review, 6 (1), 74–86.
UCI Machine Learning Repository, http://archive.ics.uci.edu/ml, (28 Feb 2021).
Machine Learning Benchmark Problems, https://www.rdocumentation.org/packages/mlbench/versions/2.1-1, (7 Jul 2019).
MATLAB Sample Data Sets, https://www.mathworks.com/help/stats/sample-data-sets.html, (7 Jul 2019)

Uzaklık Metriklerinin Performansı Üzerine Ampirik Bir Çalışma

Year 2023, Volume: 23 Issue: 6, 1445 - 1457, 28.12.2023

Fatih Aydın

https://doi.org/10.35414/akufemubid.1325843

Abstract

Metrik, bir metrik uzayda iki nokta arasındaki mesafeyi, benzerliği veya farklılığı ölçmek için kullanılır. Metrik öğrenme algoritmaları, m boyutlu metrik uzayda bir sorgulama noktasına en yakın veya en uzak olan veri noktalarını bulma görevini gerçekleştirir. Bazı metrikler, tüm boyutların eşit öneme sahip olduğu varsayımını dikkate alır ve bunun tersi de geçerlidir. Ancak bu varsayım, sınıflandırma algoritmalarının üstesinden geldiği bazı gerçek dünya problemleriyle örtüşmez. Bu araştırmada; mevcut bilgi kazanımı, bilgi kazanım oranı ve bazı iyi bilinen konvansiyonel metrikler birbirleri ile karşılaştırılmıştır. Bu metrikleri meta parametresi olarak alan 1-En Yakın Komşular algoritması 49 veri kümesine uygulanmıştır. Metriklerin performansını ölçmek için sadece doğruluk oranı ölçütü kullanılmıştır. Deneysel sonuçlar, her metriğin kendi domainine karşılık gelen veri setlerinde başarılı olduğunu göstermektedir. Başka bir deyişle; her metrik, kendi varsayımıyla örtüşen veri kümelerinin lehinedir. Ayrıca öğrenme algoritmalarında olduğu gibi metrikler için de sınıflandırma görevlerinde eksiklikler mevcuttur.

Keywords

Makine öğrenmesi , Metrik öğrenme , Bilgi kazancı , No free lunch teoremleri , K-en yakın komşular

References

Aha, D.W., 1998. Feature Weighting for Lazy Learning Algorithms. In: H. Liu and H. Motoda, eds. Feature Extraction, Construction and Selection. Springer, Boston, MA, 13–32.
Aydın, F., 2022. A class-driven approach to dimension embedding. Expert Systems with Applications, 195, 116650.
Bellet, A., Habrard, A., and Sebban, M., 2013. A Survey on Metric Learning for Feature Vectors and Structured Data.
Bellet, A., Habrard, A., and Sebban, M., 2015. Nonlinear and Local Metric Learning. In: Metric Learning. Springer, Cham., 33–42.
Beyer, K.S., Goldstein, J., Ramakrishnan, R., and Shaft, U., 1999. When Is ‘“Nearest Neighbor”’ Meaningful? In: ICDT ’99 Proceedings of the 7th International Conference on Database Theory. London, UK: Springer-Verlag, 217–235.
Brown, T. and Koplowitz, J., 1979. The weighted nearest neighbor rule for class dependent sample sizes (Corresp.). IEEE Transactions on Information Theory, 25 (5), 617–619.
Cover, T. and Hart, P., 1967. Nearest neighbor pattern classification. IEEE Transactions on Information Theory, 13 (1), 21–27.
Duneja, A. and Puyalnithi, T., 2017. Enhancing Classification Accuracy of K-Nearest Neighbours Algorithm Using Gain Ratio. International Research Journal of Engineering and Technology, 4 (9), 1385–1388.
Fix, E. and Hodges, J.L., 1951. Discriminatory Analysis - Nonparametric Discrimination: Consistency Properties. Texas.
Fukunaga, K. and Hostetler, L., 1973. Optimization of k nearest neighbor density estimates. IEEE Transactions on Information Theory, 19 (3), 320–326.
Grabczewski, K. and Jankowski, N., 2005. Feature selection with decision tree criterion. In: N. Nedjah, L. de M. Mourelle, M. Vellasco, A. Abraham, and M. Köppen, eds. Fifth International Conference on Hybrid Intelligent Systems (HIS’05). Rio de Janeiro, Brazil: IEEE, 212–217.
Grabczewski, K. and Jankowski, N., 2006. Mining for Complex Models Comprising Feature Selection and Classification. In: I. Guyon, S. Gunn, M. Nikravesh, and L.A. Zadeh, eds. Feature Extraction. Springer, Berlin, Heidelberg, 471–488.
Gu, X., Angelov, P.P., Kangin, D., and Principe, J.C., 2017. A new type of distance metric and its use for clustering. Evolving Systems, 8 (3), 167–177.
Guyon, I. and Elisseeff, A.A., 2003. An Introduction to Variable and Feature Selection. Journal of Machine Learning Research, 3 (2003), 1157–1182.
Hall, M.A., 1999. Correlation-based feature subset selection for machine learning (Ph.D. thesis). The University of Waikato, Waikato, New Zealand.
Han, J. and Kamber, M., 2006. Data Mining: Concepts and Techniques. 2nd ed. San Francisco: Morgan Kaufmann.
Hancock, J.M., 2004. Jaccard Distance (Jaccard Index, Jaccard Similarity Coefficient). In: Dictionary of Bioinformatics and Computational Biology. Chichester, UK: John Wiley & Sons, Ltd.
Hechenbichler, K. and Schliep, K., 2004. Weighted k-Nearest-Neighbor Techniques and Ordinal Classification. Collaborative Research Center 386, 399.
Jankowski, N. and Usowicz, K., 2011. Analysis of Feature Weighting Methods Based on Feature Ranking Methods for Classification. In: B.L. Lu, L. Zhang, and J. Kwok, eds. Neural Information Processing. Springer, Berlin, Heidelberg, 238–247.
Jia, H., Cheung, Y., and Liu, J., 2016. A New Distance Metric for Unsupervised Learning of Categorical Data. IEEE Transactions on Neural Networks and Learning Systems, 27 (5), 1065–1079.
Jiang, S., Xu, Y., Song, H., Wu, Q., Ng, M.K., Min, H., and Qiu, S., 2018. Multi-instance transfer metric learning by weighted distribution and consistent maximum likelihood estimation. Neurocomputing, 321, 49–60.
Korenius, T., Laurikkala, J., and Juhola, M., 2007. On principal component analysis, cosine and Euclidean measures in information retrieval. Information Sciences, 177 (22), 4893–4905.
De Maesschalck, R., Jouan-Rimbaud, D., and Massart, D.L., 2000. The Mahalanobis distance. Chemometrics and Intelligent Laboratory Systems, 50 (1), 1–18.
Manning, C.D. and Raghavan, P., 2009. An Introduction to Information Retrieval. In: Online. 1.
Monjardet, B., 1998. On the comparison of the Spearman and Kendall metrics between linear orders. Discrete Mathematics, 192 (1–3), 281–292.
Munkres, J.R., 2017. Topology. 2nd ed. Pearson, 119-121.
Nilsson, N.J., 1965. Learning Machines: Foundations of trainable pattern-classifying systems. New York: McGraw-Hill.
Norouzi, M., Fleet, D.J., and Salakhutdinov, R.R., 2012. Hamming Distance Metric Learning. In: Pereira F., C.J.C. Burges, L. Bottou, and K.Q. Weinberger, eds. Advances in Neural Information Processing Systems 25 (NIPS 2012). Curran Associates, 1061–1069.
Parmar, J., Chouhan, S.S., Raychoudhury, V., and Rathore, S.S., 2021. Open-world Machine Learning: Applications, Challenges, and Opportunities. ACM Computing Surveys, 55 (10), 1–37.
Peng, Y., Hu, L., Ying, S., and Shen, C., 2018. Global Nonlinear Metric Learning by Gluing Local Linear Metrics. In: Proceedings of the 2018 SIAM International Conference on Data Mining. Philadelphia, PA: Society for Industrial and Applied Mathematics, 423–431.
Rudin, W., 1976. Principles of Mathematical Analysis. 3rd ed. McGraw-Hill Education, 30-32.
Short, R. and Fukunaga, K., 1981. The optimal distance measure for nearest neighbor classification. IEEE Transactions on Information Theory, 27 (5), 622–627.
Spearman, C., 1904. The Proof and Measurement of Association between Two Things. The American Journal of Psychology, 15 (1), 72.
Székely, G.J., Rizzo, M.L., and Bakirov, N.K., 2007. Measuring and testing dependence by correlation of distances. The Annals of Statistics, 35 (6), 2769–2794.
Taneja, S., Gupta, C., Goyal, K., and Gureja, D., 2014. An Enhanced K-Nearest Neighbor Algorithm Using Information Gain and Clustering. In: 2014 Fourth International Conference on Advanced Computing & Communication Technologies. Rohtak, India: IEEE, 325–329.
Utkin, L. V. and Ryabinin, M.A., 2019. Discriminative Metric Learning with Deep Forest. International Journal on Artificial Intelligence Tools, 28 (02), 1950007.
Vivencio, D.P., R. Hruschka, E., do Carmo Nicoletti, M., dos Santos, E.B., and Galvao, S.D.C.O., 2007. Feature-weighted k-Nearest Neighbor Classifier. In: 2007 IEEE Symposium on Foundations of Computational Intelligence. Honolulu, HI, USA: IEEE, 481–486.
Wolpert, D.H., 1996. The Lack of A Priori Distinctions Between Learning Algorithms. Neural Computation, 8 (7), 1341–1390.
Wolpert, D.H. and Macready, W.G., 1997. No free lunch theorems for optimization. IEEE Transactions on Evolutionary Computation, 1 (1), 67–82.
Wolpert, D.H. and Macready, W.G., 2005. Coevolutionary Free Lunches. IEEE Transactions on Evolutionary Computation, 9 (6), 721–735.
Zabihzadeh, D., Monsefi, R., and Yazdi, H.S., 2019. Sparse Bayesian approach for metric learning in latent space. Knowledge-Based Systems, 178, 11–24.
Zhang, W., Yan, Z., Xiao, G., Zhang, H., and Zuo, W., 2019. Learning Distance Metric for Support Vector Machine: A Multiple Kernel Learning Approach. Neural Processing Letters, 50, 2899–2923.
Zhou, Z.-H. and Feng, J., 2019. Deep forest. National Science Review, 6 (1), 74–86.
UCI Machine Learning Repository, http://archive.ics.uci.edu/ml, (28 Feb 2021).
Machine Learning Benchmark Problems, https://www.rdocumentation.org/packages/mlbench/versions/2.1-1, (7 Jul 2019).
MATLAB Sample Data Sets, https://www.mathworks.com/help/stats/sample-data-sets.html, (7 Jul 2019)

There are 46 citations in total.

Details

Primary Language	English
Subjects	Computer Vision and Multimedia Computation (Other)
Journal Section	Research Article
Authors	Fatih Aydın 0000-0001-9679-0403
Submission Date	July 11, 2023
Early Pub Date	December 22, 2023
Publication Date	December 28, 2023
Published in Issue	Year 2023 Volume: 23 Issue: 6

Cite

APA	Aydın, F. (2023). An Empirical Study on the Performance of the Distance Metrics. Afyon Kocatepe Üniversitesi Fen Ve Mühendislik Bilimleri Dergisi, 23(6), 1445-1457. https://doi.org/10.35414/akufemubid.1325843
AMA	Aydın F. An Empirical Study on the Performance of the Distance Metrics. Afyon Kocatepe Üniversitesi Fen Ve Mühendislik Bilimleri Dergisi. December 2023;23(6):1445-1457. doi:10.35414/akufemubid.1325843
Chicago	Aydın, Fatih. “An Empirical Study on the Performance of the Distance Metrics”. Afyon Kocatepe Üniversitesi Fen Ve Mühendislik Bilimleri Dergisi 23, no. 6 (December 2023): 1445-57. https://doi.org/10.35414/akufemubid.1325843.
EndNote	Aydın F (December 1, 2023) An Empirical Study on the Performance of the Distance Metrics. Afyon Kocatepe Üniversitesi Fen Ve Mühendislik Bilimleri Dergisi 23 6 1445–1457.
IEEE	F. Aydın, “An Empirical Study on the Performance of the Distance Metrics”, Afyon Kocatepe Üniversitesi Fen Ve Mühendislik Bilimleri Dergisi, vol. 23, no. 6, pp. 1445–1457, 2023, doi: 10.35414/akufemubid.1325843.
ISNAD	Aydın, Fatih. “An Empirical Study on the Performance of the Distance Metrics”. Afyon Kocatepe Üniversitesi Fen Ve Mühendislik Bilimleri Dergisi 23/6 (December2023), 1445-1457. https://doi.org/10.35414/akufemubid.1325843.
JAMA	Aydın F. An Empirical Study on the Performance of the Distance Metrics. Afyon Kocatepe Üniversitesi Fen Ve Mühendislik Bilimleri Dergisi. 2023;23:1445–1457.
MLA	Aydın, Fatih. “An Empirical Study on the Performance of the Distance Metrics”. Afyon Kocatepe Üniversitesi Fen Ve Mühendislik Bilimleri Dergisi, vol. 23, no. 6, 2023, pp. 1445-57, doi:10.35414/akufemubid.1325843.
Vancouver	Aydın F. An Empirical Study on the Performance of the Distance Metrics. Afyon Kocatepe Üniversitesi Fen Ve Mühendislik Bilimleri Dergisi. 2023;23(6):1445-57.

Article Files

Full Text

This work is licensed under a Creative Commons Attribution-NonCommercial 4.0 International License.