A Distributed K Nearest Neighbor Classifier for Big Data

Tamer Tulgar; Ali Haydar; İbrahim Erşan

doi:10.17694/bajece.419551

Research Article

Year 2018, Volume: 6 Issue: 2, 105 - 111, 30.04.2018

Tamer Tulgar Ali Haydar , İbrahim Erşan

https://doi.org/10.17694/bajece.419551

Abstract

References

Klaus Schwab, "The Fourth Industrial Revolution", Crown Business, 2017. [2] D. Singh and .K. Reddy, ”A survey on platforms for big data analytics”, Journal of Big Data vol. 1, no. 8, 2014. [3] P. Tan, M. Steinbach and V. Kumar, ”Introduction to Data Mining”, 1st ed., Reading, MA: Addison-Wesley, 2005. [4] J. Dean, S. Ghemawat , ”MapReduce: A Flexible Data Processing Tool”, Communications of the ACM, vol. 53 no. 1, pp.72-77, 2010. [5] X. Wu et. Al., ”Top 10 algorithms in data mining”, Knowledge and Information Systems,vol. 14, no. 1, pp 137, 2008. [6] Fahad et. AL., ”A Survey of Clustering Algorithms for Big Data: Taxonomy and Empirical Analysis”, IEEE Trans.on Emerging Topics in Computing, vol. 2, no.3, pp. 267-279, 2014. [7] S. Zhang, M. Zong and D. Cheng, ”Learning k for KNN Classification”, ACM Transactions on Intelligent Systems and Technology, vol. 8, no. 3, pp. 43:1-19, 2017. [8] K. Niu, F. Zhao and S. Zhang, ”A Fast Classification Algorithm for Big Data Based on KNN”, Journal of Applied Sciences, vol. 13,no. 12, pp. 2208-2212, 2013. [9] Bifet, J. Read, B. Pfahringer and G. Holmes, ”Efficient Data Stream Classification via Probabilistic Adaptive Windows”, in Proc. 28th Annual ACM Symposium on Applied Computing, 2013, pp. 801-806. [10] S. S. Labib, ”A Comparative Study to Classify Big Data Using fuzzy Techniques”, in Proc. 5th International Conference on Electronic Devices, Systems and Applications (ICEDSA), 2016. [11] M. El Bakry, S. Safwat and O. Hegazy, ”A Mapreduce Fuzzy technique of Big Data Classification, in Proc. SAI Computing Conference 2016, pp. 118-128. [12] B. Quost and T. Denoeux, ”Clustering and Classification of fuzzy data using the fuzzy EM algorithm”, Fuzzy Sets and Systems, vol. 286, pp. 134-156, 2016. [13] Z. Deng, X. Zhu, D. Cheng, M. Zong and S. Zhang, ”Efficient kNN classification algorithm for big data”, Neurocomputing, vol.195, pp. 143-148, 2016. [14] S. Zhang, D. Cheng, M. Zong and L. Gao, ”Self representation nearest neighbour search for classification”, Neurocomputing, vol.195, pp. 137-142, 2016 [15] G. Song, J. Rochas, L. El Beze, F. Huet and F. Magoules, ”K Nearest Neighbour Joins for Big Data on MapReduce:A Theoretical and Experimental Analysis”, IEEE Trans. on Knowledge and Data Engineering, vol. 28, no. 9, pp. 2376-2392, 2016. [16] J. Maillo, S. Ramirez, I. Triguero and F. Herrera, ”kNN-IS: An Iterative Spark-based design of the k-Nearest Neighbours classifier for big data”, Knowledge-Based Systems, vol. 117, pp. 3-15, 2017. [17] T.Tulgar, A.haydar and İ.Erşan, "Data Distribution Aware Classification Algorithm based on K-Means", International Journal of Advanced Computer Science and Applications, Article in Press, 2017. [18] T. White, "Hadoop: A Definitive Guide", 4th ed., O'Reilly, 2015. [19] J. Gosling, B. Joy, G. Steele, G. Bracha, A. Buckley, (2017,AUG 01). The Java Language Specification-Java SE 8 Edition Online. Available: https://docs.oracle.com/javase/specs/jls/se8/html/index.html [20] UCI Center for Machine Learning and Intelligent Systems, (2017, AUG 01). UC Irvine Machine Learning RepositoryOnline.Available: https://archive.ics.uci.edu/ml/ [21] O.L. Mangasarian, W.N. Street and W.H. Wolberg, “Breast cancer diagnosis and prognosis via linear programming”, Operations Research, vol. 43, no. 4, pp. 570-577, July-August 1995. [22] M. Charytanowicz, J. Niewczas, P. Kulczycki, P.A. Kowalski, S. Lukasik, S. Zak, “A Complete Gradient Clustering Algorithm for Features Analysis of X-ray Images”, Information Technologies in Biomedicine, Springer-Verlag, Berlin-Heidelberg, pp. 15-24, 2010. [23] F. Alimoglu, E. Alpaydin, “Methods of Combining Multiple Classifiers Based on Different Representations for Pen-based Handwriting Recognition”, in Proc. Fifth Turkish Artificial Intelligence and Artificial Neural Networks Symposium (TAINN 96), June 1996.

A Distributed K Nearest Neighbor Classifier for Big Data

Year 2018, Volume: 6 Issue: 2, 105 - 111, 30.04.2018

Tamer Tulgar Ali Haydar , İbrahim Erşan

https://doi.org/10.17694/bajece.419551

Abstract

The
K-Nearest Neighbor classifier is a well-known and widely applied method in data
mining applications. Nevertheless, its high computation and memory usage cost
makes the classical K-NN not feasible for today’s Big Data analysis
applications. To overcome the cost drawbacks of the known data mining methods,
several distributed environment alternatives have emerged. Among these
alternatives, Hadoop MapReduce distributed ecosystem attracted significant
attention. Recently, several K-NN based classification algorithms have been
proposed which are distributed methods tested in Hadoop environment and
suitable for emerging data analysis needs. In this work, a new distributed
Z-KNN algorithm is proposed, which improves the classification accuracy
performance of the well-known K-Nearest Neighbor (K-NN) algorithm by benefiting
from the representativeness relationship of the instances belonging to
different data classes. The proposed algorithm relies on the data class
representations derived from the Z data instances from each class, which are
the closest to the test instance. The Z-KNN algorithm was tested in a physical
Hadoop Cluster using several real-datasets belonging to different application
areas. The performance results acquired after extensive experiments are
presented in this paper and they prove that the proposed Z-KNN algorithm is a
competitive alternative to other studies recently proposed in the literature

Keywords

Big Data Classification, Hadoop, K-Nearest Neighbor, MapReduce.

References

Klaus Schwab, "The Fourth Industrial Revolution", Crown Business, 2017. [2] D. Singh and .K. Reddy, ”A survey on platforms for big data analytics”, Journal of Big Data vol. 1, no. 8, 2014. [3] P. Tan, M. Steinbach and V. Kumar, ”Introduction to Data Mining”, 1st ed., Reading, MA: Addison-Wesley, 2005. [4] J. Dean, S. Ghemawat , ”MapReduce: A Flexible Data Processing Tool”, Communications of the ACM, vol. 53 no. 1, pp.72-77, 2010. [5] X. Wu et. Al., ”Top 10 algorithms in data mining”, Knowledge and Information Systems,vol. 14, no. 1, pp 137, 2008. [6] Fahad et. AL., ”A Survey of Clustering Algorithms for Big Data: Taxonomy and Empirical Analysis”, IEEE Trans.on Emerging Topics in Computing, vol. 2, no.3, pp. 267-279, 2014. [7] S. Zhang, M. Zong and D. Cheng, ”Learning k for KNN Classification”, ACM Transactions on Intelligent Systems and Technology, vol. 8, no. 3, pp. 43:1-19, 2017. [8] K. Niu, F. Zhao and S. Zhang, ”A Fast Classification Algorithm for Big Data Based on KNN”, Journal of Applied Sciences, vol. 13,no. 12, pp. 2208-2212, 2013. [9] Bifet, J. Read, B. Pfahringer and G. Holmes, ”Efficient Data Stream Classification via Probabilistic Adaptive Windows”, in Proc. 28th Annual ACM Symposium on Applied Computing, 2013, pp. 801-806. [10] S. S. Labib, ”A Comparative Study to Classify Big Data Using fuzzy Techniques”, in Proc. 5th International Conference on Electronic Devices, Systems and Applications (ICEDSA), 2016. [11] M. El Bakry, S. Safwat and O. Hegazy, ”A Mapreduce Fuzzy technique of Big Data Classification, in Proc. SAI Computing Conference 2016, pp. 118-128. [12] B. Quost and T. Denoeux, ”Clustering and Classification of fuzzy data using the fuzzy EM algorithm”, Fuzzy Sets and Systems, vol. 286, pp. 134-156, 2016. [13] Z. Deng, X. Zhu, D. Cheng, M. Zong and S. Zhang, ”Efficient kNN classification algorithm for big data”, Neurocomputing, vol.195, pp. 143-148, 2016. [14] S. Zhang, D. Cheng, M. Zong and L. Gao, ”Self representation nearest neighbour search for classification”, Neurocomputing, vol.195, pp. 137-142, 2016 [15] G. Song, J. Rochas, L. El Beze, F. Huet and F. Magoules, ”K Nearest Neighbour Joins for Big Data on MapReduce:A Theoretical and Experimental Analysis”, IEEE Trans. on Knowledge and Data Engineering, vol. 28, no. 9, pp. 2376-2392, 2016. [16] J. Maillo, S. Ramirez, I. Triguero and F. Herrera, ”kNN-IS: An Iterative Spark-based design of the k-Nearest Neighbours classifier for big data”, Knowledge-Based Systems, vol. 117, pp. 3-15, 2017. [17] T.Tulgar, A.haydar and İ.Erşan, "Data Distribution Aware Classification Algorithm based on K-Means", International Journal of Advanced Computer Science and Applications, Article in Press, 2017. [18] T. White, "Hadoop: A Definitive Guide", 4th ed., O'Reilly, 2015. [19] J. Gosling, B. Joy, G. Steele, G. Bracha, A. Buckley, (2017,AUG 01). The Java Language Specification-Java SE 8 Edition Online. Available: https://docs.oracle.com/javase/specs/jls/se8/html/index.html [20] UCI Center for Machine Learning and Intelligent Systems, (2017, AUG 01). UC Irvine Machine Learning RepositoryOnline.Available: https://archive.ics.uci.edu/ml/ [21] O.L. Mangasarian, W.N. Street and W.H. Wolberg, “Breast cancer diagnosis and prognosis via linear programming”, Operations Research, vol. 43, no. 4, pp. 570-577, July-August 1995. [22] M. Charytanowicz, J. Niewczas, P. Kulczycki, P.A. Kowalski, S. Lukasik, S. Zak, “A Complete Gradient Clustering Algorithm for Features Analysis of X-ray Images”, Information Technologies in Biomedicine, Springer-Verlag, Berlin-Heidelberg, pp. 15-24, 2010. [23] F. Alimoglu, E. Alpaydin, “Methods of Combining Multiple Classifiers Based on Different Representations for Pen-based Handwriting Recognition”, in Proc. Fifth Turkish Artificial Intelligence and Artificial Neural Networks Symposium (TAINN 96), June 1996.

There are 1 citations in total.

Details

Primary Language	English
Subjects	Engineering
Journal Section	Araştırma Articlessi
Authors	Tamer Tulgar This is me Ali Haydar İbrahim Erşan
Publication Date	April 30, 2018
Published in Issue	Year 2018 Volume: 6 Issue: 2

Cite

APA	Tulgar, T., Haydar, A., & Erşan, İ. (2018). A Distributed K Nearest Neighbor Classifier for Big Data. Balkan Journal of Electrical and Computer Engineering, 6(2), 105-111. https://doi.org/10.17694/bajece.419551

Download Cover Image

Article Files

Full Text

All articles published by BAJECE are licensed under the Creative Commons Attribution 4.0 International License. This permits anyone to copy, redistribute, remix, transmit and adapt the work provided the original work and source is appropriately cited. Creative Commons LisansÄ±