Balkan Journal of Electrical and Computer Engineering

2147-284X 2147-284X

MUSA YILMAZ

10.17694/bajece.419551

Engineering

Mühendislik

A Distributed K Nearest Neighbor Classifier for Big Data

Tulgar

Tamer

Haydar

Ali

Erşan

İbrahim

04 30 2018

6 2 105 111 08 29 2015 11 16 2017

2013

Balkan Journal of Electrical and Computer Engineering

TheK-Nearest Neighbor classifier is a well-known and widely applied method in datamining applications. Nevertheless, its high computation and memory usage costmakes the classical K-NN not feasible for today’s Big Data analysisapplications. To overcome the cost drawbacks of the known data mining methods,several distributed environment alternatives have emerged. Among thesealternatives, Hadoop MapReduce distributed ecosystem attracted significantattention. Recently, several K-NN based classification algorithms have beenproposed which are distributed methods tested in Hadoop environment andsuitable for emerging data analysis needs. In this work, a new distributedZ-KNN algorithm is proposed, which improves the classification accuracyperformance of the well-known K-Nearest Neighbor (K-NN) algorithm by benefitingfrom the representativeness relationship of the instances belonging todifferent data classes. The proposed algorithm relies on the data classrepresentations derived from the Z data instances from each class, which arethe closest to the test instance. The Z-KNN algorithm was tested in a physicalHadoop Cluster using several real-datasets belonging to different applicationareas. The performance results acquired after extensive experiments arepresented in this paper and they prove that the proposed Z-KNN algorithm is acompetitive alternative to other studies recently proposed in the literature

Big Data Classification Hadoop K-Nearest Neighbor MapReduce.

Klaus Schwab, "The Fourth Industrial Revolution", Crown Business, 2017. [2] D. Singh and .K. Reddy, ”A survey on platforms for big data analytics”, Journal of Big Data vol. 1, no. 8, 2014. [3] P. Tan, M. Steinbach and V. Kumar, ”Introduction to Data Mining”, 1st ed., Reading, MA: Addison-Wesley, 2005. [4] J. Dean, S. Ghemawat , ”MapReduce: A Flexible Data Processing Tool”, Communications of the ACM, vol. 53 no. 1, pp.72-77, 2010. [5] X. Wu et. Al., ”Top 10 algorithms in data mining”, Knowledge and Information Systems,vol. 14, no. 1, pp 137, 2008. [6] Fahad et. AL., ”A Survey of Clustering Algorithms for Big Data: Taxonomy and Empirical Analysis”, IEEE Trans.on Emerging Topics in Computing, vol. 2, no.3, pp. 267-279, 2014. [7] S. Zhang, M. Zong and D. Cheng, ”Learning k for KNN Classification”, ACM Transactions on Intelligent Systems and Technology, vol. 8, no. 3, pp. 43:1-19, 2017. [8] K. Niu, F. Zhao and S. Zhang, ”A Fast Classification Algorithm for Big Data Based on KNN”, Journal of Applied Sciences, vol. 13,no. 12, pp. 2208-2212, 2013. [9] Bifet, J. Read, B. Pfahringer and G. Holmes, ”Efficient Data Stream Classification via Probabilistic Adaptive Windows”, in Proc. 28th Annual ACM Symposium on Applied Computing, 2013, pp. 801-806. [10] S. S. Labib, ”A Comparative Study to Classify Big Data Using fuzzy Techniques”, in Proc. 5th International Conference on Electronic Devices, Systems and Applications (ICEDSA), 2016. [11] M. El Bakry, S. Safwat and O. Hegazy, ”A Mapreduce Fuzzy technique of Big Data Classification, in Proc. SAI Computing Conference 2016, pp. 118-128. [12] B. Quost and T. Denoeux, ”Clustering and Classification of fuzzy data using the fuzzy EM algorithm”, Fuzzy Sets and Systems, vol. 286, pp. 134-156, 2016. [13] Z. Deng, X. Zhu, D. Cheng, M. Zong and S. Zhang, ”Efficient kNN classification algorithm for big data”, Neurocomputing, vol.195, pp. 143-148, 2016. [14] S. Zhang, D. Cheng, M. Zong and L. Gao, ”Self representation nearest neighbour search for classification”, Neurocomputing, vol.195, pp. 137-142, 2016 [15] G. Song, J. Rochas, L. El Beze, F. Huet and F. Magoules, ”K Nearest Neighbour Joins for Big Data on MapReduce:A Theoretical and Experimental Analysis”, IEEE Trans. on Knowledge and Data Engineering, vol. 28, no. 9, pp. 2376-2392, 2016. [16] J. Maillo, S. Ramirez, I. Triguero and F. Herrera, ”kNN-IS: An Iterative Spark-based design of the k-Nearest Neighbours classifier for big data”, Knowledge-Based Systems, vol. 117, pp. 3-15, 2017. [17] T.Tulgar, A.haydar and İ.Erşan, "Data Distribution Aware Classification Algorithm based on K-Means", International Journal of Advanced Computer Science and Applications, Article in Press, 2017. [18] T. White, "Hadoop: A Definitive Guide", 4th ed., O'Reilly, 2015. [19] J. Gosling, B. Joy, G. Steele, G. Bracha, A. Buckley, (2017,AUG 01). The Java Language Specification-Java SE 8 Edition Online. Available: https://docs.oracle.com/javase/specs/jls/se8/html/index.html [20] UCI Center for Machine Learning and Intelligent Systems, (2017, AUG 01). UC Irvine Machine Learning RepositoryOnline.Available: https://archive.ics.uci.edu/ml/ [21] O.L. Mangasarian, W.N. Street and W.H. Wolberg, “Breast cancer diagnosis and prognosis via linear programming”, Operations Research, vol. 43, no. 4, pp. 570-577, July-August 1995. [22] M. Charytanowicz, J. Niewczas, P. Kulczycki, P.A. Kowalski, S. Lukasik, S. Zak, “A Complete Gradient Clustering Algorithm for Features Analysis of X-ray Images”, Information Technologies in Biomedicine, Springer-Verlag, Berlin-Heidelberg, pp. 15-24, 2010. [23] F. Alimoglu, E. Alpaydin, “Methods of Combining Multiple Classifiers Based on Different Representations for Pen-based Handwriting Recognition”, in Proc. Fifth Turkish Artificial Intelligence and Artificial Neural Networks Symposium (TAINN 96), June 1996.