Year 2021, Volume 8 , Issue 2, Pages 103 - 121 2021-06-30

Gaining New Insight into Machine-Learning Datasets via Multiple Binary-Feature Frequency Ranks with a Mobile Benign/Malware Apps Example

Gürol CANBEK [1]


Researchers compare their Machine Learning (ML) classification performances with other studies without examining and comparing the datasets they used in training, validating, and testing. One of the reasons is that there are not many convenient methods to give initial insights about datasets besides the descriptive statistics applied to individual continuous or quantitative features. After demonstrating initial manual analysis techniques, this study proposes a novel adaptation of the Kruskal-Wallis statistical test to compare a group of datasets over multiple prominent binary features that are very common in today’s datasets. As an illustrative example, the new method was tested on six benign/malign mobile application datasets over the frequencies of prominent binary features to explore the dissimilarity of the datasets per class. The feature vector consists of over a hundred “application permission requests” that are binary flags for Android platforms’ primary access control to provide privacy and secure data/information in mobile devices. Permissions are also the first leading transparent features for ML-based malware classification. The proposed data analytical methodology can be applied in any domain through their prominent features of interest. The results, which are also visualized in three new ways, have shown that the proposed method gives the dissimilarity degree among the datasets. Specifically, the conducted test shows that the frequencies in the aggregated dataset and some of the datasets are not substantially different from each other even they are in close agreement in positive-class datasets. It is expected that the proposed domain-independent method brings useful initial insight to researchers on comparing different datasets.
Machine learning, Binary classification, Dataset comparison, Malware analysis, Feature engineering, Quantitative analysis
  • 1. Canbek G, Sagiroglu S, Taskaya Temizel T, Baykal N., Binary classification performance measures/metrics: A comprehensive visualized roadmap to gain new insights, in: 2017 International Conference on Computer Science and Engineering (UBMK), IEEE, Antalya, Turkey, 2017: pp. 821–826. doi:10.1109/ UBMK.2017.8093539.
  • 2. Ostertagová E, Ostertag O, Kováč J., Methodology and Application of the Kruskal-Wallis Test, Applied Mechanics and Materials. 611 (2014) 115–120. doi:10.4028/www.scientific.net/AMM.611.115.
  • 3. Piringer H, Berger W, Hauser H., Quantifying and comparing features in high-dimensional datasets, in: Proceedings of the International Conference on Information Visualisation, IEEE, London, 2008: pp. 240–245. doi:10.1109/IV.2008.17.
  • 4. Canbek G, Sagiroglu S, Taskaya Temizel T., New techniques in profiling big datasets for machine learning with a concise review of Android mobile malware datasets, 2018 International Congress on Big Data, Deep Learning and Fighting Cyber Terrorism (IBIGDELFT). (2018) 117–121. doi:10.1109/ibigdelft.2018.8625275.
  • 5. Andrade RO, Yoo SG., Cognitive security: A comprehensive study of cognitive science in cybersecurity, Journal of Information Security and Applications. 48 (2019) 1–13. doi:10.1016/j.jisa.2019.06.008.
  • 6. Canbek G, Sagiroglu S, Baykal N., New comprehensive taxonomies on mobile security and malware analysis, International Journal of Information Security Science (IJISS). 5 (2016) 106–138. http://www. ijiss.org/ijiss/index.php/ijiss/article/view/227.
  • 7. Surendran R, Thomas T, Emmanuel S., A TAN based hybrid model for android malware detection, Journal of Information Security and Applications. 54 (2020) 1–11. doi:10.1016/j.jisa.2020.102483.
  • 8. Clement J., Average number of new Android app releases via Google Play per month as of May 2020, New York, 2020. https://www. statista.com/statistics/276703/android-app-releases-worldwide.
  • 9. Suarez-Tangil G, Tapiador JE, Peris-Lopez P, Ribagorda A., Evolution, detection and analysis of malware for smart devices, IEEE Communications Surveys & Tutorials. 16 (2014) 961–987. doi:10.1109/SURV.2013.101613.00077.
  • 10. Deypir M, Horri A., Instance based security risk value estimation for Android applications, Journal of Information Security and Applications. 40 (2018) 20–30. doi:10.1016/j.jisa.2018.02.002.
  • 11. Android, Manifest.permission, Android Developers. (2020). https:// developer.android.com/reference/android/Manifest.permission. html (accessed September 2, 2020).
  • 12. Cen L, Gates C, Si L, Li N., A probabilistic discriminative model for Android malware detection with decompiled source code, IEEE Transactions on Dependable and Secure Computing. 12 (2015) 400–412. doi:10.1109/TDSC.2014.2355839.
  • 13. Lindorfer M, Neugschwandtner M, Weichselbaum L, Fratantonio Y, Van Der Veen V, Platzer C., ANDRUBIS - 1,000,000 apps later: a view on current Android malware behaviors, in: 3rd International Workshop on Building Analysis Datasets and Gathering Experience Returns for Security (BADGERS), Wroclaw, Poland, 2014: pp. 3–17.
  • 14. Aswini AM, Vinod P., Droid permission miner: Mining prominent permissions for Android malware analysis, in: The 5th International Conference on the Applications of Digital Information and Web Technologies (ICADIWT), IEEE, Bangalore, India, 2014: pp. 81–86. doi:10.1109/ICADIWT.2014.6814679.
  • 15. Wang W, Wang X, Feng D, Liu J, Han Z, Zhang X., Exploring permission-induced risk in Android applications for malicious application detection, IEEE Transactions on Information Forensics and Security. 9 (2014) 1828–1842. doi:10.1109/TIFS.2014.2353996.
  • 16. Yerima SY, Sezer S, McWilliams G., Analysis of Bayesian classification-based approaches for Android malware detection, IET Information Security. 8 (2014) 25–36. doi:10.1049/iet-ifs.2013.0095.
  • 17. Jiang X, Zhou Y., Android Malware, Springer, Raleigh, NC, USA, 2013.
  • 18. Peng H, Gates C, Sarma B, Li N, Qi Y, Potharaju R, Nita- Rotaru C, Molloy I., Using probabilistic generative models for ranking risks of Android apps, in: 19th Conference on Computer and Communications Security (CCS), ACM, New York, New York, USA, 2012: pp. 241–252. doi:10.1145/2382196.2382224.
  • 19. Hoffmann J, Ussath M, Holz T, Spreitzenbarth M., Slicing droids: Program slicing for smali code, in: SAC ’13 Proceedings of the 28th Annual ACM Symposium on Applied Computing, Coimbra, Portugal, 2013: pp. 1844–1851. http://dl.acm.org/citation. cfm?id=2480706 (accessed October 22, 2013).
  • 20. Sarma B, Li N, Gates C, Potharaju R, Nita-Rotaru C, Molloy I., Android permissions: A perspective combining risks and benefits, in: 17th Symposium on Access Control Models and Technologies (SACMAT), ACM, New York, New York, USA, 2012: pp. 13–22. doi:10.1145/2295136.2295141.
  • 21. Canfora G, Mercaldo F, Visaggio CA., A classifier of malicious Android applications, in: The 8th International Conference on Availability, Reliability and Security (ARES), IEEE, Regensburg, 2013: pp. 607–614. doi:10.1109/ARES.2013.80.
  • 22. Peiravian N, Zhu X., Machine learning for Android malware detection using permission and API calls, in: IEEE 25th International Conference on Tools with Artificial Intelligence G. Canbek/ Hittite J Sci Eng, 2021, 8 (2) 103–121 118 (ICTAI), IEEE, Herndon, VA, 2013: pp. 300–305. doi:10.1109/ ICTAI.2013.53.
  • 23. Felt AP, Chin E, Hanna S, Song D, Wagner D., Android permissions demystified, in: Proceedings of the 18th ACM Conference on Computer and Communications Security (CCS), ACM Press, New York, New York, USA, 2011: p. 627. doi:10.1145/2046707.2046779.
  • 24. Canbek G, Baykal N, Sagiroglu S., Clustering and visualization of mobile application permissions for end users and malware analysts, in: The 5th International Symposium on Digital Forensic and Security (ISDFS), IEEE, Tirgu Mures, 2017: pp. 1–10. doi:10.1109/ ISDFS.2017.7916512.
  • 25. Kruskal WH, Wallis WA., Use of Ranks in One-Criterion Variance Analysis, Journal of the American Statistical Association. 47 (1952) 583–621. http://www.jstor.org/stable/pdf/2280779. pdf?_=1463988119080.
  • 26. Theodorsson-Norheim E., Kruskal-Wallis test: BASIC computer program to perform nonparametric one-way analysis of variance and multiple comparisons on ranks of several independent samples, Computer Methods and Programs in Biomedicine. 23 (1986) 57–62. doi:10.1016/0169-2607(86)90081-7.
  • 27. Bommert A, Sun X, Bischl B, Rahnenführer J, Lang M., Benchmark for filter methods for feature selection in high-dimensional classification data, Computational Statistics and Data Analysis. 143 (2020) 1–19. doi:10.1016/j.csda.2019.106839.
  • 28. Vora S, Yang H., A Comprehensive Study of Eleven Feature Selection Algorithms and their Impact on Text Classification, in: Computing Conference, London, United Kingdom, 2017: pp. 440– 449. doi:10.1109/SAI.2017.8252136.
  • 29. Boulesteix AL, Tutz G., Identification of interaction patterns and classification with applications to microarray data, Computational Statistics and Data Analysis. 50 (2006) 783–802. doi:10.1016/j. csda.2004.10.004.
  • 30. Chen Y, Datta S., Adjustments of multi-sample U-statistics to right censored data and confounding covariates, Computational Statistics and Data Analysis. 135 (2019) 1–14. doi:10.1016/j. csda.2019.01.012.
  • 31. Yu C, Zelterman D., A parametric model to estimate the proportion from true null using a distribution for p-values, Computational Statistics and Data Analysis. 114 (2017) 105–118. doi:10.1016/j. csda.2017.04.008.
  • 32. Von Borries G, Wang H., Partition clustering of high dimensional low sample size data based on p-values, Computational Statistics and Data Analysis. 53 (2009) 3987–3998. doi:10.1016/j.csda.2009.06.012.
  • 33. Semwal VB, Singha J, Sharma PK, Chauhan A, Behera B., An optimized feature selection technique based on incremental feature analysis for bio-metric gait data classification, Multimedia Tools and Applications. 76 (2017) 24457–24475. doi:10.1007/s11042-016- 4110-y.
  • 34. Yang C, Ji J, Liu J, Liu J, Yin B., Structural learning of Bayesian networks by bacterial foraging optimization, International Journal of Approximate Reasoning. 69 (2016) 147–167. doi:10.1016/j. ijar.2015.11.003.
  • 35. Rueda R, Ruiz LGB, Cuéllar MP, Pegalajar MC., An Ant Colony Optimization approach for symbolic regression using Straight Line Programs . Application to energy consumption modelling, International Journal of Approximate Reasoning. 121 (2020) 23–38. doi:10.1016/j.ijar.2020.03.005.
  • 36. Alomari R, Thorpe J., On password behaviours and attitudes in different populations, Journal of Information Security and Applications. 45 (2019) 79–89. doi:10.1016/j.jisa.2018.12.008.
  • 37. Zhang D, Li Q, Yang G, Li L, Sun X., Detection of image seam carving by using weber local descriptor and local binary patterns, Journal of Information Security and Applications. 36 (2017) 135– 144. doi:10.1016/j.jisa.2017.09.003.
  • 38. Asmitha KA, Vinod P., Linux Malware Detection using non- Parametric Statistical methods, in: 2014 International Conference on Advances in Computing, Communications and Informatics (ICACCI), IEEE, New Delhi, 2014: pp. 319–332.
  • 39. Zorn C., Shapiro-Wilk Test, Encyclopedia of Social Science Research Methods. (2004) 1305.
  • 40. Royston JP., Algorithm AS 181: The W Test for Normality, Applied Statistics. 31 (1982) 176–180.
  • 41. MathWorks, Multiple Comparison Test - MATLAB multcompare, (2020). http://www.mathworks.com/access/helpdesk/help/toolbox/ stats/multcompare.html (accessed September 2, 2020).
  • 42. Enck W, Ongtang M, McDaniel P., On lightweight mobile phone application certification, in: 16th Conference on Computer and Communications Security (CCS), ACM, New York, New York, USA, 2009: pp. 235–245. http://www.patrickmcdaniel.org/pubs/ccs09a. pdf.
  • 43. Pearce P, Felt AP, Nunez G, Wagner D., AdDroid: Privilege Separation for Applications and Advertisers in Android, in: Proceedings of the 7th ACM Symposium on Information, Computer and Communications Security - ASIACCS ’12, ACM Press, Seoul, Korea, 2012: p. 71. doi:10.1145/2414456.2414498.
  • 44. Sanz B, Santos I, Laorden C, Ugarte-Pedrero X, Bringas PG, Alvarez G., PUMA: Permission usage to detect malware in Android, in: International Joint Conference CISIS-ICEUTE-SOCO Special Sessions, Springer Berlin Heidelberg, Ostrava, Czech Republic, 2013: pp. 289–298.
  • 45. Canbek G., “Prominent Binary-Feature (Permissions) Frequencies for Android Mobile Benign Apps and Malware Datasets”, Mendeley Data, V1, https://doi.org/10.17632/ptd9fnsrtr.1
Primary Language en
Subjects Engineering
Journal Section Research Article
Authors

Orcid: 0000-0002-9337-097X
Author: Gürol CANBEK
Institution: ASELSAN
Country: Turkey


Dates

Application Date : January 8, 2021
Acceptance Date : March 8, 2021
Publication Date : June 30, 2021

Bibtex @research article { hjse960315, journal = {Hittite Journal of Science and Engineering}, issn = {}, eissn = {2148-4171}, address = {Hitit Üniversitesi Mühendislik Fakültesi Kuzey Kampüsü Çevre Yolu Bulvarı 19030 Çorum / TÜRKİYE}, publisher = {Hitit University}, year = {2021}, volume = {8}, pages = {103 - 121}, doi = {10.17350/HJSE19030000221}, title = {Gaining New Insight into Machine-Learning Datasets via Multiple Binary-Feature Frequency Ranks with a Mobile Benign/Malware Apps Example}, key = {cite}, author = {Canbek, Gürol} }
APA Canbek, G . (2021). Gaining New Insight into Machine-Learning Datasets via Multiple Binary-Feature Frequency Ranks with a Mobile Benign/Malware Apps Example . Hittite Journal of Science and Engineering , 8 (2) , 103-121 . DOI: 10.17350/HJSE19030000221
MLA Canbek, G . "Gaining New Insight into Machine-Learning Datasets via Multiple Binary-Feature Frequency Ranks with a Mobile Benign/Malware Apps Example" . Hittite Journal of Science and Engineering 8 (2021 ): 103-121 <https://dergipark.org.tr/en/pub/hjse/issue/63382/960315>
Chicago Canbek, G . "Gaining New Insight into Machine-Learning Datasets via Multiple Binary-Feature Frequency Ranks with a Mobile Benign/Malware Apps Example". Hittite Journal of Science and Engineering 8 (2021 ): 103-121
RIS TY - JOUR T1 - Gaining New Insight into Machine-Learning Datasets via Multiple Binary-Feature Frequency Ranks with a Mobile Benign/Malware Apps Example AU - Gürol Canbek Y1 - 2021 PY - 2021 N1 - doi: 10.17350/HJSE19030000221 DO - 10.17350/HJSE19030000221 T2 - Hittite Journal of Science and Engineering JF - Journal JO - JOR SP - 103 EP - 121 VL - 8 IS - 2 SN - -2148-4171 M3 - doi: 10.17350/HJSE19030000221 UR - https://doi.org/10.17350/HJSE19030000221 Y2 - 2021 ER -
EndNote %0 Hittite Journal of Science and Engineering Gaining New Insight into Machine-Learning Datasets via Multiple Binary-Feature Frequency Ranks with a Mobile Benign/Malware Apps Example %A Gürol Canbek %T Gaining New Insight into Machine-Learning Datasets via Multiple Binary-Feature Frequency Ranks with a Mobile Benign/Malware Apps Example %D 2021 %J Hittite Journal of Science and Engineering %P -2148-4171 %V 8 %N 2 %R doi: 10.17350/HJSE19030000221 %U 10.17350/HJSE19030000221
ISNAD Canbek, Gürol . "Gaining New Insight into Machine-Learning Datasets via Multiple Binary-Feature Frequency Ranks with a Mobile Benign/Malware Apps Example". Hittite Journal of Science and Engineering 8 / 2 (June 2021): 103-121 . https://doi.org/10.17350/HJSE19030000221
AMA Canbek G . Gaining New Insight into Machine-Learning Datasets via Multiple Binary-Feature Frequency Ranks with a Mobile Benign/Malware Apps Example. Hittite J Sci Eng. 2021; 8(2): 103-121.
Vancouver Canbek G . Gaining New Insight into Machine-Learning Datasets via Multiple Binary-Feature Frequency Ranks with a Mobile Benign/Malware Apps Example. Hittite Journal of Science and Engineering. 2021; 8(2): 103-121.
IEEE G. Canbek , "Gaining New Insight into Machine-Learning Datasets via Multiple Binary-Feature Frequency Ranks with a Mobile Benign/Malware Apps Example", Hittite Journal of Science and Engineering, vol. 8, no. 2, pp. 103-121, Jun. 2021, doi:10.17350/HJSE19030000221