A New Approach to Minimize Memory Requirements of Frequent Subgraph Mining Algorithms
Year 2021,
, 237 - 246, 01.03.2021
Turgay Bilgin
,
Murat Oğuz
Abstract
Frequent subgraph mining (FSM) is a subsection of graph mining domain which is extensively used for graph classification and graph clustering purposes. Over the past decade, many efficient FSM algorithms have been developed. The improvements generally focus on reducing time complexity by changing the algorithm structure or using parallel programming techniques. FSM algorithms have another problem to solve, which is the high memory consumption. In this study, a new approach called Predictive Dynamic Sized Structure Packing (PDSSP) have been proposed to minimize the memory requirement of FSM algorithms. Proposed approach redesigns the internal data structures of FSM algorithms without any algorithmic modifications. PDSSP has two contributions. The first one is the Dynamic Sized Integer Type (ds_Int) which is a newly designed unsigned integer data type. The second contribution is “Data Structure packaging” component that uses a data structure packing technique which changes the behaviour of the compiler. A number of experiments have been conducted to examine the effectiveness and efficiency of the PDSSP approach by embedding it into two state-of-art algorithms called gSpan and Gaston. Proposed implementation have been compared to the official one. Almost all results show that the proposed implementation consumes less memory on each support level. As a result, PDSSP extensions can save memory and the peak memory usage may decrease up to 38% depending on the dataset.
References
- [1] Thusoo A, Shao Z, Anthony S, Borthakur D, Jain N, Sen Sarma J, Murthy R, Liu H. “Data warehousing and analytics infrastructure at Facebook”. In: ACM SIGMOD/PODS 2010 Conference; Indianapolis, Indiana, USA: ACM. pp. 1013–1020, 2010.
- [2] Wang L, Xiao Y, Shao B, Wang H. “How to partition a billion-node graph”. In: IEEE 2014 International Conference on Data Engineering; Chicago, IL, USA: IEEE; pp. 568–579, 2014
- [3] Muttipati AS, Padmaja P. “Analysis of large graph partitioning and frequent subgraph mining on graph data”. International Journal of Advanced Research in Computer Science 2015; pp. 6-7, 2015.
- [4] G. V. Carlos, M. Esteban, “Comparative Analysis of de Bruijn Graph Parallel Genome Assemblers”, 2018 IEEE International Work Conference on Bioinspired Intelligence (IWOBI), IEEE, pp 1-8, 2018.
- [5] Talukder, N. and Zaki, M.J., “A distributed approach for graph mining in massive networks”. Data Mining and Knowledge Discovery, 30(5), pp.1024-1052, 2016.
- [6] Di Fatta, G., & Berthold, M. R.. “High performance subgraph mining in molecular compounds”. In International Conference on High Performance Computing and Communications, Springer, Berlin, Heidelberg.,pp. 866-877, (2005, September).
- [7] Stratikopoulos A, Chrysos G, Papaefstathiou I, Dollas A. “HPC-gSpan: An FPGA-based parallel system for frequent subgraph mining”. In IEEE 2014 24th International Conference on Field Programmable Logic and Applications (FPL); Munich, Germany, IEEE. pp. 1-4, 2014.
- [8] Aggarwal CC, Bhuiyan MA, Al Hasan M. “Frequent pattern mining algorithms: A survey”. In: Aggarwal CC, Han J, editors. Frequent Pattern Mining. Switzerland: Springer International Publishing, pp. 19–64, 2014.
- [9] Anastasiu DC, Iverson J, Smith S, Karypis G. “Big data frequent pattern mining”. In: Aggarwal CC, Han J, editors. Frequent Pattern Mining. Switzerland: Springer International Publishing, pp. 225–259, 2014.
- [10] Yan X, Han J. “gSpan: Graph-based substructure pattern mining”. In: IEEE 2003 International Conference on Data Mining; Melbourne, Florida, USA: IEEE. pp. 721–724, 2003.
- [11] Nijssen S, Kok JN. “The Gaston tool for frequent subgraph mining”. Electronic Notes in Theoretical Computer Science, pp 127;77–87, 2005.
- [12] Yan X, Han J. “CloseGraph: Mining closed frequent graph patterns”. In: ACM SIGKDD 2003 International Conference on Knowledge Discovery and Data Mining; Washington, DC, USA: ACM. pp. 286–295, 2003.
- [13] Lakshmi K, Meyyappan DT. “A comparative study of frequent subgraph mining algorithms”. IJITCS 2012; 2; 2.
- [14] Borgelt C, Berthold MR. “Mining molecular fragments: Finding relevant substructures of molecules”. In: IEEE 2003 International Conference on Data Mining; Melbourne, Florida, USA: IEEE. pp. 51–58, 2003.
- [15] B. Guan, X.Z. Zan, B.Y. Xiao, R.N. Ma, F.Y. Zhang and W.B.Liu, "Detecting dense subgraphs in complex networks based on edge density coefficient", Chinese Journal of Electronics, Vol.22, No.3, pp.517–520, 2013.
- [16] Kuramochi M, Karypis G. “Frequent subgraph discovery”. In: IEEE 2001 International Conference on Data Mining; San Jose, California, USA: IEEE. pp. 313–320., 2001.
- [17] Vanetik N, Gudes E, Shimony SE. “Computing frequent graph patterns from semistructured data”. In: IEEE 2002 International Conference on Data Mining; Maebashi City, Japan: IEEE. pp. 458–465, 2002.
- [18] Rehman, S.U., Asghar, S. and Fong, S.J., “Optimized and Frequent Subgraphs: How Are They Related?”. IEEE Access, 6, pp.37237-37249, 2018.
- [19] Yang, S., Guo, R., Liu, R., Liao, X., Zou, Q., Shi, B. and Peng, S., “cmFSM: a scalable CPU-MIC coordinated drug-finding tool by frequent subgraph mining”. BMC bioinformatics, 19(4), p.98, 2018.
- [20] Horton I. “Working with fundamental data types”. In: Anglin S, lead editor. Beginning C++. New York, NY, USA: Apress Press, pp. 55–77, 2014.
- [21] Bader DA, Meyerhenke H, Sanders P, Wagner D. “Benchmarking for graph clustering and partitioning”. Encyclopedia of Social Network Analysis and Mining 2012; pp. 73–84, 2012.
- [22] Randal E. Bryant, David R. O'Hallaron, "Computer Systems: A programmer's Perspective", 3rd edition,Pearson, 2016.
- [23] Microsoft, "Working with Packing Structures", https://docs.microsoft.com/en-us/previous-versions/ms253935 (December 8, 2017).
- [24] "Valgrind and Massif Visualizer tools", http://valgrind.org/info/tools.html, (December 8, 2017).
- [25] Wale, Nikil, Ian A. Watson, and George Karypis. "Comparison of descriptor spaces for chemical compound retrieval and classification." Knowledge and Information Systems, pp: 347-375, 2008.
- [26] P. D. Dobson and A. J. Doig. “Distinguishing enzyme structures from non-enzymes without alignments”. Journal of Molecular Biology, Vol 330(4):771–783, 2003.
- [27] D. Wajdi, A. Sabeur, N. E. Mephu, “MR-SimLab: Scalable subgraph selection with label similarity for big data”. Information Systems, Elsevier, Vol 69, pp: 155-163, 2017.
- [28] Thoma M, Cheng H, Gretton A, Han J, Kriegel HP, Smola A, Song L, Yu PS, Yan X, Borgwardt KM. “Discriminative frequent subgraph mining with optimality guarantees”. Statistical Analysis and Data Mining. The ASA Data Science Journal; Vol 3: 302–18, 2010.
- [29] "Gaston official web site", http://liacs.leidenuniv.nl/~nijssensgr/gaston/index.html, (September 8, 2017).
- [30] Wörlein, M., Meinl, T., Fischer, I., & Philippsen, M., “A quantitative comparison of the subgraph miners MoFa, gSpan, FFSM, and Gaston.”, In: European Conference on Principles of Data Mining and Knowledge Discovery. Springer, Berlin, Heidelberg, p. 392-403, 2005.
A New Approach to Minimize Memory Requirements of Frequent Subgraph Mining Algorithms
Year 2021,
, 237 - 246, 01.03.2021
Turgay Bilgin
,
Murat Oğuz
Abstract
Frequent subgraph mining (FSM) is a subsection of graph mining domain which is extensively used for graph classification and graph clustering purposes. Over the past decade, many efficient FSM algorithms have been developed. The improvements generally focus on reducing time complexity by changing the algorithm structure or using parallel programming techniques. FSM algorithms have another problem to solve, which is the high memory consumption. In this study, a new approach called Predictive Dynamic Sized Structure Packing (PDSSP) have been proposed to minimize the memory requirement of FSM algorithms. Proposed approach redesigns the internal data structures of FSM algorithms without any algorithmic modifications. PDSSP has two contributions. The first one is the Dynamic Sized Integer Type (ds_Int) which is a newly designed unsigned integer data type. The second contribution is “Data Structure packaging” component that uses a data structure packing technique which changes the behaviour of the compiler. A number of experiments have been conducted to examine the effectiveness and efficiency of the PDSSP approach by embedding it into two state-of-art algorithms called gSpan and Gaston. Proposed implementation have been compared to the official one. Almost all results show that the proposed implementation consumes less memory on each support level. As a result, PDSSP extensions can save memory and the peak memory usage may decrease up to 38% depending on the dataset.
References
- [1] Thusoo A, Shao Z, Anthony S, Borthakur D, Jain N, Sen Sarma J, Murthy R, Liu H. “Data warehousing and analytics infrastructure at Facebook”. In: ACM SIGMOD/PODS 2010 Conference; Indianapolis, Indiana, USA: ACM. pp. 1013–1020, 2010.
- [2] Wang L, Xiao Y, Shao B, Wang H. “How to partition a billion-node graph”. In: IEEE 2014 International Conference on Data Engineering; Chicago, IL, USA: IEEE; pp. 568–579, 2014
- [3] Muttipati AS, Padmaja P. “Analysis of large graph partitioning and frequent subgraph mining on graph data”. International Journal of Advanced Research in Computer Science 2015; pp. 6-7, 2015.
- [4] G. V. Carlos, M. Esteban, “Comparative Analysis of de Bruijn Graph Parallel Genome Assemblers”, 2018 IEEE International Work Conference on Bioinspired Intelligence (IWOBI), IEEE, pp 1-8, 2018.
- [5] Talukder, N. and Zaki, M.J., “A distributed approach for graph mining in massive networks”. Data Mining and Knowledge Discovery, 30(5), pp.1024-1052, 2016.
- [6] Di Fatta, G., & Berthold, M. R.. “High performance subgraph mining in molecular compounds”. In International Conference on High Performance Computing and Communications, Springer, Berlin, Heidelberg.,pp. 866-877, (2005, September).
- [7] Stratikopoulos A, Chrysos G, Papaefstathiou I, Dollas A. “HPC-gSpan: An FPGA-based parallel system for frequent subgraph mining”. In IEEE 2014 24th International Conference on Field Programmable Logic and Applications (FPL); Munich, Germany, IEEE. pp. 1-4, 2014.
- [8] Aggarwal CC, Bhuiyan MA, Al Hasan M. “Frequent pattern mining algorithms: A survey”. In: Aggarwal CC, Han J, editors. Frequent Pattern Mining. Switzerland: Springer International Publishing, pp. 19–64, 2014.
- [9] Anastasiu DC, Iverson J, Smith S, Karypis G. “Big data frequent pattern mining”. In: Aggarwal CC, Han J, editors. Frequent Pattern Mining. Switzerland: Springer International Publishing, pp. 225–259, 2014.
- [10] Yan X, Han J. “gSpan: Graph-based substructure pattern mining”. In: IEEE 2003 International Conference on Data Mining; Melbourne, Florida, USA: IEEE. pp. 721–724, 2003.
- [11] Nijssen S, Kok JN. “The Gaston tool for frequent subgraph mining”. Electronic Notes in Theoretical Computer Science, pp 127;77–87, 2005.
- [12] Yan X, Han J. “CloseGraph: Mining closed frequent graph patterns”. In: ACM SIGKDD 2003 International Conference on Knowledge Discovery and Data Mining; Washington, DC, USA: ACM. pp. 286–295, 2003.
- [13] Lakshmi K, Meyyappan DT. “A comparative study of frequent subgraph mining algorithms”. IJITCS 2012; 2; 2.
- [14] Borgelt C, Berthold MR. “Mining molecular fragments: Finding relevant substructures of molecules”. In: IEEE 2003 International Conference on Data Mining; Melbourne, Florida, USA: IEEE. pp. 51–58, 2003.
- [15] B. Guan, X.Z. Zan, B.Y. Xiao, R.N. Ma, F.Y. Zhang and W.B.Liu, "Detecting dense subgraphs in complex networks based on edge density coefficient", Chinese Journal of Electronics, Vol.22, No.3, pp.517–520, 2013.
- [16] Kuramochi M, Karypis G. “Frequent subgraph discovery”. In: IEEE 2001 International Conference on Data Mining; San Jose, California, USA: IEEE. pp. 313–320., 2001.
- [17] Vanetik N, Gudes E, Shimony SE. “Computing frequent graph patterns from semistructured data”. In: IEEE 2002 International Conference on Data Mining; Maebashi City, Japan: IEEE. pp. 458–465, 2002.
- [18] Rehman, S.U., Asghar, S. and Fong, S.J., “Optimized and Frequent Subgraphs: How Are They Related?”. IEEE Access, 6, pp.37237-37249, 2018.
- [19] Yang, S., Guo, R., Liu, R., Liao, X., Zou, Q., Shi, B. and Peng, S., “cmFSM: a scalable CPU-MIC coordinated drug-finding tool by frequent subgraph mining”. BMC bioinformatics, 19(4), p.98, 2018.
- [20] Horton I. “Working with fundamental data types”. In: Anglin S, lead editor. Beginning C++. New York, NY, USA: Apress Press, pp. 55–77, 2014.
- [21] Bader DA, Meyerhenke H, Sanders P, Wagner D. “Benchmarking for graph clustering and partitioning”. Encyclopedia of Social Network Analysis and Mining 2012; pp. 73–84, 2012.
- [22] Randal E. Bryant, David R. O'Hallaron, "Computer Systems: A programmer's Perspective", 3rd edition,Pearson, 2016.
- [23] Microsoft, "Working with Packing Structures", https://docs.microsoft.com/en-us/previous-versions/ms253935 (December 8, 2017).
- [24] "Valgrind and Massif Visualizer tools", http://valgrind.org/info/tools.html, (December 8, 2017).
- [25] Wale, Nikil, Ian A. Watson, and George Karypis. "Comparison of descriptor spaces for chemical compound retrieval and classification." Knowledge and Information Systems, pp: 347-375, 2008.
- [26] P. D. Dobson and A. J. Doig. “Distinguishing enzyme structures from non-enzymes without alignments”. Journal of Molecular Biology, Vol 330(4):771–783, 2003.
- [27] D. Wajdi, A. Sabeur, N. E. Mephu, “MR-SimLab: Scalable subgraph selection with label similarity for big data”. Information Systems, Elsevier, Vol 69, pp: 155-163, 2017.
- [28] Thoma M, Cheng H, Gretton A, Han J, Kriegel HP, Smola A, Song L, Yu PS, Yan X, Borgwardt KM. “Discriminative frequent subgraph mining with optimality guarantees”. Statistical Analysis and Data Mining. The ASA Data Science Journal; Vol 3: 302–18, 2010.
- [29] "Gaston official web site", http://liacs.leidenuniv.nl/~nijssensgr/gaston/index.html, (September 8, 2017).
- [30] Wörlein, M., Meinl, T., Fischer, I., & Philippsen, M., “A quantitative comparison of the subgraph miners MoFa, gSpan, FFSM, and Gaston.”, In: European Conference on Principles of Data Mining and Knowledge Discovery. Springer, Berlin, Heidelberg, p. 392-403, 2005.