Research Article
BibTex RIS Cite

Classification of colorectal cancer based on gene sequencing data with XGBoost model: An application of public health informatics

Year 2022, Volume: 47 Issue: 3, 1179 - 1186, 30.09.2022
https://doi.org/10.17826/cumj.1128653

Abstract

Purpose: This study aims to classify open-access colorectal cancer gene data and identify essential genes with the XGBoost method, a machine learning method.
Materials and Methods: The open-access colorectal cancer gene dataset was used in the study. The dataset included gene sequencing results of 10 mucosae from healthy controls and the colonic mucosa of 12 patients with colorectal cancer. XGboost, one of the machine learning methods, was used to classify the disease. Accuracy, balanced accuracy, sensitivity, selectivity, positive predictive value, and negative predictive value performance metrics were evaluated for model performance.
Results: According to the variable selection method, 17 genes were selected, and modeling was performed with these input variables. Accuracy, balanced accuracy, sensitivity, specificity, positive predictive value, negative predictive value, and F1 score obtained from modeling results were 95.5%, 95.8%, 91.7%, 1%, 1%, and 90.9%, and 95.7%, respectively. According to the variable impotance acquired from the XGboost technique results, the CYR61, NR4A, FOSB, and NR4A2 genes can be employed as biomarkers for colorectal cancer.
Conclusion: As a consequence of this research, genes that may be linked to colorectal cancer and genetic biomarkers for the illness were identified. In the future, the detected genes' reliability can be verified, therapeutic procedures can be established based on these genes, and their usefulness in clinical practice may be documented.

References

  • 1. Günther J, Seyfert H-M. The first line of defence: insights into mechanisms and relevance of phagocytosis in epithelial cells. Semin Immunopathol 2018; 40(6): 555-565 DOI: 10.1007/s00281-018-0701-1.
  • 2. Cao W, Chen HD, Yu YW, Li N, Chen WQ. Changing profiles of cancer burden worldwide and in China: a secondary analysis of the global cancer statistics 2020. Chin Med J (Engl) 2021; 134(07): 783-791 DOI: 10.1097/CM9.0000000000001474.
  • 3. Mattiuzzi C, Lippi G. Current cancer epidemiology. J Epidemiol Glob Health 2019; 9(4): 217-222 DOI: 10.2991/jegh.k.191008.001.
  • 4. Sharma R. An examination of colorectal cancer burden by socioeconomic status: evidence from GLOBOCAN 2018. EPMA J 2020; 11(1): 95-117 DOI: 10.1007/s13167-019-00185-y.
  • 5. Abualkhair WH, Zhou M, Ahnen D, Yu Q, Wu X-C, Karlitz JJ. Trends in incidence of early-onset colorectal cancer in the United States among those approaching screening age. JAMA Network Open 2020; 3(1): e1920407-e1920407 DOI: 10.1001/jamanetworkopen.2019.20407.
  • 6. MacEwan JP, Dennen S, Kee R, Ali F, Shafrin J, Batt K. Changes in mortality associated with cancer drug approvals in the United States from 2000 to 2016. J Med Econ 2020; 23(12): 1558-1569 DOI: 10.1080/13696998.2020.1834403.
  • 7. Del Boccio P, Urbani A. Homo sapiens proteomics: clinical perspectives. Ann Ist Super Sanita 2005; 41(4): 479-482.
  • 8. Martin DB, Nelson PS. From genomics to proteomics: techniques and applications in cancer research. Trends Cell Biol 2001; 11(11): S60-S65 DOI: 10.1016/s0962-8924(01)02123-7.
  • 9. Gagan J, Van Allen E. Next-generation sequencing to guide cancer therapy. Genome Med 7: 80. Link: https://bit ly/35WLrGw 2015
  • 10. Grady WM, Carethers JM. Genomic and epigenetic instability in colorectal cancer pathogenesis. Gastroenterology 2008; 135(4): 1079-1099 DOI: 10.1053/j.gastro.2008.07.076.
  • 11. Magnuson J, O’Carroll PW. Introduction to public health informatics. Public health informatics and information systems: Springer, 2014: 3-18
  • 12. Polikar R. Ensemble learning. Ensemble machine learning: Springer, 2012: 1-34
  • 13. Yagin FH, Yagin B, Arslan AK, Colak C. Comparison of Performances of Associative Classification Methods for Cervical Cancer Prediction: Observational Study. Turkiye Klinikleri J Biostat 2021; 13(3) DOI: 10.5336/biostatic.2021-84349.
  • 14. Akman M, Genç Y, Ankarali H. [Random Forests Methods and an Application in Health Science]. Turkiye Klinikleri J Biostat 2011; 3(1): 36-48
  • 15. Yılmaz R, Yagin FH. Early Detection of Coronary Heart Disease Based on Machine Learning Methods. Medical Records; 4(1): 1-6 DOI: 10.37990/medr.1011924.
  • 16. Witten IH, Frank E. Data mining: practical machine learning tools and techniques with Java implementations. Acm Sigmod Record 2002; 31(1): 76-77
  • 17. Yagin FH, Cicek IB, Kucukakcali Z. Classification of stroke with gradient boosting tree using smote-based oversampling method. Medicine Sci 2021; 10(4): 1510-1515 DOI: 10.5455/medscience.2021.09.322.
  • 18. Percin I, Yagin FH, Arslan AK, Colak C. An Interactive Web Tool for Classification Problems Based on Machine Learning Algorithms Using Java Programming Language: Data Classification Software. Proceedings of the 2019 3rd International Symposium on Multidisciplinary Studies and Innovative Technologies (ISMSIT); 2019. IEEE: 1-7
  • 19. Chen T, Guestrin C. Xgboost: A scalable tree boosting system. Proceedings of the Proceedings of the 22nd acm sigkdd international conference on knowledge discovery and data mining; 2016. 785-794
  • 20. Rumora L, Miler M, Medak D. Impact of various atmospheric corrections on sentinel-2 land cover classification accuracy using machine learning classifiers. SPRS Int J Geo-Inf 2020; 9(4): 277 DOI: 10.3390/ijgi9040277.
  • 21. Saeys Y, Inza I, Larrañaga P. A review of feature selection techniques in bioinformatics. Bioinformatics 2007; 23(19): 2507-2517 DOI: 10.1093/bioinformatics/btm344.
  • 22. Fodor IK. A survey of dimension reduction techniques. Lawrence Livermore National Lab., CA (US), 2002
  • 23. Fonti V. Research Paper in Business Analytics: Feature Selection with LASSO. Amsterdam: VU Amsterdam 2017
  • 24. Wang J, Li P, Ran R, Che Y, Zhou Y. A short-term photovoltaic power prediction model based on the gradient boost decision tree. Appl Sci 2018; 8(5): 689 DOI: 10.3390/app8050689.
  • 25. Z.S P. Evaluating XGBoost For User Classification By Using Behavioral Features Extracted From Smartphone Sensors. . KTH Royal Institute of Technology, School of Computer Science and Communication, Sweden. , 2018
  • 26. Dikker J. Master thesis Boosted tree learning for balanced item recommendation in online retail. 2017
  • 27. Chen T. Guestrin, C. Xgboost: A scalable tree boosting system. Proceedings of the Proceedings of the 22Nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (New York, NY, USA, 2016), KDD ‘16, ACM; 2016. 785-794
  • 28. Salam Patrous Z. Evaluating XGBoost for user classification by using behavioral features extracted from smartphone sensors. 2018
  • 29. Cevallos M, Egger M, Moher D. STROBE (STrengthening the Reporting of OBservational studies in Epidemiology). Guidelines for reporting health research: a user's manual 2014: 169-179
  • 30. Amir PN, Sazali MF, Salvaraji L, Dulajis N, Rahim SSSA, Avoi R. Public Health Informatics in Global Health Surveillance: A Review: Public Health Informatics. Borneo Epidemiology Journal 2021; 2(2): 74-88 DOI: 10.51200/bej.v2i2.3628
  • 31. Xi Y, Xu P. Global colorectal cancer burden in 2020 and projections to 2040. Transl Oncol 2021; 14(10): 101174 DOI: 10.1016/j.tranon.2021.101174.
  • 32. Mondaca S, Yaeger R. Colorectal cancer genomics and designing rational trials. Ann Transl Med 2018; 6(9): 159 DOI: 10.21037/atm.2018.03.27.
  • 33. Xu Y, Ju L, Tong J, Zhou C-M, Yang JJ. Machine learning algorithms for predicting the recurrence of stage IV colorectal cancer after tumor resection. Sci Rep 2020; 10(1): 2519 DOI: 10.1038/s41598-020-59115-y.
  • 34. Ting WC, Chang HR, Chang CC, Lu CJ. Developing a novel machine learning-based classification scheme for predicting SPCs in colorectal cancer survivors. Appl Sci 2020; 10(4): 1355 DOI: 10.3390/app10041355.
  • 35. Guinney J, Dienstmann R, Wang X, de Reynies A, Schlicker A, Soneson C, Marisa L, Roepman P, Nyamundanda G, Angelino P. The consensus molecular subtypes of colorectal cancer. Nat Med 2015; 21(11): 1350-1356 DOI: 10.1038/nm.3967.
  • 36. Rodriguez-Calvo R, Tajes M, Vazquez-Carrera M. The NR4A subfamily of nuclear receptors: potential new therapeutic targets for the treatment of inflammatory diseases. Expert Opin Ther Targets 2017; 21(3): 291-304 DOI: 10.1080/14728222.2017.1279146.
  • 37. Xie L, Song X, Lin H, Chen Z, Li Q, Guo T, et al. Aberrant activation of CYR61 enhancers in colorectal cancer development. J Exp Clin Cancer Res 2019; 38(1): 213 DOI: 10.1186/s13046-019-1217-9.
  • 38. Jeong D, Heo S, Ahn TS, Lee S, Park S, Kim H, et al. Cyr61 expression is associated with prognosis in patients with colorectal cancer. BMC Cancer 2014; 14: 164 DOI: 10.1186/1471-2407-14-164.
  • 39. Musella V, Verderio P, Reid JF, Pizzamiglio S, Gariboldi M, Callari M, et al. Effects of warm ischemic time on gene expression profiling in colorectal cancer tissues and normal mucosa. PloS One 2013; 8(1): e53406 DOI: 10.1371/journal.pone.0053406.

XGBoost modeli ile gen dizileme verilerine dayalı kolorektal kanserin sınıflandırılması: Bir halk sağlığı bilişimi uygulaması

Year 2022, Volume: 47 Issue: 3, 1179 - 1186, 30.09.2022
https://doi.org/10.17826/cumj.1128653

Abstract

Amaç: Bu çalışma, bir makine öğrenmesi yöntemi olan XGBoost yöntemi ile açık erişimli kolorektal kanser gen verilerini sınıflandırmayı ve temel genleri tanımlamayı amaçlamaktadır.
Gereç ve Yöntem: Çalışmada açık erişimli kolorektal kanser gen veri seti kullanıldı. Veri seti, sağlıklı kontrollerden 10 mukozanın ve kolorektal kanserli 12 hastanın kolon mukozasının gen dizileme sonuçlarını içeriyordu. Hastalığı sınıflandırmak için makine öğrenmesi yöntemlerinden biri olan XGboost kullanıldı. Model performansı için doğruluk, dengelenmiş doğruluk, duyarlılık, seçicilik, pozitif tahmin değeri ve negatif tahmin değeri performans metrikleri değerlendirildi.
Bulgular: Değişken seçim yöntemine göre 17 gen seçilmiş ve bu girdi değişkenleri ile modelleme yapılmıştır. Modelleme sonuçlarından elde edilen doğruluk, dengeli doğruluk, duyarlılık, özgüllük, pozitif tahmin değeri, negatif tahmin değeri ve F1 puanı sırasıyla %95.5, %95.8, %91.7, %1, %1 ve %90.9 ve %95.7 idi. XGboost tekniği sonucundan elde edilen değişken önemliliklerine göre, CYR61, NR4A, FOSB ve NR4A2 genleri kolorektal kanser için biyolojik belirteçler olarak kullanılabilir.
Sonuç: Bu araştırma sonucunda kolorektal kanserle bağlantılı olabilecek genlerin yanı sıra hastalığa yönelik genetik biyobelirteçler de belirlendi. Gelecekte, tespit edilen genlerin güvenilirliği doğrulanabilir, bu genlere dayalı olarak terapötik prosedürler oluşturulabilir ve klinik pratikteki yararları belgelenebilir.

References

  • 1. Günther J, Seyfert H-M. The first line of defence: insights into mechanisms and relevance of phagocytosis in epithelial cells. Semin Immunopathol 2018; 40(6): 555-565 DOI: 10.1007/s00281-018-0701-1.
  • 2. Cao W, Chen HD, Yu YW, Li N, Chen WQ. Changing profiles of cancer burden worldwide and in China: a secondary analysis of the global cancer statistics 2020. Chin Med J (Engl) 2021; 134(07): 783-791 DOI: 10.1097/CM9.0000000000001474.
  • 3. Mattiuzzi C, Lippi G. Current cancer epidemiology. J Epidemiol Glob Health 2019; 9(4): 217-222 DOI: 10.2991/jegh.k.191008.001.
  • 4. Sharma R. An examination of colorectal cancer burden by socioeconomic status: evidence from GLOBOCAN 2018. EPMA J 2020; 11(1): 95-117 DOI: 10.1007/s13167-019-00185-y.
  • 5. Abualkhair WH, Zhou M, Ahnen D, Yu Q, Wu X-C, Karlitz JJ. Trends in incidence of early-onset colorectal cancer in the United States among those approaching screening age. JAMA Network Open 2020; 3(1): e1920407-e1920407 DOI: 10.1001/jamanetworkopen.2019.20407.
  • 6. MacEwan JP, Dennen S, Kee R, Ali F, Shafrin J, Batt K. Changes in mortality associated with cancer drug approvals in the United States from 2000 to 2016. J Med Econ 2020; 23(12): 1558-1569 DOI: 10.1080/13696998.2020.1834403.
  • 7. Del Boccio P, Urbani A. Homo sapiens proteomics: clinical perspectives. Ann Ist Super Sanita 2005; 41(4): 479-482.
  • 8. Martin DB, Nelson PS. From genomics to proteomics: techniques and applications in cancer research. Trends Cell Biol 2001; 11(11): S60-S65 DOI: 10.1016/s0962-8924(01)02123-7.
  • 9. Gagan J, Van Allen E. Next-generation sequencing to guide cancer therapy. Genome Med 7: 80. Link: https://bit ly/35WLrGw 2015
  • 10. Grady WM, Carethers JM. Genomic and epigenetic instability in colorectal cancer pathogenesis. Gastroenterology 2008; 135(4): 1079-1099 DOI: 10.1053/j.gastro.2008.07.076.
  • 11. Magnuson J, O’Carroll PW. Introduction to public health informatics. Public health informatics and information systems: Springer, 2014: 3-18
  • 12. Polikar R. Ensemble learning. Ensemble machine learning: Springer, 2012: 1-34
  • 13. Yagin FH, Yagin B, Arslan AK, Colak C. Comparison of Performances of Associative Classification Methods for Cervical Cancer Prediction: Observational Study. Turkiye Klinikleri J Biostat 2021; 13(3) DOI: 10.5336/biostatic.2021-84349.
  • 14. Akman M, Genç Y, Ankarali H. [Random Forests Methods and an Application in Health Science]. Turkiye Klinikleri J Biostat 2011; 3(1): 36-48
  • 15. Yılmaz R, Yagin FH. Early Detection of Coronary Heart Disease Based on Machine Learning Methods. Medical Records; 4(1): 1-6 DOI: 10.37990/medr.1011924.
  • 16. Witten IH, Frank E. Data mining: practical machine learning tools and techniques with Java implementations. Acm Sigmod Record 2002; 31(1): 76-77
  • 17. Yagin FH, Cicek IB, Kucukakcali Z. Classification of stroke with gradient boosting tree using smote-based oversampling method. Medicine Sci 2021; 10(4): 1510-1515 DOI: 10.5455/medscience.2021.09.322.
  • 18. Percin I, Yagin FH, Arslan AK, Colak C. An Interactive Web Tool for Classification Problems Based on Machine Learning Algorithms Using Java Programming Language: Data Classification Software. Proceedings of the 2019 3rd International Symposium on Multidisciplinary Studies and Innovative Technologies (ISMSIT); 2019. IEEE: 1-7
  • 19. Chen T, Guestrin C. Xgboost: A scalable tree boosting system. Proceedings of the Proceedings of the 22nd acm sigkdd international conference on knowledge discovery and data mining; 2016. 785-794
  • 20. Rumora L, Miler M, Medak D. Impact of various atmospheric corrections on sentinel-2 land cover classification accuracy using machine learning classifiers. SPRS Int J Geo-Inf 2020; 9(4): 277 DOI: 10.3390/ijgi9040277.
  • 21. Saeys Y, Inza I, Larrañaga P. A review of feature selection techniques in bioinformatics. Bioinformatics 2007; 23(19): 2507-2517 DOI: 10.1093/bioinformatics/btm344.
  • 22. Fodor IK. A survey of dimension reduction techniques. Lawrence Livermore National Lab., CA (US), 2002
  • 23. Fonti V. Research Paper in Business Analytics: Feature Selection with LASSO. Amsterdam: VU Amsterdam 2017
  • 24. Wang J, Li P, Ran R, Che Y, Zhou Y. A short-term photovoltaic power prediction model based on the gradient boost decision tree. Appl Sci 2018; 8(5): 689 DOI: 10.3390/app8050689.
  • 25. Z.S P. Evaluating XGBoost For User Classification By Using Behavioral Features Extracted From Smartphone Sensors. . KTH Royal Institute of Technology, School of Computer Science and Communication, Sweden. , 2018
  • 26. Dikker J. Master thesis Boosted tree learning for balanced item recommendation in online retail. 2017
  • 27. Chen T. Guestrin, C. Xgboost: A scalable tree boosting system. Proceedings of the Proceedings of the 22Nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (New York, NY, USA, 2016), KDD ‘16, ACM; 2016. 785-794
  • 28. Salam Patrous Z. Evaluating XGBoost for user classification by using behavioral features extracted from smartphone sensors. 2018
  • 29. Cevallos M, Egger M, Moher D. STROBE (STrengthening the Reporting of OBservational studies in Epidemiology). Guidelines for reporting health research: a user's manual 2014: 169-179
  • 30. Amir PN, Sazali MF, Salvaraji L, Dulajis N, Rahim SSSA, Avoi R. Public Health Informatics in Global Health Surveillance: A Review: Public Health Informatics. Borneo Epidemiology Journal 2021; 2(2): 74-88 DOI: 10.51200/bej.v2i2.3628
  • 31. Xi Y, Xu P. Global colorectal cancer burden in 2020 and projections to 2040. Transl Oncol 2021; 14(10): 101174 DOI: 10.1016/j.tranon.2021.101174.
  • 32. Mondaca S, Yaeger R. Colorectal cancer genomics and designing rational trials. Ann Transl Med 2018; 6(9): 159 DOI: 10.21037/atm.2018.03.27.
  • 33. Xu Y, Ju L, Tong J, Zhou C-M, Yang JJ. Machine learning algorithms for predicting the recurrence of stage IV colorectal cancer after tumor resection. Sci Rep 2020; 10(1): 2519 DOI: 10.1038/s41598-020-59115-y.
  • 34. Ting WC, Chang HR, Chang CC, Lu CJ. Developing a novel machine learning-based classification scheme for predicting SPCs in colorectal cancer survivors. Appl Sci 2020; 10(4): 1355 DOI: 10.3390/app10041355.
  • 35. Guinney J, Dienstmann R, Wang X, de Reynies A, Schlicker A, Soneson C, Marisa L, Roepman P, Nyamundanda G, Angelino P. The consensus molecular subtypes of colorectal cancer. Nat Med 2015; 21(11): 1350-1356 DOI: 10.1038/nm.3967.
  • 36. Rodriguez-Calvo R, Tajes M, Vazquez-Carrera M. The NR4A subfamily of nuclear receptors: potential new therapeutic targets for the treatment of inflammatory diseases. Expert Opin Ther Targets 2017; 21(3): 291-304 DOI: 10.1080/14728222.2017.1279146.
  • 37. Xie L, Song X, Lin H, Chen Z, Li Q, Guo T, et al. Aberrant activation of CYR61 enhancers in colorectal cancer development. J Exp Clin Cancer Res 2019; 38(1): 213 DOI: 10.1186/s13046-019-1217-9.
  • 38. Jeong D, Heo S, Ahn TS, Lee S, Park S, Kim H, et al. Cyr61 expression is associated with prognosis in patients with colorectal cancer. BMC Cancer 2014; 14: 164 DOI: 10.1186/1471-2407-14-164.
  • 39. Musella V, Verderio P, Reid JF, Pizzamiglio S, Gariboldi M, Callari M, et al. Effects of warm ischemic time on gene expression profiling in colorectal cancer tissues and normal mucosa. PloS One 2013; 8(1): e53406 DOI: 10.1371/journal.pone.0053406.
There are 39 citations in total.

Details

Primary Language English
Subjects Clinical Sciences
Journal Section Research
Authors

Sami Akbulut 0000-0002-6864-7711

Zeynep Küçükakçalı 0000-0001-7956-9272

Cemil Çolak 0000-0001-5406-098X

Publication Date September 30, 2022
Acceptance Date July 25, 2022
Published in Issue Year 2022 Volume: 47 Issue: 3

Cite

MLA Akbulut, Sami et al. “Classification of Colorectal Cancer Based on Gene Sequencing Data With XGBoost Model: An Application of Public Health Informatics”. Cukurova Medical Journal, vol. 47, no. 3, 2022, pp. 1179-86, doi:10.17826/cumj.1128653.