A Machine Learning–Based Pilot Study for the Classification of Thalassemia Subtypes Using Routine Laboratory Parameters
Abstract
Objective: Thalassemia is a hereditary hemoglobinopathy and remains a significant public health problem, particularly in Mediterranean regions. Although genetic testing represents the gold standard for subtype classification, access to such testing is limited in many clinical settings. This pilot study aimed to explore the feasibility of using machine learning models based on routinely available clinical and laboratory parameters to support the differentiation of thalassemia subtypes in the absence of genetic testing.
Methods: This retrospective cross-sectional study included 83 individuals (57 thalassemia major, 11 thalassemia intermedia, and 15 healthy controls). Demographic, clinical, and laboratory variables were analyzed using the R programming language. A supervised Random Forest algorithm was applied for multiclass classification. Model performance was assessed using accuracy, class-specific sensitivity, specificity, and area under the receiver operating characteristic curve (AUC). To further evaluate the distinction between thalassemia major and intermedia, a simplified logistic regression model was constructed, and Firth logistic regression was applied to address the small sample size and class imbalance.
Results: The Random Forest model demonstrated an overall test-set accuracy of 85.7%. Sensitivity was 80% for thalassemia major and 100% for both thalassemia intermedia and healthy controls. Variable importance analysis identified red cell distribution width (RDW), hematocrit, ferritin, and hemoglobin as the most influential predictors. In the simplified logistic regression model distinguishing thalassemia major from intermedia, RDW was the only variable reaching statistical significance (p = 0.0476). Model performance metrics, including high AUC values, should be interpreted cautiously given the limited sample size.
Conclusion: The Random Forest model demonstrated an overall test-set accuracy of 85.7%. Sensitivity was 80% for thalassemia major and 100% for both thalassemia intermedia and healthy controls. Variable importance analysis identified red cell distribution width (RDW), hematocrit, ferritin, and hemoglobin as the most influential predictors. In the simplified logistic regression model distinguishing thalassemia major from intermedia, RDW was the only variable reaching statistical significance (p = 0.0476). Model performance metrics, including high AUC values, should be interpreted cautiously given the limited sample size.
Keywords
References
- 1. Viprakasit V, Ekwattanakit S. Clinical classification, screening and diagnosis for thalassemia. Hematol Oncol Clin North Am. 2018;32(2):193-211. doi:10.1016/j.hoc.2017.11.006.
- 2. Tan L, Huang T, Luo L, Ma P, Liu J, Zou J, et al. Molecular identification and the hematological findings of four novel variants in globin genes in Jiangxi Province of Southern China. Hemoglobin. 2024;48(6):369-74. doi:10.1 080/03630269.2024.2438707.
- 3. Sadiq IZ, Abubakar FS, Usman HS, Abdullahi AD, Ibrahim B, Kastayal BS, et al. Thalassemia: pathophysiology, diagnosis, and advances in treatment. Thalass Rep. 2024;14(4):81-102. doi:10.3390/thalassrep14040010.
- 4. Brancaleoni V, Di Pierro E, Motta I, Cappellini MD. Laboratory diagnosis of thalassemia. Int J Lab Hematol. 2016;38(Suppl 1):32-40. doi:10.1111/ijlh.12527.
- 5. Alowais SA, Alghamdi SS, Alsuhebany N, Alqahtani T, Alshaya AI, Almohareb SN, et al. Revolutionizing healthcare: the role of artificial intelligence in clinical practice. BMC Med Educ. 2023;23:689. doi:10.1186/ s12909-023-04698-z.
- 6. Masood A, Naseem U, Rashid J, Kim J, Razzak I. Review on enhancing clinical decision support system using machine learning. CAAI Trans Intell Technol. 2024:1-14. doi:10.1049/cit2.12286.
- 7. Busnatu Ș, Niculescu AG, Bolocan A, Petrescu GED, Păduraru DN, Năstasă I, et al. Clinical applications of artificial intelligence—an updated overview. J Clin Med. 2022;11(8):2265. doi:10.3390/jcm11082265.
- 8. Piriyakhuntorn P, Tantiworawit A, Rattanathammethee T, Chai Adisaksopha C, Rattarittamrong E, Norasetthada L. The role of red cell distribution width in the differential diagnosis of iron deficiency anemia and non transfusion dependent thalassemia patients. Hematol Rep. 2018;10(3):7605. doi:10.4081/hr.2018.7605.
Details
Primary Language
English
Subjects
Cardiovascular Medicine and Haematology (Other)
Journal Section
Research Article
Authors
Volkan Karakuş
0000-0001-9178-2850
Türkiye
Ayşegül Kurtoğlu
0000-0002-6033-4139
Türkiye
Erdal Kurtoğlu
0000-0002-6867-6053
Türkiye
Publication Date
March 6, 2026
Submission Date
December 30, 2025
Acceptance Date
January 25, 2026
Published in Issue
Year 2026 Volume: 8