ICD standardizes diagnosis codes globally, aiding payments, research, planning, and quality management. Its complexity leads to longer exams, higher training costs, increased workforce needs, coding errors, and unreliable data. Automated ICD systems using ML address these issues. Long medical notes complicate ML, making feature extraction crucial for efficient ICD classification. Despite numerous studies, no systematic analysis of feature extraction methods, especially in deep learning (DL), exists.
The MIMIC-III dataset is used with two preprocessing combinations, fundamental and advanced. TF-IDF, word2vec, GloVe, fastText, and BERT feature extraction methods are compared using DL models such as NN, CNN, and BiLSTM. For word2vec and fastText, CBOW and skip-gram architectures are compared. ROC-AUC, F1-score, precision, and recall metrics are calculated for DL performances. Advanced preprocessing improves performance for all feature extraction and DL methods. The best results for advanced preprocessing are micro ROC-AUC of 91.74\% (BiLSTM+fastText (skip-gram)), macro ROC-AUC of 88.58\% (BiLSTM+word2vec (CBOW)), micro F1/precision of 64.84\%/62.34\% (BiLSTM+word2vec (CBOW)), micro recall of 68.16\% (BiLSTM+fastText (skip-gram)), macro F1/precision of 59.67\%/57.71\% (BiLSTM+word2vec (CBOW)), and macro recall of 63.38\% (BiLSTM+fastText (skip-gram)). FastText is the most successful feature extraction method in DL models with fundamental preprocessing. However, models using well-implemented preprocessing highlight other feature extraction methods that perform better and operate more quickly. As DL model performance improves, differences between feature extraction performances diminish. Though not focused on the best results, CNN and BiLSTM with word2vec, GloVe, and fastText are competitive with current studies. Lastly, if computing power is limited, CNN may be preferable over BiLSTM with these feature extraction methods.
deep learning (DL) natural language processing (NLP) feature extraction international classification of diseases (ICD) MIMIC-III medical notes
This work was supported by The Scientific and Technological Research Council of Türkiye (TUBITAK) - International Postdoctoral Research Fellowship Program (2219) of 2023. [grant number 1059B192302269].
This work was supported by The Scientific and Technological Research Council of Türkiye (TUBITAK) - International Postdoctoral Research Fellowship Program (2219) of 2023. [grant number 1059B192302269].
Primary Language | English |
---|---|
Subjects | Applied Mathematics (Other) |
Journal Section | Research Articles |
Authors | |
Project Number | This work was supported by The Scientific and Technological Research Council of Türkiye (TUBITAK) - International Postdoctoral Research Fellowship Program (2219) of 2023. [grant number 1059B192302269]. |
Early Pub Date | July 15, 2025 |
Publication Date | June 30, 2025 |
Submission Date | March 26, 2025 |
Acceptance Date | June 29, 2025 |
Published in Issue | Year 2025 Volume: 5 Issue: 2 |