Medical Text Classification Using Semisupervised Learning and Bert-Based Models

Fatih Soygazi; Damla Oğuz

doi:10.46387/bjesr.1597329

Araştırma Makalesi

Yarı Denetimli Öğrenme ve Bert Tabanlı Modeller Kullanılarak Tıbbi Metin Sınıflandırma

Yıl 2025, Cilt: 7 Sayı: 1, 60 - 69, 30.04.2025

Fatih Soygazi , Damla Oğuz

https://doi.org/10.46387/bjesr.1597329

Öz

Tıbbi metin sınıflandırması, yetersiz eğitim verisi gibi zorluklarla karşılaşarak karmaşık tıbbi metinleri düzenlemektedir. Bu çalışma, sağlık sorunları özetleri ve etiketleri içeren bir veri setine dayanarak tıbbi metinleri sınıflandırmak için yeni bir yöntem önermektedir. Etiketli veri setimize veri temsil teknikleri uyguladık ve metin sınıflandırması için çeşitli makine öğrenmesi algoritmaları kullandık. İlk sonuçlar, sınırlı etiketli veriler nedeniyle yeterli bulunmamıştır. Bunu geliştirmek için, etiketli verileri zenginleştirmek amacıyla etiketlenmemiş bir veri seti kullanarak veri artırma teknikleri uyguladık; bu süreçte BERT tabanlı modeller (BioBERT, ClinicalBERT) kullanılmıştır. Yeni etiketli kayıtları doğrulamak ve veri setine eklemek için çoğunluk oylama ve ağırlıklı çoğunluk oylama gibi farklı oylama mekanizmaları kullanılmıştır. Etiketli verileri artırdıktan sonra, makine öğrenmesi algoritmalarını yeniden uygulanmıştır. Sonuçlar, yaklaşımımızın tıbbi metin sınıflandırmasının performansını önemli ölçüde artırdığını, sınırlı etiketli verilerin getirdiği zorlukları etkili bir şekilde ele aldığını ve genel doğruluğu artırdığını göstermiştir.

Anahtar Kelimeler

BioBERT , ClinicalBERT , Klinik Metin Sınıflandırması , Veri Artırma , Oylama Mekanizmaları

Kaynakça

Kaggle, "Medical Text Classification Dataset." Available: https://www.kaggle.com/code/chaitanyakck/medical-text-classification/, (Accessed: Jan. 24, 2025).
M. Bayer, M.-A. Kaufhold, and C. Reuter, “A survey on data augmentation for text classification,” ACM Computing Surveys, vol. 55, no. 7, pp. 1-39, 2022.
J. Lee, W. Yoon, S. Kim, D. Kim, S. Kim, C. H. So, and J. Kang, “BioBERT: A pre-trained biomedical language representation model for biomedical text mining,” Bioinformatics, vol. 36, no. 4, pp. 1234-1240, 2020.
K. Huang, J. Altosaar, and R. Ranganath, “ClinicalBERT: Modeling clinical notes and predicting hospital readmission,” arXiv preprint, arXiv:1904.05342, 2019.
K. M. Chaitrashree, T. N. Sneha, S. R. Tanushree, G. R. Usha, and T. C. Pramod, “Unstructured medical text classification using machine learning and deep learning approaches,” in 2021 IEEE International Conference on Recent Trends on Electronics, Information, Communication & Technology (RTEICT), pp. 429-433, 2021.
H. Lu, L. Ehwerhemuepha, and C. Rakovski, “A comparative study on deep learning models for text classification of unstructured medical notes with various levels of class imbalance, ” BMC Medical Research Methodology, vol. 22, no. 181, 2022.
Q. Li, H. Peng, J. Li, C. Xia, R. Yang, L. Sun, P. S. Yu, and L. He, "A survey on text classification: From traditional to deep learning,” ACM Transactions on Intelligent Systems and Technology (TIST), vol. 13, no. 2, pp. 1-41, 2022.
K. Taha, P. D. Yoo, C. Yeun, D. Homouz, and A. Taha, “A comprehensive survey of text classification techniques and their research applications: Observational and experimental insights,” Computer Science Review, vol. 54, no. 100664, 2024.
I. Goodfellow, Y. Bengio, and A. Courville, “Deep Learning,” Cambridge, MA: MIT Press, 2016.
Y. LeCun, Y. Bengio, and G. Hinton, “Deep learning,” Nature, vol. 521, pp. 436-444, 2015.
A. Esteva, A. Robicquet, B. Ramsundar, V. Kuleshov, M. DePristo, K. Chou, C. Cui, G. Corredo, S. Thrun, and J. Dean, “A guide to deep learning in healthcare,” Nature Medicine, vol. 25, no. 1, pp. 24-29, 2019.
B. Shickel, P. J. Tighe, A. Bihorac, and P. Rashidi, “Deep EHR: A survey of recent advances in deep learning techniques for electronic health record (EHR) analysis,” IEEE Journal of Biomedical andHealth Informatics, vol. 22, no. 5, pp. 1589 -1604, 2017.
L. Torrey and J. Shavlik, “Handbook of Research on Machine Learning Applications and Trends: Algorithms, Methods, and Techniques,” Hershey, PA: IGI Global, 2010.
K. Weiss, T. M. Khoshgoftaar, and D. Wang, “A survey of transfer learning,” Journal of Big Data, vol. 3, pp. 1-40, 2016.
A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, I. Polosukhin, “Attention is all you need,” in Advances in Neural Information Processing Systems 30 (NIPS 2017), pp. 5998–6008, Long Beach, CA, USA, 4–9 December 2017.
J. Devlin, M. W. Chang, K. Lee, and K. Toutanova, “BERT: Pre-training of deep bidirectional transformers for language understanding,” arXiv preprint, arXiv:1810.04805, 2018.
U. Naseem, K. Musial, P. Eklund, and M. Prasad, “Biomedical named-entity recognition by hierarchically fusing biobert representations and deep contextual-level word-embedding,” in 2020 IEEE International joint conference on neural networks (IJCNN), pp. 1-8, 2020.
I. Alimova and E. Tutubalina, “Multiple features for clinical relation extraction: A machine learning approach,” Journal of Biomedical Informatics, vol. 103, no. 103382, 2020.
B. Bhasuran, “BioBERT and similar approaches for relation extraction,” in Biomedical Text Mining, pp. 221-235, New York, NY: Springer US, 2022.
J. V. A. de Souza, E. T. R. Schneider, J. O. Cezar, L. E. Silva, Y. B. Gumiel, E. C. Paraiso, D. Teodoro, and C. M. C. M. Barra, “A multilabel approach to Portuguese clinical named entity recognition,” Journal of Health Informatics, vol. 12, pp. 366-372, 2020.
K. Zeng, Z. Pan, Y. Xu, and Y. Qu, “An ensemble learning strategy for eligibility criteria text classification for clinical trial recruitment: Algorithm development and validation,” JMIR Medical Informatics, vol. 8, no. 7, e17832, 2020.
C. Shorten, T. M. Khoshgoftaar, and B. Furht, “Text data augmentation for deep learning,” Journal of Big Data, vol. 8, no. 101, 2021.
Q. Lu, D. Dou, and T. H. Nguyen, “Textual data augmentation for patient outcomes prediction,” in 2021 IEEE International Conference on Bioinformatics and Biomedicine (BIBM), pp. 2817-2821, 2021.
A. Erdengasileng, Q. Han, T. Zhao, S. Tian, X. Sui, K. Li, and J. Zhang, “Pre-trained models, data augmentation, and ensemble learning for biomedical information extraction and document classification,” Database, pp. 1-6, 2022.
H. Zhang, D. Zhu, H. Tan, M. Shafiq, and Z. Gu, “Medical specialty classification based on semiadversarial data augmentation,” Computational Intelligence and Neuroscience, Article ID 4919371, 2023.
E. Choi, M. T. Bahadori, A. Schuetz, W. F. Stewart, and J. Sun, “Doctor AI: Predicting clinical events via recurrent neural networks,” in 1st Machine Learning for Healthcare Conference, vol. 56 of Proceedings of Machine Learning Research, PMLR, pp. 301–318, 2016.
P. Nguyen, T. Tran, N. Wickramasinghe, and S. Venkatesh, “Deepr: A convolutional net for medical records,” IEEE Journal of Biomedical and Health Informatics, vol. 21, no. 1, pp. 22–30, 2017.
T. Pham, T. Tran, D. Phung, and S. Venkatesh, “Predicting healthcare trajectories from medical records: A deep learning approach,” Journal of Biomedical Informatics, vol. 69, pp. 218–229, 2017.
A. Amin-Nejad, J. Ive, and S. Velupillai, “Exploring transformer text generation for medical dataset augmentation,” in Twelfth Language Resources and Evaluation Conference (LREC), Marseille, France, May 2020, pp. 4699–4708, 2020.
Y. H. Kim, C. Kim, and Y. S. Kim, “Language model-based text augmentation system for cerebrovascular disease-related medical reports,” Applied Sciences, vol. 14, no. 19, Article ID 8652, 2024.
J. Collado-Montañez, M. T. Martín-Valdivia, and E. Martínez-Cámara, “Data augmentation based on large language models for radiological report classification,” Knowledge-Based Systems, vol. 308, Article ID 112745, 2025.

Medical Text Classification Using Semisupervised Learning and Bert-Based Models

Yıl 2025, Cilt: 7 Sayı: 1, 60 - 69, 30.04.2025

Fatih Soygazi , Damla Oğuz

https://doi.org/10.46387/bjesr.1597329

Öz

Medical text classification organizes complex medical texts, facing challenges like insufficient training data. This paper proposes a novel method for categorizing medical texts based on a dataset of health problem abstracts and their labels. We applied data representation techniques to our labeled dataset and employed various machine learning algorithms for text classification. Initial results were unsatisfactory due to limited labeled data. To enhance this, we applied data augmentation techniques using an unlabeled dataset, utilizing BERT-based models (BioBERT, ClinicalBERT) to enrich the labeled data. Different voting mechanisms, namely hard voting and soft voting were employed to validate and add new labeled records to the dataset. After augmenting the labeled data, machine learning algorithms were re-applied. The results demonstrated that our approach significantly improves the performance of medical text classification, effectively addressing the challenges posed by limited labeled data and enhancing overall accuracy.

Anahtar Kelimeler

BioBERT , ClinicalBERT , Clinical Text Classification , Data Augmentation , Voting Mechanisms

Kaynakça

Kaggle, "Medical Text Classification Dataset." Available: https://www.kaggle.com/code/chaitanyakck/medical-text-classification/, (Accessed: Jan. 24, 2025).
M. Bayer, M.-A. Kaufhold, and C. Reuter, “A survey on data augmentation for text classification,” ACM Computing Surveys, vol. 55, no. 7, pp. 1-39, 2022.
J. Lee, W. Yoon, S. Kim, D. Kim, S. Kim, C. H. So, and J. Kang, “BioBERT: A pre-trained biomedical language representation model for biomedical text mining,” Bioinformatics, vol. 36, no. 4, pp. 1234-1240, 2020.
K. Huang, J. Altosaar, and R. Ranganath, “ClinicalBERT: Modeling clinical notes and predicting hospital readmission,” arXiv preprint, arXiv:1904.05342, 2019.
K. M. Chaitrashree, T. N. Sneha, S. R. Tanushree, G. R. Usha, and T. C. Pramod, “Unstructured medical text classification using machine learning and deep learning approaches,” in 2021 IEEE International Conference on Recent Trends on Electronics, Information, Communication & Technology (RTEICT), pp. 429-433, 2021.
H. Lu, L. Ehwerhemuepha, and C. Rakovski, “A comparative study on deep learning models for text classification of unstructured medical notes with various levels of class imbalance, ” BMC Medical Research Methodology, vol. 22, no. 181, 2022.
Q. Li, H. Peng, J. Li, C. Xia, R. Yang, L. Sun, P. S. Yu, and L. He, "A survey on text classification: From traditional to deep learning,” ACM Transactions on Intelligent Systems and Technology (TIST), vol. 13, no. 2, pp. 1-41, 2022.
K. Taha, P. D. Yoo, C. Yeun, D. Homouz, and A. Taha, “A comprehensive survey of text classification techniques and their research applications: Observational and experimental insights,” Computer Science Review, vol. 54, no. 100664, 2024.
I. Goodfellow, Y. Bengio, and A. Courville, “Deep Learning,” Cambridge, MA: MIT Press, 2016.
Y. LeCun, Y. Bengio, and G. Hinton, “Deep learning,” Nature, vol. 521, pp. 436-444, 2015.
A. Esteva, A. Robicquet, B. Ramsundar, V. Kuleshov, M. DePristo, K. Chou, C. Cui, G. Corredo, S. Thrun, and J. Dean, “A guide to deep learning in healthcare,” Nature Medicine, vol. 25, no. 1, pp. 24-29, 2019.
B. Shickel, P. J. Tighe, A. Bihorac, and P. Rashidi, “Deep EHR: A survey of recent advances in deep learning techniques for electronic health record (EHR) analysis,” IEEE Journal of Biomedical andHealth Informatics, vol. 22, no. 5, pp. 1589 -1604, 2017.
L. Torrey and J. Shavlik, “Handbook of Research on Machine Learning Applications and Trends: Algorithms, Methods, and Techniques,” Hershey, PA: IGI Global, 2010.
K. Weiss, T. M. Khoshgoftaar, and D. Wang, “A survey of transfer learning,” Journal of Big Data, vol. 3, pp. 1-40, 2016.
A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, I. Polosukhin, “Attention is all you need,” in Advances in Neural Information Processing Systems 30 (NIPS 2017), pp. 5998–6008, Long Beach, CA, USA, 4–9 December 2017.
J. Devlin, M. W. Chang, K. Lee, and K. Toutanova, “BERT: Pre-training of deep bidirectional transformers for language understanding,” arXiv preprint, arXiv:1810.04805, 2018.
U. Naseem, K. Musial, P. Eklund, and M. Prasad, “Biomedical named-entity recognition by hierarchically fusing biobert representations and deep contextual-level word-embedding,” in 2020 IEEE International joint conference on neural networks (IJCNN), pp. 1-8, 2020.
I. Alimova and E. Tutubalina, “Multiple features for clinical relation extraction: A machine learning approach,” Journal of Biomedical Informatics, vol. 103, no. 103382, 2020.
B. Bhasuran, “BioBERT and similar approaches for relation extraction,” in Biomedical Text Mining, pp. 221-235, New York, NY: Springer US, 2022.
J. V. A. de Souza, E. T. R. Schneider, J. O. Cezar, L. E. Silva, Y. B. Gumiel, E. C. Paraiso, D. Teodoro, and C. M. C. M. Barra, “A multilabel approach to Portuguese clinical named entity recognition,” Journal of Health Informatics, vol. 12, pp. 366-372, 2020.
K. Zeng, Z. Pan, Y. Xu, and Y. Qu, “An ensemble learning strategy for eligibility criteria text classification for clinical trial recruitment: Algorithm development and validation,” JMIR Medical Informatics, vol. 8, no. 7, e17832, 2020.
C. Shorten, T. M. Khoshgoftaar, and B. Furht, “Text data augmentation for deep learning,” Journal of Big Data, vol. 8, no. 101, 2021.
Q. Lu, D. Dou, and T. H. Nguyen, “Textual data augmentation for patient outcomes prediction,” in 2021 IEEE International Conference on Bioinformatics and Biomedicine (BIBM), pp. 2817-2821, 2021.
A. Erdengasileng, Q. Han, T. Zhao, S. Tian, X. Sui, K. Li, and J. Zhang, “Pre-trained models, data augmentation, and ensemble learning for biomedical information extraction and document classification,” Database, pp. 1-6, 2022.
H. Zhang, D. Zhu, H. Tan, M. Shafiq, and Z. Gu, “Medical specialty classification based on semiadversarial data augmentation,” Computational Intelligence and Neuroscience, Article ID 4919371, 2023.
E. Choi, M. T. Bahadori, A. Schuetz, W. F. Stewart, and J. Sun, “Doctor AI: Predicting clinical events via recurrent neural networks,” in 1st Machine Learning for Healthcare Conference, vol. 56 of Proceedings of Machine Learning Research, PMLR, pp. 301–318, 2016.
P. Nguyen, T. Tran, N. Wickramasinghe, and S. Venkatesh, “Deepr: A convolutional net for medical records,” IEEE Journal of Biomedical and Health Informatics, vol. 21, no. 1, pp. 22–30, 2017.
T. Pham, T. Tran, D. Phung, and S. Venkatesh, “Predicting healthcare trajectories from medical records: A deep learning approach,” Journal of Biomedical Informatics, vol. 69, pp. 218–229, 2017.
A. Amin-Nejad, J. Ive, and S. Velupillai, “Exploring transformer text generation for medical dataset augmentation,” in Twelfth Language Resources and Evaluation Conference (LREC), Marseille, France, May 2020, pp. 4699–4708, 2020.
Y. H. Kim, C. Kim, and Y. S. Kim, “Language model-based text augmentation system for cerebrovascular disease-related medical reports,” Applied Sciences, vol. 14, no. 19, Article ID 8652, 2024.
J. Collado-Montañez, M. T. Martín-Valdivia, and E. Martínez-Cámara, “Data augmentation based on large language models for radiological report classification,” Knowledge-Based Systems, vol. 308, Article ID 112745, 2025.

Toplam 31 adet kaynakça vardır.

Ayrıntılar

Birincil Dil	İngilizce
Konular	Derin Öğrenme, Yarı ve Denetimsiz Öğrenme, Makine Öğrenme (Diğer)
Bölüm	Araştırma Makalesi
Yazarlar	Fatih Soygazi 0000-0001-8426-2283 Damla Oğuz 0000-0001-6556-7444
Gönderilme Tarihi	6 Aralık 2024
Kabul Tarihi	13 Şubat 2025
Erken Görünüm Tarihi	28 Nisan 2025
Yayımlanma Tarihi	30 Nisan 2025
Yayımlandığı Sayı	Yıl 2025 Cilt: 7 Sayı: 1

Kaynak Göster

APA	Soygazi, F., & Oğuz, D. (2025). Medical Text Classification Using Semisupervised Learning and Bert-Based Models. Mühendislik Bilimleri ve Araştırmaları Dergisi, 7(1), 60-69. https://doi.org/10.46387/bjesr.1597329
AMA	Soygazi F, Oğuz D. Medical Text Classification Using Semisupervised Learning and Bert-Based Models. Müh.Bil.ve Araş.Dergisi. Nisan 2025;7(1):60-69. doi:10.46387/bjesr.1597329
Chicago	Soygazi, Fatih, ve Damla Oğuz. “Medical Text Classification Using Semisupervised Learning and Bert-Based Models”. Mühendislik Bilimleri ve Araştırmaları Dergisi 7, sy. 1 (Nisan 2025): 60-69. https://doi.org/10.46387/bjesr.1597329.
EndNote	Soygazi F, Oğuz D (01 Nisan 2025) Medical Text Classification Using Semisupervised Learning and Bert-Based Models. Mühendislik Bilimleri ve Araştırmaları Dergisi 7 1 60–69.
IEEE	F. Soygazi ve D. Oğuz, “Medical Text Classification Using Semisupervised Learning and Bert-Based Models”, Müh.Bil.ve Araş.Dergisi, c. 7, sy. 1, ss. 60–69, 2025, doi: 10.46387/bjesr.1597329.
ISNAD	Soygazi, Fatih - Oğuz, Damla. “Medical Text Classification Using Semisupervised Learning and Bert-Based Models”. Mühendislik Bilimleri ve Araştırmaları Dergisi 7/1 (Nisan2025), 60-69. https://doi.org/10.46387/bjesr.1597329.
JAMA	Soygazi F, Oğuz D. Medical Text Classification Using Semisupervised Learning and Bert-Based Models. Müh.Bil.ve Araş.Dergisi. 2025;7:60–69.
MLA	Soygazi, Fatih ve Damla Oğuz. “Medical Text Classification Using Semisupervised Learning and Bert-Based Models”. Mühendislik Bilimleri ve Araştırmaları Dergisi, c. 7, sy. 1, 2025, ss. 60-69, doi:10.46387/bjesr.1597329.
Vancouver	Soygazi F, Oğuz D. Medical Text Classification Using Semisupervised Learning and Bert-Based Models. Müh.Bil.ve Araş.Dergisi. 2025;7(1):60-9.

Makale Dosyaları

Tam Metin