Discovering Latent Themes in Heart Disease Article Abstracts: A Topic Modeling Approach

Burcu Baştürk; Aytuğ Onan

doi:10.21205/deufmd.2025278007

Research Article

Discovering Latent Themes in Heart Disease Article Abstracts: A Topic Modeling Approach

Year 2025, Volume: 27 Issue: 80, 216 - 223, 23.05.2025

Burcu Baştürk , Aytuğ Onan

https://doi.org/10.21205/deufmd.2025278007

Abstract

Heart disease is a global public health problem that requires in-depth analysis of extensive literature to uncover specific themes and relationships. This study aimed to identify latent themes and calculate consistencies in 5,000 heart disease-related abstracts retrieved from PubMed using topic modeling techniques. The original abstracts were paraphrased using ChatGPT and NLTK(Natural Language Toolkit), followed by extensive preprocessing, including tokenization, removal of stopped words, stemming, and lemmatization. For effective feature extraction, text data was vectorized using TF-IDF (term frequency-inverse document frequency). Latent Dirichlet Allocation (LDA), Latent Semantic Analysis (LSA), and Non-Negative Matrix Factorization (NMF) were applied to reveal key thematic structures. Coherence scores were calculated and compared across different numbers of subjects (5 to 50) for each model and annotation method. This approach provides a valuable methodology for summarizing large amounts of information, allowing researchers to efficiently navigate the complex landscape of heart disease literature and identify critical areas of focus. The findings aim to improve understanding of heart disease and support future research in this vital area.

Keywords

Heart Disease , Topic Modeling , Latent Dirichlet Allocation (LDA) , Latent Semantic Analysis (LSA) , Non-Negative Matrix Factorization (NMF) , Coherence Scores , Natural Language Processing(NLP)

References

[1] World Health Organization. 2020. Cardiovascular diseases (CVDs). https://www.who.int/news-room/fact-sheets/detail/cardiovascular-diseases-(cvds) (Access date: 30.05.2024).
[2] Guo, W., & Xu, S. 2021. A Comparative Study of Topic Modeling Methods for Topic Evolution Analysis. Journal of the Association for Information Science and Technology, 72(8), 1009-1024. DOI: 10.1002/asi.24486.
[3] Vajjala, S., Majumder, B., Gupta, A., & Surana, H. 2020. Practical natural language processing: a comprehensive guide to building real-world NLP systems. O'Reilly Media, 466s.
[4] Martin, G. M., Tang, S. 2022. Uncovering Hidden Patterns in Text: An Overview of Topic Modeling Techniques. ACM Computing Surveys, 54(1), pp.1-38. DOI: 10.1145/3437221.
[5] Sajid, A., Jan, S., & Shah, I. A. 2017. Automatic topic modeling for single document short texts. 2017 International Conference on Frontiers of Information Technology (FIT). IEEE, pp. 1-7.
[6] He, Q., Chen, B., Veldhuis, G., & He, J. 2021. Enhancing the Interpretability of Topic Modeling in Healthcare Applications. IEEE Access, 9, 18075-18084. DOI: 10.1109/ACCESS.2021.3052597
[7] Blei, D.M., Ng, A.Y., & Jordan, M.I. 2003. Latent Dirichlet Allocation. Journal of Machine Learning Research, Vol. 3, p. 993-1022. DOI: 10.1162/jmlr.2003.3.4-5.993.
[8] Blei, D. M., Ng, A. Y., & Jordan, M. I. 2003. Latent Dirichlet Allocation. Journal of Machine Learning Research, Vol. 3, pp. 993-1022. DOI: 10.1162/jmlr.2003.3.4-5.993.
[9] Wang, Y., & Zhu, Y. 2020. Application of Latent Dirichlet Allocation in Analyzing Electronic Health Records. Journal of Biomedical Informatics, 109, 103512. DOI: 10.1016/j.jbi.2020.103512.
[10] Zhang, Z., Zheng, J., & Yang, L. 2021. Identifying Research Trends in Medical Informatics Using LDA Topic Modeling. BMC Medical Informatics and Decision Making, 21(1), 84. DOI: 10.1186/s12911-021-01438-4.
[11] Xu, R., & Zhang, Y. 2021. Patient Feedback Analysis Using Latent Dirichlet Allocation. Health Information Science and Systems, 9(1), pp.1-12. DOI: 10.1007/s13755-021-00131-2.
[12] Chen, Y., Wang, X., & Zhang, W. 2020. Topic Modeling for Genomic Data Analysis Using Latent Dirichlet Allocation. Bioinformatics, 36(14), 4036-4043. DOI: 10.1093/bioinformatics/btaa273.
[13] Landauer, T.K., & Dumais, S.T. 1997. A solution to Plato’s problem: The latent semantic analysis theory of acquisition, induction, and representation of knowledge. Psychological Review, Vol. 104, No. 2, pp. 211-240. DOI: 10.1037/0033-295X.104.2.211.
[14] Gupta, A., & Lehal, G. 2020. A Systematic Review on Latent Semantic Analysis. International Journal of Data Science and Analytics, 9(4), pp.327-345. DOI: 10.1007/s41060-020-00221-7.
[15] Zhang, X., & Lu, X. 2021. Latent Semantic Analysis for Symptom Pattern Recognition in Clinical Texts. BMC Medical Informatics and Decision Making, 21(1), p.77. DOI: 10.1186/s12911-021-01431-x.
[16] Wang, L., & Li, J. 2021. Enhancing Disease Classification with Latent Semantic Analysis of Clinical Notes. Journal of the American Medical Informatics Association, 27(3), pp.415-422. DOI: 10.1093/jamia/ocz211.
[17] Lee, D.D., & Seung, H.S. 1999. Learning the parts of objects by non-negative matrix factorization. Nature, Vol. 401, pp. 788-791. DOI: 10.1038/44565.
[18] Zhang, Q., & Liu, W. 2021. Utilizing Non-Negative Matrix Factorization for Electronic Health Record Analysis to Identify Patient Patterns. Journal of Biomedical Informatics, 113, 103639. DOI: 10.1016/j.jbi.2020.103639.
[19] Chen, H., & Xu, Z. 2022. Topic Modeling in Biomedical Literature Using Non-Negative Matrix Factorization. BMC Bioinformatics, 23(1), 110. DOI: 10.1186/s12859-022-04663-4.
[20] Liu, Y., & Zhao, X. 2021. Analyzing Patient Feedback in Healthcare Services Using Non-Negative Matrix Factorization. Health Information Science and Systems, 9(1), p.30. DOI: 10.1007/s13755-021-00156-7.
[21] Zhang, Y., & Wang, S. 2021. Applications of Non-Negative Matrix Factorization in Genomic Data Analysis. Bioinformatics, 37(14), pp.2036-2042. DOI: 10.1093/bioinformatics/btaa1103.
[22] Chen, Y., Yang, X., Liu, Z., & Liu, W. 2017. Exploring the thematic evolution of cardiovascular disease research using topic modeling. Scientometrics, Vol. 111, pp. 305-329. DOI: 10.1007/s11192-017-2244-8.
[23] Nguyen, T. T., & Li, W. 2020. A Comprehensive Survey on Topic Modeling Techniques. DOI: 10.1109/ACCESS.2020.2998724.
[24] U.S. National Library of Medicine. 2020. PubMed Overview. https://pubmed.ncbi.nlm.nih.gov/about/ (Access Date: 31.07.2024).
[25] Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J., Dhariwal, P., ... & Amodei, D. 2020. Language Models are Few-Shot Learners. Advances in Neural Information Processing Systems, Cilt. 33, p. 1877-1901. DOI: 10.1145/3382197.
[26] Miller, G. A. 1995. WordNet: A Lexical Database for English. Communications of the ACM, Vol. 38, p. 39-41. DOI: 10.1145/219717.219748.
[27] Grefenstette, G. 1999. Tokenization. ss. 117-133. van Halteren, H., ed. 1999. Syntactic Wordclass Tagging, Springer Netherlands, Dordrecht.
[28] Kannan, S., Gurusamy, V., Vijayarani, S., Ilamathi, J., Nithya, M., Kannan, S., & Gurusamy, V. 2014. Preprocessing Techniques for Text Mining. International Journal of Computer Science & Communication Networks, Vol. 5, p. 7-16.
[29] Jones, K. S. 1972. A Statistical Interpretation of Term Specificity and Its Application in Retrieval. Journal of Documentation, Vol. 28, p. 11-21.

Kalp Hastalığı Makale Özetlerinde Gizli Temaları Keşfetme: Konu Modelleme Yaklaşımı

Year 2025, Volume: 27 Issue: 80, 216 - 223, 23.05.2025

Burcu Baştürk , Aytuğ Onan

https://doi.org/10.21205/deufmd.2025278007

Abstract

Kalp hastalığı, belirli temaları ve ilişkileri ortaya çıkarmak için kapsamlı literatürün derinlemesine analizini gerektiren küresel bir halk sağlığı sorunudur. Bu çalışma, konu modelleme teknikleri kullanılarak PubMed'den alınan kalp hastalığı ile ilgili 5.000 özetteki gizli temaları belirlemeyi ve tutarlılıkları hesaplamayı amaçlamıştır. Orijinal özetler; ChatGPT ve NLTK (Doğal Dil Araç Seti) kullanılarak başka kelimelerle ifade edildi ve ardından tokenizasyon, durdurulan kelimelerin kaldırılması, kök ayırma ve lemmatizasyon dahil olmak üzere kapsamlı ön işleme tabi tutuldu. Etkili özellik çıkarımı için metin verileri TF-IDF (frekans-ters belge frekansı terimi) kullanılarak vektörleştirildi. Temel tematik yapıları ortaya çıkarmak için Gizli Dirichlet Tahsisi (LDA), Gizli Semantik Analiz (LSA) ve Negatif Olmayan Matris Faktorizasyon (NMF) uygulandı. Tutarlılık puanları, her model ve açıklama yöntemi için farklı sayıdaki konular (5 ila 50) arasında hesaplandı ve karşılaştırıldı. Bu yaklaşım, büyük miktarlardaki bilgilerin özetlenmesi için değerli bir metodoloji sağlayarak, araştırmacıların kalp hastalığı literatürünün karmaşık manzarasında etkili bir şekilde gezinmesine ve kritik odak alanlarını belirlemesine olanak tanır. Bulgular, kalp hastalığının anlaşılmasını geliştirmeyi ve bu hayati alanda gelecekteki araştırmaları desteklemeyi amaçlıyor.

Keywords

Kalp Hastalığı , Konu Modelleme , Gizli Dirichlet Tahsisi (LDA) , Gizli Semantik Analiz (LSA) , Negatif Olmayan Matris Faktorizasyonu (NMF) , Tutarlılık Puanları , Doğal Dil İşleme

References

[1] World Health Organization. 2020. Cardiovascular diseases (CVDs). https://www.who.int/news-room/fact-sheets/detail/cardiovascular-diseases-(cvds) (Access date: 30.05.2024).
[2] Guo, W., & Xu, S. 2021. A Comparative Study of Topic Modeling Methods for Topic Evolution Analysis. Journal of the Association for Information Science and Technology, 72(8), 1009-1024. DOI: 10.1002/asi.24486.
[3] Vajjala, S., Majumder, B., Gupta, A., & Surana, H. 2020. Practical natural language processing: a comprehensive guide to building real-world NLP systems. O'Reilly Media, 466s.
[4] Martin, G. M., Tang, S. 2022. Uncovering Hidden Patterns in Text: An Overview of Topic Modeling Techniques. ACM Computing Surveys, 54(1), pp.1-38. DOI: 10.1145/3437221.
[5] Sajid, A., Jan, S., & Shah, I. A. 2017. Automatic topic modeling for single document short texts. 2017 International Conference on Frontiers of Information Technology (FIT). IEEE, pp. 1-7.
[6] He, Q., Chen, B., Veldhuis, G., & He, J. 2021. Enhancing the Interpretability of Topic Modeling in Healthcare Applications. IEEE Access, 9, 18075-18084. DOI: 10.1109/ACCESS.2021.3052597
[7] Blei, D.M., Ng, A.Y., & Jordan, M.I. 2003. Latent Dirichlet Allocation. Journal of Machine Learning Research, Vol. 3, p. 993-1022. DOI: 10.1162/jmlr.2003.3.4-5.993.
[8] Blei, D. M., Ng, A. Y., & Jordan, M. I. 2003. Latent Dirichlet Allocation. Journal of Machine Learning Research, Vol. 3, pp. 993-1022. DOI: 10.1162/jmlr.2003.3.4-5.993.
[9] Wang, Y., & Zhu, Y. 2020. Application of Latent Dirichlet Allocation in Analyzing Electronic Health Records. Journal of Biomedical Informatics, 109, 103512. DOI: 10.1016/j.jbi.2020.103512.
[10] Zhang, Z., Zheng, J., & Yang, L. 2021. Identifying Research Trends in Medical Informatics Using LDA Topic Modeling. BMC Medical Informatics and Decision Making, 21(1), 84. DOI: 10.1186/s12911-021-01438-4.
[11] Xu, R., & Zhang, Y. 2021. Patient Feedback Analysis Using Latent Dirichlet Allocation. Health Information Science and Systems, 9(1), pp.1-12. DOI: 10.1007/s13755-021-00131-2.
[12] Chen, Y., Wang, X., & Zhang, W. 2020. Topic Modeling for Genomic Data Analysis Using Latent Dirichlet Allocation. Bioinformatics, 36(14), 4036-4043. DOI: 10.1093/bioinformatics/btaa273.
[13] Landauer, T.K., & Dumais, S.T. 1997. A solution to Plato’s problem: The latent semantic analysis theory of acquisition, induction, and representation of knowledge. Psychological Review, Vol. 104, No. 2, pp. 211-240. DOI: 10.1037/0033-295X.104.2.211.
[14] Gupta, A., & Lehal, G. 2020. A Systematic Review on Latent Semantic Analysis. International Journal of Data Science and Analytics, 9(4), pp.327-345. DOI: 10.1007/s41060-020-00221-7.
[15] Zhang, X., & Lu, X. 2021. Latent Semantic Analysis for Symptom Pattern Recognition in Clinical Texts. BMC Medical Informatics and Decision Making, 21(1), p.77. DOI: 10.1186/s12911-021-01431-x.
[16] Wang, L., & Li, J. 2021. Enhancing Disease Classification with Latent Semantic Analysis of Clinical Notes. Journal of the American Medical Informatics Association, 27(3), pp.415-422. DOI: 10.1093/jamia/ocz211.
[17] Lee, D.D., & Seung, H.S. 1999. Learning the parts of objects by non-negative matrix factorization. Nature, Vol. 401, pp. 788-791. DOI: 10.1038/44565.
[18] Zhang, Q., & Liu, W. 2021. Utilizing Non-Negative Matrix Factorization for Electronic Health Record Analysis to Identify Patient Patterns. Journal of Biomedical Informatics, 113, 103639. DOI: 10.1016/j.jbi.2020.103639.
[19] Chen, H., & Xu, Z. 2022. Topic Modeling in Biomedical Literature Using Non-Negative Matrix Factorization. BMC Bioinformatics, 23(1), 110. DOI: 10.1186/s12859-022-04663-4.
[20] Liu, Y., & Zhao, X. 2021. Analyzing Patient Feedback in Healthcare Services Using Non-Negative Matrix Factorization. Health Information Science and Systems, 9(1), p.30. DOI: 10.1007/s13755-021-00156-7.
[21] Zhang, Y., & Wang, S. 2021. Applications of Non-Negative Matrix Factorization in Genomic Data Analysis. Bioinformatics, 37(14), pp.2036-2042. DOI: 10.1093/bioinformatics/btaa1103.
[22] Chen, Y., Yang, X., Liu, Z., & Liu, W. 2017. Exploring the thematic evolution of cardiovascular disease research using topic modeling. Scientometrics, Vol. 111, pp. 305-329. DOI: 10.1007/s11192-017-2244-8.
[23] Nguyen, T. T., & Li, W. 2020. A Comprehensive Survey on Topic Modeling Techniques. DOI: 10.1109/ACCESS.2020.2998724.
[24] U.S. National Library of Medicine. 2020. PubMed Overview. https://pubmed.ncbi.nlm.nih.gov/about/ (Access Date: 31.07.2024).
[25] Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J., Dhariwal, P., ... & Amodei, D. 2020. Language Models are Few-Shot Learners. Advances in Neural Information Processing Systems, Cilt. 33, p. 1877-1901. DOI: 10.1145/3382197.
[26] Miller, G. A. 1995. WordNet: A Lexical Database for English. Communications of the ACM, Vol. 38, p. 39-41. DOI: 10.1145/219717.219748.
[27] Grefenstette, G. 1999. Tokenization. ss. 117-133. van Halteren, H., ed. 1999. Syntactic Wordclass Tagging, Springer Netherlands, Dordrecht.
[28] Kannan, S., Gurusamy, V., Vijayarani, S., Ilamathi, J., Nithya, M., Kannan, S., & Gurusamy, V. 2014. Preprocessing Techniques for Text Mining. International Journal of Computer Science & Communication Networks, Vol. 5, p. 7-16.
[29] Jones, K. S. 1972. A Statistical Interpretation of Term Specificity and Its Application in Retrieval. Journal of Documentation, Vol. 28, p. 11-21.

There are 29 citations in total.

Details

Primary Language	English
Subjects	Performance Evaluation
Journal Section	Research Article
Authors	Burcu Baştürk 0009-0005-4781-353X Aytuğ Onan 0000-0002-9434-5880
Early Pub Date	May 12, 2025
Publication Date	May 23, 2025
Submission Date	June 19, 2024
Acceptance Date	August 12, 2024
Published in Issue	Year 2025 Volume: 27 Issue: 80

Cite

APA	Baştürk, B., & Onan, A. (2025). Discovering Latent Themes in Heart Disease Article Abstracts: A Topic Modeling Approach. Dokuz Eylül Üniversitesi Mühendislik Fakültesi Fen Ve Mühendislik Dergisi, 27(80), 216-223. https://doi.org/10.21205/deufmd.2025278007
AMA	Baştürk B, Onan A. Discovering Latent Themes in Heart Disease Article Abstracts: A Topic Modeling Approach. DEUFMD. May 2025;27(80):216-223. doi:10.21205/deufmd.2025278007
Chicago	Baştürk, Burcu, and Aytuğ Onan. “Discovering Latent Themes in Heart Disease Article Abstracts: A Topic Modeling Approach”. Dokuz Eylül Üniversitesi Mühendislik Fakültesi Fen Ve Mühendislik Dergisi 27, no. 80 (May 2025): 216-23. https://doi.org/10.21205/deufmd.2025278007.
EndNote	Baştürk B, Onan A (May 1, 2025) Discovering Latent Themes in Heart Disease Article Abstracts: A Topic Modeling Approach. Dokuz Eylül Üniversitesi Mühendislik Fakültesi Fen ve Mühendislik Dergisi 27 80 216–223.
IEEE	B. Baştürk and A. Onan, “Discovering Latent Themes in Heart Disease Article Abstracts: A Topic Modeling Approach”, DEUFMD, vol. 27, no. 80, pp. 216–223, 2025, doi: 10.21205/deufmd.2025278007.
ISNAD	Baştürk, Burcu - Onan, Aytuğ. “Discovering Latent Themes in Heart Disease Article Abstracts: A Topic Modeling Approach”. Dokuz Eylül Üniversitesi Mühendislik Fakültesi Fen ve Mühendislik Dergisi 27/80 (May2025), 216-223. https://doi.org/10.21205/deufmd.2025278007.
JAMA	Baştürk B, Onan A. Discovering Latent Themes in Heart Disease Article Abstracts: A Topic Modeling Approach. DEUFMD. 2025;27:216–223.
MLA	Baştürk, Burcu and Aytuğ Onan. “Discovering Latent Themes in Heart Disease Article Abstracts: A Topic Modeling Approach”. Dokuz Eylül Üniversitesi Mühendislik Fakültesi Fen Ve Mühendislik Dergisi, vol. 27, no. 80, 2025, pp. 216-23, doi:10.21205/deufmd.2025278007.
Vancouver	Baştürk B, Onan A. Discovering Latent Themes in Heart Disease Article Abstracts: A Topic Modeling Approach. DEUFMD. 2025;27(80):216-23.

Download Cover Image

Article Files

Full Text