Fabricated or accurate? Ethical concerns and citation hallucination in aI-generated scientific writing on musculoskeletal topics

Ertuğrul Safran; Adem Çalı

doi:10.38053/acmj.1746227

TR EN

Uydurma mı gerçek mi? Kas-iskelet sistemi konularında yapay zeka tarafından üretilen bilimsel yazılarda etik sorunlar ve atıf halüsinasyonları

Abstract

Amaç: ChatGPT gibi büyük dil modelleri (LLM'ler), akademik ve klinik yazımda giderek daha fazla kullanılmaktadır. Bu araçlar tutarlı ve alanına özgü metinler üretebilse de, otomatik olarak oluşturulan referansların doğruluğuna ilişkin endişeler devam etmektedir. Güncel kanıtlara büyük ölçüde dayanan kas-iskelet sistemi rehabilitasyonu alanında, atıfların güvenilirliği özellikle kritik öneme sahiptir. Ancak yapay zekâ tarafından üretilen bilimsel içeriklerde atıf doğruluğunu sistematik olarak değerlendiren çalışmalar yetersizdir. Bu çalışmanın amacı, kas-iskelet sistemi rehabilitasyonu konularında oluşturulan promptlara yanıt olarak ChatGPT (GPT-4) tarafından üretilen bilimsel metinlerdeki referansların doğruluğunu değerlendirmek ve yapılandırılmış bir doğrulama sürecinin referans doğruluğunu artırıp artırmadığını belirlemektir. Yöntem: ChatGPT’den manuel terapi, ön çapraz bağ (ACL) rekonstrüksiyonu, bel ağrısı ve rotator manşet onarımı konularında, her biri 10 DOI içeren dört bilimsel paragraf üretmesi istendi. Toplam 40 referans, 3 puanlık bir sistemle değerlendirildi: 0 = uydurma, 1 = kısmen doğru, 2 = tamamen doğru. İlk değerlendirmeden sonra ChatGPT'den referansları doğrulaması ve düzeltmesi istendi. İlk ve son durumdaki puanlar betimsel olarak karşılaştırıldı. Bulgular: İlk üretimde referansların yalnızca %7,5’i tamamen doğruydu, %42,5’i ise tamamen uydurmaydı. Geri kalan %50’si kısmen doğruydu. Doğrulama sürecinden sonra tamamen doğru referans oranı %77,5’e yükseldi. En yaygın hatalar geçersiz DOI’ler, uydurma makale başlıkları ve yanlış eşleşmiş metadata içeriğiydi. Sonuç: ChatGPT tutarlı bilimsel metinler üretebilse de, referansları sıklıkla hatalı veya uydurmadır. Üretim sonrası doğrulama, referans doğruluğunu önemli ölçüde artırmaktadır. Yapay zekâ araçlarının akademik ve klinik kas-iskelet sistemi bağlamlarında kullanımı sırasında, özellikle atıf geçerliliği açısından dikkatli olunmalıdır.

Keywords

Fabricated or accurate? Ethical concerns and citation hallucination in aI-generated scientific writing on musculoskeletal topics

Abstract

Aims: Large language models (LLMs) such as ChatGPT are increasingly used in academic and clinical writing. While these tools can generate coherent and domain-specific text, concerns persist regarding the accuracy of their automatically generated references. In musculoskeletal rehabilitation—a field heavily reliant on current evidence—the reliability of citations is especially critical. Yet, systematic evaluations of citation accuracy in AI-generated scientific content are lacking. To evaluate the reference accuracy of scientific texts generated by ChatGPT (GPT-4) in response to musculoskeletal rehabilitation prompts, and to determine whether reference accuracy improves following structured post-generation verification. Methods: ChatGPT was prompted to generate four scientific paragraphs on musculoskeletal rehabilitation topics (manual therapy, ACL reconstruction, low back pain, and rotator cuff repair), each including 10 references with DOIs. A total of 40 references were analyzed using a 3-point scoring system (0=fabricated, 1=partially correct, 2=fully accurate), which was used to assess citation quality. After initial evaluation, ChatGPT was asked to verify and revise its references. Scores before and after this step were compared descriptively and with Wilcoxon signed-rank tests to assess statistical significance, and effect sizes (r) were calculated to estimate the magnitude of improvement. Results: Only 7.5% of references were fully accurate in the initial generation, while 42.5% were completely fabricated. The remaining 50% were partially correct. After verification, the proportion of fully accurate references rose to 77.5%. Wilcoxon signed-rank testing confirmed a statistically significant improvement in accuracy across all prompts (W=561.0, p<0.001, r=0.60). The most common errors included invalid DOIs, fabricated article titles, and mismatched metadata. Conclusion: ChatGPT can generate coherent scientific content, but its initial references are frequently inaccurate or fabricated. Structured post-generation verification significantly improves reference accuracy, as confirmed by statistical testing. These findings suggest that LLMs may be integrated as drafting tools in academic and clinical musculoskeletal contexts, but only when accompanied by strict human-led verification of citations.

Keywords

References

Mondal H, Mondal S. ChatGPT in academic writing: maximizing its benefits and minimizing the risks. Indian J Ophthalmol. 2023;71(12): 3600-3606. doi:10.4103/IJO.IJO_718_23
Bom H-SH. Exploring the opportunities and challenges of ChatGPT in academic writing: a roundtable discussion. Nucl Med Mol Imaging. 2023;57(4):165-1677. doi:10.1007/s13139-023-00809-2
Jarrah AM, Wardat Y, Fidalgo P. Using ChatGPT in academic writing is (not) a form of plagiarism: what does the literature say. Online J Commun Media Technol. 2023;13(4):e202346. doi:10.30935/ojcmt/13572
Gruda D. Three ways ChatGPT helps me in my academic writing. Nature. 2024;10:1-6. doi:10.1038/d41586-024-01042-3
Švab I, Klemenc-Ketiš Z, Zupanič S. New challenges in scientific publications: referencing, artificial intelligence and ChatGPT. Slov J Public Health. 2023;62(3):109-112. doi:10.2478/sjph-2023-0015
Jan R. Examining the reliability of ChatGPT: identifying retracted scientific literature and ensuring accurate citations and references. In: Impacts of generative ai on the future of research and education. Hershey, PA: IGI Global; 2025.
Frosolini A, Gennaro P, Cascino F, Gabriele G. In reference to “role of Chat GPT in public health”, to highlight the AI’s incorrect reference generation. Ann Biomed Eng. 2023;51(10):2120-2122. doi:10.1007/s 10439-023-03248-4
Cohen F, Vallimont J, Gelfand AA. Caution regarding fabricated citations from artificial intelligence. Headache. 2024;64(1):133-135. doi: 10.1111/head.14649

Borji A. A categorical archive of ChatGPT failures. arXiv. 2023;2302. 03494. doi:10.48550/arXiv.2302.03494
Bai Z, Wang P, Xiao T, et al. Hallucination of multimodal large language models: a survey. arXiv. 2024;2404.18930. doi:10.48550/arXiv.2404.18930
Metze K, Morandin-Reis RC, Lorand-Metze I, Florindo JB. Bibliographic research with ChatGPT may be misleading: the problem of hallucination. J Pediatr Surg. 2024;59(1):158-159. doi:10.1016/j.jpedsurg.2023.08.018
Aljamaan F, Temsah M-H, Altamimi I, et al. Reference hallucination score for medical artificial intelligence chatbots: development and usability study. JMIR Med Inform. 2024;12(1):e54345. doi:10.2196/54345
Mohammad B, Supti T, Alzubaidi M, et al. The pros and cons of using ChatGPT in medical education: a scoping review. In: Healthcare transformation with informatics and artificial intelligence. Volume 305. 2023.
Gao CA, Howard FM, Markov NS, et al. Comparing scientific abstracts generated by ChatGPT to original abstracts using an artificial intelligence output detector, plagiarism detector, and blinded human reviewers. bioRxiv. 2022;12(23):521610. doi:10.1101/2022.12.23.521610
Sharun K, Banu SA, Pawde AM, et al. ChatGPT and artificial hallucinations in stem cell research: assessing the accuracy of generated references–a preliminary study. Ann Med Surg (Lond). 2023;85(10):5275-5278. doi:10.1097/MS9.0000000000001228
Safran E, Yildirim S. A cross-sectional study on ChatGPT’s alignment with clinical practice guidelines in musculoskeletal rehabilitation. BMC Musculoskelet Disord. 2025;26(1):411. doi:10.1186/s12891-025-08650-8
Lancaster T. Artificial intelligence, text generation tools and ChatGPT–does digital watermarking offer a solution? Int J Educ Integr. 2023;19(1):10.
Alkaissi H, McFarlane SI. Artificial hallucinations in ChatGPT: implications in scientific writing. Cureus. 2023;15(2):e35074. doi:10. 7759/cureus.35179
Thorp HH. ChatGPT is fun, but not an author. Science. 2023;379(6630): 313. doi:10.1126/science.adg7879
Gravel J, D’Amours-Gravel M, Osmanlliu E. Learning to fake it: limited responses and fabricated references provided by ChatGPT for medical questions. Mayo Clin Proc Digit Health. 2023;1(3):226-234. doi:10.1016/j.mcpdig.2023.05.004
Orduña-Malea E, Cabezas-Clavijo Á. ChatGPT and the potential growing of ghost bibliographic references. Scientometrics. 2023;128(9): 5351-5365. doi:10.1007/s11192-023-04804-4
Lechien JR, Briganti G, Vaira LA. Accuracy of ChatGPT-3.5 and -4 in providing scientific references in otolaryngology–head and neck surgery. Eur Arch Otorhinolaryngol. 2024;281(4):2159-2165. doi:10.1007/s00405-023-08441-8
Sanchez-Ramos L, Lin L, Romero R. Beware of references when using ChatGPT as a source of information to write scientific articles. Am J Obstet Gynecol. 2023;229(3):356. doi:10.1016/j.ajog.2023.04.004
Sebo P. How accurate are the references generated by ChatGPT in internal medicine? Intern Emerg Med. 2024;19(1):247-249. doi:10.1007/s11739-023-03484-5
Besançon L, Cabanac G, Labbé C, Magazinov A. Sneaked references: fabricated reference metadata distort citation counts. J Assoc Inf Sci Technol. 2024;75(12):1368-1379. doi:10.48550/arXiv.2310.02192
Walters WH, Wilder EI. Fabrication and errors in the bibliographic citations generated by ChatGPT. Sci Rep. 2023;13(1):14045. doi:10.1038/s41598-023-41032-5
Graf EM, McKinney JA, Dye AB, Lin L, Sanchez-Ramos L. Exploring the limits of artificial intelligence for referencing scientific articles. Am J Perinatol. 2024;41(15):2072-2081. doi:10.1055/s-0044-1786033
Altmäe S, Sola-Leyva A, Salumets A. Artificial intelligence in scientific writing: a friend or a foe? Reprod Biomed Online. 2023;47(1):3-9. doi:10. 1016/j.rbmo.2023.04.009
Watson AP. Hallucinated citation analysis: delving into student-submitted AI-generated sources at the University of Mississippi. Ser Libr. 2024;85(5-6):172-180. doi:10.1080/0361526X.2024.2433640
Frosolini A, Franz L, Benedetti S, et al. Assessing the accuracy of ChatGPT references in head and neck and ENT disciplines. Eur Arch Otorhinolaryngol. 2023;280(11):5129-5133. doi:10.1007/s00405-023-08205-4
Giray L. ChatGPT references unveiled: distinguishing the reliable from the fake. Internet Ref Serv Q. 2024;28(1):9-18. doi:10.1080/10875301.2023. 2265369
Athaluri SA, Manthena SV, Kesapragada VKM, Yarlagadda V, Dave T, Duddumpudi RTS. Exploring the boundaries of reality: investigating the phenomenon of artificial intelligence hallucination in scientific writing through ChatGPT references. Cureus. 2023;15(4):e37432. doi:10. 7759/cureus.37432
Bhattacharyya M, Miller VM, Bhattacharyya D, Miller LE, Miller V. High rates of fabricated and inaccurate references in ChatGPT-generated medical content. Cureus. 2023;15(5):e39238. doi:10.7759/cureus.39238

Details

Primary Language

English

Subjects

Physiotherapy

Journal Section

Research Article

Authors

Ertuğrul Safran ^*
0000-0002-6835-5428
Türkiye

Adem Çalı
0000-0001-7846-4647
Türkiye

Publication Date

September 15, 2025

Submission Date

July 19, 2025

Acceptance Date

September 10, 2025

Published in Issue

Year 2025 Volume: 7 Number: 5

DOI

https://doi.org/10.38053/acmj.1746227

IZ

https://izlik.org/JA39GJ56ZX

Cite

RIS / Bibtex

APA

Safran, E., & Çalı, A. (2025). Fabricated or accurate? Ethical concerns and citation hallucination in aI-generated scientific writing on musculoskeletal topics. Anatolian Current Medical Journal, 7(5), 695-702. https://doi.org/10.38053/acmj.1746227

AMA

1.Safran E, Çalı A. Fabricated or accurate? Ethical concerns and citation hallucination in aI-generated scientific writing on musculoskeletal topics. Anatolian Curr Med J / ACMJ / acmj. 2025;7(5):695-702. doi:10.38053/acmj.1746227

Chicago

Safran, Ertuğrul, and Adem Çalı. 2025. “Fabricated or Accurate? Ethical Concerns and Citation Hallucination in AI-Generated Scientific Writing on Musculoskeletal Topics”. Anatolian Current Medical Journal 7 (5): 695-702. https://doi.org/10.38053/acmj.1746227.

EndNote

Safran E, Çalı A (September 1, 2025) Fabricated or accurate? Ethical concerns and citation hallucination in aI-generated scientific writing on musculoskeletal topics. Anatolian Current Medical Journal 7 5 695–702.

IEEE

[1]E. Safran and A. Çalı, “Fabricated or accurate? Ethical concerns and citation hallucination in aI-generated scientific writing on musculoskeletal topics”, Anatolian Curr Med J / ACMJ / acmj, vol. 7, no. 5, pp. 695–702, Sept. 2025, doi: 10.38053/acmj.1746227.

ISNAD

Safran, Ertuğrul - Çalı, Adem. “Fabricated or Accurate? Ethical Concerns and Citation Hallucination in AI-Generated Scientific Writing on Musculoskeletal Topics”. Anatolian Current Medical Journal 7/5 (September 1, 2025): 695-702. https://doi.org/10.38053/acmj.1746227.

JAMA

1.Safran E, Çalı A. Fabricated or accurate? Ethical concerns and citation hallucination in aI-generated scientific writing on musculoskeletal topics. Anatolian Curr Med J / ACMJ / acmj. 2025;7:695–702.

MLA

Safran, Ertuğrul, and Adem Çalı. “Fabricated or Accurate? Ethical Concerns and Citation Hallucination in AI-Generated Scientific Writing on Musculoskeletal Topics”. Anatolian Current Medical Journal, vol. 7, no. 5, Sept. 2025, pp. 695-02, doi:10.38053/acmj.1746227.

Vancouver

1.Ertuğrul Safran, Adem Çalı. Fabricated or accurate? Ethical concerns and citation hallucination in aI-generated scientific writing on musculoskeletal topics. Anatolian Curr Med J / ACMJ / acmj. 2025 Sep. 1;7(5):695-702. doi:10.38053/acmj.1746227