Evaluating the Impact of Temperature and Instruction Strategies on Hallucination in Large Language Models
Abstract
Large Language Models (LLMs) have demonstrated impressive generative and reasoning abilities, yet their tendency to produce factually incorrect or fabricated information—so-called hallucinations—remains a key limitation. This study systematically examines how temperature and system instruction strategies affect hallucination behavior in open-source LLMs executed through the Ollama framework. Three representative models—Gemma 2B, Mistral 7B Instruct, and Phi-3 Mini—were evaluated on the TruthfulQA benchmark using zero-shot, few-shot, and “say-I-don’t-know” prompting paradigms. Performance was measured through exact match, token-level F1, semantic similarity, and embedding-based similarity metrics. Two-way ANOVA and3 Tukey post-hoc analyses revealed that system instruction significantly influenced factual accuracy across all models, while temperature effects were comparatively minor. Few-shot prompting achieved the highest mean F1 score (0.1889), indicating that example conditioning effectively constrained hallucinations. Conversely, “say-I-don’t-know” prompts increased semantic alignment but reduced precision, suggesting a conservative refusal bias. Embedding-based similarity analyses confirmed higher semantic consistency for zero-shot responses. The results highlight that prompt design exerts a stronger and more interpretable influence on hallucination than sampling stochasticity, offering practical guidance for improving the factual reliability of open-source LLMs.
Keywords
References
- Abdin, M., Aneja, J., Awadalla, H., Awadallah, A., Awan, A. A., Bach, N., Bahree, A., Bakhtiari, A., Bao, J., Behl, H., Benhaim, A., Bilenko, M., Bjorck, J., Bubeck, S., Cai, M., Cai, Q., Chaudhary, V., Chen, D., Chen, D., … Zhou, X. (2024). Phi-3 Technical Report: A Highly Capable Language Model Locally on Your Phone. https://doi.org/10.48550/arXiv.2404.14219
- Chen, C., & Shu, K. (2023). Can LLM-Generated Misinformation Be Detected?. In: Proceedings of the 12th International Conference on Learning Representations (ICLR 2024). https://doi.org/10.48550/arXiv.2309.13788
- Coeckelbergh, M. (2025). LLMs, Truth, and Democracy: An Overview of Risks. Science and Engineering Ethics, 31(1), 4. https://doi.org/10.1007/S11948-025-00529-0
- Dahl, M., Magesh, V., Suzgun, M., & Ho, D. E. (2024). Large Legal Fictions: Profiling Legal Hallucinations in Large Language Models. Journal of Legal Analysis, 16(1), 64–93. https://doi.org/10.1093/JLA/LAAE003
- Datasets. (2025). Hugging Face. https://huggingface.co/docs/datasets/en/index
- DeepSeek. (2024). deepseek-llm. DeepSeek. https://ollama.com/library/deepseek-llm?utm_source=chatgpt.com
- Du, W., Yang, Y., & Welleck, S. (2025, July 13-19). Optimizing Temperature for Language Models with Multi-Sample Inference. In: A. Singh, M. Fazel, D. Hsu, S. Lacoste-Julien, F. Berkenkamp, T. Maharaj, K. Wagstaff, & J. Zhu (Eds.), Proceedings of the 42nd International Conference on Machine Learning (PMLR 267), (pp. 14648–14668), Vancouver, Canada. https://proceedings.mlr.press/v267/du25f.html
- Dziri, N., Milton, S., Yu, M., Zaiane, O., & Reddy, S. (2022). On the Origin of Hallucinations in Conversational Models: Is it the Datasets or the Models?. In: M. Carpuat, M.-C. de Marneffe, & I. V. M. Ruiz (Eds.), Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (NAACL 2022), (pp. 5271–5285), Seattle, United States. https://doi.org/10.18653/v1/2022.naacl-main.387
Details
Primary Language
English
Subjects
Knowledge Representation and Reasoning, Natural Language Processing
Journal Section
Research Article
Authors
Early Pub Date
January 16, 2026
Publication Date
January 16, 2026
Submission Date
November 6, 2025
Acceptance Date
January 1, 2026
Published in Issue
Year 2026 Volume: 13 Number: 1
APA
Kabakuş, A. T. (2026). Evaluating the Impact of Temperature and Instruction Strategies on Hallucination in Large Language Models. Gazi University Journal of Science Part A: Engineering and Innovation, 13(1), 13-32. https://doi.org/10.54287/gujsa.1819131
AMA
1.Kabakuş AT. Evaluating the Impact of Temperature and Instruction Strategies on Hallucination in Large Language Models. GU J Sci, Part A. 2026;13(1):13-32. doi:10.54287/gujsa.1819131
Chicago
Kabakuş, Abdullah Talha. 2026. “Evaluating the Impact of Temperature and Instruction Strategies on Hallucination in Large Language Models”. Gazi University Journal of Science Part A: Engineering and Innovation 13 (1): 13-32. https://doi.org/10.54287/gujsa.1819131.
EndNote
Kabakuş AT (March 1, 2026) Evaluating the Impact of Temperature and Instruction Strategies on Hallucination in Large Language Models. Gazi University Journal of Science Part A: Engineering and Innovation 13 1 13–32.
IEEE
[1]A. T. Kabakuş, “Evaluating the Impact of Temperature and Instruction Strategies on Hallucination in Large Language Models”, GU J Sci, Part A, vol. 13, no. 1, pp. 13–32, Mar. 2026, doi: 10.54287/gujsa.1819131.
ISNAD
Kabakuş, Abdullah Talha. “Evaluating the Impact of Temperature and Instruction Strategies on Hallucination in Large Language Models”. Gazi University Journal of Science Part A: Engineering and Innovation 13/1 (March 1, 2026): 13-32. https://doi.org/10.54287/gujsa.1819131.
JAMA
1.Kabakuş AT. Evaluating the Impact of Temperature and Instruction Strategies on Hallucination in Large Language Models. GU J Sci, Part A. 2026;13:13–32.
MLA
Kabakuş, Abdullah Talha. “Evaluating the Impact of Temperature and Instruction Strategies on Hallucination in Large Language Models”. Gazi University Journal of Science Part A: Engineering and Innovation, vol. 13, no. 1, Mar. 2026, pp. 13-32, doi:10.54287/gujsa.1819131.
Vancouver
1.Abdullah Talha Kabakuş. Evaluating the Impact of Temperature and Instruction Strategies on Hallucination in Large Language Models. GU J Sci, Part A. 2026 Mar. 1;13(1):13-32. doi:10.54287/gujsa.1819131