Research Article

Evaluating the Impact of Temperature and Instruction Strategies on Hallucination in Large Language Models

Volume: 13 Number: 1 January 16, 2026

Evaluating the Impact of Temperature and Instruction Strategies on Hallucination in Large Language Models

Abstract

Large Language Models (LLMs) have demonstrated impressive generative and reasoning abilities, yet their tendency to produce factually incorrect or fabricated information—so-called hallucinations—remains a key limitation. This study systematically examines how temperature and system instruction strategies affect hallucination behavior in open-source LLMs executed through the Ollama framework. Three representative models—Gemma 2B, Mistral 7B Instruct, and Phi-3 Mini—were evaluated on the TruthfulQA benchmark using zero-shot, few-shot, and “say-I-don’t-know” prompting paradigms. Performance was measured through exact match, token-level F1, semantic similarity, and embedding-based similarity metrics. Two-way ANOVA and3 Tukey post-hoc analyses revealed that system instruction significantly influenced factual accuracy across all models, while temperature effects were comparatively minor. Few-shot prompting achieved the highest mean F1 score (0.1889), indicating that example conditioning effectively constrained hallucinations. Conversely, “say-I-don’t-know” prompts increased semantic alignment but reduced precision, suggesting a conservative refusal bias. Embedding-based similarity analyses confirmed higher semantic consistency for zero-shot responses. The results highlight that prompt design exerts a stronger and more interpretable influence on hallucination than sampling stochasticity, offering practical guidance for improving the factual reliability of open-source LLMs.

Keywords

References

  1. Abdin, M., Aneja, J., Awadalla, H., Awadallah, A., Awan, A. A., Bach, N., Bahree, A., Bakhtiari, A., Bao, J., Behl, H., Benhaim, A., Bilenko, M., Bjorck, J., Bubeck, S., Cai, M., Cai, Q., Chaudhary, V., Chen, D., Chen, D., … Zhou, X. (2024). Phi-3 Technical Report: A Highly Capable Language Model Locally on Your Phone. https://doi.org/10.48550/arXiv.2404.14219
  2. Chen, C., & Shu, K. (2023). Can LLM-Generated Misinformation Be Detected?. In: Proceedings of the 12th International Conference on Learning Representations (ICLR 2024). https://doi.org/10.48550/arXiv.2309.13788
  3. Coeckelbergh, M. (2025). LLMs, Truth, and Democracy: An Overview of Risks. Science and Engineering Ethics, 31(1), 4. https://doi.org/10.1007/S11948-025-00529-0
  4. Dahl, M., Magesh, V., Suzgun, M., & Ho, D. E. (2024). Large Legal Fictions: Profiling Legal Hallucinations in Large Language Models. Journal of Legal Analysis, 16(1), 64–93. https://doi.org/10.1093/JLA/LAAE003
  5. Datasets. (2025). Hugging Face. https://huggingface.co/docs/datasets/en/index
  6. DeepSeek. (2024). deepseek-llm. DeepSeek. https://ollama.com/library/deepseek-llm?utm_source=chatgpt.com
  7. Du, W., Yang, Y., & Welleck, S. (2025, July 13-19). Optimizing Temperature for Language Models with Multi-Sample Inference. In: A. Singh, M. Fazel, D. Hsu, S. Lacoste-Julien, F. Berkenkamp, T. Maharaj, K. Wagstaff, & J. Zhu (Eds.), Proceedings of the 42nd International Conference on Machine Learning (PMLR 267), (pp. 14648–14668), Vancouver, Canada. https://proceedings.mlr.press/v267/du25f.html
  8. Dziri, N., Milton, S., Yu, M., Zaiane, O., & Reddy, S. (2022). On the Origin of Hallucinations in Conversational Models: Is it the Datasets or the Models?. In: M. Carpuat, M.-C. de Marneffe, & I. V. M. Ruiz (Eds.), Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (NAACL 2022), (pp. 5271–5285), Seattle, United States. https://doi.org/10.18653/v1/2022.naacl-main.387

Details

Primary Language

English

Subjects

Knowledge Representation and Reasoning, Natural Language Processing

Journal Section

Research Article

Early Pub Date

January 16, 2026

Publication Date

January 16, 2026

Submission Date

November 6, 2025

Acceptance Date

January 1, 2026

Published in Issue

Year 2026 Volume: 13 Number: 1

APA
Kabakuş, A. T. (2026). Evaluating the Impact of Temperature and Instruction Strategies on Hallucination in Large Language Models. Gazi University Journal of Science Part A: Engineering and Innovation, 13(1), 13-32. https://doi.org/10.54287/gujsa.1819131
AMA
1.Kabakuş AT. Evaluating the Impact of Temperature and Instruction Strategies on Hallucination in Large Language Models. GU J Sci, Part A. 2026;13(1):13-32. doi:10.54287/gujsa.1819131
Chicago
Kabakuş, Abdullah Talha. 2026. “Evaluating the Impact of Temperature and Instruction Strategies on Hallucination in Large Language Models”. Gazi University Journal of Science Part A: Engineering and Innovation 13 (1): 13-32. https://doi.org/10.54287/gujsa.1819131.
EndNote
Kabakuş AT (March 1, 2026) Evaluating the Impact of Temperature and Instruction Strategies on Hallucination in Large Language Models. Gazi University Journal of Science Part A: Engineering and Innovation 13 1 13–32.
IEEE
[1]A. T. Kabakuş, “Evaluating the Impact of Temperature and Instruction Strategies on Hallucination in Large Language Models”, GU J Sci, Part A, vol. 13, no. 1, pp. 13–32, Mar. 2026, doi: 10.54287/gujsa.1819131.
ISNAD
Kabakuş, Abdullah Talha. “Evaluating the Impact of Temperature and Instruction Strategies on Hallucination in Large Language Models”. Gazi University Journal of Science Part A: Engineering and Innovation 13/1 (March 1, 2026): 13-32. https://doi.org/10.54287/gujsa.1819131.
JAMA
1.Kabakuş AT. Evaluating the Impact of Temperature and Instruction Strategies on Hallucination in Large Language Models. GU J Sci, Part A. 2026;13:13–32.
MLA
Kabakuş, Abdullah Talha. “Evaluating the Impact of Temperature and Instruction Strategies on Hallucination in Large Language Models”. Gazi University Journal of Science Part A: Engineering and Innovation, vol. 13, no. 1, Mar. 2026, pp. 13-32, doi:10.54287/gujsa.1819131.
Vancouver
1.Abdullah Talha Kabakuş. Evaluating the Impact of Temperature and Instruction Strategies on Hallucination in Large Language Models. GU J Sci, Part A. 2026 Mar. 1;13(1):13-32. doi:10.54287/gujsa.1819131