Evaluating the Impact of Prompt Formats on Llama2 and Phi3 Using Turkish Language Instruction Dataset

Emir Öztürk

doi:10.29130/dubited.1533514

Research Article

Türkçe Talimat Veriseti Kullanılarak Llama2 ve Phi3 Üzerinde İstek Formatlarının Etkisinin Değerlendirilmesi

Year 2026, Volume: 14 Issue: 1, 49 - 59, 21.01.2026

Emir Öztürk

https://doi.org/10.29130/dubited.1533514

https://izlik.org/JA65FN26CL

Abstract

Dil modelleri, etkin finetuning metotları ile herkes tarafından eğitilebilmekte ve özelleştirilerek farklı alanlarda kullanılabilmektedir. Dil modellerinin başarımı için genellikle kayıp veya doğruluk metrikleri kontrol edilse de prompt formatlarının seçimi de bu başarımı etkilemektedir. Bu çalışmada bir büyük dil modeli ve bir küçük dil modeli, farklı iki veriseti üzerinde farklı prompt formatlarının denenmesi ile eğitilmiş ve başarım sonuçları karşılaştırılmıştır. Ayrıca Türkçe dilinde seçilmiş verisetlerine uygun custom bir prompt format da sunulmuştur. Sonuçlar incelendiğinde prompt formatının değişikliğinin büyük dil modellerinde performansı etkilediği görülmüştür. Ayrıca custom sunulan prompt format kabul edilebilir loss değerlerinde en iyi generation metric sonuçlarını elde etmiştir.

Keywords

Llama2 , LLM , Phi3 , İstem Formatı

References

Abdin, M., Aneja, J., Behl, H., Bubeck, S., Eldan, R., Gunasekar, S., Harrison, M., Hewett, R. J., Javaheripi, M., Kauffmann, P., Lee, J. R., Lee, Y. T., Li, Y., Liu, W., Mendes, C. C. T., Nguyen, A., Price, E., de Rosa, G., Saarikivi, O., … Zhang, Y. (2024). Phi-4 technical report. arXiv. https://arxiv.org/abs/2412.08905
Ateş, M., & Başarslan, M. S. (2025). Performance comparison of traditional and contextual representations for cryptocurrency sentiment analysis on Twitter. Düzce University Journal of Science and Technology, 13(3), 1431–1444.
Austin, J., Odena, A., Nye, M., Bosma, M., Michalewski, H., Dohan, D., Jiang, E., Cai, C., Terry, M., Le, Q., & Sutton, C. (2021). Program synthesis with large language models. arXiv. https://arxiv.org/abs/2108.07732
Chung, J., Gulcehre, C., Cho, K., & Bengio, Y. (2014). Empirical evaluation of gated recurrent neural networks on sequence modeling. arXiv. https://arxiv.org/abs/1412.3555
Dang, N. C., Moreno-García, M. N., & De la Prieta, F. (2020). Sentiment analysis based on deep learning: A comparative study. Electronics, 9(3), Article 483. https://doi.org/10.3390/electronics9030483
Erdi, B., Şahin, E. A., Toydemir, M. S., & Dökeroğlu, T. (2021). Makine öğrenmesi algoritmaları ile trol hesapların tespiti. Düzce Üniversitesi Bilim ve Teknoloji Dergisi, 9(1), 430–442.
Esmer, S., Uçar, M. K., Çil, İ., & Bozkurt, M. R. (2020). Parkinson hastalığı teşhisi için makine öğrenmesi tabanlı yeni bir yöntem. Duzce University Journal of Science and Technology, 8(3), 1877–1893. https://doi.org/10.29130/dubited.688223
Goyal, M., Tatwawadi, K., Chandak, S., & Ochoa, I. (2021). DZip: Improved general-purpose loss less compression based on novel neural network modeling. In 2021 Data Compression Conference (DCC) (pp. 153–162). IEEE. https://doi.org/10.1109/DCC50243.2021.00023
He, J., Rungta, M., Koleczek, D., Sekhon, A., Wang, F. X., & Hasan, S. (2024). Does prompt formatting have any impact on LLM performance? arXiv. https://arxiv.org/abs/2411.10541
Hochreiter, S., & Schmidhuber, J. (1997). Long short-term memory. Neural Computation, 9(8), 1735–1780. https://doi.org/10.1162/neco.1997.9.8.1735
Karri, S. P. R., & Kumar, B. S. (2020). Deep learning techniques for implementation of chatbots. In 2020 International Conference on Computer Communication and Informatics (ICCCI) (pp. 1–5).
Koç, O., Yücedağ, İ., & Şentürk, Ü. (2025). The impact of artificial intelligence enhanced no-code software development platforms on software processes: A literature review. Düzce University Journal of Science and Technology, 13(1), 383–401. https://doi.org/10.29130/dubited.1554356
Lin, C.-Y. (2004). Rouge: A package for automatic evaluation of summaries. In Text Summarization Branches Out (pp. 74–81).
Lyu, K., Zhao, H., Gu, X., Yu, D., Goyal, A., & Arora, S. (2024). Keeping LLMS aligned after fine-tuning: The crucial role of prompt templates. Advances in Neural Information Processing Systems, 37, 118603–118631. https://doi.org/10.52202/079017-3766
NovusResearch. (2025, June 4). turkish_instructions [Data set]. Hugging Face. https://huggingface.co/datasets/NovusResearch/turkish_instructions
Öztürk, E. (2024). XCompress: LLM assisted Python-based text compression toolkit. SoftwareX, 27, Article 101847. https://doi.org/10.1016/j.softx.2024.101847
Öztürk, E., Mesut, A. Ş., & Fıdan, Ö. A. (2024). A character based steganography using masked language modeling. IEEE Access, 12, 14248–14259. https://doi.org/10.1109/ACCESS.2024.3354710
Papineni, K., Roukos, S., Ward, T., & Zhu, W.-J. (2002). BLEU: A method for automatic evaluation of machine translation. Proceedings of the 40th Annual Meeting of the Association for Computational Linguistics, 311–318.
Schick, T., & Schütze, H. (2020). It’s not just size that matters: Small language models are also few-shot learners. arXiv. https://arxiv.org/abs/2009.07118
Sclar, M., Choi, Y., Tsvetkov, Y., & Suhr, A. (2023). Quantifying Language Models’ Sensitivity to Spurious Features in Prompt Design or: How I learned to start worrying about prompt formatting. arXiv. https://arxiv.org/abs/2310.11324
Shleifer, S., & Rush, A. M. (2020). Pre-trained summarization distillation. arXiv. https://arxiv.org/abs/2010.13002
Şahin, Ö. (2025). Are large language models rational or behavioral? A comparative analysis of investor behavior interpretation. Duzce University Journal of Science and Technology, 13(4), 1556–1582. https://doi.org/10.29130/dubited.1711955
Touvron, H., Martin, L., Stone, K., Albert, P., Almahairi, A., Babaei, Y., Bashlykov, N., Batra, S., Bhargava, P., Bhosale, S., Bikel, D., Blecher, L., Canton Ferrer, C., Chen, M., Cucurull, G., Esiobu, D., Fernandes, J., Fu, J., Fu, W., … Scialom, T. (2023). Llama 2: Open foundation and fine-tuned chat models. arXiv. https://arxiv.org/abs/2307.09288
Tüfekci, P., & Önal, Ç. M. (2024). Kötü Amaçlı Yazılım Tespiti için Makine Öğrenmesi Algoritmalarının Kullanımı. Duzce University Journal of Science and Technology, 12(1), 307–319. https://doi.org/10.29130/dubited.1287453
Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., Kaiser, Ł., & Polosukhin, I. (2017). Attention is all you need. Advances in Neural Information Processing Systems, 30.
Voronov, A., Wolf, L., & Ryabinin, M. (2024). Mind your format: Towards consistent evaluation of in-context learning improvements. arXiv. https://arxiv.org/abs/2401.06766
Xu, L., Xie, H., Qin, S.-Z. J., Tao, X., & Wang, F. L. (2023). Parameter-efficient fine-tuning methods for pretrained language models: A critical review and assessment. arXiv. https://arxiv.org/abs/2312.12148
Yuan, A., Coenen, A., Reif, E., & Ippolito, D. (2022). Wordcraft: Story writing with large language models. Proceedings of the 27th International Conference on Intelligent User Interfaces, 841–852.
Yudum. (2025). turkish-instruct-dataset [Data set]. Hugging Face. https://huggingface.co/datasets/Yudum/turkish-instruct-dataset
Zhang, T., Kishore, V., Wu, F., Weinberger, K. Q., & Artzi, Y. (2019). BERTScore: Evaluating text generation with BERT. arXiv. https://arxiv.org/abs/1904.09675
Zhang, T., Ladhak, F., Durmus, E., Liang, P., McKeown, K., & Hashimoto, T. B. (2024). Benchmarking large language models for news summarization. Transactions of the Association for Computational Linguistics, 12, 39–57. https://doi.org/10.1162/tacl_a_00632
Zhao, W. X., Zhou, K., Li, J., Tang, T., Wang, X., Hou, Y., Min, Y., Zhang, B., Zhang, J., Dong, Z., Du, Y., Yang, C., Chen, Y., Chen, Z., Jiang, J., Ren, R., Li, Y., Tang, X., Liu, Z., … Wen, J.-R. (2023). A survey of large language models. arXiv. https://arxiv.org/abs/2303.18223

Evaluating the Impact of Prompt Formats on Llama2 and Phi3 Using Turkish Language Instruction Dataset

Year 2026, Volume: 14 Issue: 1, 49 - 59, 21.01.2026

Emir Öztürk

https://doi.org/10.29130/dubited.1533514

https://izlik.org/JA65FN26CL

Abstract

Language models can be trained and customized by anyone using effective fine-tuning methods, making them versatile tools across various domains. While metrics like loss and accuracy are commonly used to assess the performance of language models, the choice of prompt formats also plays a crucial role. In this study, a large language model and a small language model are trained on two different datasets using various prompt formats, and their performance results are compared. Additionally, a custom prompt format tailored for selected Turkish datasets is introduced. The results indicate that changes in prompt format significantly impact the performance of large language models. In addition, the customized prompt format achieved the best loss values for both models and the best metric results for the large language model.

Keywords

Llama2 , LLM , Phi3 , Prompt Format

Ethical Statement

This study does not involve human or animal participants. All procedures followed scientific and ethical principles, and all referenced studies are appropriately cited.

Supporting Institution

This research received no external funding.

Thanks

The author/authors do not wish to acknowledge any individual or institution.

References

Abdin, M., Aneja, J., Behl, H., Bubeck, S., Eldan, R., Gunasekar, S., Harrison, M., Hewett, R. J., Javaheripi, M., Kauffmann, P., Lee, J. R., Lee, Y. T., Li, Y., Liu, W., Mendes, C. C. T., Nguyen, A., Price, E., de Rosa, G., Saarikivi, O., … Zhang, Y. (2024). Phi-4 technical report. arXiv. https://arxiv.org/abs/2412.08905
Ateş, M., & Başarslan, M. S. (2025). Performance comparison of traditional and contextual representations for cryptocurrency sentiment analysis on Twitter. Düzce University Journal of Science and Technology, 13(3), 1431–1444.
Austin, J., Odena, A., Nye, M., Bosma, M., Michalewski, H., Dohan, D., Jiang, E., Cai, C., Terry, M., Le, Q., & Sutton, C. (2021). Program synthesis with large language models. arXiv. https://arxiv.org/abs/2108.07732
Chung, J., Gulcehre, C., Cho, K., & Bengio, Y. (2014). Empirical evaluation of gated recurrent neural networks on sequence modeling. arXiv. https://arxiv.org/abs/1412.3555
Dang, N. C., Moreno-García, M. N., & De la Prieta, F. (2020). Sentiment analysis based on deep learning: A comparative study. Electronics, 9(3), Article 483. https://doi.org/10.3390/electronics9030483
Erdi, B., Şahin, E. A., Toydemir, M. S., & Dökeroğlu, T. (2021). Makine öğrenmesi algoritmaları ile trol hesapların tespiti. Düzce Üniversitesi Bilim ve Teknoloji Dergisi, 9(1), 430–442.
Esmer, S., Uçar, M. K., Çil, İ., & Bozkurt, M. R. (2020). Parkinson hastalığı teşhisi için makine öğrenmesi tabanlı yeni bir yöntem. Duzce University Journal of Science and Technology, 8(3), 1877–1893. https://doi.org/10.29130/dubited.688223
Goyal, M., Tatwawadi, K., Chandak, S., & Ochoa, I. (2021). DZip: Improved general-purpose loss less compression based on novel neural network modeling. In 2021 Data Compression Conference (DCC) (pp. 153–162). IEEE. https://doi.org/10.1109/DCC50243.2021.00023
He, J., Rungta, M., Koleczek, D., Sekhon, A., Wang, F. X., & Hasan, S. (2024). Does prompt formatting have any impact on LLM performance? arXiv. https://arxiv.org/abs/2411.10541
Hochreiter, S., & Schmidhuber, J. (1997). Long short-term memory. Neural Computation, 9(8), 1735–1780. https://doi.org/10.1162/neco.1997.9.8.1735
Karri, S. P. R., & Kumar, B. S. (2020). Deep learning techniques for implementation of chatbots. In 2020 International Conference on Computer Communication and Informatics (ICCCI) (pp. 1–5).
Koç, O., Yücedağ, İ., & Şentürk, Ü. (2025). The impact of artificial intelligence enhanced no-code software development platforms on software processes: A literature review. Düzce University Journal of Science and Technology, 13(1), 383–401. https://doi.org/10.29130/dubited.1554356
Lin, C.-Y. (2004). Rouge: A package for automatic evaluation of summaries. In Text Summarization Branches Out (pp. 74–81).
Lyu, K., Zhao, H., Gu, X., Yu, D., Goyal, A., & Arora, S. (2024). Keeping LLMS aligned after fine-tuning: The crucial role of prompt templates. Advances in Neural Information Processing Systems, 37, 118603–118631. https://doi.org/10.52202/079017-3766
NovusResearch. (2025, June 4). turkish_instructions [Data set]. Hugging Face. https://huggingface.co/datasets/NovusResearch/turkish_instructions
Öztürk, E. (2024). XCompress: LLM assisted Python-based text compression toolkit. SoftwareX, 27, Article 101847. https://doi.org/10.1016/j.softx.2024.101847
Öztürk, E., Mesut, A. Ş., & Fıdan, Ö. A. (2024). A character based steganography using masked language modeling. IEEE Access, 12, 14248–14259. https://doi.org/10.1109/ACCESS.2024.3354710
Papineni, K., Roukos, S., Ward, T., & Zhu, W.-J. (2002). BLEU: A method for automatic evaluation of machine translation. Proceedings of the 40th Annual Meeting of the Association for Computational Linguistics, 311–318.
Schick, T., & Schütze, H. (2020). It’s not just size that matters: Small language models are also few-shot learners. arXiv. https://arxiv.org/abs/2009.07118
Sclar, M., Choi, Y., Tsvetkov, Y., & Suhr, A. (2023). Quantifying Language Models’ Sensitivity to Spurious Features in Prompt Design or: How I learned to start worrying about prompt formatting. arXiv. https://arxiv.org/abs/2310.11324
Shleifer, S., & Rush, A. M. (2020). Pre-trained summarization distillation. arXiv. https://arxiv.org/abs/2010.13002
Şahin, Ö. (2025). Are large language models rational or behavioral? A comparative analysis of investor behavior interpretation. Duzce University Journal of Science and Technology, 13(4), 1556–1582. https://doi.org/10.29130/dubited.1711955
Touvron, H., Martin, L., Stone, K., Albert, P., Almahairi, A., Babaei, Y., Bashlykov, N., Batra, S., Bhargava, P., Bhosale, S., Bikel, D., Blecher, L., Canton Ferrer, C., Chen, M., Cucurull, G., Esiobu, D., Fernandes, J., Fu, J., Fu, W., … Scialom, T. (2023). Llama 2: Open foundation and fine-tuned chat models. arXiv. https://arxiv.org/abs/2307.09288
Tüfekci, P., & Önal, Ç. M. (2024). Kötü Amaçlı Yazılım Tespiti için Makine Öğrenmesi Algoritmalarının Kullanımı. Duzce University Journal of Science and Technology, 12(1), 307–319. https://doi.org/10.29130/dubited.1287453
Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., Kaiser, Ł., & Polosukhin, I. (2017). Attention is all you need. Advances in Neural Information Processing Systems, 30.
Voronov, A., Wolf, L., & Ryabinin, M. (2024). Mind your format: Towards consistent evaluation of in-context learning improvements. arXiv. https://arxiv.org/abs/2401.06766
Xu, L., Xie, H., Qin, S.-Z. J., Tao, X., & Wang, F. L. (2023). Parameter-efficient fine-tuning methods for pretrained language models: A critical review and assessment. arXiv. https://arxiv.org/abs/2312.12148
Yuan, A., Coenen, A., Reif, E., & Ippolito, D. (2022). Wordcraft: Story writing with large language models. Proceedings of the 27th International Conference on Intelligent User Interfaces, 841–852.
Yudum. (2025). turkish-instruct-dataset [Data set]. Hugging Face. https://huggingface.co/datasets/Yudum/turkish-instruct-dataset
Zhang, T., Kishore, V., Wu, F., Weinberger, K. Q., & Artzi, Y. (2019). BERTScore: Evaluating text generation with BERT. arXiv. https://arxiv.org/abs/1904.09675
Zhang, T., Ladhak, F., Durmus, E., Liang, P., McKeown, K., & Hashimoto, T. B. (2024). Benchmarking large language models for news summarization. Transactions of the Association for Computational Linguistics, 12, 39–57. https://doi.org/10.1162/tacl_a_00632
Zhao, W. X., Zhou, K., Li, J., Tang, T., Wang, X., Hou, Y., Min, Y., Zhang, B., Zhang, J., Dong, Z., Du, Y., Yang, C., Chen, Y., Chen, Z., Jiang, J., Ren, R., Li, Y., Tang, X., Liu, Z., … Wen, J.-R. (2023). A survey of large language models. arXiv. https://arxiv.org/abs/2303.18223

There are 32 citations in total.

Details

Primary Language	English
Subjects	Deep Learning
Journal Section	Research Article
Authors	Emir Öztürk 0000-0002-3734-5171
Submission Date	August 14, 2024
Acceptance Date	October 13, 2025
Publication Date	January 21, 2026
DOI	https://doi.org/10.29130/dubited.1533514
IZ	https://izlik.org/JA65FN26CL
Published in Issue	Year 2026 Volume: 14 Issue: 1

Cite

APA	Öztürk, E. (2026). Evaluating the Impact of Prompt Formats on Llama2 and Phi3 Using Turkish Language Instruction Dataset. Duzce University Journal of Science and Technology, 14(1), 49-59. https://doi.org/10.29130/dubited.1533514
AMA	1.Öztürk E. Evaluating the Impact of Prompt Formats on Llama2 and Phi3 Using Turkish Language Instruction Dataset. DUBİTED. 2026;14(1):49-59. doi:10.29130/dubited.1533514
Chicago	Öztürk, Emir. 2026. “Evaluating the Impact of Prompt Formats on Llama2 and Phi3 Using Turkish Language Instruction Dataset”. Duzce University Journal of Science and Technology 14 (1): 49-59. https://doi.org/10.29130/dubited.1533514.
EndNote	Öztürk E (January 1, 2026) Evaluating the Impact of Prompt Formats on Llama2 and Phi3 Using Turkish Language Instruction Dataset. Duzce University Journal of Science and Technology 14 1 49–59.
IEEE	[1]E. Öztürk, “Evaluating the Impact of Prompt Formats on Llama2 and Phi3 Using Turkish Language Instruction Dataset”, DUBİTED, vol. 14, no. 1, pp. 49–59, Jan. 2026, doi: 10.29130/dubited.1533514.
ISNAD	Öztürk, Emir. “Evaluating the Impact of Prompt Formats on Llama2 and Phi3 Using Turkish Language Instruction Dataset”. Duzce University Journal of Science and Technology 14/1 (January 1, 2026): 49-59. https://doi.org/10.29130/dubited.1533514.
JAMA	1.Öztürk E. Evaluating the Impact of Prompt Formats on Llama2 and Phi3 Using Turkish Language Instruction Dataset. DUBİTED. 2026;14:49–59.
MLA	Öztürk, Emir. “Evaluating the Impact of Prompt Formats on Llama2 and Phi3 Using Turkish Language Instruction Dataset”. Duzce University Journal of Science and Technology, vol. 14, no. 1, Jan. 2026, pp. 49-59, doi:10.29130/dubited.1533514.
Vancouver	1.Emir Öztürk. Evaluating the Impact of Prompt Formats on Llama2 and Phi3 Using Turkish Language Instruction Dataset. DUBİTED. 2026 Jan. 1;14(1):49-5. doi:10.29130/dubited.1533514

Article Files

Full Text