Türkçe Talimat Veriseti Kullanılarak Llama2 ve Phi3 Üzerinde İstek Formatlarının Etkisinin Değerlendirilmesi
Year 2026,
Volume: 14 Issue: 1, 49 - 59, 21.01.2026
Emir Öztürk
Abstract
Dil modelleri, etkin finetuning metotları ile herkes tarafından eğitilebilmekte ve özelleştirilerek farklı alanlarda kullanılabilmektedir. Dil modellerinin başarımı için genellikle kayıp veya doğruluk metrikleri kontrol edilse de prompt formatlarının seçimi de bu başarımı etkilemektedir. Bu çalışmada bir büyük dil modeli ve bir küçük dil modeli, farklı iki veriseti üzerinde farklı prompt formatlarının denenmesi ile eğitilmiş ve başarım sonuçları karşılaştırılmıştır. Ayrıca Türkçe dilinde seçilmiş verisetlerine uygun custom bir prompt format da sunulmuştur. Sonuçlar incelendiğinde prompt formatının değişikliğinin büyük dil modellerinde performansı etkilediği görülmüştür. Ayrıca custom sunulan prompt format kabul edilebilir loss değerlerinde en iyi generation metric sonuçlarını elde etmiştir.
References
-
Abdin, M., Aneja, J., Behl, H., Bubeck, S., Eldan, R., Gunasekar, S., Harrison, M., Hewett, R. J., Javaheripi, M., Kauffmann, P., Lee, J. R., Lee, Y. T., Li, Y., Liu, W., Mendes, C. C. T., Nguyen, A., Price, E., de Rosa, G., Saarikivi, O., … Zhang, Y. (2024). Phi-4 technical report. arXiv. https://arxiv.org/abs/2412.08905
-
Ateş, M., & Başarslan, M. S. (2025). Performance comparison of traditional and contextual representations for cryptocurrency sentiment analysis on Twitter. Düzce University Journal of Science and Technology, 13(3), 1431–1444.
-
Austin, J., Odena, A., Nye, M., Bosma, M., Michalewski, H., Dohan, D., Jiang, E., Cai, C., Terry, M., Le, Q., & Sutton, C. (2021). Program synthesis with large language models. arXiv. https://arxiv.org/abs/2108.07732
-
Chung, J., Gulcehre, C., Cho, K., & Bengio, Y. (2014). Empirical evaluation of gated recurrent neural networks on sequence modeling. arXiv. https://arxiv.org/abs/1412.3555
-
Dang, N. C., Moreno-García, M. N., & De la Prieta, F. (2020). Sentiment analysis based on deep learning: A comparative study. Electronics, 9(3), Article 483. https://doi.org/10.3390/electronics9030483
-
Erdi, B., Şahin, E. A., Toydemir, M. S., & Dökeroğlu, T. (2021). Makine öğrenmesi algoritmaları ile trol hesapların tespiti. Düzce Üniversitesi Bilim ve Teknoloji Dergisi, 9(1), 430–442.
-
Esmer, S., Uçar, M. K., Çil, İ., & Bozkurt, M. R. (2020). Parkinson hastalığı teşhisi için makine öğrenmesi tabanlı yeni bir yöntem. Duzce University Journal of Science and Technology, 8(3), 1877–1893. https://doi.org/10.29130/dubited.688223
-
Goyal, M., Tatwawadi, K., Chandak, S., & Ochoa, I. (2021). DZip: Improved general-purpose loss less compression based on novel neural network modeling. In 2021 Data Compression Conference (DCC) (pp. 153–162). IEEE. https://doi.org/10.1109/DCC50243.2021.00023
-
He, J., Rungta, M., Koleczek, D., Sekhon, A., Wang, F. X., & Hasan, S. (2024). Does prompt formatting have any impact on LLM performance? arXiv. https://arxiv.org/abs/2411.10541
-
Hochreiter, S., & Schmidhuber, J. (1997). Long short-term memory. Neural Computation, 9(8), 1735–1780. https://doi.org/10.1162/neco.1997.9.8.1735
-
Karri, S. P. R., & Kumar, B. S. (2020). Deep learning techniques for implementation of chatbots. In 2020 International Conference on Computer Communication and Informatics (ICCCI) (pp. 1–5).
-
Koç, O., Yücedağ, İ., & Şentürk, Ü. (2025). The impact of artificial intelligence enhanced no-code software development platforms on software processes: A literature review. Düzce University Journal of Science and Technology, 13(1), 383–401. https://doi.org/10.29130/dubited.1554356
-
Lin, C.-Y. (2004). Rouge: A package for automatic evaluation of summaries. In Text Summarization Branches Out (pp. 74–81).
-
Lyu, K., Zhao, H., Gu, X., Yu, D., Goyal, A., & Arora, S. (2024). Keeping LLMS aligned after fine-tuning: The crucial role of prompt templates. Advances in Neural Information Processing Systems, 37, 118603–118631. https://doi.org/10.52202/079017-3766
-
NovusResearch. (2025, June 4). turkish_instructions [Data set]. Hugging Face. https://huggingface.co/datasets/NovusResearch/turkish_instructions
-
Öztürk, E. (2024). XCompress: LLM assisted Python-based text compression toolkit. SoftwareX, 27, Article 101847. https://doi.org/10.1016/j.softx.2024.101847
-
Öztürk, E., Mesut, A. Ş., & Fıdan, Ö. A. (2024). A character based steganography using masked language modeling. IEEE Access, 12, 14248–14259. https://doi.org/10.1109/ACCESS.2024.3354710
-
Papineni, K., Roukos, S., Ward, T., & Zhu, W.-J. (2002). BLEU: A method for automatic evaluation of machine translation. Proceedings of the 40th Annual Meeting of the Association for Computational Linguistics, 311–318.
-
Schick, T., & Schütze, H. (2020). It’s not just size that matters: Small language models are also few-shot learners. arXiv. https://arxiv.org/abs/2009.07118
-
Sclar, M., Choi, Y., Tsvetkov, Y., & Suhr, A. (2023). Quantifying Language Models’ Sensitivity to Spurious Features in Prompt Design or: How I learned to start worrying about prompt formatting. arXiv. https://arxiv.org/abs/2310.11324
-
Shleifer, S., & Rush, A. M. (2020). Pre-trained summarization distillation. arXiv. https://arxiv.org/abs/2010.13002
-
Şahin, Ö. (2025). Are large language models rational or behavioral? A comparative analysis of investor behavior interpretation. Duzce University Journal of Science and Technology, 13(4), 1556–1582. https://doi.org/10.29130/dubited.1711955
-
Touvron, H., Martin, L., Stone, K., Albert, P., Almahairi, A., Babaei, Y., Bashlykov, N., Batra, S., Bhargava, P., Bhosale, S., Bikel, D., Blecher, L., Canton Ferrer, C., Chen, M., Cucurull, G., Esiobu, D., Fernandes, J., Fu, J., Fu, W., … Scialom, T. (2023). Llama 2: Open foundation and fine-tuned chat models. arXiv. https://arxiv.org/abs/2307.09288
-
Tüfekci, P., & Önal, Ç. M. (2024). Kötü Amaçlı Yazılım Tespiti için Makine Öğrenmesi Algoritmalarının Kullanımı. Duzce University Journal of Science and Technology, 12(1), 307–319. https://doi.org/10.29130/dubited.1287453
-
Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., Kaiser, Ł., & Polosukhin, I. (2017). Attention is all you need. Advances in Neural Information Processing Systems, 30.
-
Voronov, A., Wolf, L., & Ryabinin, M. (2024). Mind your format: Towards consistent evaluation of in-context learning improvements. arXiv. https://arxiv.org/abs/2401.06766
-
Xu, L., Xie, H., Qin, S.-Z. J., Tao, X., & Wang, F. L. (2023). Parameter-efficient fine-tuning methods for pretrained language models: A critical review and assessment. arXiv. https://arxiv.org/abs/2312.12148
-
Yuan, A., Coenen, A., Reif, E., & Ippolito, D. (2022). Wordcraft: Story writing with large language models. Proceedings of the 27th International Conference on Intelligent User Interfaces, 841–852.
-
Yudum. (2025). turkish-instruct-dataset [Data set]. Hugging Face. https://huggingface.co/datasets/Yudum/turkish-instruct-dataset
-
Zhang, T., Kishore, V., Wu, F., Weinberger, K. Q., & Artzi, Y. (2019). BERTScore: Evaluating text generation with BERT. arXiv. https://arxiv.org/abs/1904.09675
-
Zhang, T., Ladhak, F., Durmus, E., Liang, P., McKeown, K., & Hashimoto, T. B. (2024). Benchmarking large language models for news summarization. Transactions of the Association for Computational Linguistics, 12, 39–57. https://doi.org/10.1162/tacl_a_00632
-
Zhao, W. X., Zhou, K., Li, J., Tang, T., Wang, X., Hou, Y., Min, Y., Zhang, B., Zhang, J., Dong, Z., Du, Y., Yang, C., Chen, Y., Chen, Z., Jiang, J., Ren, R., Li, Y., Tang, X., Liu, Z., … Wen, J.-R. (2023). A survey of large language models. arXiv. https://arxiv.org/abs/2303.18223
Evaluating the Impact of Prompt Formats on Llama2 and Phi3 Using Turkish Language Instruction Dataset
Year 2026,
Volume: 14 Issue: 1, 49 - 59, 21.01.2026
Emir Öztürk
Abstract
Language models can be trained and customized by anyone using effective fine-tuning methods, making them versatile tools across various domains. While metrics like loss and accuracy are commonly used to assess the performance of language models, the choice of prompt formats also plays a crucial role. In this study, a large language model and a small language model are trained on two different datasets using various prompt formats, and their performance results are compared. Additionally, a custom prompt format tailored for selected Turkish datasets is introduced. The results indicate that changes in prompt format significantly impact the performance of large language models. In addition, the customized prompt format achieved the best loss values for both models and the best metric results for the large language model.
Ethical Statement
This study does not involve human or animal participants. All procedures followed scientific and ethical principles, and all referenced studies are appropriately cited.
Supporting Institution
This research received no external funding.
Thanks
The author/authors do not wish to acknowledge any individual or institution.
References
-
Abdin, M., Aneja, J., Behl, H., Bubeck, S., Eldan, R., Gunasekar, S., Harrison, M., Hewett, R. J., Javaheripi, M., Kauffmann, P., Lee, J. R., Lee, Y. T., Li, Y., Liu, W., Mendes, C. C. T., Nguyen, A., Price, E., de Rosa, G., Saarikivi, O., … Zhang, Y. (2024). Phi-4 technical report. arXiv. https://arxiv.org/abs/2412.08905
-
Ateş, M., & Başarslan, M. S. (2025). Performance comparison of traditional and contextual representations for cryptocurrency sentiment analysis on Twitter. Düzce University Journal of Science and Technology, 13(3), 1431–1444.
-
Austin, J., Odena, A., Nye, M., Bosma, M., Michalewski, H., Dohan, D., Jiang, E., Cai, C., Terry, M., Le, Q., & Sutton, C. (2021). Program synthesis with large language models. arXiv. https://arxiv.org/abs/2108.07732
-
Chung, J., Gulcehre, C., Cho, K., & Bengio, Y. (2014). Empirical evaluation of gated recurrent neural networks on sequence modeling. arXiv. https://arxiv.org/abs/1412.3555
-
Dang, N. C., Moreno-García, M. N., & De la Prieta, F. (2020). Sentiment analysis based on deep learning: A comparative study. Electronics, 9(3), Article 483. https://doi.org/10.3390/electronics9030483
-
Erdi, B., Şahin, E. A., Toydemir, M. S., & Dökeroğlu, T. (2021). Makine öğrenmesi algoritmaları ile trol hesapların tespiti. Düzce Üniversitesi Bilim ve Teknoloji Dergisi, 9(1), 430–442.
-
Esmer, S., Uçar, M. K., Çil, İ., & Bozkurt, M. R. (2020). Parkinson hastalığı teşhisi için makine öğrenmesi tabanlı yeni bir yöntem. Duzce University Journal of Science and Technology, 8(3), 1877–1893. https://doi.org/10.29130/dubited.688223
-
Goyal, M., Tatwawadi, K., Chandak, S., & Ochoa, I. (2021). DZip: Improved general-purpose loss less compression based on novel neural network modeling. In 2021 Data Compression Conference (DCC) (pp. 153–162). IEEE. https://doi.org/10.1109/DCC50243.2021.00023
-
He, J., Rungta, M., Koleczek, D., Sekhon, A., Wang, F. X., & Hasan, S. (2024). Does prompt formatting have any impact on LLM performance? arXiv. https://arxiv.org/abs/2411.10541
-
Hochreiter, S., & Schmidhuber, J. (1997). Long short-term memory. Neural Computation, 9(8), 1735–1780. https://doi.org/10.1162/neco.1997.9.8.1735
-
Karri, S. P. R., & Kumar, B. S. (2020). Deep learning techniques for implementation of chatbots. In 2020 International Conference on Computer Communication and Informatics (ICCCI) (pp. 1–5).
-
Koç, O., Yücedağ, İ., & Şentürk, Ü. (2025). The impact of artificial intelligence enhanced no-code software development platforms on software processes: A literature review. Düzce University Journal of Science and Technology, 13(1), 383–401. https://doi.org/10.29130/dubited.1554356
-
Lin, C.-Y. (2004). Rouge: A package for automatic evaluation of summaries. In Text Summarization Branches Out (pp. 74–81).
-
Lyu, K., Zhao, H., Gu, X., Yu, D., Goyal, A., & Arora, S. (2024). Keeping LLMS aligned after fine-tuning: The crucial role of prompt templates. Advances in Neural Information Processing Systems, 37, 118603–118631. https://doi.org/10.52202/079017-3766
-
NovusResearch. (2025, June 4). turkish_instructions [Data set]. Hugging Face. https://huggingface.co/datasets/NovusResearch/turkish_instructions
-
Öztürk, E. (2024). XCompress: LLM assisted Python-based text compression toolkit. SoftwareX, 27, Article 101847. https://doi.org/10.1016/j.softx.2024.101847
-
Öztürk, E., Mesut, A. Ş., & Fıdan, Ö. A. (2024). A character based steganography using masked language modeling. IEEE Access, 12, 14248–14259. https://doi.org/10.1109/ACCESS.2024.3354710
-
Papineni, K., Roukos, S., Ward, T., & Zhu, W.-J. (2002). BLEU: A method for automatic evaluation of machine translation. Proceedings of the 40th Annual Meeting of the Association for Computational Linguistics, 311–318.
-
Schick, T., & Schütze, H. (2020). It’s not just size that matters: Small language models are also few-shot learners. arXiv. https://arxiv.org/abs/2009.07118
-
Sclar, M., Choi, Y., Tsvetkov, Y., & Suhr, A. (2023). Quantifying Language Models’ Sensitivity to Spurious Features in Prompt Design or: How I learned to start worrying about prompt formatting. arXiv. https://arxiv.org/abs/2310.11324
-
Shleifer, S., & Rush, A. M. (2020). Pre-trained summarization distillation. arXiv. https://arxiv.org/abs/2010.13002
-
Şahin, Ö. (2025). Are large language models rational or behavioral? A comparative analysis of investor behavior interpretation. Duzce University Journal of Science and Technology, 13(4), 1556–1582. https://doi.org/10.29130/dubited.1711955
-
Touvron, H., Martin, L., Stone, K., Albert, P., Almahairi, A., Babaei, Y., Bashlykov, N., Batra, S., Bhargava, P., Bhosale, S., Bikel, D., Blecher, L., Canton Ferrer, C., Chen, M., Cucurull, G., Esiobu, D., Fernandes, J., Fu, J., Fu, W., … Scialom, T. (2023). Llama 2: Open foundation and fine-tuned chat models. arXiv. https://arxiv.org/abs/2307.09288
-
Tüfekci, P., & Önal, Ç. M. (2024). Kötü Amaçlı Yazılım Tespiti için Makine Öğrenmesi Algoritmalarının Kullanımı. Duzce University Journal of Science and Technology, 12(1), 307–319. https://doi.org/10.29130/dubited.1287453
-
Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., Kaiser, Ł., & Polosukhin, I. (2017). Attention is all you need. Advances in Neural Information Processing Systems, 30.
-
Voronov, A., Wolf, L., & Ryabinin, M. (2024). Mind your format: Towards consistent evaluation of in-context learning improvements. arXiv. https://arxiv.org/abs/2401.06766
-
Xu, L., Xie, H., Qin, S.-Z. J., Tao, X., & Wang, F. L. (2023). Parameter-efficient fine-tuning methods for pretrained language models: A critical review and assessment. arXiv. https://arxiv.org/abs/2312.12148
-
Yuan, A., Coenen, A., Reif, E., & Ippolito, D. (2022). Wordcraft: Story writing with large language models. Proceedings of the 27th International Conference on Intelligent User Interfaces, 841–852.
-
Yudum. (2025). turkish-instruct-dataset [Data set]. Hugging Face. https://huggingface.co/datasets/Yudum/turkish-instruct-dataset
-
Zhang, T., Kishore, V., Wu, F., Weinberger, K. Q., & Artzi, Y. (2019). BERTScore: Evaluating text generation with BERT. arXiv. https://arxiv.org/abs/1904.09675
-
Zhang, T., Ladhak, F., Durmus, E., Liang, P., McKeown, K., & Hashimoto, T. B. (2024). Benchmarking large language models for news summarization. Transactions of the Association for Computational Linguistics, 12, 39–57. https://doi.org/10.1162/tacl_a_00632
-
Zhao, W. X., Zhou, K., Li, J., Tang, T., Wang, X., Hou, Y., Min, Y., Zhang, B., Zhang, J., Dong, Z., Du, Y., Yang, C., Chen, Y., Chen, Z., Jiang, J., Ren, R., Li, Y., Tang, X., Liu, Z., … Wen, J.-R. (2023). A survey of large language models. arXiv. https://arxiv.org/abs/2303.18223