Research Article

Using large language models to evaluate ethical persuasion text: A measurement modeling approach

Volume: 13 Number: 1 January 2, 2026
EN TR

Using large language models to evaluate ethical persuasion text: A measurement modeling approach

Abstract

As AI becomes prevalent in all stages of assessment procedures, it is essential to develop procedures to ensure that its use supports ethical and psychometrically defensible measurement. In this study, we consider how measurement principles can be directly incorporated into an ethical reasoning performance assessment in which Large Language Models (LLMs) serve as raters. We demonstrate how a measurement approach can be used to obtain defensible measures of LLM-generated text related to ethics, prompts designed to elicit text-based ethical persuasion responses, and individual learners. We demonstrate how measurement quality indicators can serve as guardrails to help mitigate potential AI-related risks that can impact learners, such as hallucinations or errors. We describe a novel approach to designing, implementing, and evaluating performance assessments with AI, with the goal of enabling effective personalized learning experiences.

Keywords

References

  1. Andrich, D.A. (2018). Rasch rating-scale model. In W.J. van der Linden (Ed.), Handbook of Item Response Theory (Vol. 1, pp. 31-50). CRC Press.
  2. Barney, M.F. (2019). The Reciprocal roles of artificial intelligence and industrial-organizational psychology. In R.N. Landers (Ed.), Cambridge Handbook of Technology and Employee Behavior (pp. 3-21). Cambridge University Press.
  3. Boyd, R.L., Ashokkumar, A., Seraj, S., & Pennebaker, J.W. (2022). The development and psychometric properties of LIWC-22. University of Texas at Austin. https://www.liwc.app/static/documents/LIWC-22%20Manual%20-%20Development%20and%20Psychometrics.pdf
  4. Bubeck, S., Chandrasekaran, V., Eldan, R., Gehrke, J., Horvitz, E., Kamar, E., Lee, P., Lee, Y.T., Li, Y., Lundberg, S., Nori, H., Palangi, H., Ribeiro, M.T., & Zhang, Y. (2023). Sparks of Artificial General Intelligence: Early experiments with GPT 4 (Version 5). arXiv. https://doi.org/10.48550/ARXIV.2303.12712
  5. Chan, K.K.Y., Bond, T., & Yan, Z. (2023b). Application of an Automated Essay Scoring engine to English writing assessment using Many-Facet Rasch Measurement. Language Testing, 40(1), 61-85. https://doi.org/10.1177/02655322221076025
  6. Chia, Y.K., Hong, P., Bing, L., & Poria, S. (2023). INSTRUCTEVAL: Towards Holistic Evaluation of Instruction Tuned Large Language Models (No. arXiv:2306.04757). arXiv. https://doi.org/10.48550/arXiv.2306.04757
  7. Cialdini, R.B. (2009). Influence: Science and practice (5. ed., internat. ed). Pearson Education [u.a.].
  8. Cialdini, R.B. (2014). Influence. HarperCollins e-Books.

Details

Primary Language

English

Subjects

Measurement Theories and Applications in Education and Psychology

Journal Section

Research Article

Authors

Matt Barney This is me
0000-0002-4538-3194
United States

Vaishak Krishna This is me
0009-0001-1951-9801
United States

Publication Date

January 2, 2026

Submission Date

October 7, 2025

Acceptance Date

December 7, 2025

Published in Issue

Year 2026 Volume: 13 Number: 1

APA
Barney, M., Wind, S., & Krishna, V. (2026). Using large language models to evaluate ethical persuasion text: A measurement modeling approach. International Journal of Assessment Tools in Education, 13(1), 224-247. https://doi.org/10.21449/ijate.1788563

23823             23825             23824