Research Article
BibTex RIS Cite

Using large language models to evaluate ethical persuasion text: A measurement modeling approach

Year 2026, Volume: 13 Issue: 1, 224 - 247, 02.01.2026
https://doi.org/10.21449/ijate.1788563

Abstract

As AI becomes prevalent in all stages of assessment procedures, it is essential to develop procedures to ensure that its use supports ethical and psychometrically defensible measurement. In this study, we consider how measurement principles can be directly incorporated into an ethical reasoning performance assessment in which Large Language Models (LLMs) serve as raters. We demonstrate how a measurement approach can be used to obtain defensible measures of LLM-generated text related to ethics, prompts designed to elicit text-based ethical persuasion responses, and individual learners. We demonstrate how measurement quality indicators can serve as guardrails to help mitigate potential AI-related risks that can impact learners, such as hallucinations or errors. We describe a novel approach to designing, implementing, and evaluating performance assessments with AI, with the goal of enabling effective personalized learning experiences.

References

  • Andrich, D.A. (2018). Rasch rating-scale model. In W.J. van der Linden (Ed.), Handbook of Item Response Theory (Vol. 1, pp. 31-50). CRC Press.
  • Barney, M.F. (2019). The Reciprocal roles of artificial intelligence and industrial-organizational psychology. In R.N. Landers (Ed.), Cambridge Handbook of Technology and Employee Behavior (pp. 3-21). Cambridge University Press.
  • Boyd, R.L., Ashokkumar, A., Seraj, S., & Pennebaker, J.W. (2022). The development and psychometric properties of LIWC-22. University of Texas at Austin. https://www.liwc.app/static/documents/LIWC-22%20Manual%20-%20Development%20and%20Psychometrics.pdf
  • Bubeck, S., Chandrasekaran, V., Eldan, R., Gehrke, J., Horvitz, E., Kamar, E., Lee, P., Lee, Y.T., Li, Y., Lundberg, S., Nori, H., Palangi, H., Ribeiro, M.T., & Zhang, Y. (2023). Sparks of Artificial General Intelligence: Early experiments with GPT 4 (Version 5). arXiv. https://doi.org/10.48550/ARXIV.2303.12712
  • Chan, K.K.Y., Bond, T., & Yan, Z. (2023b). Application of an Automated Essay Scoring engine to English writing assessment using Many-Facet Rasch Measurement. Language Testing, 40(1), 61-85. https://doi.org/10.1177/02655322221076025
  • Chia, Y.K., Hong, P., Bing, L., & Poria, S. (2023). INSTRUCTEVAL: Towards Holistic Evaluation of Instruction Tuned Large Language Models (No. arXiv:2306.04757). arXiv. https://doi.org/10.48550/arXiv.2306.04757
  • Cialdini, R.B. (2009). Influence: Science and practice (5. ed., internat. ed). Pearson Education [u.a.].
  • Cialdini, R.B. (2014). Influence. HarperCollins e-Books.
  • Cialdini, R.B. (2016). Pre-Suasion: A revolutionary way to influence and persuade. Simon & Schuster.
  • Commons, M.L. (2007). Introduction to the model of hierarchical complexity. Behavioral Development Bulletin, 13(1), 1–6. https://doi.org/10.1037/h0100493
  • de Boeck, P., & Wilson, M. (Eds.). (2004). Explanatory Item Response Models: A Generalized Linear and Nonlinear Approach (2004th edition). Springer.
  • Engelhard, G., & Myford, C.M. (2003). Monitoring faculty consultant performance in the Advanced Placement English Literature and Composition Program with a many-faceted Rasch model. College Entrance Examination Board.
  • Engelhard, G., & Wind, S.A. (2018). Invariant measurement with raters and rating scales: Rasch models for rater-mediated assessments. Taylor & Francis.
  • Engelhard Jr, G., & Wang, J. (2020). Rasch Models for Solving Measurement Problems (Vol. 187). Sage. https://us.sagepub.com/en-us/nam/rasch-models-for-solving-measurement-problems/book267292
  • Fischer, K.W. (1980). A theory of cognitive development: The control and construction of hierarchies of skills. Psychological Review, 87(6), 477–531. https://doi.org/10.1037/0033-295X.87.6.477
  • Flodén, J. (2025). Grading exams using large language models: A comparison between human and AI grading of exams in higher education using ChatGPT. British Educational Research Journal, 51(1), 201–224. https://doi.org/10.1002/berj.4069
  • He, K., Pu, N., Lao, M., & Lew, M.S. (2023). Few-shot and meta-learning methods for image understanding: A survey. International Journal of Multimedia Information Retrieval, 12(2), 14. https://doi.org/10.1007/s13735-023-00279-4
  • Kennedy, C.J., Bacon, G., Sahn, A., & Vacano, C. von. (2020). Constructing interval variables via faceted Rasch measurement and multitask deep learning: A hate speech application (No. arXiv:2009.10277). arXiv. https://doi.org/10.48550/arXiv.2009.10277
  • Kim, H., Baghestani, S., Yin, S., Karatay, Y., Kurt, S., Beck, J., & Karatay, L. (2024). ChatGPT for Writing Evaluation: Examining the Accuracy and Reliability of AI-Generated Scores Compared to Human Raters. In C. Chapelle, G. Beckett, & J. Ranalli (Eds.), Exploring AI in Applied Linguistics (pp. 73–95). Iowa State University Digital Press. https://doi.org/10.31274/isudp.2024.154.06
  • Linacre, J.M. (1989). Many-Facet Rasch Measurement. MESA Press.
  • Linacre, J.M. (2020). A user’s guide to FACETS: Rasch-model computer programs (Version 3.83.4) [Computer software]. winsteps.com. http://www.winsteps.com/manuals.htm
  • Masters, G.N. (2018). Partial credit model. In W. J. van der Linden (Ed.), Handbook of Item Response Theory (Vol. 1, pp. 109-126). CRC Press.
  • Mizumoto, A., & Eguchi, M. (2023). Exploring the potential of using an AI language model for automated essay scoring. Research Methods in Applied Linguistics, 2(2), 100050. https://doi.org/10.1016/j.rmal.2023.100050
  • Page, E.B. (1966). The imminence of grading essays by computer. The Phi Delta Kappan, 47(5), 238-243.
  • Pfau, A., Polio, C., & Xu, Y. (2023). Exploring the potential of ChatGPT in assessing L2 writing accuracy for research purposes. Research Methods in Applied Linguistics, 2(3), 100083. https://doi.org/10.1016/j.rmal.2023.100083
  • Rasch, G. (1960). Probabilistic models for some intelligence and achievement tests (Expanded edition, 1980). University of Chicago Press.
  • Ramesh, D., & Sanampudi, S.K. (2022). An automated essay scoring systems: A systematic literature review. Artificial Intelligence Review, 55(3), 2495–2527. https://doi.org/10.1007/s10462-021-10068-2
  • Reckase, M.D. (1979). Unifactor Latent Trait Models Applied to Multifactor Tests: Results and Implications. Journal of Educational and Behavioral Statistics, 4(3), 207 230. https://doi.org/10.3102/10769986004003207
  • Sachdeva, P.S., Barreto, R., Von Vacano, C., & Kennedy, C.J. (2022). Assessing Annotator Identity Sensitivity via Item Response Theory: A Case Study in a Hate Speech Corpus. 2022 ACM Conference on Fairness Accountability and Transparency, 1585–1603. https://doi.org/10.1145/3531146.3533216
  • Schwartz, R., Vassilev, A., Greene, K., Perine, L., Burt, A., & Hall, P. (2022). Towards a standard for identifying and managing bias in artificial intelligence (No. NIST SP 1270; p. NIST SP 1270). National Institute of Standards and Technology (U.S.). https://doi.org/10.6028/NIST.SP.1270
  • Seßler, K., Fürstenberg, M., Bühler, B., & Kasneci, E. (2025). Can AI grade your essays? A comparative analysis of large language models and teacher ratings in multidimensional essay scoring. Proceedings of the 15th International Learning Analytics and Knowledge Conference, 462 472. https://doi.org/10.1145/3706468.3706527
  • Smith, R.M. (2004). Fit analysis in latent trait models. In E. V. Smith & R. M. Smith (Eds.), Introduction to Rasch Measurement (pp. 73–92). JAM Press.
  • Solaiman, I., Talat, Z., Agnew, W., Ahmad, L., Baker, D., Blodgett, S.L., Chen, C., III, H.D., Dodge, J., Duan, I., Evans, E., Friedrich, F., Ghosh, A., Gohar, U., Hooker, S., Jernite, Y., Kalluri, R., Lusoli, A., Leidinger, A., … Subramonian, A. (2024). Evaluating the Social Impact of Generative AI Systems in Systems and Society (No. arXiv:2306.05949). arXiv. https://doi.org/10.48550/arXiv.2306.05949
  • Sun, X., Gu, J., & Sun, H. (2021). Research progress of zero-shot learning. Applied Intelligence, 51(6), 3600–3614. https://doi.org/10.1007/s10489-020-02075-7
  • Vygotsky, L.S., & Cole, M. (1981). Mind in society: The development of higher psychological processes (Nachdr.). Harvard Univ. Press.
  • Walker, A.A., & Wind, S.A. (2020). Identifying Misfitting Achievement Estimates in Performance Assessments: An Illustration Using Rasch and Mokken Scale Analyses. International Journal of Testing, 20(3), 231–251. https://doi.org/10.1080/15305058.2019.1673758
  • Wei, J., Wang, X., Schuurmans, D., Bosma, M., Ichter, B., Xia, F., Chi, E., Le, Q., & Zhou, D. (2022). Chain-of-Thought Prompting Elicits Reasoning in Large Language Models (Version 6). arXiv. https://doi.org/10.48550/ARXIV.2201.11903
  • Wetzler, E.L., Cassidy, K.S., Jones, M.J., Frazier, C.R., Korbut, N.A., Sims, C.M., Bowen, S.S., & Wood, M. (2024). Grading the graders: Comparing generative AI and human assessment in essay evaluation. Teaching of Psychology, 00986283241282696. https://doi.org/10.1177/00986283241282696
  • Wilson, M. (2005). Constructing measures: An item response modeling approach. Taylor & Francis.
  • Wilson, M. (2023). Constructing measures: An item response modeling approach (Second edition). Routledge.
  • Wu, M., & Adams, R.J. (2013). Properties of Rasch residual fit statistics. Journal of Applied Measurement, 14(4), 339–355.
  • Yamashita, T. (2024). An application of many-facet Rasch measurement to evaluate automated essay scoring: A case of ChatGPT-4.0. Research Methods in Applied Linguistics, 3(3), 100133. https://doi.org/10.1016/j.rmal.2024.100133
  • Yao, S., Yu, D., Zhao, J., Shafran, I., Griffiths, T.L., Cao, Y., & Narasimhan, K. (2023). Tree of Thoughts: Deliberate Problem Solving with Large Language Models (No. arXiv:2305.10601). arXiv. https://doi.org/10.48550/arXiv.2305.10601
  • Yavuz, F., Çelik, Ö., & Yavaş Çelik, G. (2025). Utilizing large language models for EFL essay grading: An examination of reliability and validity in rubric‐based assessments. British Journal of Educational Technology, 56(1), 150–166. https://doi.org/10.1111/bjet.13494
  • Yun, J. (2023). Meta-analysis of inter-rater agreement and discrepancy between human and automated English essay scoring. English Teaching, 78(3), 105 124. https://doi.org/10.15858/engtea.78.3.202309.105
  • Zhang, Y., Jin, R., & Zhou, Z.-H. (2010). Understanding bag-of-words model: A statistical framework. International Journal of Machine Learning and Cybernetics, 1(1 4), 43 52. https://doi.org/10.1007/s13042-010-0001-0

Using large language models to evaluate ethical persuasion text: A measurement modeling approach

Year 2026, Volume: 13 Issue: 1, 224 - 247, 02.01.2026
https://doi.org/10.21449/ijate.1788563

Abstract

As AI becomes prevalent in all stages of assessment procedures, it is essential to develop procedures to ensure that its use supports ethical and psychometrically defensible measurement. In this study, we consider how measurement principles can be directly incorporated into an ethical reasoning performance assessment in which Large Language Models (LLMs) serve as raters. We demonstrate how a measurement approach can be used to obtain defensible measures of LLM-generated text related to ethics, prompts designed to elicit text-based ethical persuasion responses, and individual learners. We demonstrate how measurement quality indicators can serve as guardrails to help mitigate potential AI-related risks that can impact learners, such as hallucinations or errors. We describe a novel approach to designing, implementing, and evaluating performance assessments with AI, with the goal of enabling effective personalized learning experiences.

References

  • Andrich, D.A. (2018). Rasch rating-scale model. In W.J. van der Linden (Ed.), Handbook of Item Response Theory (Vol. 1, pp. 31-50). CRC Press.
  • Barney, M.F. (2019). The Reciprocal roles of artificial intelligence and industrial-organizational psychology. In R.N. Landers (Ed.), Cambridge Handbook of Technology and Employee Behavior (pp. 3-21). Cambridge University Press.
  • Boyd, R.L., Ashokkumar, A., Seraj, S., & Pennebaker, J.W. (2022). The development and psychometric properties of LIWC-22. University of Texas at Austin. https://www.liwc.app/static/documents/LIWC-22%20Manual%20-%20Development%20and%20Psychometrics.pdf
  • Bubeck, S., Chandrasekaran, V., Eldan, R., Gehrke, J., Horvitz, E., Kamar, E., Lee, P., Lee, Y.T., Li, Y., Lundberg, S., Nori, H., Palangi, H., Ribeiro, M.T., & Zhang, Y. (2023). Sparks of Artificial General Intelligence: Early experiments with GPT 4 (Version 5). arXiv. https://doi.org/10.48550/ARXIV.2303.12712
  • Chan, K.K.Y., Bond, T., & Yan, Z. (2023b). Application of an Automated Essay Scoring engine to English writing assessment using Many-Facet Rasch Measurement. Language Testing, 40(1), 61-85. https://doi.org/10.1177/02655322221076025
  • Chia, Y.K., Hong, P., Bing, L., & Poria, S. (2023). INSTRUCTEVAL: Towards Holistic Evaluation of Instruction Tuned Large Language Models (No. arXiv:2306.04757). arXiv. https://doi.org/10.48550/arXiv.2306.04757
  • Cialdini, R.B. (2009). Influence: Science and practice (5. ed., internat. ed). Pearson Education [u.a.].
  • Cialdini, R.B. (2014). Influence. HarperCollins e-Books.
  • Cialdini, R.B. (2016). Pre-Suasion: A revolutionary way to influence and persuade. Simon & Schuster.
  • Commons, M.L. (2007). Introduction to the model of hierarchical complexity. Behavioral Development Bulletin, 13(1), 1–6. https://doi.org/10.1037/h0100493
  • de Boeck, P., & Wilson, M. (Eds.). (2004). Explanatory Item Response Models: A Generalized Linear and Nonlinear Approach (2004th edition). Springer.
  • Engelhard, G., & Myford, C.M. (2003). Monitoring faculty consultant performance in the Advanced Placement English Literature and Composition Program with a many-faceted Rasch model. College Entrance Examination Board.
  • Engelhard, G., & Wind, S.A. (2018). Invariant measurement with raters and rating scales: Rasch models for rater-mediated assessments. Taylor & Francis.
  • Engelhard Jr, G., & Wang, J. (2020). Rasch Models for Solving Measurement Problems (Vol. 187). Sage. https://us.sagepub.com/en-us/nam/rasch-models-for-solving-measurement-problems/book267292
  • Fischer, K.W. (1980). A theory of cognitive development: The control and construction of hierarchies of skills. Psychological Review, 87(6), 477–531. https://doi.org/10.1037/0033-295X.87.6.477
  • Flodén, J. (2025). Grading exams using large language models: A comparison between human and AI grading of exams in higher education using ChatGPT. British Educational Research Journal, 51(1), 201–224. https://doi.org/10.1002/berj.4069
  • He, K., Pu, N., Lao, M., & Lew, M.S. (2023). Few-shot and meta-learning methods for image understanding: A survey. International Journal of Multimedia Information Retrieval, 12(2), 14. https://doi.org/10.1007/s13735-023-00279-4
  • Kennedy, C.J., Bacon, G., Sahn, A., & Vacano, C. von. (2020). Constructing interval variables via faceted Rasch measurement and multitask deep learning: A hate speech application (No. arXiv:2009.10277). arXiv. https://doi.org/10.48550/arXiv.2009.10277
  • Kim, H., Baghestani, S., Yin, S., Karatay, Y., Kurt, S., Beck, J., & Karatay, L. (2024). ChatGPT for Writing Evaluation: Examining the Accuracy and Reliability of AI-Generated Scores Compared to Human Raters. In C. Chapelle, G. Beckett, & J. Ranalli (Eds.), Exploring AI in Applied Linguistics (pp. 73–95). Iowa State University Digital Press. https://doi.org/10.31274/isudp.2024.154.06
  • Linacre, J.M. (1989). Many-Facet Rasch Measurement. MESA Press.
  • Linacre, J.M. (2020). A user’s guide to FACETS: Rasch-model computer programs (Version 3.83.4) [Computer software]. winsteps.com. http://www.winsteps.com/manuals.htm
  • Masters, G.N. (2018). Partial credit model. In W. J. van der Linden (Ed.), Handbook of Item Response Theory (Vol. 1, pp. 109-126). CRC Press.
  • Mizumoto, A., & Eguchi, M. (2023). Exploring the potential of using an AI language model for automated essay scoring. Research Methods in Applied Linguistics, 2(2), 100050. https://doi.org/10.1016/j.rmal.2023.100050
  • Page, E.B. (1966). The imminence of grading essays by computer. The Phi Delta Kappan, 47(5), 238-243.
  • Pfau, A., Polio, C., & Xu, Y. (2023). Exploring the potential of ChatGPT in assessing L2 writing accuracy for research purposes. Research Methods in Applied Linguistics, 2(3), 100083. https://doi.org/10.1016/j.rmal.2023.100083
  • Rasch, G. (1960). Probabilistic models for some intelligence and achievement tests (Expanded edition, 1980). University of Chicago Press.
  • Ramesh, D., & Sanampudi, S.K. (2022). An automated essay scoring systems: A systematic literature review. Artificial Intelligence Review, 55(3), 2495–2527. https://doi.org/10.1007/s10462-021-10068-2
  • Reckase, M.D. (1979). Unifactor Latent Trait Models Applied to Multifactor Tests: Results and Implications. Journal of Educational and Behavioral Statistics, 4(3), 207 230. https://doi.org/10.3102/10769986004003207
  • Sachdeva, P.S., Barreto, R., Von Vacano, C., & Kennedy, C.J. (2022). Assessing Annotator Identity Sensitivity via Item Response Theory: A Case Study in a Hate Speech Corpus. 2022 ACM Conference on Fairness Accountability and Transparency, 1585–1603. https://doi.org/10.1145/3531146.3533216
  • Schwartz, R., Vassilev, A., Greene, K., Perine, L., Burt, A., & Hall, P. (2022). Towards a standard for identifying and managing bias in artificial intelligence (No. NIST SP 1270; p. NIST SP 1270). National Institute of Standards and Technology (U.S.). https://doi.org/10.6028/NIST.SP.1270
  • Seßler, K., Fürstenberg, M., Bühler, B., & Kasneci, E. (2025). Can AI grade your essays? A comparative analysis of large language models and teacher ratings in multidimensional essay scoring. Proceedings of the 15th International Learning Analytics and Knowledge Conference, 462 472. https://doi.org/10.1145/3706468.3706527
  • Smith, R.M. (2004). Fit analysis in latent trait models. In E. V. Smith & R. M. Smith (Eds.), Introduction to Rasch Measurement (pp. 73–92). JAM Press.
  • Solaiman, I., Talat, Z., Agnew, W., Ahmad, L., Baker, D., Blodgett, S.L., Chen, C., III, H.D., Dodge, J., Duan, I., Evans, E., Friedrich, F., Ghosh, A., Gohar, U., Hooker, S., Jernite, Y., Kalluri, R., Lusoli, A., Leidinger, A., … Subramonian, A. (2024). Evaluating the Social Impact of Generative AI Systems in Systems and Society (No. arXiv:2306.05949). arXiv. https://doi.org/10.48550/arXiv.2306.05949
  • Sun, X., Gu, J., & Sun, H. (2021). Research progress of zero-shot learning. Applied Intelligence, 51(6), 3600–3614. https://doi.org/10.1007/s10489-020-02075-7
  • Vygotsky, L.S., & Cole, M. (1981). Mind in society: The development of higher psychological processes (Nachdr.). Harvard Univ. Press.
  • Walker, A.A., & Wind, S.A. (2020). Identifying Misfitting Achievement Estimates in Performance Assessments: An Illustration Using Rasch and Mokken Scale Analyses. International Journal of Testing, 20(3), 231–251. https://doi.org/10.1080/15305058.2019.1673758
  • Wei, J., Wang, X., Schuurmans, D., Bosma, M., Ichter, B., Xia, F., Chi, E., Le, Q., & Zhou, D. (2022). Chain-of-Thought Prompting Elicits Reasoning in Large Language Models (Version 6). arXiv. https://doi.org/10.48550/ARXIV.2201.11903
  • Wetzler, E.L., Cassidy, K.S., Jones, M.J., Frazier, C.R., Korbut, N.A., Sims, C.M., Bowen, S.S., & Wood, M. (2024). Grading the graders: Comparing generative AI and human assessment in essay evaluation. Teaching of Psychology, 00986283241282696. https://doi.org/10.1177/00986283241282696
  • Wilson, M. (2005). Constructing measures: An item response modeling approach. Taylor & Francis.
  • Wilson, M. (2023). Constructing measures: An item response modeling approach (Second edition). Routledge.
  • Wu, M., & Adams, R.J. (2013). Properties of Rasch residual fit statistics. Journal of Applied Measurement, 14(4), 339–355.
  • Yamashita, T. (2024). An application of many-facet Rasch measurement to evaluate automated essay scoring: A case of ChatGPT-4.0. Research Methods in Applied Linguistics, 3(3), 100133. https://doi.org/10.1016/j.rmal.2024.100133
  • Yao, S., Yu, D., Zhao, J., Shafran, I., Griffiths, T.L., Cao, Y., & Narasimhan, K. (2023). Tree of Thoughts: Deliberate Problem Solving with Large Language Models (No. arXiv:2305.10601). arXiv. https://doi.org/10.48550/arXiv.2305.10601
  • Yavuz, F., Çelik, Ö., & Yavaş Çelik, G. (2025). Utilizing large language models for EFL essay grading: An examination of reliability and validity in rubric‐based assessments. British Journal of Educational Technology, 56(1), 150–166. https://doi.org/10.1111/bjet.13494
  • Yun, J. (2023). Meta-analysis of inter-rater agreement and discrepancy between human and automated English essay scoring. English Teaching, 78(3), 105 124. https://doi.org/10.15858/engtea.78.3.202309.105
  • Zhang, Y., Jin, R., & Zhou, Z.-H. (2010). Understanding bag-of-words model: A statistical framework. International Journal of Machine Learning and Cybernetics, 1(1 4), 43 52. https://doi.org/10.1007/s13042-010-0001-0
There are 46 citations in total.

Details

Primary Language English
Subjects Measurement Theories and Applications in Education and Psychology
Journal Section Research Article
Authors

Matt Barney This is me 0000-0002-4538-3194

Stefanie Wind 0000-0002-1599-375X

Vaishak Krishna This is me 0009-0001-1951-9801

Submission Date October 7, 2025
Acceptance Date December 7, 2025
Publication Date January 2, 2026
Published in Issue Year 2026 Volume: 13 Issue: 1

Cite

APA Barney, M., Wind, S., & Krishna, V. (2026). Using large language models to evaluate ethical persuasion text: A measurement modeling approach. International Journal of Assessment Tools in Education, 13(1), 224-247. https://doi.org/10.21449/ijate.1788563

23823             23825             23824