Using large language models to evaluate ethical persuasion text: A measurement modeling approach

Matt Barney; Stefanie Wind; Vaishak Krishna

doi:10.21449/ijate.1788563

Research Article

Using large language models to evaluate ethical persuasion text: A measurement modeling approach

Year 2026, Volume: 13 Issue: 1, 224 - 247, 02.01.2026

Matt Barney Stefanie Wind , Vaishak Krishna

https://doi.org/10.21449/ijate.1788563

https://izlik.org/JA23KY26UK

Abstract

As AI becomes prevalent in all stages of assessment procedures, it is essential to develop procedures to ensure that its use supports ethical and psychometrically defensible measurement. In this study, we consider how measurement principles can be directly incorporated into an ethical reasoning performance assessment in which Large Language Models (LLMs) serve as raters. We demonstrate how a measurement approach can be used to obtain defensible measures of LLM-generated text related to ethics, prompts designed to elicit text-based ethical persuasion responses, and individual learners. We demonstrate how measurement quality indicators can serve as guardrails to help mitigate potential AI-related risks that can impact learners, such as hallucinations or errors. We describe a novel approach to designing, implementing, and evaluating performance assessments with AI, with the goal of enabling effective personalized learning experiences.

Keywords

Performance assessment , Large language models , Rater-mediated assessment , Rasch measurement theory

References

Andrich, D.A. (2018). Rasch rating-scale model. In W.J. van der Linden (Ed.), Handbook of Item Response Theory (Vol. 1, pp. 31-50). CRC Press.
Barney, M.F. (2019). The Reciprocal roles of artificial intelligence and industrial-organizational psychology. In R.N. Landers (Ed.), Cambridge Handbook of Technology and Employee Behavior (pp. 3-21). Cambridge University Press.
Boyd, R.L., Ashokkumar, A., Seraj, S., & Pennebaker, J.W. (2022). The development and psychometric properties of LIWC-22. University of Texas at Austin. https://www.liwc.app/static/documents/LIWC-22%20Manual%20-%20Development%20and%20Psychometrics.pdf
Bubeck, S., Chandrasekaran, V., Eldan, R., Gehrke, J., Horvitz, E., Kamar, E., Lee, P., Lee, Y.T., Li, Y., Lundberg, S., Nori, H., Palangi, H., Ribeiro, M.T., & Zhang, Y. (2023). Sparks of Artificial General Intelligence: Early experiments with GPT 4 (Version 5). arXiv. https://doi.org/10.48550/ARXIV.2303.12712
Chan, K.K.Y., Bond, T., & Yan, Z. (2023b). Application of an Automated Essay Scoring engine to English writing assessment using Many-Facet Rasch Measurement. Language Testing, 40(1), 61-85. https://doi.org/10.1177/02655322221076025
Chia, Y.K., Hong, P., Bing, L., & Poria, S. (2023). INSTRUCTEVAL: Towards Holistic Evaluation of Instruction Tuned Large Language Models (No. arXiv:2306.04757). arXiv. https://doi.org/10.48550/arXiv.2306.04757
Cialdini, R.B. (2009). Influence: Science and practice (5. ed., internat. ed). Pearson Education [u.a.].
Cialdini, R.B. (2014). Influence. HarperCollins e-Books.
Cialdini, R.B. (2016). Pre-Suasion: A revolutionary way to influence and persuade. Simon & Schuster.
Commons, M.L. (2007). Introduction to the model of hierarchical complexity. Behavioral Development Bulletin, 13(1), 1–6. https://doi.org/10.1037/h0100493
de Boeck, P., & Wilson, M. (Eds.). (2004). Explanatory Item Response Models: A Generalized Linear and Nonlinear Approach (2004th edition). Springer.
Engelhard, G., & Myford, C.M. (2003). Monitoring faculty consultant performance in the Advanced Placement English Literature and Composition Program with a many-faceted Rasch model. College Entrance Examination Board.
Engelhard, G., & Wind, S.A. (2018). Invariant measurement with raters and rating scales: Rasch models for rater-mediated assessments. Taylor & Francis.
Engelhard Jr, G., & Wang, J. (2020). Rasch Models for Solving Measurement Problems (Vol. 187). Sage. https://us.sagepub.com/en-us/nam/rasch-models-for-solving-measurement-problems/book267292
Fischer, K.W. (1980). A theory of cognitive development: The control and construction of hierarchies of skills. Psychological Review, 87(6), 477–531. https://doi.org/10.1037/0033-295X.87.6.477
Flodén, J. (2025). Grading exams using large language models: A comparison between human and AI grading of exams in higher education using ChatGPT. British Educational Research Journal, 51(1), 201–224. https://doi.org/10.1002/berj.4069
He, K., Pu, N., Lao, M., & Lew, M.S. (2023). Few-shot and meta-learning methods for image understanding: A survey. International Journal of Multimedia Information Retrieval, 12(2), 14. https://doi.org/10.1007/s13735-023-00279-4
Kennedy, C.J., Bacon, G., Sahn, A., & Vacano, C. von. (2020). Constructing interval variables via faceted Rasch measurement and multitask deep learning: A hate speech application (No. arXiv:2009.10277). arXiv. https://doi.org/10.48550/arXiv.2009.10277
Kim, H., Baghestani, S., Yin, S., Karatay, Y., Kurt, S., Beck, J., & Karatay, L. (2024). ChatGPT for Writing Evaluation: Examining the Accuracy and Reliability of AI-Generated Scores Compared to Human Raters. In C. Chapelle, G. Beckett, & J. Ranalli (Eds.), Exploring AI in Applied Linguistics (pp. 73–95). Iowa State University Digital Press. https://doi.org/10.31274/isudp.2024.154.06
Linacre, J.M. (1989). Many-Facet Rasch Measurement. MESA Press.
Linacre, J.M. (2020). A user’s guide to FACETS: Rasch-model computer programs (Version 3.83.4) [Computer software]. winsteps.com. http://www.winsteps.com/manuals.htm
Masters, G.N. (2018). Partial credit model. In W. J. van der Linden (Ed.), Handbook of Item Response Theory (Vol. 1, pp. 109-126). CRC Press.
Mizumoto, A., & Eguchi, M. (2023). Exploring the potential of using an AI language model for automated essay scoring. Research Methods in Applied Linguistics, 2(2), 100050. https://doi.org/10.1016/j.rmal.2023.100050
Page, E.B. (1966). The imminence of grading essays by computer. The Phi Delta Kappan, 47(5), 238-243.
Pfau, A., Polio, C., & Xu, Y. (2023). Exploring the potential of ChatGPT in assessing L2 writing accuracy for research purposes. Research Methods in Applied Linguistics, 2(3), 100083. https://doi.org/10.1016/j.rmal.2023.100083
Rasch, G. (1960). Probabilistic models for some intelligence and achievement tests (Expanded edition, 1980). University of Chicago Press.
Ramesh, D., & Sanampudi, S.K. (2022). An automated essay scoring systems: A systematic literature review. Artificial Intelligence Review, 55(3), 2495–2527. https://doi.org/10.1007/s10462-021-10068-2
Reckase, M.D. (1979). Unifactor Latent Trait Models Applied to Multifactor Tests: Results and Implications. Journal of Educational and Behavioral Statistics, 4(3), 207 230. https://doi.org/10.3102/10769986004003207
Sachdeva, P.S., Barreto, R., Von Vacano, C., & Kennedy, C.J. (2022). Assessing Annotator Identity Sensitivity via Item Response Theory: A Case Study in a Hate Speech Corpus. 2022 ACM Conference on Fairness Accountability and Transparency, 1585–1603. https://doi.org/10.1145/3531146.3533216
Schwartz, R., Vassilev, A., Greene, K., Perine, L., Burt, A., & Hall, P. (2022). Towards a standard for identifying and managing bias in artificial intelligence (No. NIST SP 1270; p. NIST SP 1270). National Institute of Standards and Technology (U.S.). https://doi.org/10.6028/NIST.SP.1270
Seßler, K., Fürstenberg, M., Bühler, B., & Kasneci, E. (2025). Can AI grade your essays? A comparative analysis of large language models and teacher ratings in multidimensional essay scoring. Proceedings of the 15th International Learning Analytics and Knowledge Conference, 462 472. https://doi.org/10.1145/3706468.3706527
Smith, R.M. (2004). Fit analysis in latent trait models. In E. V. Smith & R. M. Smith (Eds.), Introduction to Rasch Measurement (pp. 73–92). JAM Press.
Solaiman, I., Talat, Z., Agnew, W., Ahmad, L., Baker, D., Blodgett, S.L., Chen, C., III, H.D., Dodge, J., Duan, I., Evans, E., Friedrich, F., Ghosh, A., Gohar, U., Hooker, S., Jernite, Y., Kalluri, R., Lusoli, A., Leidinger, A., … Subramonian, A. (2024). Evaluating the Social Impact of Generative AI Systems in Systems and Society (No. arXiv:2306.05949). arXiv. https://doi.org/10.48550/arXiv.2306.05949
Sun, X., Gu, J., & Sun, H. (2021). Research progress of zero-shot learning. Applied Intelligence, 51(6), 3600–3614. https://doi.org/10.1007/s10489-020-02075-7
Vygotsky, L.S., & Cole, M. (1981). Mind in society: The development of higher psychological processes (Nachdr.). Harvard Univ. Press.
Walker, A.A., & Wind, S.A. (2020). Identifying Misfitting Achievement Estimates in Performance Assessments: An Illustration Using Rasch and Mokken Scale Analyses. International Journal of Testing, 20(3), 231–251. https://doi.org/10.1080/15305058.2019.1673758
Wei, J., Wang, X., Schuurmans, D., Bosma, M., Ichter, B., Xia, F., Chi, E., Le, Q., & Zhou, D. (2022). Chain-of-Thought Prompting Elicits Reasoning in Large Language Models (Version 6). arXiv. https://doi.org/10.48550/ARXIV.2201.11903
Wetzler, E.L., Cassidy, K.S., Jones, M.J., Frazier, C.R., Korbut, N.A., Sims, C.M., Bowen, S.S., & Wood, M. (2024). Grading the graders: Comparing generative AI and human assessment in essay evaluation. Teaching of Psychology, 00986283241282696. https://doi.org/10.1177/00986283241282696
Wilson, M. (2005). Constructing measures: An item response modeling approach. Taylor & Francis.
Wilson, M. (2023). Constructing measures: An item response modeling approach (Second edition). Routledge.
Wu, M., & Adams, R.J. (2013). Properties of Rasch residual fit statistics. Journal of Applied Measurement, 14(4), 339–355.
Yamashita, T. (2024). An application of many-facet Rasch measurement to evaluate automated essay scoring: A case of ChatGPT-4.0. Research Methods in Applied Linguistics, 3(3), 100133. https://doi.org/10.1016/j.rmal.2024.100133
Yao, S., Yu, D., Zhao, J., Shafran, I., Griffiths, T.L., Cao, Y., & Narasimhan, K. (2023). Tree of Thoughts: Deliberate Problem Solving with Large Language Models (No. arXiv:2305.10601). arXiv. https://doi.org/10.48550/arXiv.2305.10601
Yavuz, F., Çelik, Ö., & Yavaş Çelik, G. (2025). Utilizing large language models for EFL essay grading: An examination of reliability and validity in rubric‐based assessments. British Journal of Educational Technology, 56(1), 150–166. https://doi.org/10.1111/bjet.13494
Yun, J. (2023). Meta-analysis of inter-rater agreement and discrepancy between human and automated English essay scoring. English Teaching, 78(3), 105 124. https://doi.org/10.15858/engtea.78.3.202309.105
Zhang, Y., Jin, R., & Zhou, Z.-H. (2010). Understanding bag-of-words model: A statistical framework. International Journal of Machine Learning and Cybernetics, 1(1 4), 43 52. https://doi.org/10.1007/s13042-010-0001-0

Using large language models to evaluate ethical persuasion text: A measurement modeling approach

Year 2026, Volume: 13 Issue: 1, 224 - 247, 02.01.2026

Matt Barney Stefanie Wind , Vaishak Krishna

https://doi.org/10.21449/ijate.1788563

https://izlik.org/JA23KY26UK

Abstract

Keywords

Performance assessment , Large language models , Rater-mediated assessment , Rasch measurement theory

References

Andrich, D.A. (2018). Rasch rating-scale model. In W.J. van der Linden (Ed.), Handbook of Item Response Theory (Vol. 1, pp. 31-50). CRC Press.
Barney, M.F. (2019). The Reciprocal roles of artificial intelligence and industrial-organizational psychology. In R.N. Landers (Ed.), Cambridge Handbook of Technology and Employee Behavior (pp. 3-21). Cambridge University Press.
Boyd, R.L., Ashokkumar, A., Seraj, S., & Pennebaker, J.W. (2022). The development and psychometric properties of LIWC-22. University of Texas at Austin. https://www.liwc.app/static/documents/LIWC-22%20Manual%20-%20Development%20and%20Psychometrics.pdf
Bubeck, S., Chandrasekaran, V., Eldan, R., Gehrke, J., Horvitz, E., Kamar, E., Lee, P., Lee, Y.T., Li, Y., Lundberg, S., Nori, H., Palangi, H., Ribeiro, M.T., & Zhang, Y. (2023). Sparks of Artificial General Intelligence: Early experiments with GPT 4 (Version 5). arXiv. https://doi.org/10.48550/ARXIV.2303.12712
Chan, K.K.Y., Bond, T., & Yan, Z. (2023b). Application of an Automated Essay Scoring engine to English writing assessment using Many-Facet Rasch Measurement. Language Testing, 40(1), 61-85. https://doi.org/10.1177/02655322221076025
Chia, Y.K., Hong, P., Bing, L., & Poria, S. (2023). INSTRUCTEVAL: Towards Holistic Evaluation of Instruction Tuned Large Language Models (No. arXiv:2306.04757). arXiv. https://doi.org/10.48550/arXiv.2306.04757
Cialdini, R.B. (2009). Influence: Science and practice (5. ed., internat. ed). Pearson Education [u.a.].
Cialdini, R.B. (2014). Influence. HarperCollins e-Books.
Cialdini, R.B. (2016). Pre-Suasion: A revolutionary way to influence and persuade. Simon & Schuster.
Commons, M.L. (2007). Introduction to the model of hierarchical complexity. Behavioral Development Bulletin, 13(1), 1–6. https://doi.org/10.1037/h0100493
de Boeck, P., & Wilson, M. (Eds.). (2004). Explanatory Item Response Models: A Generalized Linear and Nonlinear Approach (2004th edition). Springer.
Engelhard, G., & Myford, C.M. (2003). Monitoring faculty consultant performance in the Advanced Placement English Literature and Composition Program with a many-faceted Rasch model. College Entrance Examination Board.
Engelhard, G., & Wind, S.A. (2018). Invariant measurement with raters and rating scales: Rasch models for rater-mediated assessments. Taylor & Francis.
Engelhard Jr, G., & Wang, J. (2020). Rasch Models for Solving Measurement Problems (Vol. 187). Sage. https://us.sagepub.com/en-us/nam/rasch-models-for-solving-measurement-problems/book267292
Fischer, K.W. (1980). A theory of cognitive development: The control and construction of hierarchies of skills. Psychological Review, 87(6), 477–531. https://doi.org/10.1037/0033-295X.87.6.477
Flodén, J. (2025). Grading exams using large language models: A comparison between human and AI grading of exams in higher education using ChatGPT. British Educational Research Journal, 51(1), 201–224. https://doi.org/10.1002/berj.4069
He, K., Pu, N., Lao, M., & Lew, M.S. (2023). Few-shot and meta-learning methods for image understanding: A survey. International Journal of Multimedia Information Retrieval, 12(2), 14. https://doi.org/10.1007/s13735-023-00279-4
Kennedy, C.J., Bacon, G., Sahn, A., & Vacano, C. von. (2020). Constructing interval variables via faceted Rasch measurement and multitask deep learning: A hate speech application (No. arXiv:2009.10277). arXiv. https://doi.org/10.48550/arXiv.2009.10277
Kim, H., Baghestani, S., Yin, S., Karatay, Y., Kurt, S., Beck, J., & Karatay, L. (2024). ChatGPT for Writing Evaluation: Examining the Accuracy and Reliability of AI-Generated Scores Compared to Human Raters. In C. Chapelle, G. Beckett, & J. Ranalli (Eds.), Exploring AI in Applied Linguistics (pp. 73–95). Iowa State University Digital Press. https://doi.org/10.31274/isudp.2024.154.06
Linacre, J.M. (1989). Many-Facet Rasch Measurement. MESA Press.
Linacre, J.M. (2020). A user’s guide to FACETS: Rasch-model computer programs (Version 3.83.4) [Computer software]. winsteps.com. http://www.winsteps.com/manuals.htm
Masters, G.N. (2018). Partial credit model. In W. J. van der Linden (Ed.), Handbook of Item Response Theory (Vol. 1, pp. 109-126). CRC Press.
Mizumoto, A., & Eguchi, M. (2023). Exploring the potential of using an AI language model for automated essay scoring. Research Methods in Applied Linguistics, 2(2), 100050. https://doi.org/10.1016/j.rmal.2023.100050
Page, E.B. (1966). The imminence of grading essays by computer. The Phi Delta Kappan, 47(5), 238-243.
Pfau, A., Polio, C., & Xu, Y. (2023). Exploring the potential of ChatGPT in assessing L2 writing accuracy for research purposes. Research Methods in Applied Linguistics, 2(3), 100083. https://doi.org/10.1016/j.rmal.2023.100083
Rasch, G. (1960). Probabilistic models for some intelligence and achievement tests (Expanded edition, 1980). University of Chicago Press.
Ramesh, D., & Sanampudi, S.K. (2022). An automated essay scoring systems: A systematic literature review. Artificial Intelligence Review, 55(3), 2495–2527. https://doi.org/10.1007/s10462-021-10068-2
Reckase, M.D. (1979). Unifactor Latent Trait Models Applied to Multifactor Tests: Results and Implications. Journal of Educational and Behavioral Statistics, 4(3), 207 230. https://doi.org/10.3102/10769986004003207
Sachdeva, P.S., Barreto, R., Von Vacano, C., & Kennedy, C.J. (2022). Assessing Annotator Identity Sensitivity via Item Response Theory: A Case Study in a Hate Speech Corpus. 2022 ACM Conference on Fairness Accountability and Transparency, 1585–1603. https://doi.org/10.1145/3531146.3533216
Schwartz, R., Vassilev, A., Greene, K., Perine, L., Burt, A., & Hall, P. (2022). Towards a standard for identifying and managing bias in artificial intelligence (No. NIST SP 1270; p. NIST SP 1270). National Institute of Standards and Technology (U.S.). https://doi.org/10.6028/NIST.SP.1270
Seßler, K., Fürstenberg, M., Bühler, B., & Kasneci, E. (2025). Can AI grade your essays? A comparative analysis of large language models and teacher ratings in multidimensional essay scoring. Proceedings of the 15th International Learning Analytics and Knowledge Conference, 462 472. https://doi.org/10.1145/3706468.3706527
Smith, R.M. (2004). Fit analysis in latent trait models. In E. V. Smith & R. M. Smith (Eds.), Introduction to Rasch Measurement (pp. 73–92). JAM Press.
Solaiman, I., Talat, Z., Agnew, W., Ahmad, L., Baker, D., Blodgett, S.L., Chen, C., III, H.D., Dodge, J., Duan, I., Evans, E., Friedrich, F., Ghosh, A., Gohar, U., Hooker, S., Jernite, Y., Kalluri, R., Lusoli, A., Leidinger, A., … Subramonian, A. (2024). Evaluating the Social Impact of Generative AI Systems in Systems and Society (No. arXiv:2306.05949). arXiv. https://doi.org/10.48550/arXiv.2306.05949
Sun, X., Gu, J., & Sun, H. (2021). Research progress of zero-shot learning. Applied Intelligence, 51(6), 3600–3614. https://doi.org/10.1007/s10489-020-02075-7
Vygotsky, L.S., & Cole, M. (1981). Mind in society: The development of higher psychological processes (Nachdr.). Harvard Univ. Press.
Walker, A.A., & Wind, S.A. (2020). Identifying Misfitting Achievement Estimates in Performance Assessments: An Illustration Using Rasch and Mokken Scale Analyses. International Journal of Testing, 20(3), 231–251. https://doi.org/10.1080/15305058.2019.1673758
Wei, J., Wang, X., Schuurmans, D., Bosma, M., Ichter, B., Xia, F., Chi, E., Le, Q., & Zhou, D. (2022). Chain-of-Thought Prompting Elicits Reasoning in Large Language Models (Version 6). arXiv. https://doi.org/10.48550/ARXIV.2201.11903
Wetzler, E.L., Cassidy, K.S., Jones, M.J., Frazier, C.R., Korbut, N.A., Sims, C.M., Bowen, S.S., & Wood, M. (2024). Grading the graders: Comparing generative AI and human assessment in essay evaluation. Teaching of Psychology, 00986283241282696. https://doi.org/10.1177/00986283241282696
Wilson, M. (2005). Constructing measures: An item response modeling approach. Taylor & Francis.
Wilson, M. (2023). Constructing measures: An item response modeling approach (Second edition). Routledge.
Wu, M., & Adams, R.J. (2013). Properties of Rasch residual fit statistics. Journal of Applied Measurement, 14(4), 339–355.
Yamashita, T. (2024). An application of many-facet Rasch measurement to evaluate automated essay scoring: A case of ChatGPT-4.0. Research Methods in Applied Linguistics, 3(3), 100133. https://doi.org/10.1016/j.rmal.2024.100133
Yao, S., Yu, D., Zhao, J., Shafran, I., Griffiths, T.L., Cao, Y., & Narasimhan, K. (2023). Tree of Thoughts: Deliberate Problem Solving with Large Language Models (No. arXiv:2305.10601). arXiv. https://doi.org/10.48550/arXiv.2305.10601
Yavuz, F., Çelik, Ö., & Yavaş Çelik, G. (2025). Utilizing large language models for EFL essay grading: An examination of reliability and validity in rubric‐based assessments. British Journal of Educational Technology, 56(1), 150–166. https://doi.org/10.1111/bjet.13494
Yun, J. (2023). Meta-analysis of inter-rater agreement and discrepancy between human and automated English essay scoring. English Teaching, 78(3), 105 124. https://doi.org/10.15858/engtea.78.3.202309.105
Zhang, Y., Jin, R., & Zhou, Z.-H. (2010). Understanding bag-of-words model: A statistical framework. International Journal of Machine Learning and Cybernetics, 1(1 4), 43 52. https://doi.org/10.1007/s13042-010-0001-0

There are 46 citations in total.

Details

Primary Language	English
Subjects	Measurement Theories and Applications in Education and Psychology
Journal Section	Research Article
Authors	Matt Barney This is me 0000-0002-4538-3194 Stefanie Wind 0000-0002-1599-375X Vaishak Krishna This is me 0009-0001-1951-9801
Submission Date	October 7, 2025
Acceptance Date	December 7, 2025
Publication Date	January 2, 2026
DOI	https://doi.org/10.21449/ijate.1788563
IZ	https://izlik.org/JA23KY26UK
Published in Issue	Year 2026 Volume: 13 Issue: 1

Cite

APA	Barney, M., Wind, S., & Krishna, V. (2026). Using large language models to evaluate ethical persuasion text: A measurement modeling approach. International Journal of Assessment Tools in Education, 13(1), 224-247. https://doi.org/10.21449/ijate.1788563

Article Files

Full Text

23823 23825 23824