Investigating Sampling Impacts on an LLM-Based AI Scoring Approach: Prediction Accuracy and Fairness

Mo Zhang; Matthew Johnson; Chunyi Ruan

doi:10.21031/epod.1561580

EN

Investigating Sampling Impacts on an LLM-Based AI Scoring Approach: Prediction Accuracy and Fairness

Abstract

AI scoring capabilities are commonly implemented in educational assessments as a supplement or replacement to human scoring, with significant interest in leveraging large language models for scoring. In order to use AI scoring capability responsibly, the AI scores should be accurate and fair. In this study, we explored one approach to potentially mitigate bias in AI scoring by using equal-allocation stratified sampling for AI model training. The data set included 13 open-ended short-response items in a K-12 state science assessment. Empirical results suggested that stratification did not improve or worsen fairness evaluations on the AI models. BERT based AI scoring models resulting from the stratified sampling method but trained on much less data performed comparably to models resulting from simple random sampling in terms of overall prediction accuracy and fairness on the subgroup level. Limitations and future research are also discussed.

Keywords

References

Ali, S., Abuhmed, T., El-Sappagh, S., et al. (2023). Explainable artificial intelligence (XAI): What we know and what is left to attain trustworthy artificial intelligence. Information Fusion, 99(C). Retrieved from https://doi.org/10.1016/j.inffus.2023.101805
American Educational Research Association, American Psychological Association, & National Council on Measurement in Education. (2014). Standards for educational and psychological testing. American Educational Research Association.
Ayoub, N. F., Balakrishnan, K., Ayoub, M. S., Barrett, T. F., David, A. P., & Gray, S. T. (2024). Inherent bias in large language models: A random sampling analysis. Mayo Clinic Proceedings: Digital Health, 2, 186–191. Retrieved from https://doi.org/10.1016/j.mcpdig.2024.03.003
Bai, X., Wang, A., Sucholutsky, I., & Griffiths, T. L. (2024). Measuring implicit bias in explicitly unbiased large language models. arXiv. Retrieved from https://arxiv.org/pdf/2402.04105
Bennett, R. E., & Zhang, M. (2016). Validity and automated scoring. In F. Drasgow (Ed.), Technology in testing: Measurement issues (pp. 142–173). Taylor & Francis.
Caton, S., & Haas, C. (2024). Fairness in machine learning: A survey. ACM Computing Surveys, 56(7), Article 166. Retrieved from https://doi.org/10.1145/3616865
Chamieh, I., Zesch, T., & Giebermann, K. (2024). LLMs in short answer scoring: Limitations and promise of zero-shot and few-shot approaches. In E. Kochmar et al. (Eds.), Proceedings of the 19th Workshop on Innovative Use of NLP for Building Educational Applications (pp. 309–315). Association for Computational Linguistics. Retrieved from https://aclanthology.org/2024.bea-1.25.pdf
Chhabra, A., Singla, A., & Mohapatra, P. (2022). Fair clustering using antidote data. In J. Schrouff, A. Dieng, M. Rateike, K. Kwegyir-Aggrey, & G. Farnadi (Eds.), Proceedings of the algorithmic fairness through the lens of causality and robustness (Vol. 171, pp. 19–39). PMLR. Retrieved from https://proceedings.mlr.press/v171/chhabra22a.html

Chu, Z., Wang, Z., & Zhang, W. (2024). Fairness in large language models: A taxonomic survey. ACM SIGKDD Explorations Newsletter, 26(1), 34–48. Retrieved from https://doi.org/10.1145/3682112.3682117
Cohen, J. (1968). Weighted kappa: Normal scale agreement provision for scaled disagreement or partial credit. Psychological Bulletin, 70(4), 213-220. http://dx.doi.org/10.1037/h0026256
Ferrara, C., Sellitto, G., Ferrucci, F., et al. (2024). Fairness-aware machine learning engineering: How far are we? Empirical Software Engineering, 29(9). Retrieved from https://doi.org/10.1007/s10664-023-10402-y
Haberman, S. J. (1984). Adjustment by minimum discriminant information. Annals of Statistics, 12(3), 971–988. Retrieved from https://www.jstor.org/stable/2240973
Haberman, S. J. (2019). Measures of agreement versus measures of prediction accuracy (Research Report No. RR-19-20). Retrieved from https://doi.org/10.1002/ets2.12258
Haberman, S. J., & Sinharay, S. (2008). Sample-size requirements for automated essay scoring (Research Report No. RR-08-32). Retrieved from https://doi.org/10.1002/j.2333-8504.2008.tb02118.x
Heilman, M., & Madnani, N. (2015). The impact of training data on automated short answer scoring performance. In J. Tetreault, J. Burstein, & C. Leacock (Eds.), Proceedings of the Tenth Workshop on Innovative Use of NLP for Building Educational Applications (pp. 81–85). Retrieved from https://doi.org/10.3115/v1/W15-0610
Johnson, M. S., Liu, X., & McCaffrey, D. F. (2022). Psychometric methods to evaluate measurement and algorithmic bias in automated scoring. Journal of Educational Measurement, 59, 338–361. Retrieved from https://doi.org/10.1111/jedm.12335
Johnson, M. S., & McCaffrey, D. F. (2023). Evaluating fairness of automated scoring in educational measurement. In V. Yaneva & M. von Davier (Eds.), Advancing natural language processing in educational assessment. Routledge.
Johnson, M. S., & Zhang, M. (2024). Examining the responsible use of zero-shot AI approaches to scoring essays. Manuscript submitted for publication.
Kortemeyer, G. (2024). Performance of the pre-trained large language model GPT-4 on automated short answer grading. Discover Artificial Intelligence, 4(47). Retrieved from https://doi.org/10.1007/s44163-024-00147-y
Kumar, A., Dikshit, S., & de Albuquerque, V. (2021). Explainable artificial intelligence for sarcasm detection in dialogues. Wireless Communications and Mobile Computing, 1, 1–13. Retrieved from https://doi.org/10.1155/2021/2939334
Lee, G.-G., Latif, E., Wu, X., Liu, N., & Zhai, X. (2024). Applying large language models and chain-of-thought for automatic scoring. Computers and Education: Artificial Intelligence, 6, 100213. https://doi.org/10.1016/j.caeai.2024.100213
Lohr, S. L. (2021). Sampling: Design and analysis (3rd ed.). Chapman and Hall/CRC. Retrieved from https://doi.org/10.1201/9780429298899
Loukina, A., Madnani, N., Cahill, A., Yao, L., Johnson, M. S., Riordan, B., & McCaffrey, D. F. (2020). Using PRMSEs to evaluate automated scoring systems in the presence of label noise. In J. Burstein, E. Kochmar, C. Leacock, N. Madnani, H. Y. Ildikó Pilán, & T. Zesch (Eds.), Proceedings of the Fifteenth Workshop on Innovative Use of NLP for Building Educational Applications (pp. 18–29). Retrieved from https://doi.org/10.18653/v1/2020.bea-1.2
Lubis, F. F. M., Putri, A. W. D., et al. (2021). Automated short-answer grading using semantic similarity based on word embedding. International Journal of Technology, 12(3), 571–581. Retrieved from https://doi.org/10.14716/ijtech.v12i3.4651
Ma, W., Scheible, H., Wang, B., & Veeramachaneni, G. (2023). Deciphering stereotypes in pre-trained language models. In H. Bouamor, J. Pino, & K. Bali (Eds.), Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing (pp. 11328–11345). Association for Computational Linguistics. Retrieved from https://doi.org/10.18653/v1/2023.emnlp-main.697
Manvi, R., Khanna, S., Burke, M., Lobell, D., & Ermon, S. (2024). Large language models are geographically biased. arXiv. Retrieved from https://arxiv.org/abs/2402.02680
McCaffrey, D. F., Casabianca, J., Ricker-Pedley, K. L., Lawless, R., & Wendler, C. (2022). Best practices for constructed-response scoring (Research Report No. RR-22-17). Retrieved from https://doi.org/10.1002/ets2.12358
Navigli, R., Conia, S., & Ross, B. (2023). Biases in large language models: Origins, inventory, and discussion. Journal of Data and Information Quality, 15(2), 1–21. Retrieved from https://doi.org/10.1145/3597307
Oka, R., Kusumi, T., & Utsumi, A. (2024). Performance evaluation of automated scoring for the descriptive similarity response task. Nature Scientific Reports, 14, Article 6228. Retrieved from https://doi.org/10.1038/s41598-024-56743-6
Whitmer, J., Deng, E. Y., Blankenship, C., Beiting-Parrish, M., Zhang, T., & Bailey, P. (2021). Results of NAEP reading item automated scoring data challenge (fall 2021). EdArXiv. Retrieved from https://osf.io/preprints/edarxiv/2hevq
Williamson, D. M., Xi, X., & Breyer, F. J. (2012). A framework for evaluation and use of automated scoring. Educational Measurement: Issues and Practice, 31(1), 2–13. Retrieved from https://doi.org/10.1111/j.1745-3992.2011.00223.x
Zhang, M. (2013). The impact of sampling approach on population invariance in automated scoring of essays (Research Report No. RR-13-18). https://doi.org/10.1002/j.2333-8504.2013.tb02325.x

Details

Primary Language

English

Subjects

Modelling

Journal Section

Research Article

Authors

Mo Zhang ^*
0000-0003-2689-2089
United States

Matthew Johnson
0000-0003-3157-4165
United States

Chunyi Ruan This is me
0009-0009-3073-229X
United States

Publication Date

December 30, 2024

Submission Date

October 4, 2024

Acceptance Date

November 12, 2024

Published in Issue

Year 2024 Volume: 15 Number: Special Issue

DOI

https://doi.org/10.21031/epod.1561580

IZ

https://izlik.org/JA95ZW44GU

Cite

RIS / Bibtex

APA

Zhang, M., Johnson, M., & Ruan, C. (2024). Investigating Sampling Impacts on an LLM-Based AI Scoring Approach: Prediction Accuracy and Fairness. Journal of Measurement and Evaluation in Education and Psychology, 15(Special Issue), 348-360. https://doi.org/10.21031/epod.1561580

AMA

1.Zhang M, Johnson M, Ruan C. Investigating Sampling Impacts on an LLM-Based AI Scoring Approach: Prediction Accuracy and Fairness. JMEEP. 2024;15(Special Issue):348-360. doi:10.21031/epod.1561580

Chicago

Zhang, Mo, Matthew Johnson, and Chunyi Ruan. 2024. “Investigating Sampling Impacts on an LLM-Based AI Scoring Approach: Prediction Accuracy and Fairness”. Journal of Measurement and Evaluation in Education and Psychology 15 (Special Issue): 348-60. https://doi.org/10.21031/epod.1561580.

EndNote

Zhang M, Johnson M, Ruan C (December 1, 2024) Investigating Sampling Impacts on an LLM-Based AI Scoring Approach: Prediction Accuracy and Fairness. Journal of Measurement and Evaluation in Education and Psychology 15 Special Issue 348–360.

IEEE

[1]M. Zhang, M. Johnson, and C. Ruan, “Investigating Sampling Impacts on an LLM-Based AI Scoring Approach: Prediction Accuracy and Fairness”, JMEEP, vol. 15, no. Special Issue, pp. 348–360, Dec. 2024, doi: 10.21031/epod.1561580.

ISNAD

Zhang, Mo - Johnson, Matthew - Ruan, Chunyi. “Investigating Sampling Impacts on an LLM-Based AI Scoring Approach: Prediction Accuracy and Fairness”. Journal of Measurement and Evaluation in Education and Psychology 15/Special Issue (December 1, 2024): 348-360. https://doi.org/10.21031/epod.1561580.

JAMA

1.Zhang M, Johnson M, Ruan C. Investigating Sampling Impacts on an LLM-Based AI Scoring Approach: Prediction Accuracy and Fairness. JMEEP. 2024;15:348–360.

MLA

Zhang, Mo, et al. “Investigating Sampling Impacts on an LLM-Based AI Scoring Approach: Prediction Accuracy and Fairness”. Journal of Measurement and Evaluation in Education and Psychology, vol. 15, no. Special Issue, Dec. 2024, pp. 348-60, doi:10.21031/epod.1561580.

Vancouver

1.Mo Zhang, Matthew Johnson, Chunyi Ruan. Investigating Sampling Impacts on an LLM-Based AI Scoring Approach: Prediction Accuracy and Fairness. JMEEP. 2024 Dec. 1;15(Special Issue):348-60. doi:10.21031/epod.1561580

Cited By

AI-feedback in education: user experience analysis

Педагогика и просвещение

https://doi.org/10.7256/2454-0676.2025.3.75129

Answer-based and reference-based BERT models for automatic scoring of Turkish short answers: The decisive role of task complexity

International Journal of Assessment Tools in Education

https://doi.org/10.21449/ijate.1687429