Research Article

Investigating Sampling Impacts on an LLM-Based AI Scoring Approach: Prediction Accuracy and Fairness

Volume: 15 Number: Special Issue December 30, 2024
EN

Investigating Sampling Impacts on an LLM-Based AI Scoring Approach: Prediction Accuracy and Fairness

Abstract

AI scoring capabilities are commonly implemented in educational assessments as a supplement or replacement to human scoring, with significant interest in leveraging large language models for scoring. In order to use AI scoring capability responsibly, the AI scores should be accurate and fair. In this study, we explored one approach to potentially mitigate bias in AI scoring by using equal-allocation stratified sampling for AI model training. The data set included 13 open-ended short-response items in a K-12 state science assessment. Empirical results suggested that stratification did not improve or worsen fairness evaluations on the AI models. BERT based AI scoring models resulting from the stratified sampling method but trained on much less data performed comparably to models resulting from simple random sampling in terms of overall prediction accuracy and fairness on the subgroup level. Limitations and future research are also discussed.

Keywords

References

  1. Ali, S., Abuhmed, T., El-Sappagh, S., et al. (2023). Explainable artificial intelligence (XAI): What we know and what is left to attain trustworthy artificial intelligence. Information Fusion, 99(C). Retrieved from https://doi.org/10.1016/j.inffus.2023.101805
  2. American Educational Research Association, American Psychological Association, & National Council on Measurement in Education. (2014). Standards for educational and psychological testing. American Educational Research Association.
  3. Ayoub, N. F., Balakrishnan, K., Ayoub, M. S., Barrett, T. F., David, A. P., & Gray, S. T. (2024). Inherent bias in large language models: A random sampling analysis. Mayo Clinic Proceedings: Digital Health, 2, 186–191. Retrieved from https://doi.org/10.1016/j.mcpdig.2024.03.003
  4. Bai, X., Wang, A., Sucholutsky, I., & Griffiths, T. L. (2024). Measuring implicit bias in explicitly unbiased large language models. arXiv. Retrieved from https://arxiv.org/pdf/2402.04105
  5. Bennett, R. E., & Zhang, M. (2016). Validity and automated scoring. In F. Drasgow (Ed.), Technology in testing: Measurement issues (pp. 142–173). Taylor & Francis.
  6. Caton, S., & Haas, C. (2024). Fairness in machine learning: A survey. ACM Computing Surveys, 56(7), Article 166. Retrieved from https://doi.org/10.1145/3616865
  7. Chamieh, I., Zesch, T., & Giebermann, K. (2024). LLMs in short answer scoring: Limitations and promise of zero-shot and few-shot approaches. In E. Kochmar et al. (Eds.), Proceedings of the 19th Workshop on Innovative Use of NLP for Building Educational Applications (pp. 309–315). Association for Computational Linguistics. Retrieved from https://aclanthology.org/2024.bea-1.25.pdf
  8. Chhabra, A., Singla, A., & Mohapatra, P. (2022). Fair clustering using antidote data. In J. Schrouff, A. Dieng, M. Rateike, K. Kwegyir-Aggrey, & G. Farnadi (Eds.), Proceedings of the algorithmic fairness through the lens of causality and robustness (Vol. 171, pp. 19–39). PMLR. Retrieved from https://proceedings.mlr.press/v171/chhabra22a.html

Details

Primary Language

English

Subjects

Modelling

Journal Section

Research Article

Publication Date

December 30, 2024

Submission Date

October 4, 2024

Acceptance Date

November 12, 2024

Published in Issue

Year 2024 Volume: 15 Number: Special Issue

APA
Zhang, M., Johnson, M., & Ruan, C. (2024). Investigating Sampling Impacts on an LLM-Based AI Scoring Approach: Prediction Accuracy and Fairness. Journal of Measurement and Evaluation in Education and Psychology, 15(Special Issue), 348-360. https://doi.org/10.21031/epod.1561580
AMA
1.Zhang M, Johnson M, Ruan C. Investigating Sampling Impacts on an LLM-Based AI Scoring Approach: Prediction Accuracy and Fairness. JMEEP. 2024;15(Special Issue):348-360. doi:10.21031/epod.1561580
Chicago
Zhang, Mo, Matthew Johnson, and Chunyi Ruan. 2024. “Investigating Sampling Impacts on an LLM-Based AI Scoring Approach: Prediction Accuracy and Fairness”. Journal of Measurement and Evaluation in Education and Psychology 15 (Special Issue): 348-60. https://doi.org/10.21031/epod.1561580.
EndNote
Zhang M, Johnson M, Ruan C (December 1, 2024) Investigating Sampling Impacts on an LLM-Based AI Scoring Approach: Prediction Accuracy and Fairness. Journal of Measurement and Evaluation in Education and Psychology 15 Special Issue 348–360.
IEEE
[1]M. Zhang, M. Johnson, and C. Ruan, “Investigating Sampling Impacts on an LLM-Based AI Scoring Approach: Prediction Accuracy and Fairness”, JMEEP, vol. 15, no. Special Issue, pp. 348–360, Dec. 2024, doi: 10.21031/epod.1561580.
ISNAD
Zhang, Mo - Johnson, Matthew - Ruan, Chunyi. “Investigating Sampling Impacts on an LLM-Based AI Scoring Approach: Prediction Accuracy and Fairness”. Journal of Measurement and Evaluation in Education and Psychology 15/Special Issue (December 1, 2024): 348-360. https://doi.org/10.21031/epod.1561580.
JAMA
1.Zhang M, Johnson M, Ruan C. Investigating Sampling Impacts on an LLM-Based AI Scoring Approach: Prediction Accuracy and Fairness. JMEEP. 2024;15:348–360.
MLA
Zhang, Mo, et al. “Investigating Sampling Impacts on an LLM-Based AI Scoring Approach: Prediction Accuracy and Fairness”. Journal of Measurement and Evaluation in Education and Psychology, vol. 15, no. Special Issue, Dec. 2024, pp. 348-60, doi:10.21031/epod.1561580.
Vancouver
1.Mo Zhang, Matthew Johnson, Chunyi Ruan. Investigating Sampling Impacts on an LLM-Based AI Scoring Approach: Prediction Accuracy and Fairness. JMEEP. 2024 Dec. 1;15(Special Issue):348-60. doi:10.21031/epod.1561580

Cited By