Investigating Sampling Impacts on an LLM-Based AI Scoring Approach: Prediction Accuracy and Fairness

Mo Zhang; Matthew Johnson; Chunyi Ruan

doi:10.21031/epod.1561580

Research Article

Year 2024, Volume: 15 Issue: Special Issue, 348 - 360, 30.12.2024

Mo Zhang , Matthew Johnson , Chunyi Ruan

https://doi.org/10.21031/epod.1561580

Abstract

References

Ali, S., Abuhmed, T., El-Sappagh, S., et al. (2023). Explainable artificial intelligence (XAI): What we know and what is left to attain trustworthy artificial intelligence. Information Fusion, 99(C). Retrieved from https://doi.org/10.1016/j.inffus.2023.101805
American Educational Research Association, American Psychological Association, & National Council on Measurement in Education. (2014). Standards for educational and psychological testing. American Educational Research Association.
Ayoub, N. F., Balakrishnan, K., Ayoub, M. S., Barrett, T. F., David, A. P., & Gray, S. T. (2024). Inherent bias in large language models: A random sampling analysis. Mayo Clinic Proceedings: Digital Health, 2, 186–191. Retrieved from https://doi.org/10.1016/j.mcpdig.2024.03.003
Bai, X., Wang, A., Sucholutsky, I., & Griffiths, T. L. (2024). Measuring implicit bias in explicitly unbiased large language models. arXiv. Retrieved from https://arxiv.org/pdf/2402.04105
Bennett, R. E., & Zhang, M. (2016). Validity and automated scoring. In F. Drasgow (Ed.), Technology in testing: Measurement issues (pp. 142–173). Taylor & Francis.
Caton, S., & Haas, C. (2024). Fairness in machine learning: A survey. ACM Computing Surveys, 56(7), Article 166. Retrieved from https://doi.org/10.1145/3616865
Chamieh, I., Zesch, T., & Giebermann, K. (2024). LLMs in short answer scoring: Limitations and promise of zero-shot and few-shot approaches. In E. Kochmar et al. (Eds.), Proceedings of the 19th Workshop on Innovative Use of NLP for Building Educational Applications (pp. 309–315). Association for Computational Linguistics. Retrieved from https://aclanthology.org/2024.bea-1.25.pdf
Chhabra, A., Singla, A., & Mohapatra, P. (2022). Fair clustering using antidote data. In J. Schrouff, A. Dieng, M. Rateike, K. Kwegyir-Aggrey, & G. Farnadi (Eds.), Proceedings of the algorithmic fairness through the lens of causality and robustness (Vol. 171, pp. 19–39). PMLR. Retrieved from https://proceedings.mlr.press/v171/chhabra22a.html
Chu, Z., Wang, Z., & Zhang, W. (2024). Fairness in large language models: A taxonomic survey. ACM SIGKDD Explorations Newsletter, 26(1), 34–48. Retrieved from https://doi.org/10.1145/3682112.3682117
Cohen, J. (1968). Weighted kappa: Normal scale agreement provision for scaled disagreement or partial credit. Psychological Bulletin, 70(4), 213-220. http://dx.doi.org/10.1037/h0026256
Ferrara, C., Sellitto, G., Ferrucci, F., et al. (2024). Fairness-aware machine learning engineering: How far are we? Empirical Software Engineering, 29(9). Retrieved from https://doi.org/10.1007/s10664-023-10402-y
Haberman, S. J. (1984). Adjustment by minimum discriminant information. Annals of Statistics, 12(3), 971–988. Retrieved from https://www.jstor.org/stable/2240973
Haberman, S. J. (2019). Measures of agreement versus measures of prediction accuracy (Research Report No. RR-19-20). Retrieved from https://doi.org/10.1002/ets2.12258
Haberman, S. J., & Sinharay, S. (2008). Sample-size requirements for automated essay scoring (Research Report No. RR-08-32). Retrieved from https://doi.org/10.1002/j.2333-8504.2008.tb02118.x
Heilman, M., & Madnani, N. (2015). The impact of training data on automated short answer scoring performance. In J. Tetreault, J. Burstein, & C. Leacock (Eds.), Proceedings of the Tenth Workshop on Innovative Use of NLP for Building Educational Applications (pp. 81–85). Retrieved from https://doi.org/10.3115/v1/W15-0610
Johnson, M. S., Liu, X., & McCaffrey, D. F. (2022). Psychometric methods to evaluate measurement and algorithmic bias in automated scoring. Journal of Educational Measurement, 59, 338–361. Retrieved from https://doi.org/10.1111/jedm.12335
Johnson, M. S., & McCaffrey, D. F. (2023). Evaluating fairness of automated scoring in educational measurement. In V. Yaneva & M. von Davier (Eds.), Advancing natural language processing in educational assessment. Routledge.
Johnson, M. S., & Zhang, M. (2024). Examining the responsible use of zero-shot AI approaches to scoring essays. Manuscript submitted for publication.
Kortemeyer, G. (2024). Performance of the pre-trained large language model GPT-4 on automated short answer grading. Discover Artificial Intelligence, 4(47). Retrieved from https://doi.org/10.1007/s44163-024-00147-y
Kumar, A., Dikshit, S., & de Albuquerque, V. (2021). Explainable artificial intelligence for sarcasm detection in dialogues. Wireless Communications and Mobile Computing, 1, 1–13. Retrieved from https://doi.org/10.1155/2021/2939334
Lee, G.-G., Latif, E., Wu, X., Liu, N., & Zhai, X. (2024). Applying large language models and chain-of-thought for automatic scoring. Computers and Education: Artificial Intelligence, 6, 100213. https://doi.org/10.1016/j.caeai.2024.100213
Lohr, S. L. (2021). Sampling: Design and analysis (3rd ed.). Chapman and Hall/CRC. Retrieved from https://doi.org/10.1201/9780429298899
Loukina, A., Madnani, N., Cahill, A., Yao, L., Johnson, M. S., Riordan, B., & McCaffrey, D. F. (2020). Using PRMSEs to evaluate automated scoring systems in the presence of label noise. In J. Burstein, E. Kochmar, C. Leacock, N. Madnani, H. Y. Ildikó Pilán, & T. Zesch (Eds.), Proceedings of the Fifteenth Workshop on Innovative Use of NLP for Building Educational Applications (pp. 18–29). Retrieved from https://doi.org/10.18653/v1/2020.bea-1.2
Lubis, F. F. M., Putri, A. W. D., et al. (2021). Automated short-answer grading using semantic similarity based on word embedding. International Journal of Technology, 12(3), 571–581. Retrieved from https://doi.org/10.14716/ijtech.v12i3.4651
Ma, W., Scheible, H., Wang, B., & Veeramachaneni, G. (2023). Deciphering stereotypes in pre-trained language models. In H. Bouamor, J. Pino, & K. Bali (Eds.), Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing (pp. 11328–11345). Association for Computational Linguistics. Retrieved from https://doi.org/10.18653/v1/2023.emnlp-main.697
Manvi, R., Khanna, S., Burke, M., Lobell, D., & Ermon, S. (2024). Large language models are geographically biased. arXiv. Retrieved from https://arxiv.org/abs/2402.02680
McCaffrey, D. F., Casabianca, J., Ricker-Pedley, K. L., Lawless, R., & Wendler, C. (2022). Best practices for constructed-response scoring (Research Report No. RR-22-17). Retrieved from https://doi.org/10.1002/ets2.12358
Navigli, R., Conia, S., & Ross, B. (2023). Biases in large language models: Origins, inventory, and discussion. Journal of Data and Information Quality, 15(2), 1–21. Retrieved from https://doi.org/10.1145/3597307
Oka, R., Kusumi, T., & Utsumi, A. (2024). Performance evaluation of automated scoring for the descriptive similarity response task. Nature Scientific Reports, 14, Article 6228. Retrieved from https://doi.org/10.1038/s41598-024-56743-6
Whitmer, J., Deng, E. Y., Blankenship, C., Beiting-Parrish, M., Zhang, T., & Bailey, P. (2021). Results of NAEP reading item automated scoring data challenge (fall 2021). EdArXiv. Retrieved from https://osf.io/preprints/edarxiv/2hevq
Williamson, D. M., Xi, X., & Breyer, F. J. (2012). A framework for evaluation and use of automated scoring. Educational Measurement: Issues and Practice, 31(1), 2–13. Retrieved from https://doi.org/10.1111/j.1745-3992.2011.00223.x
Zhang, M. (2013). The impact of sampling approach on population invariance in automated scoring of essays (Research Report No. RR-13-18). https://doi.org/10.1002/j.2333-8504.2013.tb02325.x

Investigating Sampling Impacts on an LLM-Based AI Scoring Approach: Prediction Accuracy and Fairness

Year 2024, Volume: 15 Issue: Special Issue, 348 - 360, 30.12.2024

Mo Zhang , Matthew Johnson , Chunyi Ruan

https://doi.org/10.21031/epod.1561580

Abstract

AI scoring capabilities are commonly implemented in educational assessments as a supplement or replacement to human scoring, with significant interest in leveraging large language models for scoring. In order to use AI scoring capability responsibly, the AI scores should be accurate and fair. In this study, we explored one approach to potentially mitigate bias in AI scoring by using equal-allocation stratified sampling for AI model training. The data set included 13 open-ended short-response items in a K-12 state science assessment. Empirical results suggested that stratification did not improve or worsen fairness evaluations on the AI models. BERT based AI scoring models resulting from the stratified sampling method but trained on much less data performed comparably to models resulting from simple random sampling in terms of overall prediction accuracy and fairness on the subgroup level. Limitations and future research are also discussed.

Keywords

AI scoring, educational assessment, large language model, sampling, prediction accuracy, fairness

References

Ali, S., Abuhmed, T., El-Sappagh, S., et al. (2023). Explainable artificial intelligence (XAI): What we know and what is left to attain trustworthy artificial intelligence. Information Fusion, 99(C). Retrieved from https://doi.org/10.1016/j.inffus.2023.101805
American Educational Research Association, American Psychological Association, & National Council on Measurement in Education. (2014). Standards for educational and psychological testing. American Educational Research Association.
Ayoub, N. F., Balakrishnan, K., Ayoub, M. S., Barrett, T. F., David, A. P., & Gray, S. T. (2024). Inherent bias in large language models: A random sampling analysis. Mayo Clinic Proceedings: Digital Health, 2, 186–191. Retrieved from https://doi.org/10.1016/j.mcpdig.2024.03.003
Bai, X., Wang, A., Sucholutsky, I., & Griffiths, T. L. (2024). Measuring implicit bias in explicitly unbiased large language models. arXiv. Retrieved from https://arxiv.org/pdf/2402.04105
Bennett, R. E., & Zhang, M. (2016). Validity and automated scoring. In F. Drasgow (Ed.), Technology in testing: Measurement issues (pp. 142–173). Taylor & Francis.
Caton, S., & Haas, C. (2024). Fairness in machine learning: A survey. ACM Computing Surveys, 56(7), Article 166. Retrieved from https://doi.org/10.1145/3616865
Chamieh, I., Zesch, T., & Giebermann, K. (2024). LLMs in short answer scoring: Limitations and promise of zero-shot and few-shot approaches. In E. Kochmar et al. (Eds.), Proceedings of the 19th Workshop on Innovative Use of NLP for Building Educational Applications (pp. 309–315). Association for Computational Linguistics. Retrieved from https://aclanthology.org/2024.bea-1.25.pdf
Chhabra, A., Singla, A., & Mohapatra, P. (2022). Fair clustering using antidote data. In J. Schrouff, A. Dieng, M. Rateike, K. Kwegyir-Aggrey, & G. Farnadi (Eds.), Proceedings of the algorithmic fairness through the lens of causality and robustness (Vol. 171, pp. 19–39). PMLR. Retrieved from https://proceedings.mlr.press/v171/chhabra22a.html
Chu, Z., Wang, Z., & Zhang, W. (2024). Fairness in large language models: A taxonomic survey. ACM SIGKDD Explorations Newsletter, 26(1), 34–48. Retrieved from https://doi.org/10.1145/3682112.3682117
Cohen, J. (1968). Weighted kappa: Normal scale agreement provision for scaled disagreement or partial credit. Psychological Bulletin, 70(4), 213-220. http://dx.doi.org/10.1037/h0026256
Ferrara, C., Sellitto, G., Ferrucci, F., et al. (2024). Fairness-aware machine learning engineering: How far are we? Empirical Software Engineering, 29(9). Retrieved from https://doi.org/10.1007/s10664-023-10402-y
Haberman, S. J. (1984). Adjustment by minimum discriminant information. Annals of Statistics, 12(3), 971–988. Retrieved from https://www.jstor.org/stable/2240973
Haberman, S. J. (2019). Measures of agreement versus measures of prediction accuracy (Research Report No. RR-19-20). Retrieved from https://doi.org/10.1002/ets2.12258
Haberman, S. J., & Sinharay, S. (2008). Sample-size requirements for automated essay scoring (Research Report No. RR-08-32). Retrieved from https://doi.org/10.1002/j.2333-8504.2008.tb02118.x
Heilman, M., & Madnani, N. (2015). The impact of training data on automated short answer scoring performance. In J. Tetreault, J. Burstein, & C. Leacock (Eds.), Proceedings of the Tenth Workshop on Innovative Use of NLP for Building Educational Applications (pp. 81–85). Retrieved from https://doi.org/10.3115/v1/W15-0610
Johnson, M. S., Liu, X., & McCaffrey, D. F. (2022). Psychometric methods to evaluate measurement and algorithmic bias in automated scoring. Journal of Educational Measurement, 59, 338–361. Retrieved from https://doi.org/10.1111/jedm.12335
Johnson, M. S., & McCaffrey, D. F. (2023). Evaluating fairness of automated scoring in educational measurement. In V. Yaneva & M. von Davier (Eds.), Advancing natural language processing in educational assessment. Routledge.
Johnson, M. S., & Zhang, M. (2024). Examining the responsible use of zero-shot AI approaches to scoring essays. Manuscript submitted for publication.
Kortemeyer, G. (2024). Performance of the pre-trained large language model GPT-4 on automated short answer grading. Discover Artificial Intelligence, 4(47). Retrieved from https://doi.org/10.1007/s44163-024-00147-y
Kumar, A., Dikshit, S., & de Albuquerque, V. (2021). Explainable artificial intelligence for sarcasm detection in dialogues. Wireless Communications and Mobile Computing, 1, 1–13. Retrieved from https://doi.org/10.1155/2021/2939334
Lee, G.-G., Latif, E., Wu, X., Liu, N., & Zhai, X. (2024). Applying large language models and chain-of-thought for automatic scoring. Computers and Education: Artificial Intelligence, 6, 100213. https://doi.org/10.1016/j.caeai.2024.100213
Lohr, S. L. (2021). Sampling: Design and analysis (3rd ed.). Chapman and Hall/CRC. Retrieved from https://doi.org/10.1201/9780429298899
Loukina, A., Madnani, N., Cahill, A., Yao, L., Johnson, M. S., Riordan, B., & McCaffrey, D. F. (2020). Using PRMSEs to evaluate automated scoring systems in the presence of label noise. In J. Burstein, E. Kochmar, C. Leacock, N. Madnani, H. Y. Ildikó Pilán, & T. Zesch (Eds.), Proceedings of the Fifteenth Workshop on Innovative Use of NLP for Building Educational Applications (pp. 18–29). Retrieved from https://doi.org/10.18653/v1/2020.bea-1.2
Lubis, F. F. M., Putri, A. W. D., et al. (2021). Automated short-answer grading using semantic similarity based on word embedding. International Journal of Technology, 12(3), 571–581. Retrieved from https://doi.org/10.14716/ijtech.v12i3.4651
Ma, W., Scheible, H., Wang, B., & Veeramachaneni, G. (2023). Deciphering stereotypes in pre-trained language models. In H. Bouamor, J. Pino, & K. Bali (Eds.), Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing (pp. 11328–11345). Association for Computational Linguistics. Retrieved from https://doi.org/10.18653/v1/2023.emnlp-main.697
Manvi, R., Khanna, S., Burke, M., Lobell, D., & Ermon, S. (2024). Large language models are geographically biased. arXiv. Retrieved from https://arxiv.org/abs/2402.02680
McCaffrey, D. F., Casabianca, J., Ricker-Pedley, K. L., Lawless, R., & Wendler, C. (2022). Best practices for constructed-response scoring (Research Report No. RR-22-17). Retrieved from https://doi.org/10.1002/ets2.12358
Navigli, R., Conia, S., & Ross, B. (2023). Biases in large language models: Origins, inventory, and discussion. Journal of Data and Information Quality, 15(2), 1–21. Retrieved from https://doi.org/10.1145/3597307
Oka, R., Kusumi, T., & Utsumi, A. (2024). Performance evaluation of automated scoring for the descriptive similarity response task. Nature Scientific Reports, 14, Article 6228. Retrieved from https://doi.org/10.1038/s41598-024-56743-6
Whitmer, J., Deng, E. Y., Blankenship, C., Beiting-Parrish, M., Zhang, T., & Bailey, P. (2021). Results of NAEP reading item automated scoring data challenge (fall 2021). EdArXiv. Retrieved from https://osf.io/preprints/edarxiv/2hevq
Williamson, D. M., Xi, X., & Breyer, F. J. (2012). A framework for evaluation and use of automated scoring. Educational Measurement: Issues and Practice, 31(1), 2–13. Retrieved from https://doi.org/10.1111/j.1745-3992.2011.00223.x
Zhang, M. (2013). The impact of sampling approach on population invariance in automated scoring of essays (Research Report No. RR-13-18). https://doi.org/10.1002/j.2333-8504.2013.tb02325.x

There are 32 citations in total.

Details

Primary Language	English
Subjects	Modelling
Journal Section	Articles
Authors	Mo Zhang 0000-0003-2689-2089 Matthew Johnson 0000-0003-3157-4165 Chunyi Ruan This is me 0009-0009-3073-229X
Publication Date	December 30, 2024
Submission Date	October 4, 2024
Acceptance Date	November 12, 2024
Published in Issue	Year 2024 Volume: 15 Issue: Special Issue

Cite

APA	Zhang, M., Johnson, M., & Ruan, C. (2024). Investigating Sampling Impacts on an LLM-Based AI Scoring Approach: Prediction Accuracy and Fairness. Journal of Measurement and Evaluation in Education and Psychology, 15(Special Issue), 348-360. https://doi.org/10.21031/epod.1561580

Download Cover Image

Article Files

Full Text