Year 2024,
Volume: 15 Issue: Special Issue, 348 - 360, 30.12.2024
Mo Zhang
,
Matthew Johnson
,
Chunyi Ruan
References
- Ali, S., Abuhmed, T., El-Sappagh, S., et al. (2023). Explainable artificial intelligence (XAI): What we know and what is left to attain trustworthy artificial intelligence. Information Fusion, 99(C). Retrieved from https://doi.org/10.1016/j.inffus.2023.101805
- American Educational Research Association, American Psychological Association, & National Council on Measurement in Education. (2014). Standards for educational and psychological testing. American Educational Research Association.
- Ayoub, N. F., Balakrishnan, K., Ayoub, M. S., Barrett, T. F., David, A. P., & Gray, S. T. (2024). Inherent bias in large language models: A random sampling analysis. Mayo Clinic Proceedings: Digital Health, 2, 186–191. Retrieved from https://doi.org/10.1016/j.mcpdig.2024.03.003
- Bai, X., Wang, A., Sucholutsky, I., & Griffiths, T. L. (2024). Measuring implicit bias in explicitly unbiased large language models. arXiv. Retrieved from https://arxiv.org/pdf/2402.04105
- Bennett, R. E., & Zhang, M. (2016). Validity and automated scoring. In F. Drasgow (Ed.), Technology in testing: Measurement issues (pp. 142–173). Taylor & Francis.
- Caton, S., & Haas, C. (2024). Fairness in machine learning: A survey. ACM Computing Surveys, 56(7), Article 166. Retrieved from https://doi.org/10.1145/3616865
- Chamieh, I., Zesch, T., & Giebermann, K. (2024). LLMs in short answer scoring: Limitations and promise of zero-shot and few-shot approaches. In E. Kochmar et al. (Eds.), Proceedings of the 19th Workshop on Innovative Use of NLP for Building Educational Applications (pp. 309–315). Association for Computational Linguistics. Retrieved from https://aclanthology.org/2024.bea-1.25.pdf
- Chhabra, A., Singla, A., & Mohapatra, P. (2022). Fair clustering using antidote data. In J. Schrouff,
A. Dieng, M. Rateike, K. Kwegyir-Aggrey, & G. Farnadi (Eds.), Proceedings of the algorithmic fairness through the lens of causality and robustness (Vol. 171, pp. 19–39). PMLR. Retrieved from https://proceedings.mlr.press/v171/chhabra22a.html
- Chu, Z., Wang, Z., & Zhang, W. (2024). Fairness in large language models: A taxonomic survey. ACM SIGKDD Explorations Newsletter, 26(1), 34–48. Retrieved from https://doi.org/10.1145/3682112.3682117
- Cohen, J. (1968). Weighted kappa: Normal scale agreement provision for scaled disagreement or partial credit. Psychological Bulletin, 70(4), 213-220. http://dx.doi.org/10.1037/h0026256
- Ferrara, C., Sellitto, G., Ferrucci, F., et al. (2024). Fairness-aware machine learning engineering: How far are we? Empirical Software Engineering, 29(9). Retrieved from https://doi.org/10.1007/s10664-023-10402-y
- Haberman, S. J. (1984). Adjustment by minimum discriminant information. Annals of Statistics, 12(3), 971–988. Retrieved from https://www.jstor.org/stable/2240973
- Haberman, S. J. (2019). Measures of agreement versus measures of prediction accuracy (Research Report No. RR-19-20). Retrieved from https://doi.org/10.1002/ets2.12258
- Haberman, S. J., & Sinharay, S. (2008). Sample-size requirements for automated essay scoring (Research Report No. RR-08-32). Retrieved from https://doi.org/10.1002/j.2333-8504.2008.tb02118.x
- Heilman, M., & Madnani, N. (2015). The impact of training data on automated short answer scoring performance. In J. Tetreault, J. Burstein, & C. Leacock (Eds.), Proceedings of the Tenth Workshop on Innovative Use of NLP for Building Educational Applications (pp. 81–85). Retrieved from https://doi.org/10.3115/v1/W15-0610
- Johnson, M. S., Liu, X., & McCaffrey, D. F. (2022). Psychometric methods to evaluate measurement and algorithmic bias in automated scoring. Journal of Educational Measurement, 59, 338–361. Retrieved from https://doi.org/10.1111/jedm.12335
- Johnson, M. S., & McCaffrey, D. F. (2023). Evaluating fairness of automated scoring in educational measurement. In V. Yaneva & M. von Davier (Eds.), Advancing natural language processing in educational assessment. Routledge.
- Johnson, M. S., & Zhang, M. (2024). Examining the responsible use of zero-shot AI approaches to scoring essays. Manuscript submitted for publication.
- Kortemeyer, G. (2024). Performance of the pre-trained large language model GPT-4 on automated short answer grading. Discover Artificial Intelligence, 4(47). Retrieved from https://doi.org/10.1007/s44163-024-00147-y
- Kumar, A., Dikshit, S., & de Albuquerque, V. (2021). Explainable artificial intelligence for sarcasm detection in dialogues. Wireless Communications and Mobile Computing, 1, 1–13. Retrieved from https://doi.org/10.1155/2021/2939334
- Lee, G.-G., Latif, E., Wu, X., Liu, N., & Zhai, X. (2024). Applying large language models and chain-of-thought for automatic scoring. Computers and Education: Artificial Intelligence, 6, 100213. https://doi.org/10.1016/j.caeai.2024.100213
- Lohr, S. L. (2021). Sampling: Design and analysis (3rd ed.). Chapman and Hall/CRC. Retrieved from https://doi.org/10.1201/9780429298899
- Loukina, A., Madnani, N., Cahill, A., Yao, L., Johnson, M. S., Riordan, B., & McCaffrey, D. F. (2020). Using PRMSEs to evaluate automated scoring systems in the presence of label noise. In J. Burstein, E. Kochmar, C. Leacock, N. Madnani, H. Y. Ildikó Pilán, & T. Zesch (Eds.), Proceedings of the Fifteenth Workshop on Innovative Use of NLP for Building Educational Applications (pp. 18–29). Retrieved from https://doi.org/10.18653/v1/2020.bea-1.2
- Lubis, F. F. M., Putri, A. W. D., et al. (2021). Automated short-answer grading using semantic similarity based on word embedding. International Journal of Technology, 12(3), 571–581. Retrieved from https://doi.org/10.14716/ijtech.v12i3.4651
- Ma, W., Scheible, H., Wang, B., & Veeramachaneni, G. (2023). Deciphering stereotypes in pre-trained language models. In H. Bouamor, J. Pino, & K. Bali (Eds.), Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing (pp. 11328–11345). Association for Computational Linguistics. Retrieved from https://doi.org/10.18653/v1/2023.emnlp-main.697
- Manvi, R., Khanna, S., Burke, M., Lobell, D., & Ermon, S. (2024). Large language models are geographically biased. arXiv. Retrieved from https://arxiv.org/abs/2402.02680
- McCaffrey, D. F., Casabianca, J., Ricker-Pedley, K. L., Lawless, R., & Wendler, C. (2022). Best practices for constructed-response scoring (Research Report No. RR-22-17). Retrieved from https://doi.org/10.1002/ets2.12358
- Navigli, R., Conia, S., & Ross, B. (2023). Biases in large language models: Origins, inventory, and discussion. Journal of Data and Information Quality, 15(2), 1–21. Retrieved from https://doi.org/10.1145/3597307
- Oka, R., Kusumi, T., & Utsumi, A. (2024). Performance evaluation of automated scoring for the descriptive similarity response task. Nature Scientific Reports, 14, Article 6228. Retrieved from https://doi.org/10.1038/s41598-024-56743-6
- Whitmer, J., Deng, E. Y., Blankenship, C., Beiting-Parrish, M., Zhang, T., & Bailey, P. (2021). Results of NAEP reading item automated scoring data challenge (fall 2021). EdArXiv. Retrieved from https://osf.io/preprints/edarxiv/2hevq
- Williamson, D. M., Xi, X., & Breyer, F. J. (2012). A framework for evaluation and use of automated scoring. Educational Measurement: Issues and Practice, 31(1), 2–13. Retrieved from https://doi.org/10.1111/j.1745-3992.2011.00223.x
- Zhang, M. (2013). The impact of sampling approach on population invariance in automated scoring of essays (Research Report No. RR-13-18). https://doi.org/10.1002/j.2333-8504.2013.tb02325.x
Investigating Sampling Impacts on an LLM-Based AI Scoring Approach: Prediction Accuracy and Fairness
Year 2024,
Volume: 15 Issue: Special Issue, 348 - 360, 30.12.2024
Mo Zhang
,
Matthew Johnson
,
Chunyi Ruan
Abstract
AI scoring capabilities are commonly implemented in educational assessments as a supplement or replacement to human scoring, with significant interest in leveraging large language models for scoring. In order to use AI scoring capability responsibly, the AI scores should be accurate and fair. In this study, we explored one approach to potentially mitigate bias in AI scoring by using equal-allocation stratified sampling for AI model training. The data set included 13 open-ended short-response items in a K-12 state science assessment. Empirical results suggested that stratification did not improve or worsen fairness evaluations on the AI models. BERT based AI scoring models resulting from the stratified sampling method but trained on much less data performed comparably to models resulting from simple random sampling in terms of overall prediction accuracy and fairness on the subgroup level. Limitations and future research are also discussed.
References
- Ali, S., Abuhmed, T., El-Sappagh, S., et al. (2023). Explainable artificial intelligence (XAI): What we know and what is left to attain trustworthy artificial intelligence. Information Fusion, 99(C). Retrieved from https://doi.org/10.1016/j.inffus.2023.101805
- American Educational Research Association, American Psychological Association, & National Council on Measurement in Education. (2014). Standards for educational and psychological testing. American Educational Research Association.
- Ayoub, N. F., Balakrishnan, K., Ayoub, M. S., Barrett, T. F., David, A. P., & Gray, S. T. (2024). Inherent bias in large language models: A random sampling analysis. Mayo Clinic Proceedings: Digital Health, 2, 186–191. Retrieved from https://doi.org/10.1016/j.mcpdig.2024.03.003
- Bai, X., Wang, A., Sucholutsky, I., & Griffiths, T. L. (2024). Measuring implicit bias in explicitly unbiased large language models. arXiv. Retrieved from https://arxiv.org/pdf/2402.04105
- Bennett, R. E., & Zhang, M. (2016). Validity and automated scoring. In F. Drasgow (Ed.), Technology in testing: Measurement issues (pp. 142–173). Taylor & Francis.
- Caton, S., & Haas, C. (2024). Fairness in machine learning: A survey. ACM Computing Surveys, 56(7), Article 166. Retrieved from https://doi.org/10.1145/3616865
- Chamieh, I., Zesch, T., & Giebermann, K. (2024). LLMs in short answer scoring: Limitations and promise of zero-shot and few-shot approaches. In E. Kochmar et al. (Eds.), Proceedings of the 19th Workshop on Innovative Use of NLP for Building Educational Applications (pp. 309–315). Association for Computational Linguistics. Retrieved from https://aclanthology.org/2024.bea-1.25.pdf
- Chhabra, A., Singla, A., & Mohapatra, P. (2022). Fair clustering using antidote data. In J. Schrouff,
A. Dieng, M. Rateike, K. Kwegyir-Aggrey, & G. Farnadi (Eds.), Proceedings of the algorithmic fairness through the lens of causality and robustness (Vol. 171, pp. 19–39). PMLR. Retrieved from https://proceedings.mlr.press/v171/chhabra22a.html
- Chu, Z., Wang, Z., & Zhang, W. (2024). Fairness in large language models: A taxonomic survey. ACM SIGKDD Explorations Newsletter, 26(1), 34–48. Retrieved from https://doi.org/10.1145/3682112.3682117
- Cohen, J. (1968). Weighted kappa: Normal scale agreement provision for scaled disagreement or partial credit. Psychological Bulletin, 70(4), 213-220. http://dx.doi.org/10.1037/h0026256
- Ferrara, C., Sellitto, G., Ferrucci, F., et al. (2024). Fairness-aware machine learning engineering: How far are we? Empirical Software Engineering, 29(9). Retrieved from https://doi.org/10.1007/s10664-023-10402-y
- Haberman, S. J. (1984). Adjustment by minimum discriminant information. Annals of Statistics, 12(3), 971–988. Retrieved from https://www.jstor.org/stable/2240973
- Haberman, S. J. (2019). Measures of agreement versus measures of prediction accuracy (Research Report No. RR-19-20). Retrieved from https://doi.org/10.1002/ets2.12258
- Haberman, S. J., & Sinharay, S. (2008). Sample-size requirements for automated essay scoring (Research Report No. RR-08-32). Retrieved from https://doi.org/10.1002/j.2333-8504.2008.tb02118.x
- Heilman, M., & Madnani, N. (2015). The impact of training data on automated short answer scoring performance. In J. Tetreault, J. Burstein, & C. Leacock (Eds.), Proceedings of the Tenth Workshop on Innovative Use of NLP for Building Educational Applications (pp. 81–85). Retrieved from https://doi.org/10.3115/v1/W15-0610
- Johnson, M. S., Liu, X., & McCaffrey, D. F. (2022). Psychometric methods to evaluate measurement and algorithmic bias in automated scoring. Journal of Educational Measurement, 59, 338–361. Retrieved from https://doi.org/10.1111/jedm.12335
- Johnson, M. S., & McCaffrey, D. F. (2023). Evaluating fairness of automated scoring in educational measurement. In V. Yaneva & M. von Davier (Eds.), Advancing natural language processing in educational assessment. Routledge.
- Johnson, M. S., & Zhang, M. (2024). Examining the responsible use of zero-shot AI approaches to scoring essays. Manuscript submitted for publication.
- Kortemeyer, G. (2024). Performance of the pre-trained large language model GPT-4 on automated short answer grading. Discover Artificial Intelligence, 4(47). Retrieved from https://doi.org/10.1007/s44163-024-00147-y
- Kumar, A., Dikshit, S., & de Albuquerque, V. (2021). Explainable artificial intelligence for sarcasm detection in dialogues. Wireless Communications and Mobile Computing, 1, 1–13. Retrieved from https://doi.org/10.1155/2021/2939334
- Lee, G.-G., Latif, E., Wu, X., Liu, N., & Zhai, X. (2024). Applying large language models and chain-of-thought for automatic scoring. Computers and Education: Artificial Intelligence, 6, 100213. https://doi.org/10.1016/j.caeai.2024.100213
- Lohr, S. L. (2021). Sampling: Design and analysis (3rd ed.). Chapman and Hall/CRC. Retrieved from https://doi.org/10.1201/9780429298899
- Loukina, A., Madnani, N., Cahill, A., Yao, L., Johnson, M. S., Riordan, B., & McCaffrey, D. F. (2020). Using PRMSEs to evaluate automated scoring systems in the presence of label noise. In J. Burstein, E. Kochmar, C. Leacock, N. Madnani, H. Y. Ildikó Pilán, & T. Zesch (Eds.), Proceedings of the Fifteenth Workshop on Innovative Use of NLP for Building Educational Applications (pp. 18–29). Retrieved from https://doi.org/10.18653/v1/2020.bea-1.2
- Lubis, F. F. M., Putri, A. W. D., et al. (2021). Automated short-answer grading using semantic similarity based on word embedding. International Journal of Technology, 12(3), 571–581. Retrieved from https://doi.org/10.14716/ijtech.v12i3.4651
- Ma, W., Scheible, H., Wang, B., & Veeramachaneni, G. (2023). Deciphering stereotypes in pre-trained language models. In H. Bouamor, J. Pino, & K. Bali (Eds.), Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing (pp. 11328–11345). Association for Computational Linguistics. Retrieved from https://doi.org/10.18653/v1/2023.emnlp-main.697
- Manvi, R., Khanna, S., Burke, M., Lobell, D., & Ermon, S. (2024). Large language models are geographically biased. arXiv. Retrieved from https://arxiv.org/abs/2402.02680
- McCaffrey, D. F., Casabianca, J., Ricker-Pedley, K. L., Lawless, R., & Wendler, C. (2022). Best practices for constructed-response scoring (Research Report No. RR-22-17). Retrieved from https://doi.org/10.1002/ets2.12358
- Navigli, R., Conia, S., & Ross, B. (2023). Biases in large language models: Origins, inventory, and discussion. Journal of Data and Information Quality, 15(2), 1–21. Retrieved from https://doi.org/10.1145/3597307
- Oka, R., Kusumi, T., & Utsumi, A. (2024). Performance evaluation of automated scoring for the descriptive similarity response task. Nature Scientific Reports, 14, Article 6228. Retrieved from https://doi.org/10.1038/s41598-024-56743-6
- Whitmer, J., Deng, E. Y., Blankenship, C., Beiting-Parrish, M., Zhang, T., & Bailey, P. (2021). Results of NAEP reading item automated scoring data challenge (fall 2021). EdArXiv. Retrieved from https://osf.io/preprints/edarxiv/2hevq
- Williamson, D. M., Xi, X., & Breyer, F. J. (2012). A framework for evaluation and use of automated scoring. Educational Measurement: Issues and Practice, 31(1), 2–13. Retrieved from https://doi.org/10.1111/j.1745-3992.2011.00223.x
- Zhang, M. (2013). The impact of sampling approach on population invariance in automated scoring of essays (Research Report No. RR-13-18). https://doi.org/10.1002/j.2333-8504.2013.tb02325.x