Ask NAEP: A Generative AI Assistant for Querying Assessment Information

Ting Zhang; Luke Patterson; Maggie Beiting-parrish; Blue Webb; Bhashithe Abeysinghe; Paul Bailey; Emmanuel Sikali

doi:10.21031/epod.1548128

EN

Ask NAEP: A Generative AI Assistant for Querying Assessment Information

Abstract

Ask NAEP, a chatbot built with the Retrieval-Augmented Generation (RAG) technique, aims to provide accurate and comprehensive responses to queries about publicly available information of the National Assessment of Educational Progress (NAEP). This study presents an evaluation of this chatbot’s performance in generating high-quality responses. We conducted a series of experiments to explore the impact of incorporating a retrieval component into GPT-3.5 and GPT-4o large language models and evaluated the combined retrieval and generative processes. This work presents a multidimensional evaluation framework using an ordinal scale to assess three dimensions of chatbot performance: correctness, completeness, and communication. Human evaluators assessed the quality of responses across various NAEP subjects. The findings revealed that GPT-4o consistently outperformed GPT-3.5, with statistically significant improvements across all dimensions. Incorporating retrieval into the pipeline further enhanced performance. The RAG approach resulted in high-quality responses. Ask NAEP reduced the occurrence of hallucinations by increasing the correctness measure from 85.5% of questions to 92.7%, a 50% reduction in non-passing responses. The study demonstrates that leveraging large language models (LLMs) like GPT-4o, along with a robust RAG technique, significantly improves the quality of responses generated by the Ask NAEP chatbot. These enhancements can help users to better navigate the extensive NAEP documentation more effectively by providing accurate responses to their queries.

Keywords

Supporting Institution

This project has been funded at least in part with Federal funds from the U.S. Department of Education under contract numbers ED-IES-12-D-0002/0004, 91990022C0053, and 91990023D0006/91990023F0350. The content of this publication does not necessarily reflect the views or policies of the U.S. Department of Education nor does mention of trade names, commercial products, or organizations imply endorsement by the U.S. Government.

Thanks

We would like to express our gratitude to Joseph Wilson for his review of the manuscript and constructive feedback. We also thank Jillian Harrison and Martin Hahn for their assistance with editing and formatting.

References

Abd-Alrazaq, A., Safi, Z., Alajlani, M., Warren, J., Househ, M., & Denecke, K. (2020). Technical Metrics Used to Evaluate Health Care Chatbots: Scoping Review. Journal of Medical Internet Research, 22(6), e18301. https://doi.org/10.2196/18301
Abeysinghe, B., & Circi, R. (2024, June 13). The Challenges of Evaluating LLM Applications: An Analysis of Automated, Human, and LLM-Based Approaches. The First Workshop on Large Language Models for Evaluation in Information Retrieval, Washington D.C. https://doi.org/10.48550/arXiv.2406.03339
Anil, R., Borgeaud, S., Alayrac, J.-B., Yu, J., Soricut, R., Schalkwyk, J., Dai, A. M., Hauth, A., Millican, K., Silver, D., Johnson, M., Antonoglou, I., Schrittwieser, J., Glaese, A., Chen, J., Pitler, E., Lillicrap, T., Lazaridou, A., Firat, O., … Vinyals, O. (2024). Gemini: A Family of Highly Capable Multimodal Models (arXiv:2312.11805). arXiv. https://doi.org/10.48550/arXiv.2312.11805
Brown, T. B., Mann, B., Ryder, N., Subbiah, M., Kaplan, J., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., Agarwal, S., Herbert-Voss, A., Krueger, G., Henighan, T., Child, R., Ramesh, A., Ziegler, D. M., Wu, J., Winter, C., … Amodei, D. (2020). Language Models are Few-Shot Learners. arXiv:2005.14165 [Cs]. http://arxiv.org/abs/2005.14165
Celikyilmaz, A., Clark, E., & Gao, J. (2021). Evaluation of Text Generation: A Survey (arXiv:2006.14799). arXiv. http://arxiv.org/abs/2006.14799
Chan, C.-M., Chen, W., Su, Y., Yu, J., Xue, W., Zhang, S., Fu, J., & Liu, Z. (2023). ChatEval: Towards Better LLM-based Evaluators through Multi-Agent Debate (arXiv:2308.07201). arXiv. http://arxiv.org/abs/2308.07201
Cohen, J. (1960). A coefficient of agreement for nominal scales. Educational and psychological measurement, 20(1), 37-46.
Dhuliawala, S., Komeili, M., Xu, J., Raileanu, R., Li, X., Celikyilmaz, A., & Weston, J. (2023). Chain-of-Verification Reduces Hallucination in Large Language Models (arXiv:2309.11495). arXiv. https://doi.org/10.48550/arXiv.2309.11495

Fu, J., Ng, S.-K., Jiang, Z., & Liu, P. (2023). GPTScore: Evaluate as You Desire (arXiv:2302.04166). arXiv. http://arxiv.org/abs/2302.04166
Gao, Y., Xiong, Y., Gao, X., Jia, K., Pan, J., Bi, Y., Dai, Y., Sun, J., & Wang, H. (2023). Retrieval-Augmented Generation for Large Language Models: A Survey (arXiv:2312.10997). arXiv. https://doi.org/10.48550/arXiv.2312.10997
Gehrmann, S., Clark, E., & Sellam, T. (2023). Repairing the Cracked Foundation: A Survey of Obstacles in Evaluation Practices for Generated Text. Journal of Artificial Intelligence Research, 77, 103–166. https://doi.org/10.1613/jair.1.13715
GitHub. (2024a). Scrapy. GitHub. https://github.com/scrapy/scrapy
GitHub. (2024b). Selenium. GitHub. https://github.com/SeleniumHQ/selenium
Grice, P. (1989). In the way of words. London: Harward University Press.
HuggingFace. (2024). cross-encoder/ms-marco-MiniLM-L-6-v2. HuggingFace. https://huggingface.co/cross-encoder/ms-marco-MiniLM-L-6-v2
Howcroft, D. M., & Rieser, V. (2021). What happens if you treat ordinal ratings as interval data? Human evaluations in NLP are even more under-powered than you think. In M.-F. Moens, X. Huang, L. Specia, & S. W. Yih (Eds.), Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing (pp. 8932–8939). Association for Computational Linguistics. https://doi.org/10.18653/v1/2021.emnlp-main.703
Iskender, N., Polzehl, T., & Möller, S. (2021). Reliability of Human Evaluation for Text Summarization: Lessons Learned and Challenges Ahead. In A. Belz, S. Agarwal, Y. Graham, E. Reiter, & A. Shimorina (Eds.), Proceedings of the Workshop on Human Evaluation of NLP Systems (HumEval) (pp. 86–96). Association for Computational Linguistics. https://aclanthology.org/2021.humeval-1.10
Malkov, Y. A., & Yashunin, D. A. (2018). Efficient and robust approximate nearest neighbor search using Hierarchical Navigable Small World graphs (arXiv:1603.09320). arXiv. https://doi.org/10.48550/arXiv.1603.09320
National Center for Education Statistics. (2024a). NAEP. U.S. Department of Education. https://nces.ed.gov/nationsreportcard/
National Center for Education Statistics. (2024b). The Nation’s Report Card. U.S. Department of Education. https://www.nationsreportcard.gov/
National Center for Education Statistics. (2024c). Technical documentation. U.S. Department of Education. https://nces.ed.gov/nationsreportcard/tdw/
Schoch, S., Yang, D., & Ji, Y. (2020). “This is a Problem, Don’t You Agree?” Framing and Bias in Human Evaluation for Natural Language Generation. In S. Agarwal, O. Dušek, S. Gehrmann, D. Gkatzia, I. Konstas, E. Van Miltenburg, & S. Santhanam (Eds.), Proceedings of the 1st Workshop on Evaluating NLG Evaluation (pp. 10–16). Association for Computational Linguistics. https://aclanthology.org/2020.evalnlgeval-1.2
Sellam, T., Das, D., & Parikh, A. P. (2020). BLEURT: Learning Robust Metrics for Text Generation (arXiv:2004.04696). arXiv. https://doi.org/10.48550/arXiv.2004.04696
Smith, E. M., Hsu, O., Qian, R., Roller, S., Boureau, Y.-L., & Weston, J. (2022). Human Evaluation of Conversations is an Open Problem: Comparing the sensitivity of various methods for evaluating dialogue agents (arXiv:2201.04723). arXiv. http://arxiv.org/abs/2201.04723
Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.-A., Lacroix, T., Rozière, B., Goyal, N., Hambro, E., Azhar, F., Rodriguez, A., Joulin, A., Grave, E., & Lample, G. (2023). LLaMA: Open and Efficient Foundation Language Models (arXiv:2302.13971). arXiv. https://doi.org/10.48550/arXiv.2302.13971
van der Lee, C., Gatt, A., van Miltenburg, E., & Krahmer, E. (2021). Human evaluation of automatically generated text: Current trends and best practice guidelines. Computer Speech & Language, 67, 101151. https://doi.org/10.1016/j.csl.2020.101151
van der Lee, C., Gatt, A., Van Miltenburg, E., Wubben, S., & Krahmer, E. (2019). Best practices for the human evaluation of automatically generated text. Proceedings of the 12th International Conference on Natural Language Generation, 355–368.
Wolf, K., Connelly, M., & Komara, A. (2008). A Tale of Two Rubrics: Improving Teaching and Learning Across the Content Areas through Assessment. 8(1).
Zhang, T., Kishore, V., Wu, F., Weinberger, K. Q., & Artzi, Y. (2020). BERTScore: Evaluating Text Generation with BERT (arXiv:1904.09675). arXiv. https://doi.org/10.48550/arXiv.1904.09675

Details

Primary Language

English

Subjects

Testing, Assessment and Psychometrics (Other)

Journal Section

Research Article

Authors

Ting Zhang ^*
0009-0001-1724-6141
United States

Luke Patterson This is me
0009-0000-2612-0375
United States

Maggie Beiting-parrish This is me
0000-0002-3998-8672
United States

Blue Webb This is me
0009-0004-4080-9864
United States

Bhashithe Abeysinghe
0009-0006-4107-8615
United States

Paul Bailey This is me
0000-0003-0989-8729
United States

Emmanuel Sikali This is me
0009-0007-5325-0475
United States

Publication Date

December 30, 2024

Submission Date

October 10, 2024

Acceptance Date

December 2, 2024

Published in Issue

Year 2024 Volume: 15 Number: Special Issue

DOI

https://doi.org/10.21031/epod.1548128

IZ

https://izlik.org/JA76MM67KR

Cite

RIS / Bibtex

APA

Zhang, T., Patterson, L., Beiting-parrish, M., Webb, B., Abeysinghe, B., Bailey, P., & Sikali, E. (2024). Ask NAEP: A Generative AI Assistant for Querying Assessment Information. Journal of Measurement and Evaluation in Education and Psychology, 15(Special Issue), 378-394. https://doi.org/10.21031/epod.1548128

AMA

1.Zhang T, Patterson L, Beiting-parrish M, et al. Ask NAEP: A Generative AI Assistant for Querying Assessment Information. JMEEP. 2024;15(Special Issue):378-394. doi:10.21031/epod.1548128

Chicago

Zhang, Ting, Luke Patterson, Maggie Beiting-parrish, et al. 2024. “Ask NAEP: A Generative AI Assistant for Querying Assessment Information”. Journal of Measurement and Evaluation in Education and Psychology 15 (Special Issue): 378-94. https://doi.org/10.21031/epod.1548128.

EndNote

Zhang T, Patterson L, Beiting-parrish M, Webb B, Abeysinghe B, Bailey P, Sikali E (December 1, 2024) Ask NAEP: A Generative AI Assistant for Querying Assessment Information. Journal of Measurement and Evaluation in Education and Psychology 15 Special Issue 378–394.

IEEE

[1]T. Zhang et al., “Ask NAEP: A Generative AI Assistant for Querying Assessment Information”, JMEEP, vol. 15, no. Special Issue, pp. 378–394, Dec. 2024, doi: 10.21031/epod.1548128.

ISNAD

Zhang, Ting - Patterson, Luke - Beiting-parrish, Maggie - Webb, Blue - Abeysinghe, Bhashithe - Bailey, Paul - Sikali, Emmanuel. “Ask NAEP: A Generative AI Assistant for Querying Assessment Information”. Journal of Measurement and Evaluation in Education and Psychology 15/Special Issue (December 1, 2024): 378-394. https://doi.org/10.21031/epod.1548128.

JAMA

1.Zhang T, Patterson L, Beiting-parrish M, Webb B, Abeysinghe B, Bailey P, Sikali E. Ask NAEP: A Generative AI Assistant for Querying Assessment Information. JMEEP. 2024;15:378–394.

MLA

Zhang, Ting, et al. “Ask NAEP: A Generative AI Assistant for Querying Assessment Information”. Journal of Measurement and Evaluation in Education and Psychology, vol. 15, no. Special Issue, Dec. 2024, pp. 378-94, doi:10.21031/epod.1548128.

Vancouver

1.Ting Zhang, Luke Patterson, Maggie Beiting-parrish, Blue Webb, Bhashithe Abeysinghe, Paul Bailey, Emmanuel Sikali. Ask NAEP: A Generative AI Assistant for Querying Assessment Information. JMEEP. 2024 Dec. 1;15(Special Issue):378-94. doi:10.21031/epod.1548128