Research Article

Ask NAEP: A Generative AI Assistant for Querying Assessment Information

Volume: 15 Number: Special Issue December 30, 2024
EN

Ask NAEP: A Generative AI Assistant for Querying Assessment Information

Abstract

Ask NAEP, a chatbot built with the Retrieval-Augmented Generation (RAG) technique, aims to provide accurate and comprehensive responses to queries about publicly available information of the National Assessment of Educational Progress (NAEP). This study presents an evaluation of this chatbot’s performance in generating high-quality responses. We conducted a series of experiments to explore the impact of incorporating a retrieval component into GPT-3.5 and GPT-4o large language models and evaluated the combined retrieval and generative processes. This work presents a multidimensional evaluation framework using an ordinal scale to assess three dimensions of chatbot performance: correctness, completeness, and communication. Human evaluators assessed the quality of responses across various NAEP subjects. The findings revealed that GPT-4o consistently outperformed GPT-3.5, with statistically significant improvements across all dimensions. Incorporating retrieval into the pipeline further enhanced performance. The RAG approach resulted in high-quality responses. Ask NAEP reduced the occurrence of hallucinations by increasing the correctness measure from 85.5% of questions to 92.7%, a 50% reduction in non-passing responses. The study demonstrates that leveraging large language models (LLMs) like GPT-4o, along with a robust RAG technique, significantly improves the quality of responses generated by the Ask NAEP chatbot. These enhancements can help users to better navigate the extensive NAEP documentation more effectively by providing accurate responses to their queries.

Keywords

Supporting Institution

This project has been funded at least in part with Federal funds from the U.S. Department of Education under contract numbers ED-IES-12-D-0002/0004, 91990022C0053, and 91990023D0006/91990023F0350. The content of this publication does not necessarily reflect the views or policies of the U.S. Department of Education nor does mention of trade names, commercial products, or organizations imply endorsement by the U.S. Government.

Thanks

We would like to express our gratitude to Joseph Wilson for his review of the manuscript and constructive feedback. We also thank Jillian Harrison and Martin Hahn for their assistance with editing and formatting.

References

  1. Abd-Alrazaq, A., Safi, Z., Alajlani, M., Warren, J., Househ, M., & Denecke, K. (2020). Technical Metrics Used to Evaluate Health Care Chatbots: Scoping Review. Journal of Medical Internet Research, 22(6), e18301. https://doi.org/10.2196/18301
  2. Abeysinghe, B., & Circi, R. (2024, June 13). The Challenges of Evaluating LLM Applications: An Analysis of Automated, Human, and LLM-Based Approaches. The First Workshop on Large Language Models for Evaluation in Information Retrieval, Washington D.C. https://doi.org/10.48550/arXiv.2406.03339
  3. Anil, R., Borgeaud, S., Alayrac, J.-B., Yu, J., Soricut, R., Schalkwyk, J., Dai, A. M., Hauth, A., Millican, K., Silver, D., Johnson, M., Antonoglou, I., Schrittwieser, J., Glaese, A., Chen, J., Pitler, E., Lillicrap, T., Lazaridou, A., Firat, O., … Vinyals, O. (2024). Gemini: A Family of Highly Capable Multimodal Models (arXiv:2312.11805). arXiv. https://doi.org/10.48550/arXiv.2312.11805
  4. Brown, T. B., Mann, B., Ryder, N., Subbiah, M., Kaplan, J., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., Agarwal, S., Herbert-Voss, A., Krueger, G., Henighan, T., Child, R., Ramesh, A., Ziegler, D. M., Wu, J., Winter, C., … Amodei, D. (2020). Language Models are Few-Shot Learners. arXiv:2005.14165 [Cs]. http://arxiv.org/abs/2005.14165
  5. Celikyilmaz, A., Clark, E., & Gao, J. (2021). Evaluation of Text Generation: A Survey (arXiv:2006.14799). arXiv. http://arxiv.org/abs/2006.14799
  6. Chan, C.-M., Chen, W., Su, Y., Yu, J., Xue, W., Zhang, S., Fu, J., & Liu, Z. (2023). ChatEval: Towards Better LLM-based Evaluators through Multi-Agent Debate (arXiv:2308.07201). arXiv. http://arxiv.org/abs/2308.07201
  7. Cohen, J. (1960). A coefficient of agreement for nominal scales. Educational and psychological measurement, 20(1), 37-46.
  8. Dhuliawala, S., Komeili, M., Xu, J., Raileanu, R., Li, X., Celikyilmaz, A., & Weston, J. (2023). Chain-of-Verification Reduces Hallucination in Large Language Models (arXiv:2309.11495). arXiv. https://doi.org/10.48550/arXiv.2309.11495

Details

Primary Language

English

Subjects

Testing, Assessment and Psychometrics (Other)

Journal Section

Research Article

Authors

Luke Patterson This is me
0009-0000-2612-0375
United States

Maggie Beiting-parrish This is me
0000-0002-3998-8672
United States

Blue Webb This is me
0009-0004-4080-9864
United States

Paul Bailey This is me
0000-0003-0989-8729
United States

Emmanuel Sikali This is me
0009-0007-5325-0475
United States

Publication Date

December 30, 2024

Submission Date

October 10, 2024

Acceptance Date

December 2, 2024

Published in Issue

Year 2024 Volume: 15 Number: Special Issue

APA
Zhang, T., Patterson, L., Beiting-parrish, M., Webb, B., Abeysinghe, B., Bailey, P., & Sikali, E. (2024). Ask NAEP: A Generative AI Assistant for Querying Assessment Information. Journal of Measurement and Evaluation in Education and Psychology, 15(Special Issue), 378-394. https://doi.org/10.21031/epod.1548128
AMA
1.Zhang T, Patterson L, Beiting-parrish M, et al. Ask NAEP: A Generative AI Assistant for Querying Assessment Information. JMEEP. 2024;15(Special Issue):378-394. doi:10.21031/epod.1548128
Chicago
Zhang, Ting, Luke Patterson, Maggie Beiting-parrish, et al. 2024. “Ask NAEP: A Generative AI Assistant for Querying Assessment Information”. Journal of Measurement and Evaluation in Education and Psychology 15 (Special Issue): 378-94. https://doi.org/10.21031/epod.1548128.
EndNote
Zhang T, Patterson L, Beiting-parrish M, Webb B, Abeysinghe B, Bailey P, Sikali E (December 1, 2024) Ask NAEP: A Generative AI Assistant for Querying Assessment Information. Journal of Measurement and Evaluation in Education and Psychology 15 Special Issue 378–394.
IEEE
[1]T. Zhang et al., “Ask NAEP: A Generative AI Assistant for Querying Assessment Information”, JMEEP, vol. 15, no. Special Issue, pp. 378–394, Dec. 2024, doi: 10.21031/epod.1548128.
ISNAD
Zhang, Ting - Patterson, Luke - Beiting-parrish, Maggie - Webb, Blue - Abeysinghe, Bhashithe - Bailey, Paul - Sikali, Emmanuel. “Ask NAEP: A Generative AI Assistant for Querying Assessment Information”. Journal of Measurement and Evaluation in Education and Psychology 15/Special Issue (December 1, 2024): 378-394. https://doi.org/10.21031/epod.1548128.
JAMA
1.Zhang T, Patterson L, Beiting-parrish M, Webb B, Abeysinghe B, Bailey P, Sikali E. Ask NAEP: A Generative AI Assistant for Querying Assessment Information. JMEEP. 2024;15:378–394.
MLA
Zhang, Ting, et al. “Ask NAEP: A Generative AI Assistant for Querying Assessment Information”. Journal of Measurement and Evaluation in Education and Psychology, vol. 15, no. Special Issue, Dec. 2024, pp. 378-94, doi:10.21031/epod.1548128.
Vancouver
1.Ting Zhang, Luke Patterson, Maggie Beiting-parrish, Blue Webb, Bhashithe Abeysinghe, Paul Bailey, Emmanuel Sikali. Ask NAEP: A Generative AI Assistant for Querying Assessment Information. JMEEP. 2024 Dec. 1;15(Special Issue):378-94. doi:10.21031/epod.1548128