Research Article
BibTex RIS Cite

A data pipeline for e-large-scale assessments: Better automation, quality assurance, and efficiency

Year 2023, Volume: 10 Issue: Special Issue, 116 - 131, 27.12.2023
https://doi.org/10.21449/ijate.1321061

Abstract

The increasing volume of large-scale assessment data poses a challenge for testing organizations to manage data and conduct psychometric analysis efficiently. Traditional psychometric software presents barriers, such as a lack of functionality for managing data and conducting various standard psychometric analyses efficiently. These challenges have resulted in high costs to achieve the desired research and analysis outcomes. To address these challenges, we have designed and implemented a modernized data pipeline that allows psychometricians and statisticians to efficiently manage the data, conduct psychometric analysis, generate technical reports, and perform quality assurance to validate the required outputs. This modernized pipeline has proven to scale with large databases, decrease human error by reducing manual processes, efficiently make complex workloads repeatable, ensure high quality of the outputs, and reduce overall costs of psychometric analysis of large-scale assessment data. This paper aims to provide information to support the modernization of the current psychometric analysis practices. We shared details on the workflow design and functionalities of our modernized data pipeline, which provide a universal interface to large-scale assessments. The methods for developing non-technical and user-friendly interfaces will also be discussed.

References

  • Addey, C., & Sellar, S. (2018). Why do countries participate in PISA? Understanding the role of international large-scale assessments in global education policy. In A. Verger, H.K. Altinyelken, & M. Novelli (Eds.), Global education policy and international development: New agendas, issues and policies (3rd ed., pp. 97–117). Bloomsbury Publishing.
  • Allaire, J., Xie, Y., McPherson, J., Luraschi, J., Ushey, K., Atkins, A., ... & Iannone, R. (2022). rmarkdown: Dynamic Documents for R. R package version, 1(11).
  • Ansari, G.A., Parvez, M.T., & Al Khalifah, A. (2017). Cross-organizational information systems: A case for educational data mining. International Journal of Advanced Computer Science and Applications, 8(11), 170 175. http://dx.doi.org/10.14569/IJACSA.2017.081122
  • Azab, A. (2017, April). Enabling docker containers for high-performance and many-task computing. In 2017 ieee international conference on cloud engineering (ic2e) (pp. 279-285). IEEE.
  • Bezanson, J., Karpinski, S., Shah, V.B., & Edelman, A. (2012). Julia: A fast dynamic language for technical computing. ArXiv Preprint ArXiv:1209.5145.
  • Bertolini, R., Finch, S.J., & Nehm, R.H. (2021). Enhancing data pipelines for forecasting student performance: Integrating feature selection with cross-validation. International Journal of Educational Technology in Higher Education, 18(1), 1 23. https://doi.org/10.1186/s41239-021-00279-6
  • Bertolini, R., Finch, S.J., & Nehm, R.H. (2022). Quantifying variability in predictions of student performance: Examining the impact of bootstrap resampling in data pipelines. Computers and Education: Artificial Intelligence, 3, 100067. https://doi.org/10.1016/j.caeai.2022.100067
  • Bryant, W. (2019). Developing a strategy for using technology-enhanced items in large-scale standardized tests. Practical Assessment, Research, and Evaluation, 22(1), 1. https://doi.org/10.7275/70yb-dj34
  • Camara, W.J., & Harris, D.J. (2020). Impact of technology, digital devices, and test timing on score comparability. In M.J. Margolis, R.A. Feinberg (Eds.), Integrating timing considerations to improve testing practices (pp. 104-121). Routledge.
  • Chalmers. R.P. (2012). mirt: A multidimensional item response theory package for the R environment. Journal of Statistical Software, 48(6), 1 29. https://doi.org/10.18637/jss.v048.i06
  • Croudace, T., Ploubidis, G., & Abbott, R. (2005). BILOG-MG, MULTILOG, PARSCALE and TESTFACT. British Journal of Mathematical & Statistical Psychology, 58(1), 193. https://doi.org/10.1348/000711005X37529
  • Desjardins, C.D., & Bulut, O. (2018). Handbook of educational measurement and psychometrics using R. CRC Press.
  • Dogaru, I., & Dogaru, R. (2015, May). Using Python and Julia for efficient implementation of natural computing and complexity related algorithms. In 2015 20th International Conference on Control Systems and Computer Science (pp. 599-604). IEEE.
  • Dowle, M., & Srinivasan, A. (2023). data.table: Extension of 'data.frame'. https://r-datatable.com,https://Rdatatable.gitlab.io/data.table.
  • du Toit, M. (2003). IRT from SSI: BILOG-MG, MULTILOG, PARSCALE, TESTFACT. Scientific Software International.
  • Embretson, S.E., & Reise, S.P. (2000). Item response theory for psychologists. Erlbaum.
  • Hambleton, R.K., Swaminathan, H., & Rogers, H.J. (1991). Fundamentals of item response theory (Vol. 2). Sage.
  • Kamens, D.H., & McNeely, C.L. (2010). Globalization and the growth of international educational testing and national assessment. Comparative education review, 54(1), 5-25. https://doi.org/10.1086/648471
  • Goodman, D.P., & Hambleton, R.K. (2004). Student test score reports and interpretive guides: Review of current practices and suggestions for future research. Applied Measurement in Education, 17(2), 145-220. https://doi.org/10.1207/s15324818ame1702_3
  • Liu, O.L., Brew, C., Blackmore, J., Gerard, L., Madhok, J., & Linn, M.C. (2014). Automated scoring of constructed‐response science items: Prospects and obstacles. Educational Measurement: Issues and Practice, 33(2), 19-28. https://doi.org/10.1111/emip.12028
  • Lord, F.M., & Novick, M.R. (1968). Statistical theories of mental test scores. Addison Wesley, Reading, MA.
  • Lynch, S. (2022). Adapting paper-based tests for computer administration: Lessons learned from 30 years of mode effects studies in education. Practical Assessment, Research, and Evaluation, 27(1), 22.
  • IBM (2020). IBM SPSS Statistics for Windows, Version 27.0. IBM Corp.
  • Martinková, P., & Drabinová, A. (2018). ShinyItemAnalysis for teaching psychometrics and to enforce routine analysis of educational tests. R Journal, 10(2), 503-515.
  • Merkel, D. (2014). Docker: Lightweight linux containers for consistent development and deployment. Linux Journal, 2014(239), 2.
  • Microsoft Corporation. (2018). Microsoft Excel. Retrieved from https://office.microsoft.com/excel
  • Moncaleano, S., & Russell, M. (2018). A historical analysis of technological advances to educational testing: A drive for efficiency and the interplay with validity. Journal of Applied Testing Technology, 19(1), 1–19.
  • Morandat, F., Hill, B., Osvald, L., & Vitek, J. (2012). Evaluating the design of the R language: Objects and functions for data analysis. In ECOOP 2012–Object-Oriented Programming: 26th European Conference, Beijing, China, June 11-16, 2012. Proceedings 26 (pp. 104-131). Springer Berlin Heidelberg.
  • Muraki, E., & Bock, R.D. (2003). PARSCALE 4 for Windows: IRT based test scoring and item analysis for graded items and rating scales [Computer software]. Scientific Software International, Inc.
  • Oranje, A., & Kolstad, A. (2019). Research on psychometric modeling, analysis, and reporting of the national assessment of educational progress. Journal of Educational and Behavioral Statistics, 44(6), 648-670. https://doi.org/10.3102/1076998619867105
  • R Core Team (2022). R: Language and environment for statistical computing. (Version 4.2.1) [Computer software]. Retrieved from https://cran.r-project.org.
  • Reise, S.P., Ainsworth, A.T., & Haviland, M.G. (2005). Item response theory: Fundamentals, applications, and promise in psychological research. Current directions in psychological science, 14(2), 95-101.
  • Rupp, A.A. (2003). Item response modeling with BILOG-MG and MULTILOG for Windows. International Journal of Testing, 3(4), 365 384. https://doi.org/10.1207/S15327574IJT0304_5
  • Russell, M. (2016). A framework for examining the utility of technology-enhanced items. Journal of Applied Testing Technology, 17(1), 20-32.
  • Rutkowski, L., Gonzalez, E., Joncas, M., & Von Davier, M. (2010). International large-scale assessment data: Issues in secondary analysis and reporting. Educational Researcher, 39(2), 142-151. https://doi.org/10.3102/0013189X10363170
  • Scalise, K., & Gifford, B. (2006). Computer-based assessment in e-learning: A framework for constructing" intermediate constraint" questions and tasks for technology platforms. The Journal of Technology, Learning and Assessment, 4(6).
  • Schauberger, P., & Walker, A (2022). openxlsx: Read, Write and Edit xlsx Files. https://ycphs.github.io/openxlsx/index.html, https://github.com/ycphs/openxlsx
  • Schleiss, J., Günther, K., & Stober, S. (2022). Protecting student data in ML Pipelines: An overview of privacy-preserving ML. In International Conference on Artificial Intelligence in Education (pp. 532-536). Springer, Cham.
  • Schloerke, B., & Allen, J. (2023). plumber: An API Generator for R. https://www.rplumber.io, https://github.com/rstudio/plumber
  • Schumacker, R. (2019). Psychometric packages in R. Measurement: Interdisciplinary Research and Perspectives, 17(2), 106-112. https://doi.org/10.1080/15366367.2018.1544434
  • Skiena, S.S. (2017). The data science design manual. Springer.
  • Sung, K.H., Noh, E.H., & Chon, K.H. (2017). Multivariate generalizability analysis of automated scoring for short answer items of social studies in large-scale assessment. Asia Pacific Education Review, 18, 425-437. https://doi.org/10.1007/s12564-017-9498-1
  • Thissen, D., Chen, W-H, & Bock, R.D. (2003). MULTILOG 7 for Windows: Multiple category item analysis and test scoring using item response theory [Computer software]. Scientific Software International, Inc.
  • Van Rossum, G., & Drake Jr, F.L. (1995). Python reference manual. Centrum voor Wiskunde en Informatica Amsterdam.
  • Volante, L., & Ben Jaafar, S. (2008). Educational assessment in Canada. Assessment in Education: Principles, Policy & Practice, 15(2), 201 210. https://doi.org/10.1080/09695940802164226
  • Weber, B. G. (2020). Data science in production: Building scalable model pipelines with Python. CreateSpace Independent Publishing.
  • Wickham, H. (2022). stringr: Simple, consistent wrappers for common string operations. https://stringr.tidyverse.org.
  • Wickham, H., François, R., Henry, L., & Müller, K. (2022). dplyr: A grammar of data manipulation. Retrieved from https://dplyr.tidyverse.org.
  • Wickham, H., & Girlich, M. (2022). tidyr: Tidy messy data. Retrieved from https://tidyr.tidyverse.org
  • Wise, S.L. (2018). Computer-based testing. In the SAGE Encyclopedia of Educational Research, Measurement, and Evaluation (pp. 341–344). SAGE Publications, Inc.
  • Ysseldyke, J., & Nelson, J.R. (2002). Reporting results of student performance on large-scale assessments. In G. Tindal & T.M. Haladyna (Eds.), Large-scale assessment programs for all students: Validity, technical adequacy, and implementation. (pp. 467-483). Routledge
  • Zenisky, A.L., & Sireci, S.G. (2002). Technological innovations in large-scale assessment. Applied Measurement in Education, 15(4), 337 362. https://doi.org/10.1207/S15324818AME1504_02

A data pipeline for e-large-scale assessments: Better automation, quality assurance, and efficiency

Year 2023, Volume: 10 Issue: Special Issue, 116 - 131, 27.12.2023
https://doi.org/10.21449/ijate.1321061

Abstract

The increasing volume of large-scale assessment data poses a challenge for testing organizations to manage data and conduct psychometric analysis efficiently. Traditional psychometric software presents barriers, such as a lack of functionality for managing data and conducting various standard psychometric analyses efficiently. These challenges have resulted in high costs to achieve the desired research and analysis outcomes. To address these challenges, we have designed and implemented a modernized data pipeline that allows psychometricians and statisticians to efficiently manage the data, conduct psychometric analysis, generate technical reports, and perform quality assurance to validate the required outputs. This modernized pipeline has proven to scale with large databases, decrease human error by reducing manual processes, efficiently make complex workloads repeatable, ensure high quality of the outputs, and reduce overall costs of psychometric analysis of large-scale assessment data. This paper aims to provide information to support the modernization of the current psychometric analysis practices. We shared details on the workflow design and functionalities of our modernized data pipeline, which provide a universal interface to large-scale assessments. The methods for developing non-technical and user-friendly interfaces will also be discussed.

References

  • Addey, C., & Sellar, S. (2018). Why do countries participate in PISA? Understanding the role of international large-scale assessments in global education policy. In A. Verger, H.K. Altinyelken, & M. Novelli (Eds.), Global education policy and international development: New agendas, issues and policies (3rd ed., pp. 97–117). Bloomsbury Publishing.
  • Allaire, J., Xie, Y., McPherson, J., Luraschi, J., Ushey, K., Atkins, A., ... & Iannone, R. (2022). rmarkdown: Dynamic Documents for R. R package version, 1(11).
  • Ansari, G.A., Parvez, M.T., & Al Khalifah, A. (2017). Cross-organizational information systems: A case for educational data mining. International Journal of Advanced Computer Science and Applications, 8(11), 170 175. http://dx.doi.org/10.14569/IJACSA.2017.081122
  • Azab, A. (2017, April). Enabling docker containers for high-performance and many-task computing. In 2017 ieee international conference on cloud engineering (ic2e) (pp. 279-285). IEEE.
  • Bezanson, J., Karpinski, S., Shah, V.B., & Edelman, A. (2012). Julia: A fast dynamic language for technical computing. ArXiv Preprint ArXiv:1209.5145.
  • Bertolini, R., Finch, S.J., & Nehm, R.H. (2021). Enhancing data pipelines for forecasting student performance: Integrating feature selection with cross-validation. International Journal of Educational Technology in Higher Education, 18(1), 1 23. https://doi.org/10.1186/s41239-021-00279-6
  • Bertolini, R., Finch, S.J., & Nehm, R.H. (2022). Quantifying variability in predictions of student performance: Examining the impact of bootstrap resampling in data pipelines. Computers and Education: Artificial Intelligence, 3, 100067. https://doi.org/10.1016/j.caeai.2022.100067
  • Bryant, W. (2019). Developing a strategy for using technology-enhanced items in large-scale standardized tests. Practical Assessment, Research, and Evaluation, 22(1), 1. https://doi.org/10.7275/70yb-dj34
  • Camara, W.J., & Harris, D.J. (2020). Impact of technology, digital devices, and test timing on score comparability. In M.J. Margolis, R.A. Feinberg (Eds.), Integrating timing considerations to improve testing practices (pp. 104-121). Routledge.
  • Chalmers. R.P. (2012). mirt: A multidimensional item response theory package for the R environment. Journal of Statistical Software, 48(6), 1 29. https://doi.org/10.18637/jss.v048.i06
  • Croudace, T., Ploubidis, G., & Abbott, R. (2005). BILOG-MG, MULTILOG, PARSCALE and TESTFACT. British Journal of Mathematical & Statistical Psychology, 58(1), 193. https://doi.org/10.1348/000711005X37529
  • Desjardins, C.D., & Bulut, O. (2018). Handbook of educational measurement and psychometrics using R. CRC Press.
  • Dogaru, I., & Dogaru, R. (2015, May). Using Python and Julia for efficient implementation of natural computing and complexity related algorithms. In 2015 20th International Conference on Control Systems and Computer Science (pp. 599-604). IEEE.
  • Dowle, M., & Srinivasan, A. (2023). data.table: Extension of 'data.frame'. https://r-datatable.com,https://Rdatatable.gitlab.io/data.table.
  • du Toit, M. (2003). IRT from SSI: BILOG-MG, MULTILOG, PARSCALE, TESTFACT. Scientific Software International.
  • Embretson, S.E., & Reise, S.P. (2000). Item response theory for psychologists. Erlbaum.
  • Hambleton, R.K., Swaminathan, H., & Rogers, H.J. (1991). Fundamentals of item response theory (Vol. 2). Sage.
  • Kamens, D.H., & McNeely, C.L. (2010). Globalization and the growth of international educational testing and national assessment. Comparative education review, 54(1), 5-25. https://doi.org/10.1086/648471
  • Goodman, D.P., & Hambleton, R.K. (2004). Student test score reports and interpretive guides: Review of current practices and suggestions for future research. Applied Measurement in Education, 17(2), 145-220. https://doi.org/10.1207/s15324818ame1702_3
  • Liu, O.L., Brew, C., Blackmore, J., Gerard, L., Madhok, J., & Linn, M.C. (2014). Automated scoring of constructed‐response science items: Prospects and obstacles. Educational Measurement: Issues and Practice, 33(2), 19-28. https://doi.org/10.1111/emip.12028
  • Lord, F.M., & Novick, M.R. (1968). Statistical theories of mental test scores. Addison Wesley, Reading, MA.
  • Lynch, S. (2022). Adapting paper-based tests for computer administration: Lessons learned from 30 years of mode effects studies in education. Practical Assessment, Research, and Evaluation, 27(1), 22.
  • IBM (2020). IBM SPSS Statistics for Windows, Version 27.0. IBM Corp.
  • Martinková, P., & Drabinová, A. (2018). ShinyItemAnalysis for teaching psychometrics and to enforce routine analysis of educational tests. R Journal, 10(2), 503-515.
  • Merkel, D. (2014). Docker: Lightweight linux containers for consistent development and deployment. Linux Journal, 2014(239), 2.
  • Microsoft Corporation. (2018). Microsoft Excel. Retrieved from https://office.microsoft.com/excel
  • Moncaleano, S., & Russell, M. (2018). A historical analysis of technological advances to educational testing: A drive for efficiency and the interplay with validity. Journal of Applied Testing Technology, 19(1), 1–19.
  • Morandat, F., Hill, B., Osvald, L., & Vitek, J. (2012). Evaluating the design of the R language: Objects and functions for data analysis. In ECOOP 2012–Object-Oriented Programming: 26th European Conference, Beijing, China, June 11-16, 2012. Proceedings 26 (pp. 104-131). Springer Berlin Heidelberg.
  • Muraki, E., & Bock, R.D. (2003). PARSCALE 4 for Windows: IRT based test scoring and item analysis for graded items and rating scales [Computer software]. Scientific Software International, Inc.
  • Oranje, A., & Kolstad, A. (2019). Research on psychometric modeling, analysis, and reporting of the national assessment of educational progress. Journal of Educational and Behavioral Statistics, 44(6), 648-670. https://doi.org/10.3102/1076998619867105
  • R Core Team (2022). R: Language and environment for statistical computing. (Version 4.2.1) [Computer software]. Retrieved from https://cran.r-project.org.
  • Reise, S.P., Ainsworth, A.T., & Haviland, M.G. (2005). Item response theory: Fundamentals, applications, and promise in psychological research. Current directions in psychological science, 14(2), 95-101.
  • Rupp, A.A. (2003). Item response modeling with BILOG-MG and MULTILOG for Windows. International Journal of Testing, 3(4), 365 384. https://doi.org/10.1207/S15327574IJT0304_5
  • Russell, M. (2016). A framework for examining the utility of technology-enhanced items. Journal of Applied Testing Technology, 17(1), 20-32.
  • Rutkowski, L., Gonzalez, E., Joncas, M., & Von Davier, M. (2010). International large-scale assessment data: Issues in secondary analysis and reporting. Educational Researcher, 39(2), 142-151. https://doi.org/10.3102/0013189X10363170
  • Scalise, K., & Gifford, B. (2006). Computer-based assessment in e-learning: A framework for constructing" intermediate constraint" questions and tasks for technology platforms. The Journal of Technology, Learning and Assessment, 4(6).
  • Schauberger, P., & Walker, A (2022). openxlsx: Read, Write and Edit xlsx Files. https://ycphs.github.io/openxlsx/index.html, https://github.com/ycphs/openxlsx
  • Schleiss, J., Günther, K., & Stober, S. (2022). Protecting student data in ML Pipelines: An overview of privacy-preserving ML. In International Conference on Artificial Intelligence in Education (pp. 532-536). Springer, Cham.
  • Schloerke, B., & Allen, J. (2023). plumber: An API Generator for R. https://www.rplumber.io, https://github.com/rstudio/plumber
  • Schumacker, R. (2019). Psychometric packages in R. Measurement: Interdisciplinary Research and Perspectives, 17(2), 106-112. https://doi.org/10.1080/15366367.2018.1544434
  • Skiena, S.S. (2017). The data science design manual. Springer.
  • Sung, K.H., Noh, E.H., & Chon, K.H. (2017). Multivariate generalizability analysis of automated scoring for short answer items of social studies in large-scale assessment. Asia Pacific Education Review, 18, 425-437. https://doi.org/10.1007/s12564-017-9498-1
  • Thissen, D., Chen, W-H, & Bock, R.D. (2003). MULTILOG 7 for Windows: Multiple category item analysis and test scoring using item response theory [Computer software]. Scientific Software International, Inc.
  • Van Rossum, G., & Drake Jr, F.L. (1995). Python reference manual. Centrum voor Wiskunde en Informatica Amsterdam.
  • Volante, L., & Ben Jaafar, S. (2008). Educational assessment in Canada. Assessment in Education: Principles, Policy & Practice, 15(2), 201 210. https://doi.org/10.1080/09695940802164226
  • Weber, B. G. (2020). Data science in production: Building scalable model pipelines with Python. CreateSpace Independent Publishing.
  • Wickham, H. (2022). stringr: Simple, consistent wrappers for common string operations. https://stringr.tidyverse.org.
  • Wickham, H., François, R., Henry, L., & Müller, K. (2022). dplyr: A grammar of data manipulation. Retrieved from https://dplyr.tidyverse.org.
  • Wickham, H., & Girlich, M. (2022). tidyr: Tidy messy data. Retrieved from https://tidyr.tidyverse.org
  • Wise, S.L. (2018). Computer-based testing. In the SAGE Encyclopedia of Educational Research, Measurement, and Evaluation (pp. 341–344). SAGE Publications, Inc.
  • Ysseldyke, J., & Nelson, J.R. (2002). Reporting results of student performance on large-scale assessments. In G. Tindal & T.M. Haladyna (Eds.), Large-scale assessment programs for all students: Validity, technical adequacy, and implementation. (pp. 467-483). Routledge
  • Zenisky, A.L., & Sireci, S.G. (2002). Technological innovations in large-scale assessment. Applied Measurement in Education, 15(4), 337 362. https://doi.org/10.1207/S15324818AME1504_02
There are 52 citations in total.

Details

Primary Language English
Subjects Measurement Theories and Applications in Education and Psychology
Journal Section Special Issue 2023
Authors

Ryan Schwarz This is me 0009-0004-5867-3176

Hatice Cigdem Bulut 0000-0003-2585-3686

Charles Anifowose This is me 0009-0006-2524-9613

Publication Date December 27, 2023
Submission Date June 30, 2023
Published in Issue Year 2023 Volume: 10 Issue: Special Issue

Cite

APA Schwarz, R., Bulut, H. C., & Anifowose, C. (2023). A data pipeline for e-large-scale assessments: Better automation, quality assurance, and efficiency. International Journal of Assessment Tools in Education, 10(Special Issue), 116-131. https://doi.org/10.21449/ijate.1321061

23823             23825             23824