Automated Essay Scoring Effect on Test Equating Errors in Mixed-format Test

İbrahim Uysal; Nuri Doğan

doi:10.21449/ijate.815961

Research Article

Automated Essay Scoring Effect on Test Equating Errors in Mixed-format Test

Year 2021, Volume: 8 Issue: 2, 222 - 238, 10.06.2021

İbrahim Uysal , Nuri Doğan

https://doi.org/10.21449/ijate.815961

Cited By: 4

https://izlik.org/JA88MK53BT

Abstract

Scoring constructed-response items can be highly difficult, time-consuming, and costly in practice. Improvements in computer technology have enabled automated scoring of constructed-response items. However, the application of automated scoring without an investigation of test equating can lead to serious problems. The goal of this study was to score the constructed-response items in mixed-format tests automatically with different test/training data rates and to investigate the indirect effect of these scores on test equating compared with human raters. Bidirectional long-short term memory (BLSTM) was selected as the automated scoring method for the best performance. During the test equating process, methods based on classical test theory and item response theory were utilized. In most of the equating methods, errors of the equating resulting from automated scoring were close to the errors occurring in equating processes conducted by human raters. It was concluded that automated scoring can be applied because it is convenient in terms of equating.

Keywords

Test equating , automated scoring , classical test theory , item response theory , mixed format tests

References

Adesiji, K. M., Agbonifo, O. C., Adesuyi, A. T., & Olabode, O. (2016). Development of an automated descriptive text-based scoring system. British Journal of Mathematics & Computer Science, 19(4), 1-14. https://doi.org/10.9734/BJMCS/2016/27558
Albano, A. D. (2016). equate: An R package for observed-score linking and equating. Journal of Statistical Software, 74(8), 1-36. https://doi.org/10.18637/jss.v074.i08
Almond, R. G. (2014). Using automated essay scores as an anchor when equating constructed-response writing tests. International Journal of Testing, 14(1), 73 91. https://doi.org/10.1080/15305058.2013.816309
Angoff, W. H. (1984). Scales, norms and equivalent scores. Educational Testing Service.
Attali, Y., & Burstein, J. (2006). Automated essay scoring with e-rater® v.2. Journal of Technology, Learning, and Assessment, 4(3), 1-30. http://www.jtla.org.
Barendse, M. T., Oort, F. J., & Timmerman, M. E. (2015). Using exploratory factor analysis to determine the dimensionality of discrete responses. Structural Equation Modeling: A Multidisciplinary Journal, 22(1), 87 101. https://doi.org/10.1080/10705511.2014.934850
Bartlett, M. S. (1950). Tests of significance in factor analysis. British Journal of Psychology, 3(2), 77-85. https://doi.org/10.1111/j.2044-8317.1950.tb00285.x
Chen, H., Xu, J., & He, B. (2014). Automated essay scoring by capturing relative writing quality. The Computer Journal, 57(9), 1318-1330. https://doi.org/10.1093/comjnl/bxt117
Cliff, N. (1993). Dominance statistics: Ordinal analyses to answer ordinal questions. Psychological Bulletin, 114(3), 494-509. https://doi.org/10.1037/0033-2909.114.3.494
Cliff, N. (1996). Ordinal methods for behavioral data analysis. Routledge.
Cortina, J. M. (1993). What is coefficient alpha? An examination of theory and applications. The Journal of Applied Psychology, 78(1), 98 104. https://doi.org/10.1037/0021 9010.78.1.98
Creswell, J. W. (2012). Educational research: Planning, conducting and evaluating quantitative and qualitative research (4th ed.). Pearson.
Deng, W., & Monfils, R. (2017). Long-term impact of valid case criterion on capturing population-level growth under Item Response Theory equating (Research Report 17-17). Educational Testing Service. https://doi.org/10.1002/ets2.12144
Fritz, C. O., Morris, P. E., & Richler, J. J. (2012). Effect size estimates: Current use, calculations, and interpretation. Journal of Experimental Psychology: General, 141(1), 2-18. https://doi.org/10.1037/a0024338
Gonzalez, J., & Wiberg, M. (2017). Applying test equating methods: Using R. Springer.
Hagge, S. L., & Kolen, M. J. (2011). Equating mixed-format tests with format representative and non-representative common items. In M. J. Kolen & W-C. Lee (Eds.), Mixed-format tests: Psychometric properties with a primary focus on equating (Vol. 1, pp. 95-135). Center for Advanced Studies in Measurement and Assessment, The University of Iowa.
Hagge, S. L., Liu, C., He, Y., Powers, S. J., Wang, W., & Kolen, M. J. (2011). A comparison of IRT and traditional equipercentile methods in mixed-format equating. In M. J. Kolen & W-C. Lee (Eds.), Mixed-format tests: Psychometric properties with a primary focus on equating (Vol. 1, pp. 19-50). Center for Advanced Studies in Measurement and Assessment, The University of Iowa.
He, Y. (2011). Evaluating equating properties for mixed-format tests [Doctoral dissertation, University of IOWA]. https://ir.uiowa.edu/etd/981/
Kaiser, H. F. (1970). A second generation little jiffy. Psychometrika, 35(4), 401 415. https://doi.org/10.1007/BF02291817
Kaiser, H. F., & Rice, J. (1974). Little jiffy, mark IV. Educational and Psychological Measurement, 34(1), 111–117. https://doi.org/10.1177/001316447403400115
Kolen, M. J., & Brennan, R. L. (2014). Test equating, scaling and linking (2nd ed.). Springer.
Keller, L. A., & Keller, R. R. (2011). The long-term sustainability of different Item Response Theory scaling methods. Educational and Psychological Measurement, 71(2), 362–379. https://doi.org/10.1177/0013164410375111
LaFlair, G. T., Isbell, D., May, L. D. N., Arvizu, M. N. G., & Jamieson, J. (2017). Equating in small scale language testing programs. Language Testing, 34(1), 127 144. https://doi.org/10.1177/0265532215620825
Lee, E., Lee, W-C., & Brennan, R. L. (2012). Exploring equity properties in equating using AP® examinations (Report No. 2012-4). CollegeBoard.
Liu, C., & Kolen, M. J. (2011). A comparison among IRT equating methods and traditional equating methods for mixed-format tests. In M. J. Kolen & W-C. Lee (Eds.), Mixed-format tests: Psychometric properties with a primary focus on equating (Vol. 1, pp. 75-94). Center for Advanced Studies in Measurement and Assessment, The University of Iowa.
Lorenzo-Seva, U., & Ferrando, P. J. (2006). FACTOR: A computer program to fit the exploratory factor analysis model. Behavior Research Methods, 38(1), 88-91. https://doi.org/10.3758/BF03192753
Martinez, M. E. (1999). Cognition and the question of test item format. Educational Psychologist, 34(4), 207-218. https://doi.org/10.1207/s15326985ep3404_2
Messick, S. (1993). Trait equivalence as construct validity of score interpretation across multiple methods of measurement. In R. E. Bennett & W. C. Ward (Eds.), Construction versus choice in cognitive measurement: Issues in constructed-response, performance testing, and portfolio assessment (pp. 61-73). Lawrence Erlbaum Associates, Inc.
MoNE. (2017). Monitoring and evaluation of academic skills (ABİDE) 2016 8th grade report. https://odsgm.meb.gov.tr/meb_iys_dosyalar/2017_11/30114819_iY-web-v6.pdf
Moses, T. P., & von Davier, A. A. (2006). An SAS macro for loglinear smoothing: Applications and implications (Report No. 06-05). Educational Testing Service.
Moses, T., von Davier, A. A., & Casabianca, J. (2004). Loglinear smoothing: An alternative numerical approach using SAS (Research No. 04-27). Educational Testing Service.
Muthén, L. K., & Muthén, B. O. (2012). Mplus user’s guide (7th ed.). Muthén & Muthén.
Olgar, S. (2015). The integration of automated essay scoring systems into the equating process for mixed format tests [Doctoral dissertation, Florida State University]. http://diginole.lib.fsu.edu/islandora/object/fsu%3A253122
Page, E. B. (1966). The imminence of grading essays by computers. Phi Delta Kappan, 47(5), 238–243. http://www.jstor.org/stable/20371545
Pang, X., Madera, E., Radwan, N., & Zhang, S. (2010). A comparison of four test equating methods (Research Report). Eqao.
R Development Core Team. (2018). R: A language and environment for statistical computing (version 3.5.2) [Computer software]. R Foundation for Statistical Computing.
Rodriguez, M. C. (2002). Choosing an item format. In G. Tindal & T. M. Haladyna (Eds.), Large scale assessment programs for all students: Validity, technical adequacy and implementation (pp. 213-231). Lawrence Erlbaum Associates.
SAS Instutite. (2015). Statistical analysis software (version 9.4) [Computer software]. SAS Institute.
Spence, P. D. (1996). The effect of multidimensionality on unidimensional equating with item response theory [Doctoral dissertation, University of Florida]. https://www.proquest.com/docview/304315473
Tankersley, K. (2007). Tests that teach: Using standardized tests to improve instruction. Association for Supervision and Curriculum Development.
Tate, R. (2000). Performance of a proposed method for the linking of mixed-format tests with constructed-response and multiple-choice items. Journal of Educational Measurement, 37(4), 329-346. http://www.jstor.org/stable/1435244
Timmerman, M. E., & Lorenzo-Seva, U. (2011). Dimensionality assessment of ordered polytomous items with parallel analysis. Psychological Methods, 16(2), 209 220. https://doi.org/10.1037/a0023353
Torchiano, M. (2020). effsize: Efficient Effect Size Computation (Version 0.8.1) [Computer software]. https://CRAN.R-project.org/package=effsize
Weiss, D. J., & Minden, S. V. (2012). A comparison of item parameter estimates from Xcalibre 4.1 and Bilog-MG (Technical Report). Assessment Systems Corporation.
Wolf, R. (2013). Assessing the impact of characteristics of the test, common-items, and examinees on the preservation of equity properties in mixed-format test equating [Doctoral dissertation, University of Pittsburgh]. https://core.ac.uk/download/pdf/19441049.pdf
Yoes, M. E. (1996). User's manual for the XCALIBRE marginal maximum-likelihood estimation program [Computer software]. Assessment Systems Corporation.
Zu, J., & Liu, J. (2010). Observed score equating using discrete and passage-based anchor items. Journal of Educational Measurement, 47(4), 395 412. https://doi.org/10.1111/j.1745-3984.2010.00120.x

Automated Essay Scoring Effect on Test Equating Errors in Mixed-format Test

Year 2021, Volume: 8 Issue: 2, 222 - 238, 10.06.2021

İbrahim Uysal , Nuri Doğan

https://doi.org/10.21449/ijate.815961

Cited By: 4

https://izlik.org/JA88MK53BT

Abstract

Keywords

Test equating , Automated scoring , Classical test theory , Item response theory , Mixed-format tests

References

Adesiji, K. M., Agbonifo, O. C., Adesuyi, A. T., & Olabode, O. (2016). Development of an automated descriptive text-based scoring system. British Journal of Mathematics & Computer Science, 19(4), 1-14. https://doi.org/10.9734/BJMCS/2016/27558
Albano, A. D. (2016). equate: An R package for observed-score linking and equating. Journal of Statistical Software, 74(8), 1-36. https://doi.org/10.18637/jss.v074.i08
Almond, R. G. (2014). Using automated essay scores as an anchor when equating constructed-response writing tests. International Journal of Testing, 14(1), 73 91. https://doi.org/10.1080/15305058.2013.816309
Angoff, W. H. (1984). Scales, norms and equivalent scores. Educational Testing Service.
Attali, Y., & Burstein, J. (2006). Automated essay scoring with e-rater® v.2. Journal of Technology, Learning, and Assessment, 4(3), 1-30. http://www.jtla.org.
Barendse, M. T., Oort, F. J., & Timmerman, M. E. (2015). Using exploratory factor analysis to determine the dimensionality of discrete responses. Structural Equation Modeling: A Multidisciplinary Journal, 22(1), 87 101. https://doi.org/10.1080/10705511.2014.934850
Bartlett, M. S. (1950). Tests of significance in factor analysis. British Journal of Psychology, 3(2), 77-85. https://doi.org/10.1111/j.2044-8317.1950.tb00285.x
Chen, H., Xu, J., & He, B. (2014). Automated essay scoring by capturing relative writing quality. The Computer Journal, 57(9), 1318-1330. https://doi.org/10.1093/comjnl/bxt117
Cliff, N. (1993). Dominance statistics: Ordinal analyses to answer ordinal questions. Psychological Bulletin, 114(3), 494-509. https://doi.org/10.1037/0033-2909.114.3.494
Cliff, N. (1996). Ordinal methods for behavioral data analysis. Routledge.
Cortina, J. M. (1993). What is coefficient alpha? An examination of theory and applications. The Journal of Applied Psychology, 78(1), 98 104. https://doi.org/10.1037/0021 9010.78.1.98
Creswell, J. W. (2012). Educational research: Planning, conducting and evaluating quantitative and qualitative research (4th ed.). Pearson.
Deng, W., & Monfils, R. (2017). Long-term impact of valid case criterion on capturing population-level growth under Item Response Theory equating (Research Report 17-17). Educational Testing Service. https://doi.org/10.1002/ets2.12144
Fritz, C. O., Morris, P. E., & Richler, J. J. (2012). Effect size estimates: Current use, calculations, and interpretation. Journal of Experimental Psychology: General, 141(1), 2-18. https://doi.org/10.1037/a0024338
Gonzalez, J., & Wiberg, M. (2017). Applying test equating methods: Using R. Springer.
Hagge, S. L., & Kolen, M. J. (2011). Equating mixed-format tests with format representative and non-representative common items. In M. J. Kolen & W-C. Lee (Eds.), Mixed-format tests: Psychometric properties with a primary focus on equating (Vol. 1, pp. 95-135). Center for Advanced Studies in Measurement and Assessment, The University of Iowa.
Hagge, S. L., Liu, C., He, Y., Powers, S. J., Wang, W., & Kolen, M. J. (2011). A comparison of IRT and traditional equipercentile methods in mixed-format equating. In M. J. Kolen & W-C. Lee (Eds.), Mixed-format tests: Psychometric properties with a primary focus on equating (Vol. 1, pp. 19-50). Center for Advanced Studies in Measurement and Assessment, The University of Iowa.
He, Y. (2011). Evaluating equating properties for mixed-format tests [Doctoral dissertation, University of IOWA]. https://ir.uiowa.edu/etd/981/
Kaiser, H. F. (1970). A second generation little jiffy. Psychometrika, 35(4), 401 415. https://doi.org/10.1007/BF02291817
Kaiser, H. F., & Rice, J. (1974). Little jiffy, mark IV. Educational and Psychological Measurement, 34(1), 111–117. https://doi.org/10.1177/001316447403400115
Kolen, M. J., & Brennan, R. L. (2014). Test equating, scaling and linking (2nd ed.). Springer.
Keller, L. A., & Keller, R. R. (2011). The long-term sustainability of different Item Response Theory scaling methods. Educational and Psychological Measurement, 71(2), 362–379. https://doi.org/10.1177/0013164410375111
LaFlair, G. T., Isbell, D., May, L. D. N., Arvizu, M. N. G., & Jamieson, J. (2017). Equating in small scale language testing programs. Language Testing, 34(1), 127 144. https://doi.org/10.1177/0265532215620825
Lee, E., Lee, W-C., & Brennan, R. L. (2012). Exploring equity properties in equating using AP® examinations (Report No. 2012-4). CollegeBoard.
Liu, C., & Kolen, M. J. (2011). A comparison among IRT equating methods and traditional equating methods for mixed-format tests. In M. J. Kolen & W-C. Lee (Eds.), Mixed-format tests: Psychometric properties with a primary focus on equating (Vol. 1, pp. 75-94). Center for Advanced Studies in Measurement and Assessment, The University of Iowa.
Lorenzo-Seva, U., & Ferrando, P. J. (2006). FACTOR: A computer program to fit the exploratory factor analysis model. Behavior Research Methods, 38(1), 88-91. https://doi.org/10.3758/BF03192753
Martinez, M. E. (1999). Cognition and the question of test item format. Educational Psychologist, 34(4), 207-218. https://doi.org/10.1207/s15326985ep3404_2
Messick, S. (1993). Trait equivalence as construct validity of score interpretation across multiple methods of measurement. In R. E. Bennett & W. C. Ward (Eds.), Construction versus choice in cognitive measurement: Issues in constructed-response, performance testing, and portfolio assessment (pp. 61-73). Lawrence Erlbaum Associates, Inc.
MoNE. (2017). Monitoring and evaluation of academic skills (ABİDE) 2016 8th grade report. https://odsgm.meb.gov.tr/meb_iys_dosyalar/2017_11/30114819_iY-web-v6.pdf
Moses, T. P., & von Davier, A. A. (2006). An SAS macro for loglinear smoothing: Applications and implications (Report No. 06-05). Educational Testing Service.
Moses, T., von Davier, A. A., & Casabianca, J. (2004). Loglinear smoothing: An alternative numerical approach using SAS (Research No. 04-27). Educational Testing Service.
Muthén, L. K., & Muthén, B. O. (2012). Mplus user’s guide (7th ed.). Muthén & Muthén.
Olgar, S. (2015). The integration of automated essay scoring systems into the equating process for mixed format tests [Doctoral dissertation, Florida State University]. http://diginole.lib.fsu.edu/islandora/object/fsu%3A253122
Page, E. B. (1966). The imminence of grading essays by computers. Phi Delta Kappan, 47(5), 238–243. http://www.jstor.org/stable/20371545
Pang, X., Madera, E., Radwan, N., & Zhang, S. (2010). A comparison of four test equating methods (Research Report). Eqao.
R Development Core Team. (2018). R: A language and environment for statistical computing (version 3.5.2) [Computer software]. R Foundation for Statistical Computing.
Rodriguez, M. C. (2002). Choosing an item format. In G. Tindal & T. M. Haladyna (Eds.), Large scale assessment programs for all students: Validity, technical adequacy and implementation (pp. 213-231). Lawrence Erlbaum Associates.
SAS Instutite. (2015). Statistical analysis software (version 9.4) [Computer software]. SAS Institute.
Spence, P. D. (1996). The effect of multidimensionality on unidimensional equating with item response theory [Doctoral dissertation, University of Florida]. https://www.proquest.com/docview/304315473
Tankersley, K. (2007). Tests that teach: Using standardized tests to improve instruction. Association for Supervision and Curriculum Development.
Tate, R. (2000). Performance of a proposed method for the linking of mixed-format tests with constructed-response and multiple-choice items. Journal of Educational Measurement, 37(4), 329-346. http://www.jstor.org/stable/1435244
Timmerman, M. E., & Lorenzo-Seva, U. (2011). Dimensionality assessment of ordered polytomous items with parallel analysis. Psychological Methods, 16(2), 209 220. https://doi.org/10.1037/a0023353
Torchiano, M. (2020). effsize: Efficient Effect Size Computation (Version 0.8.1) [Computer software]. https://CRAN.R-project.org/package=effsize
Weiss, D. J., & Minden, S. V. (2012). A comparison of item parameter estimates from Xcalibre 4.1 and Bilog-MG (Technical Report). Assessment Systems Corporation.
Wolf, R. (2013). Assessing the impact of characteristics of the test, common-items, and examinees on the preservation of equity properties in mixed-format test equating [Doctoral dissertation, University of Pittsburgh]. https://core.ac.uk/download/pdf/19441049.pdf
Yoes, M. E. (1996). User's manual for the XCALIBRE marginal maximum-likelihood estimation program [Computer software]. Assessment Systems Corporation.
Zu, J., & Liu, J. (2010). Observed score equating using discrete and passage-based anchor items. Journal of Educational Measurement, 47(4), 395 412. https://doi.org/10.1111/j.1745-3984.2010.00120.x

There are 47 citations in total.

Details

Primary Language	English
Subjects	Studies on Education
Journal Section	Research Article
Authors	İbrahim Uysal 0000-0002-6767-0362 Nuri Doğan 0000-0001-6274-2016
Submission Date	October 24, 2020
Publication Date	June 10, 2021
DOI	https://doi.org/10.21449/ijate.815961
IZ	https://izlik.org/JA88MK53BT
Published in Issue	Year 2021 Volume: 8 Issue: 2

Cite

APA	Uysal, İ., & Doğan, N. (2021). Automated Essay Scoring Effect on Test Equating Errors in Mixed-format Test. International Journal of Assessment Tools in Education, 8(2), 222-238. https://doi.org/10.21449/ijate.815961

Cited By

A review of deep-neural automated essay scoring models

Behaviormetrika

Masaki Uto

https://doi.org/10.1007/s41237-021-00142-y

Automatic essay exam scoring system: a systematic literature review

Procedia Computer Science

https://doi.org/10.1016/j.procs.2022.12.166

A Study of Scoring English Tests Using an Automatic Scoring Model Incorporating Semantics

Automatic Control and Computer Sciences

https://doi.org/10.3103/S0146411623050115

Methods for online calibration of Q-matrix and item parameters for polytomous responses in cognitive diagnostic computerized adaptive testing

Behavior Research Methods

https://doi.org/10.3758/s13428-024-02392-6

Article Files

Full Text

23823 23825 23824