Automated Essay Scoring Effect on Test Equating Errors in Mixed-format Test
Yıl 2021,
, 222 - 238, 10.06.2021
İbrahim Uysal
,
Nuri Doğan
Öz
Scoring constructed-response items can be highly difficult, time-consuming, and costly in practice. Improvements in computer technology have enabled automated scoring of constructed-response items. However, the application of automated scoring without an investigation of test equating can lead to serious problems. The goal of this study was to score the constructed-response items in mixed-format tests automatically with different test/training data rates and to investigate the indirect effect of these scores on test equating compared with human raters. Bidirectional long-short term memory (BLSTM) was selected as the automated scoring method for the best performance. During the test equating process, methods based on classical test theory and item response theory were utilized. In most of the equating methods, errors of the equating resulting from automated scoring were close to the errors occurring in equating processes conducted by human raters. It was concluded that automated scoring can be applied because it is convenient in terms of equating.
Kaynakça
- Adesiji, K. M., Agbonifo, O. C., Adesuyi, A. T., & Olabode, O. (2016). Development of an automated descriptive text-based scoring system. British Journal of Mathematics & Computer Science, 19(4), 1-14. https://doi.org/10.9734/BJMCS/2016/27558
- Albano, A. D. (2016). equate: An R package for observed-score linking and equating. Journal of Statistical Software, 74(8), 1-36. https://doi.org/10.18637/jss.v074.i08
- Almond, R. G. (2014). Using automated essay scores as an anchor when equating constructed-response writing tests. International Journal of Testing, 14(1), 73 91. https://doi.org/10.1080/15305058.2013.816309
- Angoff, W. H. (1984). Scales, norms and equivalent scores. Educational Testing Service.
- Attali, Y., & Burstein, J. (2006). Automated essay scoring with e-rater® v.2. Journal of Technology, Learning, and Assessment, 4(3), 1-30. http://www.jtla.org.
- Barendse, M. T., Oort, F. J., & Timmerman, M. E. (2015). Using exploratory factor analysis to determine the dimensionality of discrete responses. Structural Equation Modeling: A Multidisciplinary Journal, 22(1), 87 101. https://doi.org/10.1080/10705511.2014.934850
- Bartlett, M. S. (1950). Tests of significance in factor analysis. British Journal of Psychology, 3(2), 77-85. https://doi.org/10.1111/j.2044-8317.1950.tb00285.x
- Chen, H., Xu, J., & He, B. (2014). Automated essay scoring by capturing relative writing quality. The Computer Journal, 57(9), 1318-1330. https://doi.org/10.1093/comjnl/bxt117
- Cliff, N. (1993). Dominance statistics: Ordinal analyses to answer ordinal questions. Psychological Bulletin, 114(3), 494-509. https://doi.org/10.1037/0033-2909.114.3.494
- Cliff, N. (1996). Ordinal methods for behavioral data analysis. Routledge.
- Cortina, J. M. (1993). What is coefficient alpha? An examination of theory and applications. The Journal of Applied Psychology, 78(1), 98 104. https://doi.org/10.1037/0021 9010.78.1.98
- Creswell, J. W. (2012). Educational research: Planning, conducting and evaluating quantitative and qualitative research (4th ed.). Pearson.
- Deng, W., & Monfils, R. (2017). Long-term impact of valid case criterion on capturing population-level growth under Item Response Theory equating (Research Report 17-17). Educational Testing Service. https://doi.org/10.1002/ets2.12144
- Fritz, C. O., Morris, P. E., & Richler, J. J. (2012). Effect size estimates: Current use, calculations, and interpretation. Journal of Experimental Psychology: General, 141(1), 2-18. https://doi.org/10.1037/a0024338
- Gonzalez, J., & Wiberg, M. (2017). Applying test equating methods: Using R. Springer.
- Hagge, S. L., & Kolen, M. J. (2011). Equating mixed-format tests with format representative and non-representative common items. In M. J. Kolen & W-C. Lee (Eds.), Mixed-format tests: Psychometric properties with a primary focus on equating (Vol. 1, pp. 95-135). Center for Advanced Studies in Measurement and Assessment, The University of Iowa.
- Hagge, S. L., Liu, C., He, Y., Powers, S. J., Wang, W., & Kolen, M. J. (2011). A comparison of IRT and traditional equipercentile methods in mixed-format equating. In M. J. Kolen & W-C. Lee (Eds.), Mixed-format tests: Psychometric properties with a primary focus on equating (Vol. 1, pp. 19-50). Center for Advanced Studies in Measurement and Assessment, The University of Iowa.
- He, Y. (2011). Evaluating equating properties for mixed-format tests [Doctoral dissertation, University of IOWA]. https://ir.uiowa.edu/etd/981/
- Kaiser, H. F. (1970). A second generation little jiffy. Psychometrika, 35(4), 401 415. https://doi.org/10.1007/BF02291817
- Kaiser, H. F., & Rice, J. (1974). Little jiffy, mark IV. Educational and Psychological Measurement, 34(1), 111–117. https://doi.org/10.1177/001316447403400115
- Kolen, M. J., & Brennan, R. L. (2014). Test equating, scaling and linking (2nd ed.). Springer.
- Keller, L. A., & Keller, R. R. (2011). The long-term sustainability of different Item Response Theory scaling methods. Educational and Psychological Measurement, 71(2), 362–379. https://doi.org/10.1177/0013164410375111
- LaFlair, G. T., Isbell, D., May, L. D. N., Arvizu, M. N. G., & Jamieson, J. (2017). Equating in small scale language testing programs. Language Testing, 34(1), 127 144. https://doi.org/10.1177/0265532215620825
- Lee, E., Lee, W-C., & Brennan, R. L. (2012). Exploring equity properties in equating using AP® examinations (Report No. 2012-4). CollegeBoard.
- Liu, C., & Kolen, M. J. (2011). A comparison among IRT equating methods and traditional equating methods for mixed-format tests. In M. J. Kolen & W-C. Lee (Eds.), Mixed-format tests: Psychometric properties with a primary focus on equating (Vol. 1, pp. 75-94). Center for Advanced Studies in Measurement and Assessment, The University of Iowa.
- Lorenzo-Seva, U., & Ferrando, P. J. (2006). FACTOR: A computer program to fit the exploratory factor analysis model. Behavior Research Methods, 38(1), 88-91. https://doi.org/10.3758/BF03192753
- Martinez, M. E. (1999). Cognition and the question of test item format. Educational Psychologist, 34(4), 207-218. https://doi.org/10.1207/s15326985ep3404_2
- Messick, S. (1993). Trait equivalence as construct validity of score interpretation across multiple methods of measurement. In R. E. Bennett & W. C. Ward (Eds.), Construction versus choice in cognitive measurement: Issues in constructed-response, performance testing, and portfolio assessment (pp. 61-73). Lawrence Erlbaum Associates, Inc.
- MoNE. (2017). Monitoring and evaluation of academic skills (ABİDE) 2016 8th grade report. https://odsgm.meb.gov.tr/meb_iys_dosyalar/2017_11/30114819_iY-web-v6.pdf
- Moses, T. P., & von Davier, A. A. (2006). An SAS macro for loglinear smoothing: Applications and implications (Report No. 06-05). Educational Testing Service.
- Moses, T., von Davier, A. A., & Casabianca, J. (2004). Loglinear smoothing: An alternative numerical approach using SAS (Research No. 04-27). Educational Testing Service.
- Muthén, L. K., & Muthén, B. O. (2012). Mplus user’s guide (7th ed.). Muthén & Muthén.
- Olgar, S. (2015). The integration of automated essay scoring systems into the equating process for mixed format tests [Doctoral dissertation, Florida State University]. http://diginole.lib.fsu.edu/islandora/object/fsu%3A253122
- Page, E. B. (1966). The imminence of grading essays by computers. Phi Delta Kappan, 47(5), 238–243. http://www.jstor.org/stable/20371545
- Pang, X., Madera, E., Radwan, N., & Zhang, S. (2010). A comparison of four test equating methods (Research Report). Eqao.
- R Development Core Team. (2018). R: A language and environment for statistical computing (version 3.5.2) [Computer software]. R Foundation for Statistical Computing.
- Rodriguez, M. C. (2002). Choosing an item format. In G. Tindal & T. M. Haladyna (Eds.), Large scale assessment programs for all students: Validity, technical adequacy and implementation (pp. 213-231). Lawrence Erlbaum Associates.
- SAS Instutite. (2015). Statistical analysis software (version 9.4) [Computer software]. SAS Institute.
- Spence, P. D. (1996). The effect of multidimensionality on unidimensional equating with item response theory [Doctoral dissertation, University of Florida]. https://www.proquest.com/docview/304315473
- Tankersley, K. (2007). Tests that teach: Using standardized tests to improve instruction. Association for Supervision and Curriculum Development.
- Tate, R. (2000). Performance of a proposed method for the linking of mixed-format tests with constructed-response and multiple-choice items. Journal of Educational Measurement, 37(4), 329-346. http://www.jstor.org/stable/1435244
- Timmerman, M. E., & Lorenzo-Seva, U. (2011). Dimensionality assessment of ordered polytomous items with parallel analysis. Psychological Methods, 16(2), 209 220. https://doi.org/10.1037/a0023353
- Torchiano, M. (2020). effsize: Efficient Effect Size Computation (Version 0.8.1) [Computer software]. https://CRAN.R-project.org/package=effsize
- Weiss, D. J., & Minden, S. V. (2012). A comparison of item parameter estimates from Xcalibre 4.1 and Bilog-MG (Technical Report). Assessment Systems Corporation.
- Wolf, R. (2013). Assessing the impact of characteristics of the test, common-items, and examinees on the preservation of equity properties in mixed-format test equating [Doctoral dissertation, University of Pittsburgh]. https://core.ac.uk/download/pdf/19441049.pdf
- Yoes, M. E. (1996). User's manual for the XCALIBRE marginal maximum-likelihood estimation program [Computer software]. Assessment Systems Corporation.
- Zu, J., & Liu, J. (2010). Observed score equating using discrete and passage-based anchor items. Journal of Educational Measurement, 47(4), 395 412. https://doi.org/10.1111/j.1745-3984.2010.00120.x
Automated Essay Scoring Effect on Test Equating Errors in Mixed-format Test
Yıl 2021,
, 222 - 238, 10.06.2021
İbrahim Uysal
,
Nuri Doğan
Öz
Scoring constructed-response items can be highly difficult, time-consuming, and costly in practice. Improvements in computer technology have enabled automated scoring of constructed-response items. However, the application of automated scoring without an investigation of test equating can lead to serious problems. The goal of this study was to score the constructed-response items in mixed-format tests automatically with different test/training data rates and to investigate the indirect effect of these scores on test equating compared with human raters. Bidirectional long-short term memory (BLSTM) was selected as the automated scoring method for the best performance. During the test equating process, methods based on classical test theory and item response theory were utilized. In most of the equating methods, errors of the equating resulting from automated scoring were close to the errors occurring in equating processes conducted by human raters. It was concluded that automated scoring can be applied because it is convenient in terms of equating.
Kaynakça
- Adesiji, K. M., Agbonifo, O. C., Adesuyi, A. T., & Olabode, O. (2016). Development of an automated descriptive text-based scoring system. British Journal of Mathematics & Computer Science, 19(4), 1-14. https://doi.org/10.9734/BJMCS/2016/27558
- Albano, A. D. (2016). equate: An R package for observed-score linking and equating. Journal of Statistical Software, 74(8), 1-36. https://doi.org/10.18637/jss.v074.i08
- Almond, R. G. (2014). Using automated essay scores as an anchor when equating constructed-response writing tests. International Journal of Testing, 14(1), 73 91. https://doi.org/10.1080/15305058.2013.816309
- Angoff, W. H. (1984). Scales, norms and equivalent scores. Educational Testing Service.
- Attali, Y., & Burstein, J. (2006). Automated essay scoring with e-rater® v.2. Journal of Technology, Learning, and Assessment, 4(3), 1-30. http://www.jtla.org.
- Barendse, M. T., Oort, F. J., & Timmerman, M. E. (2015). Using exploratory factor analysis to determine the dimensionality of discrete responses. Structural Equation Modeling: A Multidisciplinary Journal, 22(1), 87 101. https://doi.org/10.1080/10705511.2014.934850
- Bartlett, M. S. (1950). Tests of significance in factor analysis. British Journal of Psychology, 3(2), 77-85. https://doi.org/10.1111/j.2044-8317.1950.tb00285.x
- Chen, H., Xu, J., & He, B. (2014). Automated essay scoring by capturing relative writing quality. The Computer Journal, 57(9), 1318-1330. https://doi.org/10.1093/comjnl/bxt117
- Cliff, N. (1993). Dominance statistics: Ordinal analyses to answer ordinal questions. Psychological Bulletin, 114(3), 494-509. https://doi.org/10.1037/0033-2909.114.3.494
- Cliff, N. (1996). Ordinal methods for behavioral data analysis. Routledge.
- Cortina, J. M. (1993). What is coefficient alpha? An examination of theory and applications. The Journal of Applied Psychology, 78(1), 98 104. https://doi.org/10.1037/0021 9010.78.1.98
- Creswell, J. W. (2012). Educational research: Planning, conducting and evaluating quantitative and qualitative research (4th ed.). Pearson.
- Deng, W., & Monfils, R. (2017). Long-term impact of valid case criterion on capturing population-level growth under Item Response Theory equating (Research Report 17-17). Educational Testing Service. https://doi.org/10.1002/ets2.12144
- Fritz, C. O., Morris, P. E., & Richler, J. J. (2012). Effect size estimates: Current use, calculations, and interpretation. Journal of Experimental Psychology: General, 141(1), 2-18. https://doi.org/10.1037/a0024338
- Gonzalez, J., & Wiberg, M. (2017). Applying test equating methods: Using R. Springer.
- Hagge, S. L., & Kolen, M. J. (2011). Equating mixed-format tests with format representative and non-representative common items. In M. J. Kolen & W-C. Lee (Eds.), Mixed-format tests: Psychometric properties with a primary focus on equating (Vol. 1, pp. 95-135). Center for Advanced Studies in Measurement and Assessment, The University of Iowa.
- Hagge, S. L., Liu, C., He, Y., Powers, S. J., Wang, W., & Kolen, M. J. (2011). A comparison of IRT and traditional equipercentile methods in mixed-format equating. In M. J. Kolen & W-C. Lee (Eds.), Mixed-format tests: Psychometric properties with a primary focus on equating (Vol. 1, pp. 19-50). Center for Advanced Studies in Measurement and Assessment, The University of Iowa.
- He, Y. (2011). Evaluating equating properties for mixed-format tests [Doctoral dissertation, University of IOWA]. https://ir.uiowa.edu/etd/981/
- Kaiser, H. F. (1970). A second generation little jiffy. Psychometrika, 35(4), 401 415. https://doi.org/10.1007/BF02291817
- Kaiser, H. F., & Rice, J. (1974). Little jiffy, mark IV. Educational and Psychological Measurement, 34(1), 111–117. https://doi.org/10.1177/001316447403400115
- Kolen, M. J., & Brennan, R. L. (2014). Test equating, scaling and linking (2nd ed.). Springer.
- Keller, L. A., & Keller, R. R. (2011). The long-term sustainability of different Item Response Theory scaling methods. Educational and Psychological Measurement, 71(2), 362–379. https://doi.org/10.1177/0013164410375111
- LaFlair, G. T., Isbell, D., May, L. D. N., Arvizu, M. N. G., & Jamieson, J. (2017). Equating in small scale language testing programs. Language Testing, 34(1), 127 144. https://doi.org/10.1177/0265532215620825
- Lee, E., Lee, W-C., & Brennan, R. L. (2012). Exploring equity properties in equating using AP® examinations (Report No. 2012-4). CollegeBoard.
- Liu, C., & Kolen, M. J. (2011). A comparison among IRT equating methods and traditional equating methods for mixed-format tests. In M. J. Kolen & W-C. Lee (Eds.), Mixed-format tests: Psychometric properties with a primary focus on equating (Vol. 1, pp. 75-94). Center for Advanced Studies in Measurement and Assessment, The University of Iowa.
- Lorenzo-Seva, U., & Ferrando, P. J. (2006). FACTOR: A computer program to fit the exploratory factor analysis model. Behavior Research Methods, 38(1), 88-91. https://doi.org/10.3758/BF03192753
- Martinez, M. E. (1999). Cognition and the question of test item format. Educational Psychologist, 34(4), 207-218. https://doi.org/10.1207/s15326985ep3404_2
- Messick, S. (1993). Trait equivalence as construct validity of score interpretation across multiple methods of measurement. In R. E. Bennett & W. C. Ward (Eds.), Construction versus choice in cognitive measurement: Issues in constructed-response, performance testing, and portfolio assessment (pp. 61-73). Lawrence Erlbaum Associates, Inc.
- MoNE. (2017). Monitoring and evaluation of academic skills (ABİDE) 2016 8th grade report. https://odsgm.meb.gov.tr/meb_iys_dosyalar/2017_11/30114819_iY-web-v6.pdf
- Moses, T. P., & von Davier, A. A. (2006). An SAS macro for loglinear smoothing: Applications and implications (Report No. 06-05). Educational Testing Service.
- Moses, T., von Davier, A. A., & Casabianca, J. (2004). Loglinear smoothing: An alternative numerical approach using SAS (Research No. 04-27). Educational Testing Service.
- Muthén, L. K., & Muthén, B. O. (2012). Mplus user’s guide (7th ed.). Muthén & Muthén.
- Olgar, S. (2015). The integration of automated essay scoring systems into the equating process for mixed format tests [Doctoral dissertation, Florida State University]. http://diginole.lib.fsu.edu/islandora/object/fsu%3A253122
- Page, E. B. (1966). The imminence of grading essays by computers. Phi Delta Kappan, 47(5), 238–243. http://www.jstor.org/stable/20371545
- Pang, X., Madera, E., Radwan, N., & Zhang, S. (2010). A comparison of four test equating methods (Research Report). Eqao.
- R Development Core Team. (2018). R: A language and environment for statistical computing (version 3.5.2) [Computer software]. R Foundation for Statistical Computing.
- Rodriguez, M. C. (2002). Choosing an item format. In G. Tindal & T. M. Haladyna (Eds.), Large scale assessment programs for all students: Validity, technical adequacy and implementation (pp. 213-231). Lawrence Erlbaum Associates.
- SAS Instutite. (2015). Statistical analysis software (version 9.4) [Computer software]. SAS Institute.
- Spence, P. D. (1996). The effect of multidimensionality on unidimensional equating with item response theory [Doctoral dissertation, University of Florida]. https://www.proquest.com/docview/304315473
- Tankersley, K. (2007). Tests that teach: Using standardized tests to improve instruction. Association for Supervision and Curriculum Development.
- Tate, R. (2000). Performance of a proposed method for the linking of mixed-format tests with constructed-response and multiple-choice items. Journal of Educational Measurement, 37(4), 329-346. http://www.jstor.org/stable/1435244
- Timmerman, M. E., & Lorenzo-Seva, U. (2011). Dimensionality assessment of ordered polytomous items with parallel analysis. Psychological Methods, 16(2), 209 220. https://doi.org/10.1037/a0023353
- Torchiano, M. (2020). effsize: Efficient Effect Size Computation (Version 0.8.1) [Computer software]. https://CRAN.R-project.org/package=effsize
- Weiss, D. J., & Minden, S. V. (2012). A comparison of item parameter estimates from Xcalibre 4.1 and Bilog-MG (Technical Report). Assessment Systems Corporation.
- Wolf, R. (2013). Assessing the impact of characteristics of the test, common-items, and examinees on the preservation of equity properties in mixed-format test equating [Doctoral dissertation, University of Pittsburgh]. https://core.ac.uk/download/pdf/19441049.pdf
- Yoes, M. E. (1996). User's manual for the XCALIBRE marginal maximum-likelihood estimation program [Computer software]. Assessment Systems Corporation.
- Zu, J., & Liu, J. (2010). Observed score equating using discrete and passage-based anchor items. Journal of Educational Measurement, 47(4), 395 412. https://doi.org/10.1111/j.1745-3984.2010.00120.x