A Comparison of IRT Vertical Scaling Methods in Determining of the Increase in Achievement of Science Education

Aylin Albayrak Sarı; Hülya Kelecioğlu

doi:10.21031/epod.286221

Research Article

A Comparison of IRT Vertical Scaling Methods in Determining of the Increase in Achievement of Science Education

Year 2017, , 98 - 111, 03.04.2017

Aylin Albayrak Sarı , Hülya Kelecioğlu

https://doi.org/10.21031/epod.286221

Abstract

This study is based on a
vertical scaling implemented with reference to the Item Response Theory, and
involves a comparison of vertical scaling results obtained through the
application of proficiency estimation methods and calibration methods. The
vertical scales thus developed were assessed with reference to the criteria of
grade-to-grade growth, grade-to-grade variability, and the separation of grade
distributions. The data used in the study pertains to a dataset composed of a
total of 1500 students from twelve primary schools in the province of Ankara,
characterized by different levels of socio-economic cultural development. The
comparison of the findings pertaining to the first and second sub-problems
reveals that the mean differences found through separate calibration were lower
than those applicable to concurrent calibration, while the standard deviation
found in the case of separate calibration were again lower than the values
established through concurrent calibration. Furthermore, the scale of impact in
the case of separate calibration was again lower than the values applicable to
concurrent calibration. The results reached for all three criteria, using the
concurrent calibration method were ranked in the order ML < MAP < EAP,
with ML leading to the lowest value while EAP producing the highest one. In
case of separate calibration, on the other hand, the ranking of results was
found to vary with reference to the criteria applied.

Keywords

Item response theory, vertical scaling, calibration methods, proficiency estimation methods

References

Boughton, K. A., Lorie, W. & Yao, L. (2005). A multidimensional multi-group irt models for vertical scales with complex test structure: An empirical evaluation of student growth using real data. Paper presented at the annual meeting of the National Council on Measurement in Education, Monreal, Canada.
Creswell, J. W. (2013). Research design: Qualitative, quantitative and mixed methods approaches (4th edition). University of Nebraska, Lincoln: Sage.
Çetin, E. (2009). Dikey ölçeklemede klasik test ve madde tepki kuramına dayalı yöntemlerin karşılaştırılması. Unpublished Doctoral Thesis, Ankara: Hacettepe University.
Dongyang, L. (2009). Developing a common scale for testlet model parameter estimates under the common-item nonequivalent groups design. Unpublished Doctoral Thesis, University of Maryland.
Hambleton, R. K., & Swaminathan, H. (1985). Item response theory: principles and applications. Boston: Kluwer.
Hambleton, R. K., Swaminathan, H., & Rogers, H. J. (1991). Fundamentals of item response theory. Newbury Park, CA: Sage.
Hanson, B. A., & Béguin, A. A. (2002). Obtaining a common scale for item response theory item parameters using separate versus concurrent estimation in the common- item equating design. Applied Psychological Measurement, 26, 3-24.
Hanson, B. A., Zeng, L., & Chien, Y. (2004). ST: A Computer Program for IRT Scale Transformation [Computer software]. Retrieved January 24, 2005, from http://www.education.uiowa.edu/casma
Harris, D. J. (2003). Equating the multistate bar examination. The Bar Examiner, 72(3), 12-18.
Holland, P. W. & Dorans, N. J. (2006). Linking and equating. In R. L. Brennan (Ed.), Educational measurement (pp. 187–220). Westport, CT: Praeger Publishers.
Karkee, T. B. & Wright K. R. (2004). Evaluation of linking methods for placing three-parameter logistic item parameter estimates onto a one-parameter scale. Paper presented at the Annual Meeting of the American Educational Research Association, San Diego, California.
Kim, J. (2007). A comparison of calibration methods and proficiency estimators for creating IRT vertical scales. Unpublished Doctoral Thesis, University of Iowa.
Kim, S., & Kolen, M. J. (2006). Robustness to format effects of IRT linking methods for mixed-format tests. Applied Measurement in Education, 19(4), 357-381.
Kim, J., Lee, W.C., Kim, D. & Kelley, K. ( 2009). Investigation of Vertical Scaling Using the Rasch Model. Paper presented at the annual meeting of the National Council on Measurement in Education, San Diego, CA.
Kolen, M. J & Brennan, R. L. (2004). Test equating, scaling, and linking: methods and practices (2nd edn) (New York, Springer Verlag).
Lord, F. M., & Novick, M. R. (1968). Statistical theories of mental test scores. Reading, MA: Addison Wesley.
Loyd, B. H., & Hoover, H. D. (1980). Vertical equating using the Rasch model. Journal of Educational Measurement, 17, 179-193.
McBridge, J., & Wise, L. (2001). Developing the Vertical Scale for the Florida Comprehensive Assessment Test (FCAT). Paper presented at the annual meeting of the Harcourt Educational Measurement, San Antonio, Texas.
Meng, H (2007). A comparison study of IRT calibration methods for mixed-format tests in vertical scaling. Unpublished Ph.D. Thesis, University of Iowa, Iowa.
Meng, H., Kolen, M. J. & Lohman, D. (2006). An empirical investigation of IRT scaling methods: How different IRT models, parameter estimation procedures, proficiency estimation methods, and estimation programs affect the results of vertical scaling for the Cognitive Abilities Test. Paper presented at the annual meeting of the National Council on Measurement in Education, San Francisco, CA.
Nandakumar, R. (1994). Assessing dimensionality of a set of item responses: Comparison of different approaches. Journal of Educational Measurement, 31(1), 17-35.
Schermelleh-Engel, K., Moosbrugger, H., & Müller, H. (2003). Evaluating the fit of structural equation models: test of significance and descriptive goodness-of-fit measures. Methods of Psychological Research Online, 8(2), 23-74.
Sinharay, S. & Holland, P. W. (2007). Is it necessary to make anchor tests mini versions of the tests being equated or can some restrictions be relaxed? Journal of Educational Measurement, 44(3), 249-275.
Tong, T. (2005). Comparison of methodologies and results in vertical scaling for educational achievements tests. Unpublished Doctoral Thesis, University of Iowa, Iowa.
Tong, Y. & Kolen M. (2007). Comparison of methodologies and results in vertical scaling for educational achievement tests. Applied Measurement in Education, 20(2), 227-253
Tong, Y. & Kolen, M. (2010). Scaling: An ITEMS Module. Educational Measurement: Issues and Practice, 29(4), 39-48
von Davier, A. A., Holland, P. W. & Thayer, D. T. (2004). The kernel method of test equating. New York: Springer.
von Davier, A. A. & Wilson, C. (2008). Investigating the population sensitivity assumption of item response theory true-score equating across two subgroups of examinees and two test formats. Applied Psychological Measurement, 32(1), 11-26.
Yen, W. M. (1984). Obtaining maximum likelihood trait estimates from number-correct scores for the three-parameter logistic model. Journal of Educational Measurement, 21, 93-111.
Zhu, W. (1998). Test equating: What, why, who?. Research Quarterly for Exercise and Sport, 69(1), 11–23.

A Comparison of IRT Vertical Scaling Methods in Determining the Increase in Science Achievement

Year 2017, , 98 - 111, 03.04.2017

Aylin Albayrak Sarı , Hülya Kelecioğlu

https://doi.org/10.21031/epod.286221

Abstract

This study is based on a
vertical scaling implemented with reference to the Item Response Theory, and
involves a comparison of vertical scaling results obtained through the
application of proficiency estimation methods and calibration methods. The
vertical scales thus developed were assessed with reference to the criteria of
grade-to-grade growth, grade-to-grade variability, and the separation of grade
distributions. The data used in the study pertains to a dataset composed of a
total of 1500 students from twelve primary schools in the province of Ankara,
characterized by different levels of socio-economic cultural development. The
comparison of the findings pertaining to the first and the second sub-problems
reveals that the mean differences found through separate calibration were lower
than those applicable to concurrent calibration, while the standard deviation
found in the case of separate calibration were again lower than the values
established through concurrent calibration. Furthermore, the scale of impact in
the case of separate calibration was again lower than the values applicable to
concurrent calibration. The results reached for all three criteria, using the
concurrent calibration method were ranked in the order ML < MAP < EAP, with
ML leading to the lowest value while EAP producing the highest one. In case of
separate calibration, on the other hand, the ranking of results was found to
vary with reference to the criteria applied.

Keywords

Item response theory, vertical scaling, calibration methods, proficiency estimation methods

References

Boughton, K. A., Lorie, W. & Yao, L. (2005). A multidimensional multi-group irt models for vertical scales with complex test structure: An empirical evaluation of student growth using real data. Paper presented at the annual meeting of the National Council on Measurement in Education, Monreal, Canada.
Creswell, J. W. (2013). Research design: Qualitative, quantitative and mixed methods approaches (4th edition). University of Nebraska, Lincoln: Sage.
Çetin, E. (2009). Dikey ölçeklemede klasik test ve madde tepki kuramına dayalı yöntemlerin karşılaştırılması. Unpublished Doctoral Thesis, Ankara: Hacettepe University.
Dongyang, L. (2009). Developing a common scale for testlet model parameter estimates under the common-item nonequivalent groups design. Unpublished Doctoral Thesis, University of Maryland.
Hambleton, R. K., & Swaminathan, H. (1985). Item response theory: principles and applications. Boston: Kluwer.
Hambleton, R. K., Swaminathan, H., & Rogers, H. J. (1991). Fundamentals of item response theory. Newbury Park, CA: Sage.
Hanson, B. A., & Béguin, A. A. (2002). Obtaining a common scale for item response theory item parameters using separate versus concurrent estimation in the common- item equating design. Applied Psychological Measurement, 26, 3-24.
Hanson, B. A., Zeng, L., & Chien, Y. (2004). ST: A Computer Program for IRT Scale Transformation [Computer software]. Retrieved January 24, 2005, from http://www.education.uiowa.edu/casma
Harris, D. J. (2003). Equating the multistate bar examination. The Bar Examiner, 72(3), 12-18.
Holland, P. W. & Dorans, N. J. (2006). Linking and equating. In R. L. Brennan (Ed.), Educational measurement (pp. 187–220). Westport, CT: Praeger Publishers.
Karkee, T. B. & Wright K. R. (2004). Evaluation of linking methods for placing three-parameter logistic item parameter estimates onto a one-parameter scale. Paper presented at the Annual Meeting of the American Educational Research Association, San Diego, California.
Kim, J. (2007). A comparison of calibration methods and proficiency estimators for creating IRT vertical scales. Unpublished Doctoral Thesis, University of Iowa.
Kim, S., & Kolen, M. J. (2006). Robustness to format effects of IRT linking methods for mixed-format tests. Applied Measurement in Education, 19(4), 357-381.
Kim, J., Lee, W.C., Kim, D. & Kelley, K. ( 2009). Investigation of Vertical Scaling Using the Rasch Model. Paper presented at the annual meeting of the National Council on Measurement in Education, San Diego, CA.
Kolen, M. J & Brennan, R. L. (2004). Test equating, scaling, and linking: methods and practices (2nd edn) (New York, Springer Verlag).
Lord, F. M., & Novick, M. R. (1968). Statistical theories of mental test scores. Reading, MA: Addison Wesley.
Loyd, B. H., & Hoover, H. D. (1980). Vertical equating using the Rasch model. Journal of Educational Measurement, 17, 179-193.
McBridge, J., & Wise, L. (2001). Developing the Vertical Scale for the Florida Comprehensive Assessment Test (FCAT). Paper presented at the annual meeting of the Harcourt Educational Measurement, San Antonio, Texas.
Meng, H (2007). A comparison study of IRT calibration methods for mixed-format tests in vertical scaling. Unpublished Ph.D. Thesis, University of Iowa, Iowa.
Meng, H., Kolen, M. J. & Lohman, D. (2006). An empirical investigation of IRT scaling methods: How different IRT models, parameter estimation procedures, proficiency estimation methods, and estimation programs affect the results of vertical scaling for the Cognitive Abilities Test. Paper presented at the annual meeting of the National Council on Measurement in Education, San Francisco, CA.
Nandakumar, R. (1994). Assessing dimensionality of a set of item responses: Comparison of different approaches. Journal of Educational Measurement, 31(1), 17-35.
Schermelleh-Engel, K., Moosbrugger, H., & Müller, H. (2003). Evaluating the fit of structural equation models: test of significance and descriptive goodness-of-fit measures. Methods of Psychological Research Online, 8(2), 23-74.
Sinharay, S. & Holland, P. W. (2007). Is it necessary to make anchor tests mini versions of the tests being equated or can some restrictions be relaxed? Journal of Educational Measurement, 44(3), 249-275.
Tong, T. (2005). Comparison of methodologies and results in vertical scaling for educational achievements tests. Unpublished Doctoral Thesis, University of Iowa, Iowa.
Tong, Y. & Kolen M. (2007). Comparison of methodologies and results in vertical scaling for educational achievement tests. Applied Measurement in Education, 20(2), 227-253
Tong, Y. & Kolen, M. (2010). Scaling: An ITEMS Module. Educational Measurement: Issues and Practice, 29(4), 39-48
von Davier, A. A., Holland, P. W. & Thayer, D. T. (2004). The kernel method of test equating. New York: Springer.
von Davier, A. A. & Wilson, C. (2008). Investigating the population sensitivity assumption of item response theory true-score equating across two subgroups of examinees and two test formats. Applied Psychological Measurement, 32(1), 11-26.
Yen, W. M. (1984). Obtaining maximum likelihood trait estimates from number-correct scores for the three-parameter logistic model. Journal of Educational Measurement, 21, 93-111.
Zhu, W. (1998). Test equating: What, why, who?. Research Quarterly for Exercise and Sport, 69(1), 11–23.

There are 30 citations in total.

Details

Journal Section	Articles
Authors	Aylin Albayrak Sarı Hülya Kelecioğlu
Publication Date	April 3, 2017
Acceptance Date	March 3, 2017
Published in Issue	Year 2017

Cite

APA	Albayrak Sarı, A., & Kelecioğlu, H. (2017). A Comparison of IRT Vertical Scaling Methods in Determining of the Increase in Achievement of Science Education. Journal of Measurement and Evaluation in Education and Psychology, 8(1), 98-111. https://doi.org/10.21031/epod.286221

Article Files

Full Text