TY  - JOUR
TT  - A Comparison of IRT Vertical Scaling Methods in Determining the Increase in Science Achievement
AU  - Albayrak Sarı, Aylin
AU  - Kelecioğlu, Hülya
PY  - 2017
DA  - March
Y2  - 2017
DO  - 10.21031/epod.286221
JF  - Journal of Measurement and Evaluation in Education and Psychology
JO  - JMEEP
PB  - Association for Measurement and Evaluation in Education and Psychology
WT  - DergiPark
SN  - 1309-6575
SP  - 98
EP  - 111
VL  - 8
IS  - 1
KW  - Item response theory
KW  - vertical scaling
KW  - calibration methods
KW  - proficiency estimation methods
N2  - This study is based on avertical scaling implemented with reference to the Item Response Theory, andinvolves a comparison of vertical scaling results obtained through theapplication of proficiency estimation methods and calibration methods. Thevertical scales thus developed were assessed with reference to the criteria ofgrade-to-grade growth, grade-to-grade variability, and the separation of gradedistributions. The data used in the study pertains to a dataset composed of atotal of 1500 students from twelve primary schools in the province of Ankara,characterized by different levels of socio-economic cultural development. Thecomparison of the findings pertaining to the first and the second sub-problemsreveals that the mean differences found through separate calibration were lowerthan those applicable to concurrent calibration, while the standard deviationfound in the case of separate calibration were again lower than the valuesestablished through concurrent calibration. Furthermore, the scale of impact inthe case of separate calibration was again lower than the values applicable toconcurrent calibration. The results reached for all three criteria, using theconcurrent calibration method were ranked in the order ML &amp;lt; MAP &amp;lt; EAP, withML leading to the lowest value while EAP producing the highest one. In case ofseparate calibration, on the other hand, the ranking of results was found tovary with reference to the criteria applied.
CR  - Boughton, K. A., Lorie, W. &amp; Yao, L. (2005). A multidimensional multi-group irt models for vertical scales with complex test structure: An empirical evaluation of student growth using real data. Paper presented at the annual meeting of the National Council on Measurement in Education, Monreal, Canada.
CR  - Creswell, J. W. (2013). Research design: Qualitative, quantitative and mixed methods approaches (4th edition). University of Nebraska, Lincoln: Sage.
CR  - Çetin, E. (2009). Dikey ölçeklemede klasik test ve madde tepki kuramına dayalı yöntemlerin karşılaştırılması. Unpublished Doctoral Thesis, Ankara: Hacettepe University.
CR  - Dongyang, L. (2009). Developing a common scale for testlet model parameter estimates under the common-item nonequivalent groups design. Unpublished Doctoral Thesis, University of Maryland.
CR  - Hambleton, R. K., &amp; Swaminathan, H. (1985). Item response theory: principles and applications. Boston: Kluwer.
CR  - Hambleton, R. K., Swaminathan, H., &amp; Rogers, H. J. (1991). Fundamentals of item response theory. Newbury Park, CA: Sage.
CR  - Hanson, B. A., &amp; Béguin, A. A. (2002). Obtaining a common scale for item response theory item parameters using separate versus concurrent estimation in the common- item equating design. Applied Psychological Measurement, 26, 3-24.
CR  - Hanson, B. A., Zeng, L., &amp; Chien, Y. (2004). ST: A Computer Program for IRT Scale Transformation [Computer software]. Retrieved January 24, 2005, from http://www.education.uiowa.edu/casma
CR  - Harris, D. J. (2003). Equating the multistate bar examination. The Bar Examiner, 72(3), 12-18.
CR  - Holland, P. W. &amp; Dorans, N. J. (2006). Linking and equating. In R. L. Brennan (Ed.), Educational measurement (pp. 187–220). Westport, CT: Praeger Publishers.
CR  - Karkee, T. B. &amp; Wright K. R. (2004). Evaluation of linking methods for placing three-parameter logistic item parameter estimates onto a one-parameter scale. Paper presented at the Annual Meeting of the American Educational Research Association, San Diego, California.
CR  - Kim, J. (2007). A comparison of calibration methods and proficiency estimators for creating IRT vertical scales. Unpublished Doctoral Thesis, University of Iowa.
CR  - Kim, S., &amp; Kolen, M. J. (2006). Robustness to format effects of IRT linking methods for mixed-format tests. Applied Measurement in Education, 19(4), 357-381.
CR  - Kim, J., Lee, W.C., Kim, D. &amp; Kelley, K. ( 2009). Investigation of Vertical Scaling Using the Rasch Model. Paper presented at the annual meeting of the National Council on Measurement in Education, San Diego, CA.
CR  - Kolen, M. J &amp; Brennan, R. L. (2004). Test equating, scaling, and linking: methods and practices (2nd edn) (New York, Springer Verlag).
CR  - Lord, F. M., &amp; Novick, M. R. (1968). Statistical theories of mental test scores. Reading, MA: Addison Wesley.
CR  - Loyd, B. H., &amp; Hoover, H. D. (1980). Vertical equating using the Rasch model. Journal of Educational Measurement, 17, 179-193.
CR  - McBridge, J., &amp; Wise, L. (2001). Developing the Vertical Scale for the Florida Comprehensive Assessment Test (FCAT). Paper presented at the annual meeting of the Harcourt Educational Measurement, San Antonio, Texas.
CR  - Meng, H (2007). A comparison study of IRT calibration methods for mixed-format tests in vertical scaling. Unpublished Ph.D. Thesis, University of Iowa, Iowa.
CR  - Meng, H., Kolen, M. J. &amp; Lohman, D. (2006). An empirical investigation of IRT scaling methods: How different IRT models, parameter estimation procedures, proficiency estimation methods, and estimation programs affect the results of vertical scaling for the Cognitive Abilities Test. Paper presented at the annual meeting of the National Council on Measurement in Education, San Francisco, CA.
CR  - Nandakumar, R. (1994). Assessing dimensionality of a set of item responses: Comparison of different approaches. Journal of Educational Measurement, 31(1), 17-35.
CR  - Schermelleh-Engel, K., Moosbrugger, H., &amp; Müller, H. (2003). Evaluating the fit of structural equation models: test of significance and descriptive goodness-of-fit measures. Methods of Psychological Research Online, 8(2), 23-74.
CR  - Sinharay, S. &amp; Holland, P. W. (2007). Is it necessary to make anchor tests mini versions of the tests being equated or can some restrictions be relaxed? Journal of Educational Measurement, 44(3), 249-275.
CR  - Tong, T. (2005). Comparison of methodologies and results in vertical scaling for educational achievements tests. Unpublished Doctoral Thesis, University of Iowa, Iowa.
CR  - Tong, Y. &amp; Kolen M. (2007). Comparison of methodologies and results in vertical scaling for educational achievement tests. Applied Measurement in Education, 20(2), 227-253
CR  - Tong, Y. &amp; Kolen, M. (2010). Scaling: An ITEMS Module. Educational Measurement: Issues and Practice, 29(4), 39-48
CR  - von Davier, A. A., Holland, P. W. &amp; Thayer, D. T. (2004). The kernel method of test equating. New York: Springer.
CR  - von Davier, A. A. &amp; Wilson, C. (2008). Investigating the population sensitivity assumption of item response theory true-score equating across two subgroups of examinees and two test formats. Applied Psychological Measurement, 32(1), 11-26.
CR  - Yen, W. M. (1984). Obtaining maximum likelihood trait estimates from number-correct scores for the three-parameter logistic model. Journal of Educational Measurement, 21, 93-111.
CR  - Zhu, W. (1998). Test equating: What, why, who?. Research Quarterly for Exercise and Sport, 69(1), 11–23.
UR  - https://doi.org/10.21031/epod.286221
L1  - https://dergipark.org.tr/en/download/article-file/268891
ER  -