Examination of Scale Transformation and Test Equating Methods in Testlet Based Tests

Harun Dilek; Kübra Atalay Kabasakal; Sebahat Gören

doi:10.24106/kefdergi.1750267

Research Article

Madde Takımı İçeren Testlerde Ölçek Dönüştürme ve Test Eşitleme Yöntemlerinin İncelenmesi

Year 2025, Volume: 33 Issue: 3, 658 - 671, 25.07.2025

Harun Dilek , Kübra Atalay Kabasakal , Sebahat Gören

https://doi.org/10.24106/kefdergi.1750267

Abstract

Çalışmanın amacı: Bu çalışmada madde takımları içeren testlerde farklı madde tepki kuramı modelleri ve örneklem büyüklükleri koşullarına dayalı test eşitleme performansları incelenmiştir.
Materyal ve Yöntem: Bu amaçla araştırmada, eTIMSS 2019 bilim testine ait veriler kullanılarak, Tek Boyutlu Madde Tepki Kuramı (TBMTK), Madde Takımı Tepki Kuramı (MTTK) ve bifaktör modelleri altında farklı örneklem büyüklüklerinde yapılan ölçek dönüştürme yöntemleri ve test eşitleme sonuçları incelenmiştir. Denk olmayan gruplarda ortak madde deseni altında ortalama-sigma ve Stocking-Lord ölçek dönüştürme yöntemleri ve gerçek ile gözlenen puana dayalı eşitleme yöntemleri kullanılmıştır. Değerlendirme ölçütleri olarak RMSE ve BIAS değerleri hesaplanmıştır.
Bulgular: Genel olarak düşük düzeyde madde takımı etkisinin olduğu bilim testinde TBMTK modeline dayalı ölçek dönüştürme ve bifaktör modele dayalı test eşitleme sonuçlarının daha düşük hata değerleri ürettiği görülmüştür. Ayrıca örneklem büyüklüğü arttıkça genel olarak parametre kestirimlerinin hata değerlerinin azaldığı gözlemlenmiş olup özellikle MTTK ile çalışıldığında örneklem sayısının 500’den fazla olması gerektiği sonucuna varılmıştır.
Önemli Vurgular: Madde takımı etkisi göz önüne alındığında, bifaktör model daha doğru ve kararlı sonuçlar sunarak adil ve güvenilir puan eşitlemesi yapılmasını sağlamaktadır. Gerçek veri seti kullanılarak gerçekleştirilen bu çalışma ile madde takımları içeren testlerde madde takımı etkisinin pratikte nasıl bir etki yarattığı somut bir şekilde ortaya konulmuştur.

Keywords

Madde takımı , Madde takımı etkisi , Test eşitleme , Madde Takımı Tepki Kuramı

References

Asriadi M., H. (2023). Equating of standardized science subjects tests using various methods: which is the most profitable? Thabiea : Journal of Natural Science Teaching, 6(1), 51-64.
Atalay Kabasakal, K. (2014). Değişen madde fonksiyonunun test eşitlemeye etkisi [Doktora tezi]. Hacettepe Üniversitesi.
Babcock, B., & Hodge, K. J. (2020). Rasch versus classical equating in the context of small sample sizes. Educational and Psychological Measurement, 80(3), 499-521. https://doi.org/10.1177/0013164419878
Baker, F. B. & Al-Karni, A. (1991). A comparison of two procedures for computing IRT equating coefficients. Journal of Educational Measurement, 28(2), 147-162 https://doi.org/10.1111/j.1745-3984.1991.tb00350.x
Bradlow, E. T., Wainer, H., & Wang, X. (1999). A Bayesian random effects model for testlets. Psychometrika, 64(2), 153–168. https://doi.org/10.1007/BF02294533
Büyüköztürk, Ş., Kılıç Çakmak, E., Akgün, Ö. E., Karadeniz, Ş., & Demirel, F. (2020). Eğitimde bilimsel araştırma yöntemleri. Pegem Akademi.
Cai, L., Yang, J. S., & Hansen, M. (2011). Generalized full-information item bifactor analysis. Psychol Methods, 16(3), 221–248. 10.1037/a0023350
Cao, Y., Lu, R., & Tao, W. (2014). Effect of item response theory (IRT) model selection on testlet- based test equating (ETS Research Report No. RR-14-19). Educational Testing Service. https://doi.org/10.1002/ets2.12017
Chalmers, R. P. (2012). mirt: A Multidimensional Item Response Theory Package for the R Environment. Journal of Statistical Software, 48(6), 1–29. https://doi.org/10.18637/jss.v048.i06
Chen, J. (2014). Model selection for IRT equating of testlet-based tests in the random groups design [Doctoral dissertation] The University of Iowa. ProQuest Dissertations Publishing.
Cui, Z., & Kolen, M. J. (2009). Evaluation of two new smoothing methods in equating: The cubic B‐ spline presmoothing method and the direct presmoothing method. Journal of Educational Measurement, 46(2), 135-158. https://doi.org/10.1111/j.1745-3984.2009.00074.x
DeMars, C. (2006). Application of the bi-factor multidimensional item response theory model to testlet-based tests. Journal of Educational Measurement, 43(2), 145–168. https://doi.org/10.1111/j.1745-3984.2006.00010.x
Doğuyurt, A. (2023). İkili puanlanan testlerde yerel madde bağımsızlık varsayımının ihlâlinin test eşitleme yöntemlerine etkisi [Doctoral dissertation]. Gazi Üniversitesi.
Dorans, N. J., & Feigenbaum, M. D. (1994). Equating issues engendered by changes to the SATR and PSAT/NMSQTR (ETS RM-94-10). ETS.
Embretson, S. E., & Reise, S. P. (2000). Item response theory for psychologists. Lawrence Erlbaum Associates.
Gibbons, R. D., & Hedeker, D. (1992). Full information item bifactor analysis. Psychometrika, 57(3), 423-436. https://doi.org/10.1007/BF02295430
Gök, B., & Kelecioğlu, H. (2014). Denk olmayan gruplarda ortak madde deseni kullanılarak madde tepki kuramına dayalı eşitleme yöntemlerinin karşılaştırılması. Mersin Üniversitesi Eğitim Fakültesi Dergisi, 10(1), 120-136. https://doi.org/10.17860/efd.78698
Gül, E., Doğan-Gül, Ç., Çokluk-Bökeoğlu, Ö. & Özkan, M. (2017). Temel eğitimden ortaöğretime geçiş matematik alt testi asıl sınav ve mazeret sınavlarının madde tepki kuramına göre eşitlenmesi. Abant İzzet Baysal Üniversitesi Eğitim Fakültesi Dergisi, 17(4), 1900-1915. https://doi.org/10.17240/aibuefd.2017.17.32772-363973
Haebara, T. (1980). Equating logistic ability scales by a weighted least squares method. Japanese Psychological Research, 22(3), 144–149.
Hambleton, R.K., & Cook, L.L. (1983). Robustness of ítem response models and effects of test length and sample size on the precision of ability estimates. In D.J. Weiss (Ed.), New horizons in testing: Latent trait test theory and computerized adaptive testing (pp. 31-49). Vancouver.
Hambleton, R. K., & Swaminathan, H. (1985). Item response theory: Principles and applications. Kluwer.
Hambleton, R. K., Swaminathan, H., & Rogers, H. J. (1991). Fundamentals of item response theory. SAGE Publications.
Han, T., Kolen, M., & Pohlmann, J.(1997). A comparison among IRT true-andobserved score equatings and traditional equipercentile equating. Applied Measurement in Education,10,105–121.
Hanson, B. A. & Béguin, A. A. (2002). Obtaining a common scale for Item Response Theory item parameters using separate versus concurrent estimation in the common-item equating design. Applied Psychological Measurement, 26(1), 3-24. https://doi.org/10.1177/0146621602026001001
He, W., Li, F., Wolfe, E. W., & Mao, X. (2012). Model selection for equating testlet-based tests in the NEAT design: An empirical study. Annual NCME Conference.
He, Y., Zhongmin, C. & Osterlind, S. J. New robust scale transformation methods in the presence of outlying common items. Applied Psychological Measurement 39 (8), 613-626. https://doi.org/10.1177/0146621615587003
Huang, F., Li, Z., Liu, Y., Su, J., Yin, L., & Zhang, M. (2022). An extension of testlet-based equating to the polytomous testlet response theory model. Frontiers in Psychology, 12, 743362. https://doi.org/10.3389/fpsyg.2021.743362
Kilmen, S. &, Demirtaşlı, N. (2012). Comparison of test equating methods based on item response theory according to the sample size and ability distribution. Procedia - Social and Behavioral Sciences 46, 130 – 134. 10.1016/j.sbspro.2012.05.081
Kim, S. & Cohen, A. S. (1991). Effects of linking methods on detection of DIF. Journal of Educational Measurement, 29(1), 51-66.
Kim, S. & Lee, W. (2006). An extension of four IRT linking methods for mixed-format tests. Journal of Educational Measurement, 43(1), 53-76.
Kim, K. Y., Lim, E., & Lee, W. C. (2019). A comparison of the relative performance of four IRT models on equating passage-based tests. International Journal of Testing, 19(3), 248–269. https://doi.org/10.1080/15305058.2018.1530239
Kolen, M. J., & Brennan, R. L. (2014). Test equating, scaling, and linking: Methods and practices (3rd ed.). Springer.
Lee, W. C., and Ban, J. C. (2009). Comparison of IRT linking procedures. Applied Measurement in Education, 23(1), 23-48. https://doi.org/10.1080/08957340903423537
Lee, W., He, Y., Hagge, S., Wang, W., & Kolen, M. J. (2012). Equating mixed-format tests using dichotomous common items. In M. J. Kolen & W. Lee (Eds.), Mixed-format tests: Psychometric properties with a primary focus on equating (Vol. 2). CASMA Monograph. Iowa City, IA: Center for Advanced Studies in Measurement and Assessment, The University of Iowa.
Lee, G., Kolen, M. J., Frisbie, D. A., & Ankenmann, R. D. (2001). Comparison of dichotomous and polytomous item response models in equating scores from tests composed of testlets. Applied Psychological Measurement, 25, 357–372. https://doi.org/10.1177/01466210122032226
Linacre, J. M. (1994). Sample size and item calibration stability. Rasch measurement transactions, 7, 328.
Liu, C., & Kolen, M. J. (2011). A comparison among IRT equating methods and traditional equating methods for mixed-format tests. Mixed-format tests: Psychometric properties with a primary focus on equating, 1, 75-94.
Livingston, S. A. (1993). Small-sample equating with log-linear smoothing. Journal of Educational Measurement, 30(1), 23–29.
Lord, F. M. (1980). Applications of item response theory to practical testing problems. Lawrence Erlbaum Associates.
Lord, F. M. & Wingersky, M. S. (1984). Comparison of IRT true score and equipercentile observed score “equatings.”. Applied Psychological Measurement, 8, 453–46.
Loyd, B. H., & Hoover, H. D. (1980). Vertical equating using the Rasch model. Journal of Educational Measurement, 17(3), 179–193.
Marais, I. D., & Andrich, D. (2008). Effects of varying magnitude and patterns of local dependence in the unidimensional Rasch model. Journal of Applied Measurement, 9(2), 105–124.
Marco, G. L. (1977). Item characteristic curve solutions to three intractable testing problems. Journal of Educational Measurement, 14(2), 139–160.
Ogasawara, H. (2000). Asymptotic standard errors of IRT equating coefficients using moments. Economic Review, Otaru University of Commerce, 51(1), 1–23.
Ogasawara, H. (2001). Least squares estimation of item response theory linking coefficients. Applied Psychological Measurement, 25, 3–24.
Ogasawara, H. (2003). Asymptotic standard errors of IRT observed-score equating methods. Psychometrika, 68(2), 193–211.
Özdemir, G., & Atar, B. (2022). Investigation of the missing data imputation methods on characteristic curve transformation methods used in test equating. Journal of Measurement and Evaluation in Education and Psychology, 13(2), 105-116. https://doi.org/10.21031/epod.1029044Öztürk Gübeş, N. &, Kelecioğlu, H. (2016). The impact of test dimensionality, common-item set format, and scale linking methods on mixed format test equating. Educational Sciences: Theory & Practice, 16(3), 715-734. 10.12738/estp.2016.3.0218
Öztürk Gübeş, N. (2019). Test eşitlemede çok boyutluluğun eş zamanlı ve ayrı kalibrasyona etkisi. Hacettepe Üniversitesi Eğitim Fakültesi Dergisi, 34(4), doi: 1061-1074. 10.16986/HUJE.2019049186
Reckase, M. D. (2009). Multidimensional item response theory. Springer.
Rosenbaum, P. R. (1988). Items bundles. Psychometrika, 53(3), 349–359.
Rijmen, F. (2009). Three multidimensional models for testlet-based tests: Formal relations and an empirical comparison. ETS Research Report, 2009(2), 1–41.
Robitzsch, 2024. Bias-reduced Haebara and Stocking–Lord linking. J Multidisciplinary Scientific journal, 7(3), 373-384. https://doi.org/10.3390/j7030021
Ryan, J., & Brockmann, F. (2009). A Practitioner’s Introduction to Equating with Primers on Classical Test Theory and Item Response Theory. Council of Chief State School Officers.
Skaggs, G. (2005). Accuracy of random groups equating with very small samples. Journal of Educational Measurement, 42(4), 309-330. https://doi.org/10.1111/j.1745-3984.2005.00018.x
Spence, P. (1996). The effect of multidimensionality on unidimensional equating with item response theory [Doctoratal dissertation] University of Florida.
Stocking, M. L., & Lord, F. M. (1983). Developing a common metric in item response theory. Applied Psychological Measurement, 7(2), 201–210.
Tao, W., & Cao, Y. (2016). An extension of IRT-based equating to the dichotomous testlet response theory model. Applied Measurement in Education, 29(2), 108–121. https://doi.org/10.1080/08957347.2016.1138956
Uşkul, B. (2024). Madde takımı tabanlı testlerde ölçek dönüştürme hatalarının incelenmesi [Doctoral dissertation]. Hacettepe Üniversitesi.
Uysal, İ. &, Kilmen, S. (2016). Comparison of item response theory test equating methods for mixed format tests. International Online Journal of Educational Sciences, 2016, 8 (2), 1-11. http://dx.doi.org/10.15345/iojes.2016.02.001
Wainer, H., & Kiely, G. L. (1987). Item clusters and computerized adaptive testing: A case for testlets. Journal of Educational Measurement, 24(3), 185–202.
Wainer, H., & Wang, X. (2000). Using a new statistical model for testlets to score TOEFL. Journal of Educational Measurement, 37(3), 203–220. https://doi.org/10.1111/j.1745-3984.2000.tb01083.x
Wainer, H., Bradlow, E. T., & Wang, X. (2007). Testlet response theory and its applications. Cambridge University Press.
Wang, X., Bradlow, E. T., & Wainer, H. (2002). A general Bayesian model for testlets: Theory and applications. Applied Psychological Measurement, 26(2), 190–218. https://doi.org/10.1177/0146621602026001007
Wang, S. & Liu, H. (2018). Minimum sample size needed for equipercentile equating under the random groups design. In M. J. Kolen ve W. Lee (Ed.), Mixed-format tests: Psychometric properties with a primary focus on equating (vol 2.5, s. 107-126). Center for Advanced Studies in Measurement and Assessment.
Weeks, J. P. (2010). plink: An R package for linking mixed-format tests using IRT-based methods. Journal of Statistical Software, 35(12), 1–33.
Yıldırım Seheryeli, M., Yahşi-Sarı, H., & Kelecioğlu, H. (2021). Comparison of Kernel Equating and Kernel Local Equating in Item Response Theory Observed Score Equating. Journal of Measurement and Evaluation in Education and Psychology, 12(4), 348-357. https://doi.org/10.21031/epod.900843
Yılmaz Koğar, E., & Kelecioğlu, H. (2017). Examination of different item response theory models on tests composed of testlets. Journal of Education and Learning, 6(4), 113-126. 10.5539/jel.v6n4p113
Zor, Y. M. (2023). Investigation of multidimensional scale transformation methods applied to multidimensional tests according to various conditions. Adıyaman University Journal of Educational Sciences, 13(1), 41-53. http://dx.doi.org/10.17984/adyuebd.1239198

Examination of Scale Transformation and Test Equating Methods in Testlet Based Tests

Year 2025, Volume: 33 Issue: 3, 658 - 671, 25.07.2025

Harun Dilek , Kübra Atalay Kabasakal , Sebahat Gören

https://doi.org/10.24106/kefdergi.1750267

Abstract

Purpose: This study examines the test equating performance under various item response theory models and sample size conditions in testlet based tests.
Design/Methodology/Approach: Utilizing data from the eTIMSS 2019 science test, the study compares scale transformation methods and test equating results under Unidimensional Item Response Theory (UIRT), Testlet Response Theory (TRT) and bifactor models with varying sample sizes. Scale transformation methods, including the mean-sigma and Stocking-Lord methods, as well as observed and true score equating methods, were employed within the framework of a common-item nonequivalent groups design. To evaluate the equating performance, RMSE and BIAS values were calculated.
Findings: The findings indicate that in a science test with low testlet effects, scale transformation results based on the UIRT model and test equating results based on the bifactor model demonstrated lower error rates. Moreover, as sample size increased, the error in parameter estimations generally decreased, with the TRT model specifically requiring a sample size of at least 500 for robust estimations.

Highlights: The bifactor model, taking testlet effects into account, yielded more precise and consistent results, facilitating fair and reliable score equating. This study, utilizing real data, concretely illustrates the practical implications of testlet effects in tests containing testlets.

Keywords

Testlet , Testlet effect , Test equating , Testlet Response Theory

References

Asriadi M., H. (2023). Equating of standardized science subjects tests using various methods: which is the most profitable? Thabiea : Journal of Natural Science Teaching, 6(1), 51-64.
Atalay Kabasakal, K. (2014). Değişen madde fonksiyonunun test eşitlemeye etkisi [Doktora tezi]. Hacettepe Üniversitesi.
Babcock, B., & Hodge, K. J. (2020). Rasch versus classical equating in the context of small sample sizes. Educational and Psychological Measurement, 80(3), 499-521. https://doi.org/10.1177/0013164419878
Baker, F. B. & Al-Karni, A. (1991). A comparison of two procedures for computing IRT equating coefficients. Journal of Educational Measurement, 28(2), 147-162 https://doi.org/10.1111/j.1745-3984.1991.tb00350.x
Bradlow, E. T., Wainer, H., & Wang, X. (1999). A Bayesian random effects model for testlets. Psychometrika, 64(2), 153–168. https://doi.org/10.1007/BF02294533
Büyüköztürk, Ş., Kılıç Çakmak, E., Akgün, Ö. E., Karadeniz, Ş., & Demirel, F. (2020). Eğitimde bilimsel araştırma yöntemleri. Pegem Akademi.
Cai, L., Yang, J. S., & Hansen, M. (2011). Generalized full-information item bifactor analysis. Psychol Methods, 16(3), 221–248. 10.1037/a0023350
Cao, Y., Lu, R., & Tao, W. (2014). Effect of item response theory (IRT) model selection on testlet- based test equating (ETS Research Report No. RR-14-19). Educational Testing Service. https://doi.org/10.1002/ets2.12017
Chalmers, R. P. (2012). mirt: A Multidimensional Item Response Theory Package for the R Environment. Journal of Statistical Software, 48(6), 1–29. https://doi.org/10.18637/jss.v048.i06
Chen, J. (2014). Model selection for IRT equating of testlet-based tests in the random groups design [Doctoral dissertation] The University of Iowa. ProQuest Dissertations Publishing.
Cui, Z., & Kolen, M. J. (2009). Evaluation of two new smoothing methods in equating: The cubic B‐ spline presmoothing method and the direct presmoothing method. Journal of Educational Measurement, 46(2), 135-158. https://doi.org/10.1111/j.1745-3984.2009.00074.x
DeMars, C. (2006). Application of the bi-factor multidimensional item response theory model to testlet-based tests. Journal of Educational Measurement, 43(2), 145–168. https://doi.org/10.1111/j.1745-3984.2006.00010.x
Doğuyurt, A. (2023). İkili puanlanan testlerde yerel madde bağımsızlık varsayımının ihlâlinin test eşitleme yöntemlerine etkisi [Doctoral dissertation]. Gazi Üniversitesi.
Dorans, N. J., & Feigenbaum, M. D. (1994). Equating issues engendered by changes to the SATR and PSAT/NMSQTR (ETS RM-94-10). ETS.
Embretson, S. E., & Reise, S. P. (2000). Item response theory for psychologists. Lawrence Erlbaum Associates.
Gibbons, R. D., & Hedeker, D. (1992). Full information item bifactor analysis. Psychometrika, 57(3), 423-436. https://doi.org/10.1007/BF02295430
Gök, B., & Kelecioğlu, H. (2014). Denk olmayan gruplarda ortak madde deseni kullanılarak madde tepki kuramına dayalı eşitleme yöntemlerinin karşılaştırılması. Mersin Üniversitesi Eğitim Fakültesi Dergisi, 10(1), 120-136. https://doi.org/10.17860/efd.78698
Gül, E., Doğan-Gül, Ç., Çokluk-Bökeoğlu, Ö. & Özkan, M. (2017). Temel eğitimden ortaöğretime geçiş matematik alt testi asıl sınav ve mazeret sınavlarının madde tepki kuramına göre eşitlenmesi. Abant İzzet Baysal Üniversitesi Eğitim Fakültesi Dergisi, 17(4), 1900-1915. https://doi.org/10.17240/aibuefd.2017.17.32772-363973
Haebara, T. (1980). Equating logistic ability scales by a weighted least squares method. Japanese Psychological Research, 22(3), 144–149.
Hambleton, R.K., & Cook, L.L. (1983). Robustness of ítem response models and effects of test length and sample size on the precision of ability estimates. In D.J. Weiss (Ed.), New horizons in testing: Latent trait test theory and computerized adaptive testing (pp. 31-49). Vancouver.
Hambleton, R. K., & Swaminathan, H. (1985). Item response theory: Principles and applications. Kluwer.
Hambleton, R. K., Swaminathan, H., & Rogers, H. J. (1991). Fundamentals of item response theory. SAGE Publications.
Han, T., Kolen, M., & Pohlmann, J.(1997). A comparison among IRT true-andobserved score equatings and traditional equipercentile equating. Applied Measurement in Education,10,105–121.
Hanson, B. A. & Béguin, A. A. (2002). Obtaining a common scale for Item Response Theory item parameters using separate versus concurrent estimation in the common-item equating design. Applied Psychological Measurement, 26(1), 3-24. https://doi.org/10.1177/0146621602026001001
He, W., Li, F., Wolfe, E. W., & Mao, X. (2012). Model selection for equating testlet-based tests in the NEAT design: An empirical study. Annual NCME Conference.
He, Y., Zhongmin, C. & Osterlind, S. J. New robust scale transformation methods in the presence of outlying common items. Applied Psychological Measurement 39 (8), 613-626. https://doi.org/10.1177/0146621615587003
Huang, F., Li, Z., Liu, Y., Su, J., Yin, L., & Zhang, M. (2022). An extension of testlet-based equating to the polytomous testlet response theory model. Frontiers in Psychology, 12, 743362. https://doi.org/10.3389/fpsyg.2021.743362
Kilmen, S. &, Demirtaşlı, N. (2012). Comparison of test equating methods based on item response theory according to the sample size and ability distribution. Procedia - Social and Behavioral Sciences 46, 130 – 134. 10.1016/j.sbspro.2012.05.081
Kim, S. & Cohen, A. S. (1991). Effects of linking methods on detection of DIF. Journal of Educational Measurement, 29(1), 51-66.
Kim, S. & Lee, W. (2006). An extension of four IRT linking methods for mixed-format tests. Journal of Educational Measurement, 43(1), 53-76.
Kim, K. Y., Lim, E., & Lee, W. C. (2019). A comparison of the relative performance of four IRT models on equating passage-based tests. International Journal of Testing, 19(3), 248–269. https://doi.org/10.1080/15305058.2018.1530239
Kolen, M. J., & Brennan, R. L. (2014). Test equating, scaling, and linking: Methods and practices (3rd ed.). Springer.
Lee, W. C., and Ban, J. C. (2009). Comparison of IRT linking procedures. Applied Measurement in Education, 23(1), 23-48. https://doi.org/10.1080/08957340903423537
Lee, W., He, Y., Hagge, S., Wang, W., & Kolen, M. J. (2012). Equating mixed-format tests using dichotomous common items. In M. J. Kolen & W. Lee (Eds.), Mixed-format tests: Psychometric properties with a primary focus on equating (Vol. 2). CASMA Monograph. Iowa City, IA: Center for Advanced Studies in Measurement and Assessment, The University of Iowa.
Lee, G., Kolen, M. J., Frisbie, D. A., & Ankenmann, R. D. (2001). Comparison of dichotomous and polytomous item response models in equating scores from tests composed of testlets. Applied Psychological Measurement, 25, 357–372. https://doi.org/10.1177/01466210122032226
Linacre, J. M. (1994). Sample size and item calibration stability. Rasch measurement transactions, 7, 328.
Liu, C., & Kolen, M. J. (2011). A comparison among IRT equating methods and traditional equating methods for mixed-format tests. Mixed-format tests: Psychometric properties with a primary focus on equating, 1, 75-94.
Livingston, S. A. (1993). Small-sample equating with log-linear smoothing. Journal of Educational Measurement, 30(1), 23–29.
Lord, F. M. (1980). Applications of item response theory to practical testing problems. Lawrence Erlbaum Associates.
Lord, F. M. & Wingersky, M. S. (1984). Comparison of IRT true score and equipercentile observed score “equatings.”. Applied Psychological Measurement, 8, 453–46.
Loyd, B. H., & Hoover, H. D. (1980). Vertical equating using the Rasch model. Journal of Educational Measurement, 17(3), 179–193.
Marais, I. D., & Andrich, D. (2008). Effects of varying magnitude and patterns of local dependence in the unidimensional Rasch model. Journal of Applied Measurement, 9(2), 105–124.
Marco, G. L. (1977). Item characteristic curve solutions to three intractable testing problems. Journal of Educational Measurement, 14(2), 139–160.
Ogasawara, H. (2000). Asymptotic standard errors of IRT equating coefficients using moments. Economic Review, Otaru University of Commerce, 51(1), 1–23.
Ogasawara, H. (2001). Least squares estimation of item response theory linking coefficients. Applied Psychological Measurement, 25, 3–24.
Ogasawara, H. (2003). Asymptotic standard errors of IRT observed-score equating methods. Psychometrika, 68(2), 193–211.
Özdemir, G., & Atar, B. (2022). Investigation of the missing data imputation methods on characteristic curve transformation methods used in test equating. Journal of Measurement and Evaluation in Education and Psychology, 13(2), 105-116. https://doi.org/10.21031/epod.1029044Öztürk Gübeş, N. &, Kelecioğlu, H. (2016). The impact of test dimensionality, common-item set format, and scale linking methods on mixed format test equating. Educational Sciences: Theory & Practice, 16(3), 715-734. 10.12738/estp.2016.3.0218
Öztürk Gübeş, N. (2019). Test eşitlemede çok boyutluluğun eş zamanlı ve ayrı kalibrasyona etkisi. Hacettepe Üniversitesi Eğitim Fakültesi Dergisi, 34(4), doi: 1061-1074. 10.16986/HUJE.2019049186
Reckase, M. D. (2009). Multidimensional item response theory. Springer.
Rosenbaum, P. R. (1988). Items bundles. Psychometrika, 53(3), 349–359.
Rijmen, F. (2009). Three multidimensional models for testlet-based tests: Formal relations and an empirical comparison. ETS Research Report, 2009(2), 1–41.
Robitzsch, 2024. Bias-reduced Haebara and Stocking–Lord linking. J Multidisciplinary Scientific journal, 7(3), 373-384. https://doi.org/10.3390/j7030021
Ryan, J., & Brockmann, F. (2009). A Practitioner’s Introduction to Equating with Primers on Classical Test Theory and Item Response Theory. Council of Chief State School Officers.
Skaggs, G. (2005). Accuracy of random groups equating with very small samples. Journal of Educational Measurement, 42(4), 309-330. https://doi.org/10.1111/j.1745-3984.2005.00018.x
Spence, P. (1996). The effect of multidimensionality on unidimensional equating with item response theory [Doctoratal dissertation] University of Florida.
Stocking, M. L., & Lord, F. M. (1983). Developing a common metric in item response theory. Applied Psychological Measurement, 7(2), 201–210.
Tao, W., & Cao, Y. (2016). An extension of IRT-based equating to the dichotomous testlet response theory model. Applied Measurement in Education, 29(2), 108–121. https://doi.org/10.1080/08957347.2016.1138956
Uşkul, B. (2024). Madde takımı tabanlı testlerde ölçek dönüştürme hatalarının incelenmesi [Doctoral dissertation]. Hacettepe Üniversitesi.
Uysal, İ. &, Kilmen, S. (2016). Comparison of item response theory test equating methods for mixed format tests. International Online Journal of Educational Sciences, 2016, 8 (2), 1-11. http://dx.doi.org/10.15345/iojes.2016.02.001
Wainer, H., & Kiely, G. L. (1987). Item clusters and computerized adaptive testing: A case for testlets. Journal of Educational Measurement, 24(3), 185–202.
Wainer, H., & Wang, X. (2000). Using a new statistical model for testlets to score TOEFL. Journal of Educational Measurement, 37(3), 203–220. https://doi.org/10.1111/j.1745-3984.2000.tb01083.x
Wainer, H., Bradlow, E. T., & Wang, X. (2007). Testlet response theory and its applications. Cambridge University Press.
Wang, X., Bradlow, E. T., & Wainer, H. (2002). A general Bayesian model for testlets: Theory and applications. Applied Psychological Measurement, 26(2), 190–218. https://doi.org/10.1177/0146621602026001007
Wang, S. & Liu, H. (2018). Minimum sample size needed for equipercentile equating under the random groups design. In M. J. Kolen ve W. Lee (Ed.), Mixed-format tests: Psychometric properties with a primary focus on equating (vol 2.5, s. 107-126). Center for Advanced Studies in Measurement and Assessment.
Weeks, J. P. (2010). plink: An R package for linking mixed-format tests using IRT-based methods. Journal of Statistical Software, 35(12), 1–33.
Yıldırım Seheryeli, M., Yahşi-Sarı, H., & Kelecioğlu, H. (2021). Comparison of Kernel Equating and Kernel Local Equating in Item Response Theory Observed Score Equating. Journal of Measurement and Evaluation in Education and Psychology, 12(4), 348-357. https://doi.org/10.21031/epod.900843
Yılmaz Koğar, E., & Kelecioğlu, H. (2017). Examination of different item response theory models on tests composed of testlets. Journal of Education and Learning, 6(4), 113-126. 10.5539/jel.v6n4p113
Zor, Y. M. (2023). Investigation of multidimensional scale transformation methods applied to multidimensional tests according to various conditions. Adıyaman University Journal of Educational Sciences, 13(1), 41-53. http://dx.doi.org/10.17984/adyuebd.1239198

There are 68 citations in total.

Details

Primary Language	English
Subjects	Measurement and Evaluation in Education (Other)
Journal Section	Research Article
Authors	Harun Dilek 0000-0001-5671-6858 Kübra Atalay Kabasakal 0000-0002-3580-5568 Sebahat Gören 0000-0002-6453-3258
Publication Date	July 25, 2025
Submission Date	December 25, 2024
Acceptance Date	July 25, 2025
Published in Issue	Year 2025 Volume: 33 Issue: 3

Cite

APA	Dilek, H., Atalay Kabasakal, K., & Gören, S. (2025). Examination of Scale Transformation and Test Equating Methods in Testlet Based Tests. Kastamonu Education Journal, 33(3), 658-671. https://doi.org/10.24106/kefdergi.1750267

Download Cover Image

Article Files

Full Text