Madde tepki kuramına dayalı test eşitlemede ölçek dönüştürme yöntemlerinin, ortak madde oranının ve madde ayırt ediciliğinin eşitleme hatasına etkisi

Yıldız Yıldırım; Tuba Gündüz; Fazilet Gül İnce Aracı

doi:10.21764/maeuefd.1366213

EN TR

The effect of calibration methods, common item ratio and item discrimination on equating error in test equating based on item response theory

Abstract

The purpose of this research is to examine the effects of common item ratio, item discrimination and calibration method on equating error in test equating based on Item Response Theory. Within the scope of this basic research, there are a total of 24 simulation conditions [two item discrimination levels (medium (alog-mean = 0.00), high (alog-mean = 0.50), high (amean = 1.00)) × three common item ratios (10%, 20% and 30%) × four calibration methods (Stocking & Lord, Haebara, Mean-Sigma and Mean-Mean)] and 100 replications were made for each condition. As a result of the research, it was determined that the calibrating method with the lowest equating error was Stocking & Lord's method and the method with the highest equating error was the mean-mean method. Also, it was concluded that as both the common item ratio and item discrimination increased, the equating error decreased. Additionally, the lowest equating error was found when the calibration method was Stocking & Lord, the discrimination level was high and the common item ratio was 30%. The highest equating error was observed when the calibration method was mean-mean, the discrimination level was medium, and the common item ratio was 10%. Finally, it was concluded that the equating error was more affected by item discrimination than the common item ratio. Based on these results, recommendations are made to researchers and test developers.

Keywords

Madde tepki kuramına dayalı test eşitlemede ölçek dönüştürme yöntemlerinin, ortak madde oranının ve madde ayırt ediciliğinin eşitleme hatasına etkisi

Öz

Bu araştırmanın amacı Madde Tepki Kuramına madde tepki kuramına dayalı test eşitlemede ortak madde oranının, madde ayırt ediciliğinin ve ölçek dönüştürme yönteminin eşitlemenin standart hatasına etkisini incelemektir. Bu temel araştırma kapsamında, iki madde ayırt edicilik düzeyi (orta (alog-ort = 0,00) ve yüksek (alog-ort = 0,50)) × üç ortak madde oranı (%10, %20 ve %30) × dört ölçek dönüştürme yöntemi (Stocking & Lord, Haebara, Ortalama-Standart sapma ve Ortalama-Ortalama) olmak üzere toplam 24 simülasyon koşulu bulunmaktadır ve her koşul için 100 tekrar yapılmıştır. Araştırma sonucunda eşitlemenin standart hatasının en düşük olduğu ölçek dönüştürme yönteminin Stocking & Lord'un yöntemi, en yüksek olduğu yöntemin ise ortalama-ortalama yöntemi olduğu belirlenmiştir. Ayrıca hem ortak madde oranı hem de madde ayırt ediciliği arttıkça eşitlemenin standart hatasının azaldığı sonucuna varılmıştır. Ek olarak eşitlemenin standart hatasının en düşük olduğu koşul, ölçek dönüştürme yönteminin Stocking & Lord, ayırt edicilik düzeyinin yüksek ve ortak madde oranının %30 olduğu koşuldur. Eşitlemenin standart hatasının en yüksek olduğu koşul ise ölçek dönüştürme yönteminin ortalama-ortalama, ayırt edicilik düzeyinin orta ve ortak madde oranının %10 olduğu koşuldur. Son olarak eşitlemenin standart hatasının ortak madde oranına göre madde ayırt ediciliğinden daha çok etkilendiği sonucuna varılmıştır. Bu sonuçlara dayalı olarak araştırmacılara ve test geliştiricilere öneriler sunulmuştur.

Anahtar Kelimeler

Ethical Statement

Since this study does not use data belonging to real individuals and the data were generated through simulation, it falls within the scope of studies that do not require ethics committee approval.

References

Alordiah, C., & Oji, J. (2024). Test equating in educational assessment: A comprehensive framework for promoting fairness, validity, and cross-cultural equity. Asian Journal of Assessment in Teaching and Learning, 14(1), 70-84. https://doi.org/10.37134/ajatel.vol14.1.7.2024
Andersson, B., & Wiberg, M. (2017). Item response theory observed-score Kernel equating. Psychometrika, 82(1), 48–66. https://doi.org/10.1007/s11336-016-9528-7
Baker, F. (2001). The Basics of Item Response Theory. ERIC Clearinghouse.
Baker, F. B. & Al-Karni, A. (1991). A comparison of two procedures for computing IRT equating coefficients. Journal of Educational Measurement, 28(2), 147–162. https://doi.org/10.1111/j.1745-3984.1991.tb00350.x
Bastari, B. (2000). Linking multiple choice and constructed response items to a common proficiency scale (Order No. 44070296). [Unpublished doctoral dissertation, University of Massachusetts Amherst]. https://doi.org/10.7275/16132240
Battauz, M. (2015). equateIRT: An R package for IRT test equating. Journal of Statistical Software, 68(7), 1–22. https://doi.org/10.18637/jss.v068.i07
Bulut, O. (2013). Between-person and within-person subscore reliability: Comparison of unidimensional and multidimensional IRT models (Order No. 3589000). Available from ProQuest Dissertations & Theses Global. (1429501632). https://www.proquest.com/dissertations-theses/between-person-within-subscore-reliability/docview/1429501632/se-2
Caldwell, L. J. (1984). A comparison of equating error in linear and Rasch model test equating methods (Order No. 8427294). Available from ProQuest Dissertations & Theses Global. (303292556). https://www.proquest.com/dissertations-theses/comparison-equating-error-linear-rasch-model-test/docview/303292556/se-2

Chen, H. (2001). Calibration of the ITBS Survey Test Battery to the complete test battery: A comparison of five linking methods (Order No. 3009576). Available from ProQuest Dissertations & Theses Global. (304701160). https://www.proquest.com/dissertations-theses/calibration-itbs-survey-test-battery-complete/docview/304701160/se-2
Cho, Y. (2007). Comparison of bootstrap standard errors of equating using IRT and equipercentile methods with polytomously -scored items under the common -item nonequivalent -groups design (Order No. 3301690). Available from ProQuest Dissertations & Theses Global. (304858423). https://www.proquest.com/dissertations-theses/comparison-bootstrap-standard-errors-equating/docview/304858423/se-2
Cohen, A. S., & Kim, S. H. (1998). An investigation of linking methods under the graded response model. Applied Psychological Measurement, 22(2), 116–130. https://doi.org/10.1177/01466216980222002
Cook, L. L., & Eignor, D. R. (1991). IRT equating methods. Educational measurement: Issues and practice, 10(3), 37-45. https://doi.org/10.1111/j.1745-3992.1991.tb00207.x
Çokluk-Bökeoglu, Ö., Uçar, A., & Balta, E. (2022). Madde tepki kuramına dayalı gerçek puan eşitlemede ölçek dönüştürme yöntemlerinin incelenmesi. Ankara Üniversitesi Eğitim Bilimleri Fakültesi Dergisi, 55(1), 1-36. https://doi.org/10.30964/auebfd.1001128
Dilek, H., Atalay Kabasakal, K., & Gören, S. (2025). Examination of Scale Transformation and Test Equating Methods in Testlet Based Tests, Kastamonu Education Journal, 33(3), 658-671. https://doi.org/10.24106/kefdergi.1750267
French, D. C. (1996). The utility of Stocking & Lord's equating procedure for equating norm-referenced and criterion-referenced tests with both dichotomous and polytomous components (Order No. 9719355). Available from ProQuest Dissertations & Theses Global. (304284607). https://www.proquest.com/dissertations-theses/utility-stocking-amp-lords-equating-procedure/docview/304284607/se-2
Gök, B. & Kelecioğlu, H. (2014). Denk olmayan gruplarda ortak madde deseni kullanılarak madde tepki kuramına dayalı eşitleme yöntemlerinin karşılaştırılması. Mersin Üniversitesi Eğitim Fakültesi Dergisi, 10(1), 120-136. https://dergipark.org.tr/tr/pub/mersinefd/issue/17393/181786
Gündüz, T. (2015). Test eşitlemede Madde Tepki Kuramına dayalı yetenek parametresine yönelik ölçek dönüştürme yöntemlerinin karşılaştırılması. Yayımlanmamış Yüksek lisans tezi, Gazi Üniversitesi, Eğitim Bilimleri Enstitüsü, Ankara.
Haebara, T. (1980). Equating logistic ability scales by a weighted least squares method. Japanese Psychological Research, 22, 144-149. https://www.jstage.jst.go.jp/article/psycholres1954/22/3/22_3_144/_pdf
Hambleton, R. K., & Jones, R. W. (1993). Comparison of classical test theory and item response theory and their applications to test development. Educational Measurement: Issues and Practice, 12(3), 38-47. https://doi.org/10.1111/j.1745-3992.1993.tb00543.x
Hambleton, R.K., Swaminathan, H., & Rogers, H.J. (1991). Fundamentals of item response theory. Sage.
Han, T., Kolen, M., & Pohlmann, J. (1997). A comparison among IRT true-and observed-score equatings and traditional equipercentile equating. Applied Measurement in Education, 10(2), 105-121. https://doi.org/10.1207/s15324818ame1002_1
Hanson, B. A. & Beguin, A. A. (1999). Separate versus concurrent Estimation of IRT item parameters in the common item equating design. ACT Research Report Series, 99: 8. https://www.act.org/content/dam/act/unsecured/documents/ACT_RR99-08.pdf
Hanson, B. A., & Béguin, A. A. (2002). Obtaining a common scale for item response theory item parameters using separate versus concurrent estimation in the common-item equating design. Applied Psychological Measurement, 26(1), 3–24. https://doi.org/10.1177/0146621602026001001
He, Y. (2011). Evaluating equating properties for mixed-format tests (Order No. 3461151). Available from ProQuest Dissertations & Theses Global. (879634637). https://www.proquest.com/dissertations-theses/evaluating-equating-properties-mixed-format-tests/docview/879634637/se-2
Hills, J. R., Subhiyah, R. G., & Hirsch, T. M. (1988). Equating minimum- competency tests: Comparisons of methods. Journal of Educational Measurement, 25(3), 221-231. https://doi.org/10.1111/j.1745-3984.1988.tb00304.x
İnal, H., & Anıl, D. (2018). Investigation of group invariance in test equating under different simulation conditions. Eurasian Journal of Educational Research, 18(78), 67-86. https://dergipark.org.tr/en/download/article-file/626510
Karkee, T. B. & Wright, K. R. (2004, April). Evaluation of linking methods for placing three parameter logistic item parameter estimates onto a one-parameter scale. Paper presented at the Annual Meeting of the American Educational Research Association, San Diego, California.
Kaskowitz, G. S. (1998). The effect of error in item parameter estimates on linking and equating with the IRT test characteristic curve method (Order No. 9836419). Available from ProQuest Dissertations & Theses Global. (304426493). https://www.proquest.com/dissertations-theses/effect-error-item-parameter-estimates-on-linking/docview/304426493/se-2
Kelecioğlu, H. (1994). Öğrenci seçme sınavı puanlarının eşitlenmesi üzerine bir çalışma. Yayımlanmamış doktora tezi, Hacettepe Üniversitesi, Sosyal Bilimler Enstitüsü, Ankara.
Kilmen, S. (2010). Madde Tepki Kuramı’na dayalı test eşitleme yöntemlerinden kestirilen eşitleme hatalarının örneklem büyüklüğü ve yetenek dağılımına göre karşılaştırılması. Doktora Tezi, Ankara Üniversitesi Eğitim Bilimleri Enstitüsü, Ankara.
Kim, K., Y. & Cho, U. H. (2020). Approximating bifactor IRT true-score equating with a projective item response model. Applied Psychological Measurement, 44(3), 215-218. https://doi.org/10.1177/0146621619885903
Kim, S., & Cohen, A. S. (2002). A comparison of linking and concurrent calibration under the graded response model. Applied Psychological Measurement, 26(1), 25-41. https://doi.org/10.1177/0146621602026001002
Kim, S., & Kolen, M. J. (2006). Robustness to format effects of IRT linking methods for mixed-format tests. Applied Measurement in Education, 19(4), 357-381. https://doi.org/10.1207/s15324818ame1904_7
Kumlu, G. (2019). Test ve alt testlerde eşitlemenin farklı koşullar açısından incelenmesi. Doktora Tezi, Hacettepe Üniversitesi Eğitim Bilimleri Enstitüsü, Ankara. http://www.openaccess.hacettepe.edu.tr:8080/xmlui/handle/11655/8877
Kolen, M. J., & Brennan, R. L. (1995). Test equating methods and practices. Springer.
Kolen, M. J. & Brennan, R. L. (2014). Test equating, scaling, and linking (3rd Ed.). Springer.
Kothari, C. R. (2004). Research methodology: Methods and techniques (2nd Ed.). New Age International.
Lee, W.C., & Ban, J.C. (2010). A comparison of IRT linking procedures. Applied Measurement in Education, 23(1), 23-48. https://doi.org/10.1080/08957340903423537
Leôncio, W., Wiberg, M., & Battauz, M. (2023). Evaluating equating transformations in IRT observed-score and kernel equating methods. Applied Psychological Measurement, 47(2), 123-140. https://doi.org/10.1177/01466216221124087
Li, D., Jiang, Y., & von Davier, A. A. (2012). The accuracy and consistency of a series of IRT true score equatings. Journal of Educational Measurement, 49(2), 167-189. https://doi.org/10.1111/j.1745-3984.2012.00167.x
Loyd, B. H. & Hoover, H. D. (1980). Vertical equating using the Rasch model. Journal of Educational Measurement, 17, 179-193. https://doi.org/10.1111/j.1745-3984.1980.tb00825.x
Marco, G. L. (1977). Item characteristic curve solutions to three intractable testing problems. Journal of Educational Measurement, 14, 139-160. https://doi.org/10.1111/j.1745-3984.1977.tb00033.x
Meng, Y. (2012). Comparison of Kernel equating and Item Response Theory equating methods. Available from Education Research Index. (1651827164; ED546709). https://www.proquest.com/dissertations-theses/comparison-kernel-equating-item-response-theory/docview/1651827164/se-2
Mutluer, C., & Çakan, M. (2023). Comparison of test equating methods based on classical test theory and item response theory. Journal of Uludag University Faculty of Education, 36(3), 866-906. https://doi.org/10.19171/uefad.1325587
Ogasawara, H. (2001). Item response theory true score equatings and their standard errors. Journal of Educational and Behavioral Statistics, 26(1), 31-50. https://doi.org/10.3102/10769986026001031
Ogasawara, H. (2003). Asymptotic standard errors of IRT observed-score equating methods. Psychometrika, 68(2), 193–211. https://doi.org/10.1007/bf02294797
ÖSYM. (2023). Yabancı dil bilgisi seviye tespit sınavı (YDS/1) kılavuzu. https://dokuman.osym.gov.tr/pdfdokuman/2023/YDS-1/bkilavuz08032023.pdf
Öztürk-Gübeş, N. (2014). Testlerin boyutluluğunun, ortak madde formatının, yetenek dağılımının ve ölçek dönüştürme yöntemlerinin karma testlerin eşitlenmesine etkisi. Yayımlanmamış Doktora Tezi, Hacettepe Üniversitesi, Eğitim Bilimleri Ensitütüsü, Ankara. https://openaccess.hacettepe.edu.tr/xmlui/handle/11655/1761
Peterson, N. S., Cook, L. L., & Stocking, M. L. (1983). IRT versus conventional equating methods: a comparative study of scale stability. Journal of Educational Statistics, 8(2), 137-156. https://doi.org/10.3102/10769986008002137
Seo, D. G. (2017). Overview and current management of computerized adaptive testing in licensing/certification examinations. Journal of Educational Evaluation for Health Professions, 14, 17. https://doi.org/10.3352/jeehp.2017.14.17
Speron, E. (2009). A comparison of metric linking procedures in Item Response Theory (Order No. 3370885). Available from ProQuest Dissertations & Theses Global. (304900819). https://www.proquest.com/dissertations-theses/comparison-metric-linking-procedures-item/docview/304900819/se-2
Stocking, M. L. & Lord, F. M. (1983). Developing a common metric in item response theory. Applied Psychological Measurement, 7(2), 201–210. https://doi.org/10.1177/014662168300700208
Suanthong, S. (1998). An investigation of factors affecting test equating in latent trait theory (Order No. 9841455). Available from ProQuest Dissertations & Theses Global. (304466901). https://www.proquest.com/dissertations-theses/investigation-factors-affecting-test-equating/docview/304466901/se-2
Şahhüseyinoğlu, D. (2005). İngilizce yeterlik sınavı puanlarının üç farklı eşitleme yöntemine göre karşılaştırılması. Yayımlanmamış doktora tezi, Hacettepe Üniversitesi Sosyal Bilimler Enstitüsü, Ankara.
Tate, R. (2000). Performance of a proposed method for the linking of mixed format tests with constructed response and multiple-choice items. Journal of Educational Measurement, 37(4), 329-346. https://doi.org/10.1111/j.1745-3984.2000.tb01090.x
Tian, F. (2011). A comparison of equating/linking using the Stocking-Lord method and concurrent calibration with mixed-format tests in the non-equivalent groups common-item design under IRT (Order No. 3475252). Available from ProQuest Dissertations & Theses Global. (900304732). https://www.proquest.com/dissertations-theses/comparison-equating-linking-using-stocking-lord/docview/900304732/se-2
Uçar, A., & Sünbül, Ö. (2024). Comparing equating errors on various factors for subtests which have added value. Journal of Advanced Education Studies, 6(1), 92-111. https://doi.org/10.48166/ejaes.1438652
Uyar, Ş., Aksekioğlu, B., & Öztürk-Gübeş, N. (2018). PISA 2012 matematik okuryazarlığı testinde farklı ölçek dönüştürme yöntemlerinin karşılaştırılması. Mehmet Akif Ersoy Üniversitesi Eğitim Fakültesi Dergisi, 46, 121-148. https://doi.org/10.21764/maeuefd.330613
Uysal, İ. (2014). Madde tepki kuramına dayalı test eşitleme yöntemlerinin karma modeller üzerinde karşılaştırılması. Yüksek Lisans Tezi, Abant İzzet Baysal Üniversitesi Eğitim Bilimleri Enstitüsü, Bolu.
Uysal, İ., & Kilmen, S. (2016). Comparison of item response theory test equating methods for mixed format tests. International Online Journal of Educational Sciences, 8(2), 1-11. https://iojes.net/?mod=makale_tr_ozet&makale_id=40844
Uysal, İ., Şahin-Kürşad, M., & Kılıç, A. F. (2022). Effect of item parameter drift in mixed format common items on test equating. Participatory Educational Research, 9(5), 143-160. https://doi.org/10.17275/per.22.108.9.5
Walker, M. E., & Kim, S. (2009, April). Linking mixed-format tests using multiple choice anchors. Paper presented at the Annual Meeting of the American Educational Research Association (AERA) and the National Council on Measurement in Education (NCME), San Diego.
Wang, S., & Kolen, M. J. (2016). Evaluation of scale transformation methods with stabilized conditional standard errors of measurement for mixed-format tests. In M. J. Kolen & W. Lee (Eds.) Mixed-format tests: Psychometric properties with a primary focus on equating (Volume 4) (CASMA Monograph Number 2.4, pp. 205–222). Iowa City: CASMA, The University of Iowa. https://education.uiowa.edu/sites/education.uiowa.edu/files/2021-11/casma-monograph-2.4.pdf#page=217
Wolkowitz, A. A., & Wright, K. D. (2019). Effectiveness of equating at the passing score for exams with small sample sizes. Journal of Educational Measurement, 56(2), 361–390. https://doi.org/10.1111/jedm.12212
Yang, W., & Houang, R. T. (1996, April). The effect of anchor length and equating method on the accuracy of test equating: Comparisons of linear and IRT-based equating using anchor-item design. Paper presented at the AERA Annual Conference, New York.
Yurtçu, M., & Güzeller, C. O. (2022). Comparison of item response theory scaling methods with ROC analysis. Journal of Measurement and Evaluation in Education and Psychology, 13(1), 15-22. https://doi.org/10.21031/epod.892079
Zhang, Z. (2021). Asymptotic standard errors of generalized partial credit model true score equating using characteristic curve methods. Applied Psychological Measurement, 45(5), 331-345. https://doi.org/10.1177/01466216211013101
Zor, Y. M. (2023). Investigation of multidimensional scale transformation methods applied to multidimensional tests according to various conditions. Adıyaman University Journal of Educational Sciences, 13(1), 41-53. https://doi.org/10.17984/adyuebd.1239198

Details

Primary Language

Turkish

Subjects

Measurement Theories and Applications in Education and Psychology

Journal Section

Research Article

Authors

Yıldız Yıldırım ^*
0000-0001-8434-5062
Türkiye

Tuba Gündüz
0000-0002-0921-9290
Türkiye

Fazilet Gül İnce Aracı
Türkiye

Publication Date

April 30, 2026

Submission Date

September 25, 2023

Acceptance Date

February 9, 2026

Published in Issue

Year 2026 Number: 78

DOI

https://doi.org/10.21764/maeuefd.1366213

IZ

https://izlik.org/JA33CF85UC

Cite

RIS / Bibtex

APA

Yıldırım, Y., Gündüz, T., & İnce Aracı, F. G. (2026). Madde tepki kuramına dayalı test eşitlemede ölçek dönüştürme yöntemlerinin, ortak madde oranının ve madde ayırt ediciliğinin eşitleme hatasına etkisi. Mehmet Akif Ersoy University Journal of Education Faculty, 78, 1-17. https://doi.org/10.21764/maeuefd.1366213