Examining the impact of violations of local item independence assumption on test equating methods

Mehmet Fatih Doğuyurt; Şeref Tan

doi:10.21449/ijate.1562627

EN

Examining the impact of violations of local item independence assumption on test equating methods

Abstract

This study investigates the impact of violating the local item independence assumption by loading certain items onto a second dimension on test equating errors in unidimensional and dichotomous tests. The research was designed as a simulation study, using data generated based on the PISA 2018 mathematics exam. Analyses were conducted under 36 different conditions, varying by sample sizes (250, 1000, and 5000), test lengths (20, 40, and 60 items), and proportions of items loaded onto the second dimension (0%, 15%, 30%, and 50%). A "random groups design" was used, resulting in the creation of 3600 datasets through 100 replications. The results revealed that the equating methods based on classical test theory (CTT) showed varying levels of error depending on the error types and conditions. Among the item response theory (IRT) scale transformation methods, the Stocking-Lord method produced the least error values and was the least affected by violations of the local independence assumption. Additionally, the observed score equating method demonstrated lower root mean square error (RMSE) values than the true score equating method and was less affected by local independence violations. The SS-MIRT observed score equating method yielded lower RMSE values compared to the other methods and was found to be more robust against the violation of the local independence assumption.

Keywords

References

Aiken, L.R. (2000). Psychological testing and assesment (10th ed.). Allyn and Bacon.
Aksekioğlu, B. (2017). Madde tepki kuramına dayalı test eşitleme yöntemlerinin karşılaştırılması: PISA 2012 fen testi örneği [Comparison of test equating methods based on item response theory: PISA 2012 science test sample]. [Master's Thesis, Akdeniz University]. Higher Education Institution National Thesis Center.
Akour, M.M.M. (2006). A comparison of various equipercentile and kernel equating methods under the random groups design. [Doctoral Dissertation, Graduate College of The University of Iowa]. ProQuest Dissertations and Theses Global.
Albano, A.D. (2016). equate: An R package for observed-score linking and equating. Journal of Statistical Software, 74, 1-36. https://doi.org/10.18637/jss.v074.i08
Angoff, W.H. (1984). Scales, norms, and equivalent scores. Educational Testing Service.
Aşiret, S. (2014). Küçük Örneklemlerde test eşitleme yöntemlerinin çeşitli faktörlere göre incelenmesi [Factors affecting the test equating method using small samples]. [Master's Thesis, Mersin University]. Higher Education Institution National Thesis Center.
Atar, B., & Yeşiltaş, G. (2017) Çok boyutlu eşitleme yöntemlerinin eşdeğer olmayan gruplarda ortak madde deseni için performanslarının incelenmesi [Investigation of the performance of multidimensional equating procedures for common-item nonequivalent groups design]. Journal of Measurement and Evaluation in Education and Psychology, 8(4), 421-434. https://doi.org/10.21031/epod.335284
Baker, F.B., & Al‐Karni, A. (1991). A comparison of two procedures for computing IRT equating coefficients. Journal of Educational Measurement, 28(2), 147-162.

Battauz, M. (2015). equateIRT: An R package for IRT test equating. Journal of Statistical Software, 68, 1-22. https://doi.org/10.18637/jss.v068.i07
Brossman, B.G., & Lee, W.C. (2013). Observed score and true score equating procedures for multidimensional item response theory. Applied Psychological Measurement, 37(6), 460-481. https://doi.org/10.1177/0146621613484083
Chalmers, R.P. (2012). mirt: A multidimensional item response theory package for the R environment. Journal of statistical Software, 48, 1-29. https://doi.org/10.18637/jss.v048.i06
Chalmers, R.P. (2016). mirtCAT: Computerized adaptive testing with multidimensional item response theory. Journal of statistical Software, 71(5), 1 38. https://doi.org/10.18637/jss.v071.i05
Chen, J. (2014). Model selection for IRT equating of testlet-based tests in the random groups design. [Doctoral Dissertation, Graduate College of The University of Iowa]. ProQuest Dissertations and Theses Global.
Chen, W.H., & Thissen, D. (1997). Local dependence indexes for item pairs using item response theory. Journal of Educational and Behavioral Statistics, 22(3), 265 289. https://doi.org/10.1177/0265532220927487
Choi, J. (2019). Comparison of MIRT observed score equating methods under the common-item nonequivalent groups design. [Doctoral Dissertation, Graduate College of The University of Iowa]. ProQuest Dissertations and Theses Global.
Cook, L.L., & Eignor, D.R. (1991). An NCME module on IRT Equating methods. Educational Measurement: Issues and Practice, 10(3), 191 199. https://doi.org/10.1111/j.1745 3992.1991.tb00207.x
Çörtük, M. (2022). Çok kategorili puanlanan maddelerden oluşan testlerde klasik test kuramı ve madde tepki kuramına dayalı test eşitleme yöntemlerinin karşılaştırılması [Comparison of test equating methods based on classical test theory and item response theory in polytomously scored tests]. [Master's Thesis, Akdeniz University]. Higher Education Institution National Thesis Center.
Crocker, L., & Algina, J. (1986). Introduction to classical and modern test theory. Harcourt Brace Javonich College.
Cui, Z. (2006). Two new alternative smoothing methods in equating: The cubic B-spline presmoothing method and the direct presmoothing method. [Doctoral dissertation, Graduate College of The University of Iowa]. ProQuest Dissertations & Theses Global.
De Gruijter, D.N., & Leo, J.T. (2007). Statistical test theory for the behavioral sciences. Chapman and Hall/CRC.
DeMars, C. (2010). Item response theory. Oxford University.
Demir, S., & Güler, N. (2014). Ortak maddeli denk olmayan gruplar desenine ilişkin test eşitleme çalışması [Study of test equating on the common item nonequivalent group design]. International Journal of Human Sciences, 11(2), 190-208.
Donlon, T.F. (1984). The College Board technical handbook for the scholastic aptitude test and achievement tests. College Entrance Examination Board.
Finch, H., French, B.F., & Immekus, J.C. (2014). Applied psychometrics using SAS. IAP.
Gök, B. , & Kelecioğlu, H. (2014). Denk olmayan gruplarda ortak madde deseni kullanılarak madde tepki kuramına dayalı eşitleme yöntemlerinin karşılaştırılması [Comparison of IRT equating methods using the common-item nonequivalent groups design]. Mersin University Journal of the Faculty of Education, 10(1), 120-136
Gübeş, N.Ö. (2019). Test eşitlemede çok boyutluluğun eş zamanlı ve ayrı kalibrasyona etkisi [The effect of multidimensionality on concurrent and separate calibration in test equating]. Hacettepe University Journal of Education, 34(4), 1061 1074. https://doi.org/10.16986/HUJE.2019049186
Haebara, T. (1980). Equating logistic ability scales by a weighted least squares method. Japanese Psychological Research, 22 (3), 144-149.
Hagge, S.L. (2010). The impact of equating method and format representation of common items on the adequacy of mixed-format test equating using nonequivalent groups. [Doctoral Dissertation, Graduate College of The University of Iowa]. ProQuest Dissertations and Theses Global.
Hambleton, R.K., Swaminathan, H., & Rogers, H.J. (1991). Fundamentals of item response theory. Sage.
Han, T., Kolen, M., & Pohlmann, J. (1997). A comparison among IRT true-and observed-score equatings and traditional equipercentile equating. Applied Measurement in Education, 10(2), 105-121.
Hanson, B.A., & Béguin, A.A. (2002). Obtaining a common scale for item response theory item parameters using separate versus concurrent estimation in the common-item equating design. Applied psychological measurement, 26(1), 3-24.
Hanson, B.A., Zeng, L., & Colton, D.A. (1994). A comparison of presmoothing and postsmoothing methods in equipercentile equating (No. 94). American College Testing Program.
Harris, D.J., & Crouse, J.D. (1993). A study of criteria used in equating. Applied Measurement in Education, 6 (3),195-240. https://doi.org/10.1207/s15324818ame0603_3
Kahraman, N. (2013). Unidimensional interpretations for multidimensional test items. Journal of Educational Measurement, 50(2), 227-246. https://doi.org/10.1111/jedm.12012
Kahraman, N., & Kamata, A. (2004). Increasing the precision of subscale scores by usingout-of scale information. Applied Psychological Measurement, 28, 407 426. https://doi.org/10.1177/0146621604268736
Kahraman, N., & Thompson, T. (2011). Relating unidimensional IRT parameters to a multidi-mensional response space: A review of two alternative projection IRT models for subscalescores. Journal of Educational Measurement, 48, 146 164. https://doi.org/10.1111/j.1745-3984.2011.00138.x
Karagül, A.E. (2020). Küçük örneklemlerde çok kategorili puanlanan maddelerden oluşan testlerde klasik test eşitleme yöntemlerinin karşılaştırılması [Comparison of classical test equating methods with polytomously scored tests and small samples]. [Master’s thesis, Ankara University]. Higher Education Institution National Thesis Center.
Karkee, T.B., & Wright, K.R. (2004, April). Evaluation of linking methods for placing threeparameter logistic ıtem parameter estimates onto a one-parameter scale. Paper presented at the Annual Meeting of the American Educational Research Association in San Diego, California.
Kilmen, S. (2010). Comparison of equating errors estimated from test equation methods based on ıtem response theory according to the sample size and ability distribution. [Doctoral Dissertation, Ankara University]. Higher Education Institution National Thesis Center.
Kim, S.Y. (2018). Simple structure MIRT equating for multidimensional tests. [Doctoral Dissertation, Graduate College of The University of Iowa]. ProQuest Dissertations & Theses Global.
Kim, S.H., & Cohen, A.S. (1998). Detection of differential item functioning under the graded response model with the likelihood ratio test. Applied Psychological Measurement, 22(4), 345-355.
Kim, S.Y., Lee, W.C., & Kolen, M.J. (2020). Simple-structure multidimensional item response theory equating for multidimensional tests. Educational and Psychological Measurement, 80(1), 91-125. https://doi.org/10.1177/0013164419854208
Kline, R.B. (2015). Principles and practice of structural equation modeling. Guilford.
Kolen, M.J., & Brennan, R.L. (2014). Test equating, scaling, and linking: Methods and practices. Springer. https://doi.org/10.1007/978-1-4939-0317-7
Kolen, M.J., & Hendrickson, A.B. (2013). Scaling, norming, and equating. In K. F. Geisinger et al. (Eds.), In APA handbook of testing and assessment in psychology, Vol. 1: Test theory and testing and assessment in industrial and organizational psychology (pp. 201-222). American Psychological Association.
Kumlu, G. (2019). Test ve alt testlerde eşitlemenin farklı koşullar açısından incelenmesi [An investigation of test and sub-tests equating in terms of different conditions]. [Doctoral Dissertation, Hacettepe University]. Higher Education Institution National Thesis Center.
Lee, W.C., & Ban, J.C. (2009). A comparison of IRT linking procedures. Applied Measurement in Education, 23(1), 23-48. https://doi.org/10.1080/08957340903423537
Lee, W.C., & Brossman, B.G. (2012). Observed score equating for mixed-format tests using a simple-structure multidimensional IRT framework. In M. J. Kolen & W. C. Lee (Eds.), Mixed-format tests: Psychometric properties with a primary focus on equating (Volume 2) (CASMA Monograph No. 2.2.) Center for Advanced Studies in Measurement and Assessment, The University of Iowa. http://www.education.uiowa.edu/casma
Lee, G., & Lee, W.C. (2016). Bi-factor MIRT observed-score equating for mixed-format tests. Applied Measurement in Education, 29(3), 224 241. https://doi.org/10.1080/08957347.2016.1171770
Lee, E., Lee, W., & Brennan, R.L. (2014). Equating multidimensional tests under a random groups design: A comparison of various equating procedures. (CASMA Research Report No. 40). Iowa City, IA: Center for Advanced Studies in Measurement and Assessment, The University of Iowa. http://www.education.uiowa.edu/casma
Lee, G., Lee, W., Kolen, M.J., Park, I. –Y., Kim, D.I., & Yang, J.S. (2015). Bi-factor MIRT true-score equating for testlet-based tests. Journal of Educational Evaluation, 28, 681-700.
Li, Y.H., & Lissitz, R.W. (2000). An evaluation of the accuracy of multidimensional IRT linking. Applied Psychological Measurement, 24(2), 115-138.
Lim, E. (2016). Subscore equating with the random groups design. [Doctoral dissertation, Graduate College of The University of Iowa]. ProQuest Dissertations & Theses Global.
Liu, C., & Kolen, M.J. (2011). A comparison among IRT equating methods and traditional equating methods for mixed-format tests. Mixed-format tests: Psychometric properties with a primary focus on equating, 1, 75-94.
Livingston, S.A. (1993). Small-sample equating with log-linear smoothing. Journal of Educational Measurement,30(1), 23–29.
Livingston, S.A. (2014). Equating test scores (without IRT) (2th ed.). Educational testing service, ETS.
Livingston, S.A., & Kim, S. (2009). The circle-arc method for equating in small samples. Journal of Educational Measurement,46(3), 330 343. https://doi.org/10.1111/j.1745 3984.2009.00084.x
Lord, F.M. (1980) Applications of item response theory to practical testing problems. Lawrence Erlbaum Associates.
Lord, F.M., & Novick M.R. (1968). Statistical theories of mental test scores. Addison-Wesley.
Loyd, B.H., & Hoover, H.D. (1980). Vertical Equating Using the Rasch Model. Journal of Educational Measurement, 17 (3), 179-193.
Marco, G.L. (1977). Item characteristic curve solutions to three intractable testing problems. Journal of Educational Measurement, 14(2), 139-160.
McDonald R.P. (2000). A basis for multidimensional item response theory. Applied Psychological Measurement, 24, 99-114.
Mutluer, C. (2021). Klasik test kuramına ve madde tepki kuramına dayalı test eşitleme yöntemlerinin karşılaştırması: Uluslararası öğrenci değerlendirme programı (PISA) 2012 matematik testi örneği [Comparison of test equating methods based on Classical Test Theory and Item Response Theory: International Student Assessment Program (PISA) 2012 mathematics test case]. [Doctoral dissertation, Gazi University]. Higher Education Institution National Thesis Center.
Ogasawara, H. (2000). Asymptotic standard errors of IRT equating coefficients using moments. Economic Review (Otaru University of Commerce), 51(1), 1-23.
Öztürk, N., & Anıl, D. (2012). Akademik personel ve lisansüstü eğitimi giriş sınavı puanlarının eşitlenmesi üzerine bir çalışma [A study on equating academic staff and graduate education entrance examination scores]. Eğitim ve Bilim, 37(165), 180-193.
Petersen, N.S., Cook, L.L., & Stocking, M.L. (1983). IRT versus conventional equating methods: A comparative study of scale stability. Journal of Educational Statistics, 8(2), 137-156.
Peterson, J. (2014). Multidimensional item response theory observed score equating methods for mixed-format tests. [Doctoral dissertation, Graduate College of The University of Iowa].University of Iowa’s Institutional Repository. https://ir.uiowa.edu/cgi/viewcontent.cgi?article=5418&context=etd
Peterson, J., & Lee, W. (2014). Multidimensional item response theory observed score equating methods for mixed-format tests. In M. J. Kolen & W. Lee (Eds.), Mixedformat tests: Psychometric properties with a primary focus on equating (Volume 2) (CASMA Monograph No. 2.3). Center for Advanced Studies in Measurement and Assessment, The University of Iowa. http://www.education.uiowa.edu/casma
Powers, S.J., Hagge, S.L., Wang, W., He, Y., Liu, C., & Kolen, M.J. (2011). Effects of group differences on mixed-format equating, In M. J. Kolen & W. C. Lee (Eds.), Mixed-format tests: Psychometric properties with a primary focus on equating (Volume 1, pp. 51-73). Center for Advanced Studies in Measurement and Assessment, The University of Iowa. http://www.education.uiowa.edu/casma.
R Core Team (2019). R: A language and environment for statistical computing [Computer software]. R Foundation for Statistical Computing, Vienna, Austria. http://www.R project.org/
Salmaner Doğan, R., & Tan, Ş. (2022). Madde tepki kuramında eşitleme hatalarının belirlenmesinde kullanılan delta ve bootstrap yöntemlerinin çeşitli değişkenlere göre incelenmesi [Investigation of delta and bootstrap methods for calculating error of test equation in IRT in terms of some variables]. Gazi University Journal of Gazi Educational Faculty (GUJGEF), 42(2), 1053-1081. https://doi.org/10.17152/gefad.913241
Sansivieri, V., Wiberg, M., & Matteucci, M. (2017). A review of test equating methods with a special focus on IRT based approaches. Statistica, 77(4), 329 352. https://doi.org/10.6092/issn.1973-2201/7066
Sass D.A., & Schmitt T.A. (2010). A comparative investigation of rotation criteria within exploratory factor analysis. Multivariate Behavioral Research, 45, 73 103. https://doi.org/10.1080/00273170903504810
Sireci, S.G., Thissen, D., & Wainer, H. (1991). On the reliability of testlet‐based tests. Journal of Educational Measurement, 28(3), 237-247.
Skaggs, G. (2005). Accuracy of random groups equating with very small samples. Journal of Educational Measurement, 42(4), 309-330
Stocking, M.L., & Lord, F.M. (1983). Developing a common metric in item response theory. Applied Psychological Measurement, 7 (2), 201-210.
Tabachnick, B.G., & Fidell, L.S. (2013). Using multivariate statistics ( 6th ed.). Pearson.
Tan, Ş. (2015). Küçük örneklemlerde beta4 ve polynomial loglineer öndüzgünleştirme ve kübik eğri sondüzgünleştirme metotlarının uygunluğu [Accuracy of beta4 presmoothing polynomial loglineer presmoothing and cubic spline postsmoothing methods for small samples]. Gazi University Journal of Gazi Educational Faculty (GUJGEF), 35(1), 123-151.
Tanberkan Suna, H. (2018). Grup değişmezliği özelliğinin farklı eşitleme yöntemlerinde eşitleme fonksiyonları üzerindeki etkisi [The effect of group invariance property on equating functions obtained through various equating methods]. [Doctoral dissertation, Gazi University]. Higher Education Institution National Thesis Center.
Tao, W., & Cao, Y. (2016). An extension of IRT-based equating to the dichotomous testlet response theory model. Applied Measurement in Education, 29(2), 108 121. https://doi.org/10.1080/08957347.2016.1138956
Tsai, T.H. (1997, March). Estimating minimum sample sizes in random groups equating. Annual Meeting of the National Council on Measurement in Education, Chicago, IL.
Tuerlinckx, F., & De Boeck, P. (2001). The effect of ignoring item interactions on the estimated discrimination parameters in item response theory. Psychological methods, 6(2), 181.
Uğurlu, S. (2020). Comparison of equating methods for multidimensional tests which contain items with differential item functioning. [Doctoral dissertation, Hacettepe University]. Higher Education Institution National Thesis Center.
Wainer, H. (1995). Precision and differential item functioning on a testlet-based test: The 1991 Law School Admissions Test as an example. Applied Measurement in Education, 8(2), 157-86.
Wainer, H., & Thissen, D. (1996). How is reliability related to the quality of test scores? What is the effect of local dependence on reliability?. Educational Measurement: Issues and Practice, 15(1), 22-29.
Wainer, H., & Wang, X. (2000). Using a new statistical model for testlets to score TOEFL. Journal of Educational Measurement, 37(3), 203-220.
Wang, S., Zhang, M., & You, S. (2020). A comparison of IRT observed score kernel equating and several equating methods. Frontiers in psychology, 11, 308.
Wang, X. (2012). Effect of sample size on ırt equating of uni-dimensional tests in common item non-equivalent group design: A monte carlo simulation study. [Doctoral Dissertation, Graduate College of The University of Virginia Tech]. ProQuest Dissertations and Theses Global.
Way, W.D., Ansley, T.N., & Forsyth, R.A. (1988). The comparative effects of compensatory and noncompensatory two-dimensional data on unidimensional IRT estimates. Applied Psychological Measurement, 12(3), 239 252. https://doi.org/10.1177/014662168801200303
Woodruff, D.J. (1989). A comparison of three linear equating methods for the common-item nonequivalent-populations design. Applied psychological measurement, 13(3), 257- 262.
Yen, W.M. (1984). Effects of local item dependence on the fit and equating performance of the three-parameter logistic model. Applied Psychological Measurement, 8, 125–145.
Yen, W.M. (1993). Scaling performance assessments: Strategies for managing local item independence. Journal of Educational Measurement, 30, 187–213.
Zhang, Z. (2010). Comparison of different equating methods and an application to link testlet-based tests. [Doctoral Dissertation, Graduate College of The University of Chinese]. ProQuest Dissertations and Theses Global.

Details

Primary Language

English

Subjects

Measurement Theories and Applications in Education and Psychology, Similation Study

Journal Section

Research Article

Authors

Mehmet Fatih Doğuyurt ^*
0000-0001-9206-3321
Türkiye

Şeref Tan
0000-0002-9892-3369
Türkiye

Early Pub Date

July 21, 2025

Publication Date

September 4, 2025

Submission Date

October 7, 2024

Acceptance Date

February 21, 2025

Published in Issue

Year 2025 Volume: 12 Number: 3

DOI

https://doi.org/10.21449/ijate.1562627

IZ

https://izlik.org/JA76CT57WY

Cite

RIS / Bibtex

APA

Doğuyurt, M. F., & Tan, Ş. (2025). Examining the impact of violations of local item independence assumption on test equating methods. International Journal of Assessment Tools in Education, 12(3), 629-661. https://doi.org/10.21449/ijate.1562627

AMA

1.Doğuyurt MF, Tan Ş. Examining the impact of violations of local item independence assumption on test equating methods. Int. J. Assess. Tools Educ. 2025;12(3):629-661. doi:10.21449/ijate.1562627

Chicago

Doğuyurt, Mehmet Fatih, and Şeref Tan. 2025. “Examining the Impact of Violations of Local Item Independence Assumption on Test Equating Methods”. International Journal of Assessment Tools in Education 12 (3): 629-61. https://doi.org/10.21449/ijate.1562627.

EndNote

Doğuyurt MF, Tan Ş (September 1, 2025) Examining the impact of violations of local item independence assumption on test equating methods. International Journal of Assessment Tools in Education 12 3 629–661.

IEEE

[1]M. F. Doğuyurt and Ş. Tan, “Examining the impact of violations of local item independence assumption on test equating methods”, Int. J. Assess. Tools Educ., vol. 12, no. 3, pp. 629–661, Sept. 2025, doi: 10.21449/ijate.1562627.

ISNAD

Doğuyurt, Mehmet Fatih - Tan, Şeref. “Examining the Impact of Violations of Local Item Independence Assumption on Test Equating Methods”. International Journal of Assessment Tools in Education 12/3 (September 1, 2025): 629-661. https://doi.org/10.21449/ijate.1562627.

JAMA

1.Doğuyurt MF, Tan Ş. Examining the impact of violations of local item independence assumption on test equating methods. Int. J. Assess. Tools Educ. 2025;12:629–661.

MLA

Doğuyurt, Mehmet Fatih, and Şeref Tan. “Examining the Impact of Violations of Local Item Independence Assumption on Test Equating Methods”. International Journal of Assessment Tools in Education, vol. 12, no. 3, Sept. 2025, pp. 629-61, doi:10.21449/ijate.1562627.

Vancouver

1.Mehmet Fatih Doğuyurt, Şeref Tan. Examining the impact of violations of local item independence assumption on test equating methods. Int. J. Assess. Tools Educ. 2025 Sep. 1;12(3):629-61. doi:10.21449/ijate.1562627