The eTIMSS and TIMSS Measurement Invariance Study: Multigroup Factor Analyses and Differential Item Functioning Analyses with the 2019 Cycle
Year 2024,
Volume: 15 Issue: 2, 94 - 119, 30.06.2024
Murat Yalçınkaya
,
Hakan Atılgan
,
Selim Daşçıoğlu
,
Burak Aydın
Abstract
In this study, measurement invariance and differential item functioning (DIF) studies of the TIMSS 2019 4th and 8th-grade mathematics and science achievement tests were conducted for the country groups participating in both TIMSS and eTIMSS. The study sample consisted of 9560 responders of the first booklet of the 2019 cycle. Multiple Group Confirmatory Factor Analysis (MGCFA) was utilized to test measurement invariance, and Mantel-Haenszel (MH), Logistic Regression (LR), and SIBTEST were used for the DIF analyses. The measurement invariance results indicated strict invariance between groups for all tests which included 111 items in total. In the DIF analyses, for the 4th and 8th-grade mathematics tests, only three items showed moderate DIF with MH, and four items showed DIF with SIBTEST. For the 4th-grade science test, one item showed moderate DIF with both MH and SIBTEST. However, in the 8th-grade science test, no items showed DIF with MH and LR methods, while four items showed moderate DIF with SIBTEST. Overall, MH and SIBTEST techniques were in agreement, whereas LR method produced inconsistent results and showed disagreement with these two methods. The results of the measurement invariance analysis and the LR method were consistent and indicated equivalency of TIMSS and e-TIMSS scores.
Ethical Statement
Burada yer alan bilgiler TIMSS 2019 uygulamasının açık erişim verilerinden analizler elde edilerekgerçekleştirilmiştir.
References
- Agresti, A. (1984). Analysis of ordinal categorical data. New York: John Wiley & Sons.
- Akyıldız, M. (2009). Pırls 2001 testinin yapı geçerliliğinin ülkelerarası karşılaştırılması. Yüzüncü Yıl Üniversitesi Eğitim Fakültesi Dergisi, 6(1)
- Anakwe, B. (2008). Comparison of student performance in paper-based versus ccomputer-based testing. Journal of Education for Business. September-October, 13-17.
- Arım, G, R., Ercikan, K. (2014). Comparability between the American and Turkish versions of the TIMSS mathematics test results. Eğitim ve Bilim. 39(172), 33- 48.
- Atılgan, H., Kan, A., Aydın, B. (2017). Eğitimde ölçme ve değerlendirme. Onuncu Baskı. Ankara: Anı Yayıncılık.
- Bağdu Söyler, P., Aydın, B., & Atılgan, H. (2021). PISA 2015 Reading Test Item Parameters Across Language Groups: A measurement Invariance Study with Binary Variables. Journal of Measurement and Evaluation in Education and Psychology, 12(2), 112-128. https://doi.org/10.21031/epod.800697
- Büyüköztürk, Ş. (2010). Sosyal bilimler için veri analizi el kitabı. Pegem Akademi:Ankara.
- Büyüköztürk, Ş., Çokluk, Ö., & Şekercioğlu, G. (2014). Sosyal bilimler için çok değişkenli istatistik SPSS ve LISREL uygulamaları. Ankara: Pegem Akademi.
- Camilli G. Shepard L. A. (1994). Methods for Identifying Biased Test Items. Volume 4. California: SAGE Publications. Inc.
- Chalmers, R. P. (2012). mirt: A multidimensional item response theory package for the R environment. Journal of statistical Software, 48, 1-29.
- Cheung, G. W., Rensvold, R. B. (2002). Evaluating goodness-of-fit indexes for testing measurement invariance. Structural Equation Modeling, 9(2), 233-255.
- Çepni, Z. (2011). Değişen madde fonksiyonlarının SIBTEST, Mantel-Haenzsel,lojistik regresyon ve madde tepki kuramı yöntemleriyle incelenmesi (doktora tezi). Hacettepe Üniversitesi, Ankara
- Doğan, N; Öğretmen, T. (2008). Değişen madde fonksiyonunu belirlemede Mantel ‐ Haenszel, ki‐kare ve lojistik regresyon tekniklerinin karşılaştırılması. Eğitim ve Bilim Dergisi. 33(148).
- Dorans, N. J., & Holland, P. W. (1993). DIF detection and description: Mantel-Haenszel and standardization. In P. W. Holland & H. Wainer (Eds.), Differential item functioning (pp. 35–66). Lawrence Erlbaum Associates, Inc.
- Drasgow, F (2002). The work ahead: a psychometric infrastructure for computerized adaptive tests. In C.N. Mills, M.T. Potenza, J.J. Fremer, & W.C. Ward (Eds.), Computer-based testing: Building the foundation for future assessments (pp. 67–88). Hillsdale, NJ: Lawrence Erlbaum.
- Ercikan, K; Koh, K. (2009). Examining the construct comparability of the English and French versions of TIMSS. Internatıonal Journal Of Testıng, 5(1), 23–35
- Ergün, E. (2002). Üniversite öğrencilerinin bilgisayar destekli ölçmeden elde ettikleri eaşarının kalem-kâğıt testi başarısı, bilgisayar kaygısı ve bilgisayar tecrübeleri açısından incelenmesi. Yayımlanmamış yüksek lisans tezi. Anadolu Üniversitesi Eğitim Bilimleri Enstitüsü, Eskişehir.
- Eriştiren, İ. (2021). Ortaöğretime Geçiş Sınavlarında ölçme değişmezliği ve DIF’nin incelenmesi (Yüksek Lisans Tezi). Haccettepe Üniversitesi, Ankara.
- Gök, B., Kelecioğlu, H. ve Doğan, N. (2010). Değişen madde fonksiyonunu belirlemede Mantel- Haenzsel ve lojistik regresyon tekniklerinin karşılaştırılması. Eğitim ve Bilim, 35, 3-16.
- Gierl, M. J., Jodoin, M. G., & Ackerman, T. A. (2000). American Educational Research Association (AERA) New Orleans, Louisiana, USA April 24-27, 2000.
- Gündoğmuş, İ. (2017). Kâğıt-kalem, bilgisayar ve tablet ortamında gerçekleştirilen sınavlar için ölçme değişmezliğinin ve öğrenci görüşlerinin incelenmesi. Hacettepe Üniversitesi, Ankara
- Gregorich, S. E. (2006). Do self-report instruments allow meaningful comparisons across diverse population groups?: Testing measurement invariance using the confirmatory factor analysis framework. Medical Care, 44(1),
- Hair, J.F. Jr., Anderson, R.E., Tatham, R.L., and Black, W.C. (1998). Multivariate data analysis, (5th Edition). Upper Saddle River, NJ: Prentice Hall.
- Hambleton, R. K. (2006). Good practices for identifying differential item functioning. Medical Care, 44, 182-188.
- İlci, B. (2004). Geleneksel kâğıt-kalem yöntemi ile ve bilgisayarda online uygulanan çoktan seçmeli sayısal yetenek ve sözel yetenek testlerine ait madde ve test istatistiklerinin karşılaştırılması. Yüksek lisans tezi. Hacettepe Üniversitesi Sosyal Bilimler Enstitüsü, Ankara.
- Jöreskog, K. G. ve Sörbom, D. (2006). LISREL (Version 8.8) [computer software]. Chicago: Scientific Softare International Inc.
- Kite, B. A., Johnson, P. E., & Xing, C. (2018, January 28). Replicating the Mplus DIFFTEST Procedure. https://pj.freefaculty.org/guides/crmda_guides/44.difftest/44.difftest.html
- Klieme E., Baumert J. (2001). Identifying national cultures of mathematics education: Analysis of cognitive demands and differential item functioning in TIMSS. European Journal of Psychology of Education, 16:3, 385-402.
- Kline, R. B. (2016). Principles and practice of structural equation modeling (4th ed.b.). New York & London: The Guilford Press.
- Li, Z., Gooden, C. J., & Toland, M. D. (2016). Measurement invariance with categorical indicators. Applied Psychometric Strategies Lab, Applied Quantitative and Psychometric Series. Presentation conducted at the University of Kentucky, Lexington, KY. Retrieved from https://education. uky. edu/edp/apslab/events.
- MEB. (2020). TIMSS 2019 ulusal matematik ve fen bilimleri ön raporu: 4. ve 8. sınıflar. Ankara.
- Meredith, W., & Teresi, J. A. (2006). An essay on measurement and factorial invariance. Medical care, 44(11), 69-S77.
- Mertler, C. A. & Vannatta, R. A. (2005). Advanced and multivariate statistical methods: Practical application and interpretation (3rd ed.). Los Angeles: Pyrczak.
- Mills, C. N., Potenza, M.T., Fremer, J.J., Ward, W.C. (2001). Computer Based Testing: Building the Foundation for Future Assessment. Lawrance Erlbaum Associates, Publishers: Londra
- Moraes, C.L & Reichenheim, M.E. (2002). Cross-cultural measurement equivalence of the revised conflict tactics scales (cts2) portuguese version used to identify violence within couples. Cad. Saúde Pública, 18 (3).
- Muthén, B. O. (1993). Goodness of fit with categorical and other non-normal variables. In K. A. Bollen & J. S. Long (Eds.), Testing structural equation models (pp. 205–243). Newbury Park, CA: Sage.
- Mullis, I. V. S., Martin, M. O., Foy, P., Kelly, D., & Fishbein, B. (2020). TIMSS 2019 international results in mathematics and science. Boston College, TIMSS & PIRLS International Study Center.
- Nandakumar, R. (1993). A fortran 77 program for detecting differential item functioning through the mantel-haenszel statistic. Educational and Psychological Measurement, 53, 679–684.
- Osterlind S. J. Everson H. T. (2009). Differential Item Functioning: Second Edition. California: SAGE Publications. Inc.
- Özdemir, D. (2003). Çoktan seçmeli testlerde iki kategorili ve önsel ağırlıklı puanlamanın değişen madde fonksiyonuna etkisi ile ilgili bir araştırma. Eğitim ve Bilim, 28(129), 37-43.
- Raju, N. S., Laffitte, L. J., & Byrne, B. M. (2002). Measurement equivalence: A comparison of methods based on confirmatory factor analysis and item response theory. Journal of Applied Psychology, 87(3), 517-529.
- Raykov, T., Dimitrov, D. M., Marcoulides, G. A., Li, T., & Menold, N. (2018). Examining measurement invariance and differential item functioning with discrete latent construct indicators: A note on a multiple testing procedure. Educational and Psychological Measurement, 78(2), 343-352.
- Rogers, T. B. (1995). The psychological testing enterprise: An introduction. Pasific Grove, California: Brooks/Cole.
- Russel, M., Goldberg, A., O’Connor, K. (2003). Computer based test and validity: A look back into the future. Assessment in Education. 10, 279- 293.
- Shealy, R. and Stout, W. (1993). A model-based standardization approach that separates true bias/DIF from group ability differences and detect test bias/DTF as well as item bias/DIF. Psychometrika, 58, 159-194. doi: 10.1007/BF02294572
- Steenkamp, B., E., M. and Baumgartner, H. (1998). Assessing measurement invariance in cross‐national consumer research. journal of consumer research, 25(1),78-107.
- Tabachnick, B. G. & Fidell, L. S. (2007). Using multivariate statistics. Boston: Pearson Education.
- Wiberg, M. (2009). Differential item functioning in mastery tests: A comparison of three methods using real data. International Journal of Testing, 9, 41–59
- Vandenberg, R. J., Lance, C. E. (2000). A review and synthesis of the measurement invariance literature: Suggestions, practices, and recommendations for organizational research. Organizational Research Methods, 4, 4-70
- Wu, A. D., Li, Z. and Zumbo, B. D. (2007). Decoding the meaning of factorial invariance and updating the practice of multigroup confirmatory factor analysis: a demonstration with TIMSS data. Practical Assessment, Research and Evaluation, 12, 1-26.
- Zumbo, B. D. (1999). A handbook on the theory and methods of differential item functioning (DIF) logistic regression modeling as a unitary framework for binary and Likert‐type (ordinal) item scores. Canada: Ottowa, Directorate of Human Resources Research and Evaluation National Defense Headquarters: Author.
Year 2024,
Volume: 15 Issue: 2, 94 - 119, 30.06.2024
Murat Yalçınkaya
,
Hakan Atılgan
,
Selim Daşçıoğlu
,
Burak Aydın
References
- Agresti, A. (1984). Analysis of ordinal categorical data. New York: John Wiley & Sons.
- Akyıldız, M. (2009). Pırls 2001 testinin yapı geçerliliğinin ülkelerarası karşılaştırılması. Yüzüncü Yıl Üniversitesi Eğitim Fakültesi Dergisi, 6(1)
- Anakwe, B. (2008). Comparison of student performance in paper-based versus ccomputer-based testing. Journal of Education for Business. September-October, 13-17.
- Arım, G, R., Ercikan, K. (2014). Comparability between the American and Turkish versions of the TIMSS mathematics test results. Eğitim ve Bilim. 39(172), 33- 48.
- Atılgan, H., Kan, A., Aydın, B. (2017). Eğitimde ölçme ve değerlendirme. Onuncu Baskı. Ankara: Anı Yayıncılık.
- Bağdu Söyler, P., Aydın, B., & Atılgan, H. (2021). PISA 2015 Reading Test Item Parameters Across Language Groups: A measurement Invariance Study with Binary Variables. Journal of Measurement and Evaluation in Education and Psychology, 12(2), 112-128. https://doi.org/10.21031/epod.800697
- Büyüköztürk, Ş. (2010). Sosyal bilimler için veri analizi el kitabı. Pegem Akademi:Ankara.
- Büyüköztürk, Ş., Çokluk, Ö., & Şekercioğlu, G. (2014). Sosyal bilimler için çok değişkenli istatistik SPSS ve LISREL uygulamaları. Ankara: Pegem Akademi.
- Camilli G. Shepard L. A. (1994). Methods for Identifying Biased Test Items. Volume 4. California: SAGE Publications. Inc.
- Chalmers, R. P. (2012). mirt: A multidimensional item response theory package for the R environment. Journal of statistical Software, 48, 1-29.
- Cheung, G. W., Rensvold, R. B. (2002). Evaluating goodness-of-fit indexes for testing measurement invariance. Structural Equation Modeling, 9(2), 233-255.
- Çepni, Z. (2011). Değişen madde fonksiyonlarının SIBTEST, Mantel-Haenzsel,lojistik regresyon ve madde tepki kuramı yöntemleriyle incelenmesi (doktora tezi). Hacettepe Üniversitesi, Ankara
- Doğan, N; Öğretmen, T. (2008). Değişen madde fonksiyonunu belirlemede Mantel ‐ Haenszel, ki‐kare ve lojistik regresyon tekniklerinin karşılaştırılması. Eğitim ve Bilim Dergisi. 33(148).
- Dorans, N. J., & Holland, P. W. (1993). DIF detection and description: Mantel-Haenszel and standardization. In P. W. Holland & H. Wainer (Eds.), Differential item functioning (pp. 35–66). Lawrence Erlbaum Associates, Inc.
- Drasgow, F (2002). The work ahead: a psychometric infrastructure for computerized adaptive tests. In C.N. Mills, M.T. Potenza, J.J. Fremer, & W.C. Ward (Eds.), Computer-based testing: Building the foundation for future assessments (pp. 67–88). Hillsdale, NJ: Lawrence Erlbaum.
- Ercikan, K; Koh, K. (2009). Examining the construct comparability of the English and French versions of TIMSS. Internatıonal Journal Of Testıng, 5(1), 23–35
- Ergün, E. (2002). Üniversite öğrencilerinin bilgisayar destekli ölçmeden elde ettikleri eaşarının kalem-kâğıt testi başarısı, bilgisayar kaygısı ve bilgisayar tecrübeleri açısından incelenmesi. Yayımlanmamış yüksek lisans tezi. Anadolu Üniversitesi Eğitim Bilimleri Enstitüsü, Eskişehir.
- Eriştiren, İ. (2021). Ortaöğretime Geçiş Sınavlarında ölçme değişmezliği ve DIF’nin incelenmesi (Yüksek Lisans Tezi). Haccettepe Üniversitesi, Ankara.
- Gök, B., Kelecioğlu, H. ve Doğan, N. (2010). Değişen madde fonksiyonunu belirlemede Mantel- Haenzsel ve lojistik regresyon tekniklerinin karşılaştırılması. Eğitim ve Bilim, 35, 3-16.
- Gierl, M. J., Jodoin, M. G., & Ackerman, T. A. (2000). American Educational Research Association (AERA) New Orleans, Louisiana, USA April 24-27, 2000.
- Gündoğmuş, İ. (2017). Kâğıt-kalem, bilgisayar ve tablet ortamında gerçekleştirilen sınavlar için ölçme değişmezliğinin ve öğrenci görüşlerinin incelenmesi. Hacettepe Üniversitesi, Ankara
- Gregorich, S. E. (2006). Do self-report instruments allow meaningful comparisons across diverse population groups?: Testing measurement invariance using the confirmatory factor analysis framework. Medical Care, 44(1),
- Hair, J.F. Jr., Anderson, R.E., Tatham, R.L., and Black, W.C. (1998). Multivariate data analysis, (5th Edition). Upper Saddle River, NJ: Prentice Hall.
- Hambleton, R. K. (2006). Good practices for identifying differential item functioning. Medical Care, 44, 182-188.
- İlci, B. (2004). Geleneksel kâğıt-kalem yöntemi ile ve bilgisayarda online uygulanan çoktan seçmeli sayısal yetenek ve sözel yetenek testlerine ait madde ve test istatistiklerinin karşılaştırılması. Yüksek lisans tezi. Hacettepe Üniversitesi Sosyal Bilimler Enstitüsü, Ankara.
- Jöreskog, K. G. ve Sörbom, D. (2006). LISREL (Version 8.8) [computer software]. Chicago: Scientific Softare International Inc.
- Kite, B. A., Johnson, P. E., & Xing, C. (2018, January 28). Replicating the Mplus DIFFTEST Procedure. https://pj.freefaculty.org/guides/crmda_guides/44.difftest/44.difftest.html
- Klieme E., Baumert J. (2001). Identifying national cultures of mathematics education: Analysis of cognitive demands and differential item functioning in TIMSS. European Journal of Psychology of Education, 16:3, 385-402.
- Kline, R. B. (2016). Principles and practice of structural equation modeling (4th ed.b.). New York & London: The Guilford Press.
- Li, Z., Gooden, C. J., & Toland, M. D. (2016). Measurement invariance with categorical indicators. Applied Psychometric Strategies Lab, Applied Quantitative and Psychometric Series. Presentation conducted at the University of Kentucky, Lexington, KY. Retrieved from https://education. uky. edu/edp/apslab/events.
- MEB. (2020). TIMSS 2019 ulusal matematik ve fen bilimleri ön raporu: 4. ve 8. sınıflar. Ankara.
- Meredith, W., & Teresi, J. A. (2006). An essay on measurement and factorial invariance. Medical care, 44(11), 69-S77.
- Mertler, C. A. & Vannatta, R. A. (2005). Advanced and multivariate statistical methods: Practical application and interpretation (3rd ed.). Los Angeles: Pyrczak.
- Mills, C. N., Potenza, M.T., Fremer, J.J., Ward, W.C. (2001). Computer Based Testing: Building the Foundation for Future Assessment. Lawrance Erlbaum Associates, Publishers: Londra
- Moraes, C.L & Reichenheim, M.E. (2002). Cross-cultural measurement equivalence of the revised conflict tactics scales (cts2) portuguese version used to identify violence within couples. Cad. Saúde Pública, 18 (3).
- Muthén, B. O. (1993). Goodness of fit with categorical and other non-normal variables. In K. A. Bollen & J. S. Long (Eds.), Testing structural equation models (pp. 205–243). Newbury Park, CA: Sage.
- Mullis, I. V. S., Martin, M. O., Foy, P., Kelly, D., & Fishbein, B. (2020). TIMSS 2019 international results in mathematics and science. Boston College, TIMSS & PIRLS International Study Center.
- Nandakumar, R. (1993). A fortran 77 program for detecting differential item functioning through the mantel-haenszel statistic. Educational and Psychological Measurement, 53, 679–684.
- Osterlind S. J. Everson H. T. (2009). Differential Item Functioning: Second Edition. California: SAGE Publications. Inc.
- Özdemir, D. (2003). Çoktan seçmeli testlerde iki kategorili ve önsel ağırlıklı puanlamanın değişen madde fonksiyonuna etkisi ile ilgili bir araştırma. Eğitim ve Bilim, 28(129), 37-43.
- Raju, N. S., Laffitte, L. J., & Byrne, B. M. (2002). Measurement equivalence: A comparison of methods based on confirmatory factor analysis and item response theory. Journal of Applied Psychology, 87(3), 517-529.
- Raykov, T., Dimitrov, D. M., Marcoulides, G. A., Li, T., & Menold, N. (2018). Examining measurement invariance and differential item functioning with discrete latent construct indicators: A note on a multiple testing procedure. Educational and Psychological Measurement, 78(2), 343-352.
- Rogers, T. B. (1995). The psychological testing enterprise: An introduction. Pasific Grove, California: Brooks/Cole.
- Russel, M., Goldberg, A., O’Connor, K. (2003). Computer based test and validity: A look back into the future. Assessment in Education. 10, 279- 293.
- Shealy, R. and Stout, W. (1993). A model-based standardization approach that separates true bias/DIF from group ability differences and detect test bias/DTF as well as item bias/DIF. Psychometrika, 58, 159-194. doi: 10.1007/BF02294572
- Steenkamp, B., E., M. and Baumgartner, H. (1998). Assessing measurement invariance in cross‐national consumer research. journal of consumer research, 25(1),78-107.
- Tabachnick, B. G. & Fidell, L. S. (2007). Using multivariate statistics. Boston: Pearson Education.
- Wiberg, M. (2009). Differential item functioning in mastery tests: A comparison of three methods using real data. International Journal of Testing, 9, 41–59
- Vandenberg, R. J., Lance, C. E. (2000). A review and synthesis of the measurement invariance literature: Suggestions, practices, and recommendations for organizational research. Organizational Research Methods, 4, 4-70
- Wu, A. D., Li, Z. and Zumbo, B. D. (2007). Decoding the meaning of factorial invariance and updating the practice of multigroup confirmatory factor analysis: a demonstration with TIMSS data. Practical Assessment, Research and Evaluation, 12, 1-26.
- Zumbo, B. D. (1999). A handbook on the theory and methods of differential item functioning (DIF) logistic regression modeling as a unitary framework for binary and Likert‐type (ordinal) item scores. Canada: Ottowa, Directorate of Human Resources Research and Evaluation National Defense Headquarters: Author.