An Investigation of the Factors Affecting the Vertical Scaling of Multidimensional Mixed-Format Tests

Akif Avcu; Hülya Kelecioğlu

doi:10.21031/epod.394659

Research Article

An Investigation of the Factors Affecting the Vertical Scaling of Multidimensional Mixed Format Tests

Year 2018, Volume: 9 Issue: 4, 326 - 338, 28.12.2018

Akif Avcu , Hülya Kelecioğlu

https://doi.org/10.21031/epod.394659

Abstract

Giriş

Testlerden elde edilen puanlar birçok başlık altında alınan
önemli kararlar için temel bilgi kaynakları arasındadır. Alınacak önemli
kararlardan bağımsız olarak, test puanlarının mümkün olan en kesin bilgiyi
sunması gerekmektedir. Daha kesin bilgi daha iyi kararların alınabilmesi için
önemlidir. Bununla birlikte uygulamada test güvenliği ve öğrenci
gelişiminin takip edilebilmesi gibi birtakım gerekçeler yüzünden aynı testin
farklı formları kullanılmakta veya farklı zamanlarda uygulanan testlerde ortak
maddeler kullanılarak testler ölçeklenmektedir. Farklı formlardan elde edilen
puanlar daha sonrasında eşitlenmekte ya da ölçeklenmektedir. Bu işlemin hatasız
olması gerçekleştirilen sınavların daha adil olması ve öğrencilerin geleceği
ile ilgili doğru kararlar verebilmek için önemlidir. Buna göre, puanları önemli
kararlar için kullanılan testlere uygulanan dikey ölçekleme yöntemlerinin
psikometrik olarak savunulabilir olması önemlidir. Bu sebepten dolayı ölçekleme
gerçekleştirilirken uygulayıcıların kararlarını dayandıracakları kuramsal
çalışmalar büyük önem taşımaktadır. Bu sebepten dolayı farklı yöntemlerin
karşılaştırılması ve farklı durumlar için en az hata veren yöntemlerin
belirlenmesi gerekmektedir.

İki kategorili ve çok kategorili olarak puanlanan maddelerin
birlikte yer aldığı karma format testlerin kullanımı gün geçtikçe artmaktadır.
Benzer şekilde, büyük ölçekli ve
öğrencilerle ilgili önemli kararlarn alındığı test uygulamalarında birden fazla
formunun kullanımı da benzer şekilde yaygınlaşma eğilimindedir. Farklı test
formlarından elde edilen puanların karşılaştırılabilir olabilmesi için bu
formlar arasında fonksiyonel bir bağ oluşturulması gerekmektedir. Eğer kurulan
bu bağ farklı sınıf (ya da test güçlüğü farklılaşan) formlar arasında
gerçekleştirilirse, bu işlem dikey ölçekleme olarak adlandırılmaktadır. Dikey
ölçeklemede farklı test formları birbirlerine bağlandığı için eşitleme ile
benzerdir. Fakat test formları içerik ve güçlük olarak farklıdır çünkü formlar
sınıflar arası ya da yaşa bağlı olarak ilerlemeyi yansıtmaktadırlar. Bundan
dolayı, dikey ölçekleme farklı test formlarının karşılaştırılması için
kullanılmakla birlikte her bir seviyedeki puanlar birbirlerinin yerine
kullanılamazlar. Test ölçeklemesinde temel amaç farklı seviyelerdeki puanların
karşılaştırılmasıdır. Seviye farklılığı bir öğrencinin bulunduğu sınıf, eğitim
öğretim yılının bulunduğu aşama ya da yaştan kaynaklanabilir. Dikey ölçekleme
genellikle aynı bireylerin farklı seviyelerde elde ettikleri puanların farklı
zamanlara göre karşılaştırılabilmesi için kullanılmaktadır. Bu tür desenler ise
DOGOM (Denk Olmayan Gruplarda Ortak Madde) deseni olarak adlandırılmaktadır.

Bu çalışma kapsamında karma format maddelerden oluşan
boyutlu testler DOGOM deseni kullanılarak ölçeklendiğinde ortak madde setinin
yapısı (yalnızca iki kategorili maddelerden oluşan ortak madde seti - iki ve
çok kategorili maddelerin yer aldığı ortak madde seti), yetenek daralması (üst
yetenek grubunda yetenek varyansının daralması - varyansın eşit kalması) ve
parametre kestirim yöntemlerinin (EM - MHRM) ölçekleme sonuçları üzerindeki
etkisi incelenmiştir. Ayrıca bu koşulların etkileşim içinde olup olmadığına
bakılmıştır.

Yöntem

Çalışma, türetilmiş veriler kullanılarak
gerçekleştirilmiştir. Ölçeklemenin niteliğinin değerlendirilmesinde ölçme
hatası ve yanlılık değerleri kullanılmıştır. Veriler türetilirken yanıt
matrisleri, içerisinde İKM (iki kategorili madde) ve ÇKM(çok kategorili
madde)’ler yer alacak şekilde oluşturulmuştur. İKM’ler için parametre kestirimi
3 parametreli modele (3PLM) göre, ÇKM’ler için ise aşamalı tepki modeline (ATM)
göre gerçekleştirilmiştir. Veri türetme ve analizi sürecinde gerçekleştirilen
işlem 50 defa tekrarlanmıştır. Ayrıca, araştırmada gerçekleştirilen veri
türetme, testlerin kalibrasyonu ve ölçekleme işlemleri için R programı
kullanılmıştır. Etkileşimleri incelemek için kullanılan iki ve üç yönlü
analizler SPSS ile gerçekleştirilmiştir.

Bulgular ve Tartışma

Araştırmada sonucunda ortak madde yapısının ölçekleme işlemi
sonucunda ortaya çıkan hata ve yanlılık miktarını önemli ölçüde etkilediği
görülmüştür. Buna göre karma format testlerde ortak madde setinin sadece iki
kategorili puanlanan maddelerden oluşması ölçekleme hatasını bazı istisnalar
haricinde arttırmaktadır. Elde edilen bu bulgu, diğer koşullardan bağımsız
olarak tutarlı bir şekilde gözlenmiştir.

Varyans daralmasının etkisi incelendiğinde yetenek
parametresi ve çok kategorili puanlanan maddelerie ait a parametreleri için farklılaşmalar
olduğu görülmüştür. Gözlenen bu farklılaşmalar yanlılık değerlerine aittir. Çok
kategorili puanlanan maddelere ait a parametreleri için ise hata değerlerinde farklılaşmalar
olduğu bulunmuştur. Her iki parametre için varyansın azaldığı durumda daha iyi
sonuçlar elde edildiği görülmüştür.

Kullanılan kestirim yönteminin etkisi incelendiğinde ise
bazı boyutlar için yanlılık değerlerinin Metropolis–Hastings Robbins-Monro
kestirim yöntemi için daha az olduğu görülmüştür. Ayrıca iki kategorili
puanlanan maddelerin a ve b parametreleri ve çok kategorili
puanlanan maddelerin eşik parametreleri için bazı durumlarda kestirim
yönteminin hata ve yanlılık değerlerini etkilediği görülmüştür. Çok kategorili
puanlanan maddelerin a
parametresinin ise kestirim yönteminden etkilenmediği görülmüştür.

Son olarak, etkileşimler incelenmiştir. Buna göre, yetenek
parametresi bazı koşullara göre yanlılık değerlerinin ikişerli ve üçerli
etkileşimler gösterdiği bulunmuştur. İki kategorili maddelere ait a ve b parametreleri için bakıldığında b parametresine ait hata ve yanlılık
değerlerinde testin bazı boyutlarında varyans daralması ve kestirim yönteminin
etkileşim içinde oldukları görülmüştür. İki kategorili puanlanan maddelere ait a parametrelerine ait hata değerleri
için birinci boyutunda üç koşulun etkileşim içinde olduğu bulunmuştur. Ayrıca, çok
kategorili puanlanan maddelere ait a parametreleri ile eşik parametreleri için etkileşim
gözlenmemiştir. Üç boyutun tamamı için ortak madde yapısı ve kestirim yöntemi
koşulları arasında etkileşim olduğu görülmüştür.

Sonuç olarak, etkisi incelenen koşullar içinde ölçekleme
sonuçları üzerinde en fazla etkisi olan koşulun ortak madde yapısı olduğu
sonucuna varılmıştır.

References

Bastari, B. (2000). Linking multiple-choice and constructed-response items to a common proficiency scale. Unpublished doctorate dissertation, University of Massachusetts. Boston.
Birnbaum A (1968). Some latent trait models and their use in ınferring an examinee's ability. In FM Lord, MR Novick (eds.), Statistical Theories of Mental Test Scores, ss. 397-479. Addison-Wesley, Reading, MA.
Bock, R.D. & Aitkin, M. (1981). Marginal maximum likelihood estimation of item parameters: Application of an EM algorithm. Psychometrika, 46, 443–459
Camilli, G., Wang, M., & Fesq, J. (1995). The effects of dimensionality on equating the Law School Admission Test. Journal of Educational Measurement, 32, 79-96.
Cao, L. (2008). Mixed format test equating: Effects of test dimensionality and common-item sets. Unpublished doctorate dissertation, University of Maryland.
Cao, Y., Yin, P., & Gao, X. (2007, April). Comparison of IRT and classical equating methods for tests consisting of polytomously-scored items. Paper presented at the annual meeting of the National Council on Measurement in Education, Chicago, IL.
Cai, L. (2008). A Metropolis-Hastings Robbins-Monro algorithm for maximum likelihood nonlinear latent structure analysis with a comprehensive measurement model. Unpublished doctorate dissertation, Department of Psychology, University of North Carolina at Chapel Hill.
Cai, L. (2010). High-dimensional exploratory item factor analysis by a Metropolis-Hastings Robbins-Monro algorithm. Psychometrika, 75, 33-57.Chalmers, R. P.(2012). mirt: A Multidimensional Item Response Theory Package for the R Environment. Journal of Statistical Software, 48(6), 1-29.
Dodd, B. G. (1984). Attitude scaling: A comparison of the graded response and partial credit latent trait models. Unpublished doctorate dissertation, University of Texas at Austin. Texas.
Haebara T (1980). Equating logistic ability scales by a weighted least squares method. Japanese Psychological Research, 22(3), 144–149.
Hagge, S. L. (2010). The impact of equating method and format representation of common items on the adequacy of mixed-format test equating using nonequivalent groups. Dissertation, University of Iowa, Iowa City, Iowa.
Kim, S.H., ve Cohen, A.S. (2002). A comparison of linking and concurrent calibration under the graded response model. Applied Psychological Measurement, 26 (1), 25–41.
Kim S ve Lee W. (2006). An Extension of Four IRT Linking Methods for Mixed-Format Tests. Journal of Educational Measurement, 43(1), 53-76.
Kirkpatrick, R. K. (2005). The effects of item format in common item equating. Unpublished doctorate dissertation. University of Iowa. Iowa.
Kolen, M. J., ve Brennan, R. L. (2004). Test equating, scaling, and linking: Methods and practices. NY: Springer.
Marco, G. L., (1977). Item characteristic curve solutions to three intractable testing problems. Journal of Educational Measurement, 14, 139-160.
Muraki, E. ve Carlson, E. B. (1995). Full-information factor analysis for polytomous item responses. Applied Psychological Measurement. 19, 73-90.R Core Team (2015). R: A language and environment for statistical computing. R Foundation for Statistical Computing, Vienna, Austria.
Reckase, Mark D. (2009). Multidimensional item response theory. New York: Springer.
Rosa, K., Swygert, K., Nelson, L. & Thissen, D. (2001). Item response theory applied to combinations of multiple-choice and constructed response items: Scale scores for patterns of summed scores. D. Thissen and H. Wainer (Eds.), Test scoring (ss. 253-292). Hillsdale, NJ: Lawrence Erlbaum.
Samejima, F. (1969). Estimation of latent ability using a response pattern of graded scores. Psychomtrika Monograph, No. 17.
Samejima, F. (1972). A General Model for Free-Response Data. Psychometric Monograph, No. 18.
Stocking, M. L., & Lord, F. M. (1983). Developing a common metric in item response theory. Applied Psychological Measurement, 7, 201-210.
Tate, R. (2000). Performance of a proposed method for the linking of mixed format tests with constructed response and multiple choice items. Journal of Educational Measurement. 37, 329-346.
Wang, T., Lee, W., Brennan, R. L., & Kolen, M. J. (2008). A comparison of the frequency estimation and chained equipercentile methods under the common-item non-equivalent groups design. Applied Psychological Measurement, 32(8), 632-651.
Weeks, J. P. (2010) plink: An R package for linking mixed-format tests using IRT-based methods. Journal of Statistical Software, 35(12), 1–33
Yao, L. ve Boughton, K. (2009). Multidimensional Linking for Tests with Mixed Item Types. Journal of Educational Measurement. 46 (2), 177–197.
Baker, Frank B. (1992). Item Response Theory: Parameter Estimation Techniques. New York: Marcel Dekker, Inc.
Koretz, D.M., & Hamilton, L.S. (2006). Testing for accountability in K-12. In R.L. Brennan (Ed.), Educational Measurement (4th ed., ss. 531-578). Westport, CT: American Council on Education and Praeger Publishers.
Livingston, S. A. (2004). Equating test scores (without IRT). Princeton, NJ: Educational Testing Service.
Yen, W. M. (1985). Increasing item complexity: A possible cause of scale shrinkage for unidimensional item response theory. Psychometrika, 50, 399-410.

An Investigation of the Factors Affecting the Vertical Scaling of Multidimensional Mixed-Format Tests

Year 2018, Volume: 9 Issue: 4, 326 - 338, 28.12.2018

Akif Avcu , Hülya Kelecioğlu

https://doi.org/10.21031/epod.394659

Abstract

This study examined the effect of the structure of a common item set (only
dichotomous common items – mixed-format common item sets), parameter estimation
methods and scale shrinkage on vertical scaling results when multidimensional
datasets were used within the context of Common Item Nonequivalent Group (CINEG)
design. Interactions between these variables were also investigated. The study
was performed using simulated data. Measurement error and bias indexes were
used to evaluate the quality of vertical scaling. All the procedures used in
the data analysis were replicated 50 times to increase the generalizability of
the results. R program was used for the data generation, calibration of the
parameters and vertical scaling procedures. Possible interactions were
investigated with factorial analysis of variance by using SPSS. The results
showed a consistent effect of the common item format in all conditions. In
addition, some interactions between the variables were observed. These findings
are discussed and some recommendations are provided.

Keywords

vertical scaling, multidimensional tests

References

Bastari, B. (2000). Linking multiple-choice and constructed-response items to a common proficiency scale. Unpublished doctorate dissertation, University of Massachusetts. Boston.
Birnbaum A (1968). Some latent trait models and their use in ınferring an examinee's ability. In FM Lord, MR Novick (eds.), Statistical Theories of Mental Test Scores, ss. 397-479. Addison-Wesley, Reading, MA.
Bock, R.D. & Aitkin, M. (1981). Marginal maximum likelihood estimation of item parameters: Application of an EM algorithm. Psychometrika, 46, 443–459
Camilli, G., Wang, M., & Fesq, J. (1995). The effects of dimensionality on equating the Law School Admission Test. Journal of Educational Measurement, 32, 79-96.
Cao, L. (2008). Mixed format test equating: Effects of test dimensionality and common-item sets. Unpublished doctorate dissertation, University of Maryland.
Cao, Y., Yin, P., & Gao, X. (2007, April). Comparison of IRT and classical equating methods for tests consisting of polytomously-scored items. Paper presented at the annual meeting of the National Council on Measurement in Education, Chicago, IL.
Cai, L. (2008). A Metropolis-Hastings Robbins-Monro algorithm for maximum likelihood nonlinear latent structure analysis with a comprehensive measurement model. Unpublished doctorate dissertation, Department of Psychology, University of North Carolina at Chapel Hill.
Cai, L. (2010). High-dimensional exploratory item factor analysis by a Metropolis-Hastings Robbins-Monro algorithm. Psychometrika, 75, 33-57.Chalmers, R. P.(2012). mirt: A Multidimensional Item Response Theory Package for the R Environment. Journal of Statistical Software, 48(6), 1-29.
Dodd, B. G. (1984). Attitude scaling: A comparison of the graded response and partial credit latent trait models. Unpublished doctorate dissertation, University of Texas at Austin. Texas.
Haebara T (1980). Equating logistic ability scales by a weighted least squares method. Japanese Psychological Research, 22(3), 144–149.
Hagge, S. L. (2010). The impact of equating method and format representation of common items on the adequacy of mixed-format test equating using nonequivalent groups. Dissertation, University of Iowa, Iowa City, Iowa.
Kim, S.H., ve Cohen, A.S. (2002). A comparison of linking and concurrent calibration under the graded response model. Applied Psychological Measurement, 26 (1), 25–41.
Kim S ve Lee W. (2006). An Extension of Four IRT Linking Methods for Mixed-Format Tests. Journal of Educational Measurement, 43(1), 53-76.
Kirkpatrick, R. K. (2005). The effects of item format in common item equating. Unpublished doctorate dissertation. University of Iowa. Iowa.
Kolen, M. J., ve Brennan, R. L. (2004). Test equating, scaling, and linking: Methods and practices. NY: Springer.
Marco, G. L., (1977). Item characteristic curve solutions to three intractable testing problems. Journal of Educational Measurement, 14, 139-160.
Muraki, E. ve Carlson, E. B. (1995). Full-information factor analysis for polytomous item responses. Applied Psychological Measurement. 19, 73-90.R Core Team (2015). R: A language and environment for statistical computing. R Foundation for Statistical Computing, Vienna, Austria.
Reckase, Mark D. (2009). Multidimensional item response theory. New York: Springer.
Rosa, K., Swygert, K., Nelson, L. & Thissen, D. (2001). Item response theory applied to combinations of multiple-choice and constructed response items: Scale scores for patterns of summed scores. D. Thissen and H. Wainer (Eds.), Test scoring (ss. 253-292). Hillsdale, NJ: Lawrence Erlbaum.
Samejima, F. (1969). Estimation of latent ability using a response pattern of graded scores. Psychomtrika Monograph, No. 17.
Samejima, F. (1972). A General Model for Free-Response Data. Psychometric Monograph, No. 18.
Stocking, M. L., & Lord, F. M. (1983). Developing a common metric in item response theory. Applied Psychological Measurement, 7, 201-210.
Tate, R. (2000). Performance of a proposed method for the linking of mixed format tests with constructed response and multiple choice items. Journal of Educational Measurement. 37, 329-346.
Wang, T., Lee, W., Brennan, R. L., & Kolen, M. J. (2008). A comparison of the frequency estimation and chained equipercentile methods under the common-item non-equivalent groups design. Applied Psychological Measurement, 32(8), 632-651.
Weeks, J. P. (2010) plink: An R package for linking mixed-format tests using IRT-based methods. Journal of Statistical Software, 35(12), 1–33
Yao, L. ve Boughton, K. (2009). Multidimensional Linking for Tests with Mixed Item Types. Journal of Educational Measurement. 46 (2), 177–197.
Baker, Frank B. (1992). Item Response Theory: Parameter Estimation Techniques. New York: Marcel Dekker, Inc.
Koretz, D.M., & Hamilton, L.S. (2006). Testing for accountability in K-12. In R.L. Brennan (Ed.), Educational Measurement (4th ed., ss. 531-578). Westport, CT: American Council on Education and Praeger Publishers.
Livingston, S. A. (2004). Equating test scores (without IRT). Princeton, NJ: Educational Testing Service.
Yen, W. M. (1985). Increasing item complexity: A possible cause of scale shrinkage for unidimensional item response theory. Psychometrika, 50, 399-410.

There are 30 citations in total.

Details

Primary Language	English
Journal Section	Articles
Authors	Akif Avcu 0000-0003-1977-7592 Hülya Kelecioğlu
Publication Date	December 28, 2018
Acceptance Date	October 13, 2018
Published in Issue	Year 2018 Volume: 9 Issue: 4

Cite

APA	Avcu, A., & Kelecioğlu, H. (2018). An Investigation of the Factors Affecting the Vertical Scaling of Multidimensional Mixed-Format Tests. Journal of Measurement and Evaluation in Education and Psychology, 9(4), 326-338. https://doi.org/10.21031/epod.394659

Download Cover Image

Article Files

Full Text