Toplam Test ve Alt Test Puanlarının Kestiriminin Hiyerarşik Madde Tepki Kuramı Modelleri ile Karşılaştırılması

Sümeyra Soysal; Hülya Kelecioğlu

doi:10.21031/epod.404089

EN TR

Comparison of Estimation of Total Score and Subscores with Hierarchical Item Response Theory Models

Abstract

In this study, the relationship between subtest and total test was investigated by using hierarchical item response theory models in order to contribute to reliable subtest and total test score estimates. The RMSE and reliability of the total test score and subtest scores estimated by the Higher Order, Bi-factor and hierarchical MIRT models in the study were compared under the conditions of the size of the correlations between the subtests, subtest length and number of subtests. In addition, the performance of three models used in the research was examined on TEOG 2015 data. As a result of the study, in almost all conditions, when the correlation between the subtest and the subtest length increased, the RMSE of the ability parameters decreased and the reliability increased for the total test score obtained from the three estimation models. Under all conditions, the lowest RMSE values and the highest reliability values were yielded from Hierarchical MIRT model for subtest score recovery and from Hierarchical MIRT model for total test score recovery. In addition, all models estimated RMSE and reliability values close to each other at 0.8 level of correlation for total test score recovery. The RMSE values of the ability parameters for the subtest scores in two and three dimensional data were found to be not affected by the correlation level between the subtests while the subtest length decreased in the Hierarchical MIRT model; were found to decrease as the correlation between subtest and subtest length in the Higher Order model and were found to decrease as the subtest length increased, but significantly increased as the correlation between the subtests increased in the Bi-factor model.

Keywords

subtest scoring,overall test scoring,hierarchical item response theory models,higher order model,bi-factor model

Toplam Test ve Alt Test Puanlarının Kestiriminin Hiyerarşik Madde Tepki Kuramı Modelleri ile Karşılaştırılması

Abstract

Bu araştırmada güvenilir alt test ve toplam test puanı kestirimleri konusuna katkı sağlamak amacıyla alt test ve toplam test arasındaki ilişki hiyerarşik madde tepki kuramı modelleri ile araştırılmak istenmiştir. Çalışmada Üst Düzey Sıralı (Higher Order), İki Faktör (Bi-factor) ve hiyerarşik çok boyutlu madde tepki kuramı (ÇBMTK) modelleri ile kestirilen toplam test puanının ve alt test puanlarının RMSE ve güvenirlik değerleri alt test sayısı, alt test uzunluğu ve alt testler arasındaki korelasyonların büyüklüğü koşulları altında karşılaştırılmıştır. Ayrıca TEOG 2015 verileri üzerinde araştırmada kullanılan üç kestirim modelinin performansı incelenmiştir. Araştırmanın sonucunda iki ve üç boyutlu verilerde hemen hemen tüm koşullarda alt test uzunluğu ve alt testler arasındaki korelasyonun arttıkça üç kestirim modelinden elde edilen toplam test puanı için yetenek parametreleri kestirim hatasının azaldığı, kestirim güvenirliğinin ise arttığı bulunmuştur. Toplam test puanları için Hiyerarşik ÇBMTK model ile tüm koşullarda en düşük RMSE değeri ve en yüksek güvenirlik değeri elde edilmiştir. Ayrıca korelasyonun 0.8 düzeyinde toplam test puanı için tüm modeller birbirine yakın RMSE ve güvenirlik değerleri ile kestirim yapmıştır. İki ve üç boyutlu verilerde alt test puanı için kestirilen yetenek parametrelerinin RMSE değerleri, Hiyerarşik ÇBMTK modelde alt test uzunluğu arttıkça azalırken alt testler arasındaki korelasyon düzeyinden etkilenmediği; Üst Düzey Sıralı modelde alt test uzunluğu ve alt testler arasındaki korelasyon arttıkça azaldığı; İki Faktör modelde ise alt test uzunluğu arttıkça azalırken alt testler arasındaki korelasyon arttıkça önemli düzeyde arttığı bulunmuştur.

Keywords

Alt test puan kestirimi,toplam test puan kestirimi,hiyerarşik madde tepki kuramı modelleri,üst düzey sıralı model,iki faktör model

References

American Educational Research Association, American Psychological Association, National Council on Measurement in Education, Joint Committee on Standards for Educational, & Psychological Testing (US). (1999). Standards for educational and psychological testing. American Educational Research Association, Washington, DC.
Bradlow, E. T., Wainer, H., & Wang, X. (1999). A Bayesian random effects model for testlets. Psychometrika, 64(2), 153–168, doi: 10.1002/j.2333-8504.1998.tb01752.x
Brandt, S., & Duckor, B. (2013). Increasing unidimensional measurement precision using a multidimensional item response model approach. Psychological Test and Assessment Modeling, 55(2), 148-161.
Brennan, R. L. (2012). Utility indexes for decisions about subscores (No. 33). Center for Advanced Studies in Measurement and Assessment (CASMA). Retrieved from https://education.uiowa.edu/sites/education.uiowa.edu/files/documents/centers/casma/publications/casma-research-report-33.pdf
Bulut, O. (2013). Between-person and within-person subscore reliability: Comparison of unidimensional and multidimensional IRT models. (Doctoral Dissertation). Retrieved from https://conservancy.umn.edu/bitstream/handle/11299/155592/Bulut_umn_0130E_13879.pdf?sequence=1&isAllowed=y
Chang, Y. F. (2015). A Restricted Bi-factor Model of Subdomain Relative Strengths and Weaknesses. (Doctoral Dissertation) Retrieved from https://conservancy.umn.edu/bitstream/handle/11299/175551/CHANG_umn_0130E_16452.pdf?sequence=1&isAllowed=y
Çakıcı Eser, D. (2015). Çok boyutlu madde tepki kuramının farklı modellerinden çeşitli koşullar altında kestirilen parametrelerin incelenmesi. (Doktora tezi). Erişim adresi: http://tez2.yok.gov.tr/
de la Torre, J., & Patz, R.J. (2005). Making the most of what we have: A practical application of multidimensional IRT in test scoring. Journal of Educational and Behavioral Statistics, 30(3), 295–311, doi: 10.3102/10769986030003295

de la Torre, J. (2009). Improving the quality of ability estimates through multidimensional scoring and incorporation of ancillary variables. Applied Psychological Measurement, 33(6), 465–485, doi: 10.1177/0146621608329890
de la Torre, J., & Song, H. (2009). Simultaneous estimation of overall and domain abilities: A higher-order IRT model approach. Applied Psychological Measurement, 33(8), 620-639, doi: 10.1177/0146621608326423
de la Torre, J., Song, H., & Hong, Y. (2011). A comparison of four methods of IRT subscoring. Applied Psychological Measurement, 35(4), 296-316, doi: 10.1177/0146621610378653
Edwards, M. C., & Vevea, J. L. (2006). An empirical Bayes approach to subscore augmentation: How much strength can we borrow?. Journal of Educational and Behavioral Statistics, 31(3), 241-259, doi: 10.3102/10769986031003241
ETS. (2014). ETS standards for quality and fairness. Educational Testing Service. .Retreived from https://www.ets.org/s/about/pdf/standards.pdf
Ferrara, S., & DeMauro, G. E. (2007). Standardized assessment of individual achievement in K–12. In R. L. Brennan (Eds.). Educational measurement, 579–622. Westport, CT: Praeger.
Fraenkel, J. R., Wallen, N. E., & Hyun, H. H. (2011). How to design and evaluate research in education. (8th edition). Boston: McGraw – Hill. Gall M. D., Gall, J. P., & Borg, W., R. (2003). Educational research: An introduction. (7th. Edition). Pearson Education, Inc.
Gibbons, R. D., & Hedeker, D. (1992). Full-information item Bi-factor analysis. Psychometrika, 57, 423–436.
Haberman, S. J. (2008). When can subscores have value? Journal of Educational and Behavioral Statistics, 33(2), 204–229, doi:10.3102/1076998607302636
Haberman, S., Sinharay, S., & Puhan, G. (2009). Reporting subscores for institutions. British Journal of Mathematical and Statistical Psychology, 62(1), 79–95, doi:10.1348/000711007X248875
Haladyna, T. M., & Kramer, G. A. (2004). The validity of subscores for a credentialing Test. Evaluation & The Health Professions, 27(4), 349–368, doi: 10.1177/0163278704270010
Harwell, M., Stone, C. A., Hsu, T. C., & Kirisci, L. (1996). Monte Carlo studies in item response theory. Applied Psychological Measurement, 20(2), 101-125, doi: 10.1177/014662169602000201
Huang, H. Y., Wang, W. C., Chen, P. H., & Su, C. M. (2013). Higher-order item response models for hierarchical latent traits. Applied Psychological Measurement, 37(8), 619-637, doi: 10.1177/0146621613488819
Jiang, S., Wang, C., & Weiss, D. J. (2016). Sample size requirements for estimation of item parameters in the multidimensional graded response model. Frontiers in psychology, 7(109), 1-10, doi: 10.3389/fpsyg.2016.00109
Kelley, T. L. (1927). The interpretation of educational measurements. New York: World Book.
Kelley, T. L. (1947). Fundamentals of statistics. Cambridge: Harvard University Press Kerlinger.
Kerlinger, F.N. (1973). Foundation of behavioural research. New York. Holt. Rinehand and Hinston.
Köse, İ.A. (2010). Madde tepki kuramına dayalı tek boyutlu ve çok boyutlu modellerin test uzunluğu ve örneklem büyüklüğü açısından karşılaştırılması. (Doktora Tezi). Erişim adresi: http://tez2.yok.gov.tr/
Lee, J. (2012). Multidimensional item response theory: an investigation of interaction effects between factors on item parameter recovery using Markov Chain Monte Carlo. (Doctoral Dissertation). Retrieved from https://d.lib.msu.edu/islandora/object/etd:1577/datastream/OBJ/download/Multidimensional_item_response_theory__an_investigation_of_interaction_effects_between_factors_on_item_parameter_recovery_using_Markov_Chain_Monte_Carlo.pdf
Ling, G. (2012). Why the major field test in business does not report subscores: Reliability and construct validity evidence (No. RR-12-11). ETS Research Report. Retrieved from https://www.ets.org/Media/Research/pdf/RR-12-11.pdf
Lorenzo-Seva, U., & Ferrando, P.J. (2006). FACTOR: A computer program to fit the exploratory factor analysis model. Behavioral Research Methods, Instruments and Computers, 38(1), 88-91.
Messick, S. (1989). Validity. In R. L. Linn (Eds.). Educational measurement, 13-103, New York, NY: Macmillan. Monaghan, W. (2006). The fact about subscores (No. RDC-04). ETS Research Report. Retrieved from https://www.ets.org/research/policy_research_reports/rdc-04
Özkan, Y. Ö. (2012). Öğrenci başarılarının belirlenmesi sınavından (ÖBBS) klasik test kuramı, tek boyutlu ve çok boyutlu madde tepki kuramı modelleri ile kestirilen başarı puanlarının karşılaştırılması. (Doktora Tezi). Erişim adresi: http://tez2.yok.gov.tr/
Reckase, M. D. (1997). The past and future of multidimensional item response theory. Applied Psychological Measurement, 21, 25–36, doi: 10.1177/0146621697211002
Schmid, J., & Leiman, J. M. (1957). The development of hierarchical factor solutions. Psychometrika, 22(1), 53-61.
Sheng, Y., & Wikle, C. K. (2007). Comparing Multiunidimensional and unidimensional item response theory models. Educational and Psychological Measurement, 67(6) 899–919, doi: 10.1177/0013164406296977
Sheng, Y., & Wikle, C. K. (2008). Bayesian multidimensional IRT models with a hierarchical structure. Educational and Psychological Measurement, 68(3), 413–430, doi: 10.1177/0013164407308512
Shin, D. (2007). A comparison of methods of estimating subscale scores for mixed-format tests. Report for Pearson Educational Measurement. Retreived from https://images.pearsonassessments.com/images/tmrs/tmrs_rg/EstimatingSubscaleScoresforMixedFormatItemsforPEMreportfinal.pdf?WT.mc_id=TMRS_A_Comparison_of_Methods_of_Estimating
Shin, C. D., Ansley, T., Tsai, T., & Mao X. (2005, April). A comparison of methods of estimating objective scores. Annual meeting of the National Council on Measurement in Education, Montreal, Canada.
Sinharay, S. (2010). How often do subscores have added value? Results from operational and simulated data. Journal of Educational Measurement, 47(2),150-174.
Skorupski, W. P., & Carvajal, J. (2010). A comparison of approaches for improving the reliability of objective level scores. Educational and Psychological Measurement, 70(3), 357-375, doi: 10.1177/0013164409355694
Wainer, H., Vevea, J. L., Camacho, F., Reeve, B. B., Rosa, K., Nelson, L.,Swygert, K. A., & Thissen, D. (2001). Augmented score—‘‘borrowing strength’’ to compute scores based on small numbers of items. In D. Thissen and H. Wainer (Eds.). Test scoring, (343-387). Mahwah, Lawrence Erlbaum Associates, Inc
Wang, W. C., & Wilson, M. (2005). The Rasch testlet model. Applied Psychological Measurement, 29(2), 126–149, doi: 10.1177/0146621604271053
Wang, W. C., Chen, P. H., & Cheng, Y. Y. (2004). Improving measurement precision of test batteries using multidimensional item response models. Psychological Methods, 9(1), 116, doi: 10.1037/1082-989X.9.1.116
Yao, L. (2003). SimuMIRT [Software]. Monterey, CA: Defense Manpower Data Center. Retreived from http://www.bmirt.com
Yao, L. (2010). Reporting valid and reliable overall scores and domain scores. Journal of Educational Measurement, 47(3), 339-360, doi: 10.1111/j.1745-3984.2010.00117.x
Yao, L. (2017). Comparing methods for estimating the abilities for the multidimensional models of mixed item types. Communications in Statistics-Simulation and Computation, 1-18, doi: 10.1080/03610918.2016.1277749
Yao, L., & Boughton, K. A. (2007). A multidimensional item response modeling approach for improving subscale proficiency estimation and classification. Applied Psychological Measurement, 31(2), 83-105, doi: 10.1177/0146621606291559
Yao, L., & Schwarz, R. D. (2006). A multidimensional partial credit model with associated item and test statistics: An application to mixed-format tests. Applied Psychological Measurement, 30(6), 469–492, doi: 10.1177/0146621605284537
Yen, W. M. (1980). The extent, causes and importance of context effects on item parameters for 2 latent trait models. Journal of Educational Measurement, 17(4), 297–311, doi: 10.1111/j.1745-3984.1980.tb00833.x
Yen, W. M. (1987,June). A Bayesian/IRT index of objective performance. Annual meeting of the Psychometric Society, Montreal, Quebec, Canada.

Details

Primary Language

Turkish

Subjects

Journal Section

Research Article

Authors

Sümeyra Soysal ^*
Hacettepe Üniversitesi
0000-0002-7304-1722
Türkiye

Hülya Kelecioğlu
Hacettepe Üniversitesi
0000-0002-0741-9934

Publication Date

June 30, 2018

Submission Date

March 11, 2018

Acceptance Date

June 13, 2018

Published in Issue

Year 2018 Volume: 9 Number: 2

DOI

https://doi.org/10.21031/epod.404089

IZ

https://izlik.org/JA27HM66DD

Cite

RIS / Bibtex

APA

Soysal, S., & Kelecioğlu, H. (2018). Toplam Test ve Alt Test Puanlarının Kestiriminin Hiyerarşik Madde Tepki Kuramı Modelleri ile Karşılaştırılması. Journal of Measurement and Evaluation in Education and Psychology, 9(2), 178-201. https://doi.org/10.21031/epod.404089

AMA

1.Soysal S, Kelecioğlu H. Toplam Test ve Alt Test Puanlarının Kestiriminin Hiyerarşik Madde Tepki Kuramı Modelleri ile Karşılaştırılması. JMEEP. 2018;9(2):178-201. doi:10.21031/epod.404089

Chicago

Soysal, Sümeyra, and Hülya Kelecioğlu. 2018. “Toplam Test Ve Alt Test Puanlarının Kestiriminin Hiyerarşik Madde Tepki Kuramı Modelleri Ile Karşılaştırılması”. Journal of Measurement and Evaluation in Education and Psychology 9 (2): 178-201. https://doi.org/10.21031/epod.404089.

EndNote

Soysal S, Kelecioğlu H (June 1, 2018) Toplam Test ve Alt Test Puanlarının Kestiriminin Hiyerarşik Madde Tepki Kuramı Modelleri ile Karşılaştırılması. Journal of Measurement and Evaluation in Education and Psychology 9 2 178–201.

IEEE

[1]S. Soysal and H. Kelecioğlu, “Toplam Test ve Alt Test Puanlarının Kestiriminin Hiyerarşik Madde Tepki Kuramı Modelleri ile Karşılaştırılması”, JMEEP, vol. 9, no. 2, pp. 178–201, June 2018, doi: 10.21031/epod.404089.

ISNAD

Soysal, Sümeyra - Kelecioğlu, Hülya. “Toplam Test Ve Alt Test Puanlarının Kestiriminin Hiyerarşik Madde Tepki Kuramı Modelleri Ile Karşılaştırılması”. Journal of Measurement and Evaluation in Education and Psychology 9/2 (June 1, 2018): 178-201. https://doi.org/10.21031/epod.404089.

JAMA

1.Soysal S, Kelecioğlu H. Toplam Test ve Alt Test Puanlarının Kestiriminin Hiyerarşik Madde Tepki Kuramı Modelleri ile Karşılaştırılması. JMEEP. 2018;9:178–201.

MLA

Soysal, Sümeyra, and Hülya Kelecioğlu. “Toplam Test Ve Alt Test Puanlarının Kestiriminin Hiyerarşik Madde Tepki Kuramı Modelleri Ile Karşılaştırılması”. Journal of Measurement and Evaluation in Education and Psychology, vol. 9, no. 2, June 2018, pp. 178-01, doi:10.21031/epod.404089.

Vancouver

1.Sümeyra Soysal, Hülya Kelecioğlu. Toplam Test ve Alt Test Puanlarının Kestiriminin Hiyerarşik Madde Tepki Kuramı Modelleri ile Karşılaştırılması. JMEEP. 2018 Jun. 1;9(2):178-201. doi:10.21031/epod.404089

Cited By

Simultaneous Estimation of Overall Score and Subscores Using MIRT, HO-IRT and Bi-factor Model on TIMSS Data

Eğitimde ve Psikolojide Ölçme ve Değerlendirme Dergisi

https://doi.org/10.21031/epod.645478

Comparison of item response theory ability and item parameters according to classical and Bayesian estimation methods

International Journal of Assessment Tools in Education

https://doi.org/10.21449/ijate.1290831

Comparison of Models for Simultaneous Estimation of Overall Score and Subscores: Estimation Accuracy, Reliability, and Classification Accuracy

Eğitimde ve Psikolojide Ölçme ve Değerlendirme Dergisi

https://doi.org/10.21031/epod.1748835