tred

Trakya Eğitim Dergisi

2630-6301

Trakya Üniversitesi

10.24315/tred.1665684

Computer Based Exam Applications

Bilgisayar Tabanlı Sınav Uygulamaları

Comparative Study of Fixed and On-the-Fly Computerized Multistage Testing: Implications for Measurement Accuracy and Item Security

SABİT VE ANINDA BİREYSELLEŞTİRİLMİŞ ÇOK AŞAMALI TESTLERİN KARŞILAŞTIRMALI İNCELENMESİ: ÖLÇME KESİNLİĞİ VE MADDE GÜVENLİĞİNE İLİŞKİN ÇIKARIMLAR

https://orcid.org/0000-0002-2896-0201

Yiğiter

Mahmut Sami

ANKARA SOSYAL BİLİMLER ÜNİVERSİTESİ, UZAKTAN EĞİTİM UYGULAMA VE ARAŞTIRMA MERKEZİ

https://orcid.org/0000-0001-6274-2016

Doğan

Nuri

HACETTEPE ÜNİVERSİTESİ, EĞİTİM BİLİMLERİ ENSTİTÜSÜ

04 25 2026

16 2 766 820 03 25 2025 10 27 2025

2011

Trakya Eğitim Dergisi

In recent years, adaptive testing techniques such as Computerized Adaptive Testing (CAT) and Computerized Multistage Testing (MST) have been increasingly incorporated into large-scale evaluations. This study aims to compare Fixed-MST (F-MST) and On-the-Fly MST (O-MST), a novel approach in which items are grouped into modules based on the participant’s ability level, in terms of measurement precision and item security across various simulation scenarios. The simulations were carried out using item parameter distributions derived from the 3PL model applied in TIMSS. A total of 72 different conditions were analyzed to compare O-MST with F-MST. The findings on measurement precision reveal that O-MST performs better than F-MST, especially when the test lengths are shorter, where O-MST shows substantially higher measurement precision. Moreover, when examining ability distributions, O-MST demonstrates better measurement precision compared to F-MST, particularly in cases of non-normal distributions. A significant result from this study is that the measurement precision of O-MST improves as the length of the final module increases, whereas the measurement precision of F-MST becomes more similar to O-MST as the length of the initial module increases. Regarding item security, O-MST employed a greater number of items and exhibited a lower item exposure rate compared to F-MST in all conditions. The favorable results in terms of measurement precision and item security for O-MST are discussed within the framework of large-scale assessments and relevant literature.

Son yıllarda, Bireyselleştirilmiş Bilgisayarlı Testler (BBT) ve Bireyselleştirilmiş Çok Aşamalı Testler (BÇAT) gibi uyarlanabilir test teknikleri, büyük ölçekli değerlendirmelere giderek daha fazla dahil edilmektedir. Bu çalışmanın amacı, maddelerin katılımcının yetenek düzeyine göre modüller halinde gruplandırıldığı yeni bir yaklaşım olan Sabit-BÇAT (S-BÇAT) ve Anında BÇAT'ı (A-BÇAT) çeşitli simülasyon senaryolarında ölçme kesinliği ve madde güvenliği açısından karşılaştırmaktır. Simülasyonlar, TIMSS'te uygulanan maddelerin 3PL modelinden türetilen madde parametre dağılımları kullanılarak gerçekleştirilmiştir. A-BÇAT ile S-BÇAT'ı karşılaştırmak için toplam 72 farklı koşul analiz edilmiştir. Ölçme kesinliğine ilişkin bulgular, A-BÇAT'ın S-BÇAT'tan daha iyi performans gösterdiğini, özellikle de test uzunlukları daha kısa olduğunda, A-BÇAT'ın önemli ölçüde daha yüksek ölçme kesinliği gösterdiğini ortaya koymaktadır. Ayrıca, yetenek dağılımları incelendiğinde, A-BÇAT, özellikle normal olmayan dağılımlarda S-BÇAT'a kıyasla daha iyi ölçme kesinliği göstermektedir. Bu çalışmadan elde edilen önemli bir sonuç, A-BÇAT'ın ölçme kesinliğinin son modülün uzunluğu arttıkça iyileşmesi, S-BÇAT'ın ölçme kesinliğinin ise başlangıç modülünün uzunluğu arttıkça A-BÇAT'a daha çok benzemesidir. Madde güvenliği ile ilgili olarak, A-BÇAT daha fazla sayıda madde kullanmış ve tüm koşullarda S-BÇAT'a kıyasla daha düşük bir madde maruz kalma oranı sergilemiştir. A-BÇAT için ölçme kesinliği ve madde güvenliği açısından olumlu sonuçlar tartışılmaktadır.

Madde Güvenliği Bireyselleştirilmiş Çok Aşamalı Testler Bilgisayarlı Testler Madde Güvenliği Madde Teşhir Oranı

Computerized Multistage Testing Adaptive Testing Item Security Item Exposure Rate.

Arvey, R. D., Strickland, W., Drauden, G., & Martin, C. (1990). Motivational components of test taking. Personnel Psychology, 43(4), 695–716. https://doi.org/10.1111/j.1744-6570.1990.tb00679.x

Bergstrom, B. A., Lunz, M. E., & Gershon, R. C. (1992). Altering the level of difficulty in computer adaptive testing. Applied Measurement in Education, 5(2), 137–149. https://doi.org/10.1207/s15324818ame0502_4

Boztunç Öztürk, N. (2019). How the length and characteristics of routing module affect ability estimation in ca-MST? Universal Journal of Educational Research, 7(1), 164–170. https://doi.org/10.13189/ujer.2019.070121

Breithaupt, K. J., Mills, C. N., & Melican, G. J. (2006). Facing the opportunities of the future. Computer-based testing and the Internet: Issues and advances, 219-251.

Bulut, O. (2021). Beyond multiple-choice with digital assessments. ELearn, 2021(Special Issue), 1–10. https://doi.org/10.1145/3472394

Bulut, O., & Sünbül, Ö. (2017). R Programlama Dili ile Madde Tepki Kuramında Monte Carlo Simülasyon Çalışmaları. Egitimde ve Psikolojide Olcme ve Degerlendirme Dergisi, 8(3), 266–287. https://doi.org/10.21031/epod.305821

Cai, L., Albano, A. D., & Roussos, L. A. (2021). An investigation of item calibration methods in multistage testing. Measurement: Interdisciplinary Research and Perspectives, 19(3), 163–178. https://doi.org/10.1080/15366367.2021.1878778

Carlson, S. (2000). ETS finds flaws in the way online GRE rates some students. Chronicle of Higher Education, 47(8), A47.

Cetin-Berber, D. D., Sari, H. I., & Huggins-Manley, A. C. (2019). Imputation methods to deal with missing responses in computerized adaptive multistage testing. Educational and psychological measurement, 79(3), 495-511.

Chang, H.-H. (2004). Understanding computerized adaptive testing: From Robbins-Monro to Lord and beyond. In D. Kaplan (Ed.), The Sage handbook of quantitative methodology for the social sciences (pp. 117-133). Thousand Oaks, CA: Sage.

Chang, H.-H. (2015). Psychometrics behind Computerized Adaptive Testing. Psychometrika, 80(1), 1–20. https://doi.org/10.1007/s11336-014-9401-5

Chang, H.-H., & Ying, Z. (2008). To weight or not to weight? Balancing influence of initial items in adaptive testing. Psychometrika, 73(3), 441–450.

Choi, S. W., & van der Linden, W. J. (2018). Ensuring content validity of patient-reported outcomes: a shadow-test approach to their adaptive measurement. Quality of Life Research, 27(7), 1683-1693.

Choi, S. W., Lim, S., & van der Linden, W. J. (2021). TestDesign: an optimal test design approach to constructing fixed and adaptive tests in R. Behaviormetrika, 1-39.

Choi, S. W., Moellering, K. T., Li, J., & van der Linden, W. J. (2016). Optimal reassembly of shadow tests in CAT. Applied psychological measurement, 40(7), 469-485. https://doi.org/10.1177/0146621616654597.

Cohen, J. (1988). Statistical power analysis fort he behavioral sciences (2nd ed.). Hillsdale, NJ: Erlbaum.

Drasgow, F., Luecht, R. M., & Bennett, R. E. (2006). Technology and testing. Educational measurement, 4, 471-515.

Demir, H., & Gelbal, S. (2025). A systematic review on Computerized Adaptive Testing. Journal of Education Faculty, 27(1), 137–150. https://doi.org/10.17556/erziefd.1577880

Ebenbeck, N., & Gebhardt, M. (2022). Simulating computerized adaptive testing in special education based on inclusive progress monitoring data. Frontiers in Education, 7. https://doi.org/10.3389/feduc.2022.945733

Feinberg, R. A., & Rubright, J. D. (2016). Conducting simulation studies in psychometrics. Educational Measurement: Issues and Practice, 35(2), 36-49.

Fleishman, A. I. (1978). A method for simulating non-normal distributions. Psychometrika, 43(4), 521-532.

Fraenkel, J. R., Wallen, N. E., & Hyun, H. H. (2012). How to design and evaluate research in education. McGram-Hill Publishing.

Gür, R., & Gülleroğlu, H. (2020). The effect of item exposure control methods on measurement precision and test security under different measurement conditions in computerized adaptive testing. TED EĞİTİM VE BİLİM, 45(202), 113–139. https://doi.org/10.15390/eb.2020.8256

Hambleton, R. K., & Xing, D. (2006). Optimal and nonoptimal computer-based test designs for making pass–fail decisions. Applied Measurement in Education, 19(3), 221-239.

Han, K. C. T., & Guo, F. (2016). Multistage testing by shaping modules on the fly. In Computerized Multistage Testing (pp. 157-172). Chapman and Hall/CRC.

Han, K. T. (2007). WinGen: Windows software that generates item response theory parameters and item responses. Applied Psychological Measurement, 31(5), 457–459. https://doi.org/10.1177/0146621607299271

Harwell, M., Stone, C. A., Hsu, T. C. & Kirisci, L. (1996). Monte Carlo studies in item response theory. Applied Psychological Measurement, 20(2), 101-125. doi: 10.1177/014662169602000201

Harwell, M., Stone, C. A., Hsu, T.-C., & Kirisci, L. (1996). Monte Carlo studies in item response theory. Applied Psychological Measurement, 20(2), 101–125. https://doi.org/10.1177/014662169602000201

Hendrickson, A. (2007). An NCME instructional module on multistage testing. Educational Measurement Issues and Practice, 26(2), 44–52. https://doi.org/10.1111/j.1745-3992.2007.00093.x

Khorramdel, L., Pokropek, A., Joo, S. H., Kirsch, I., & Halderman, L. (2020). Examining gender DIF and gender differences in the PISA 2018 reading literacy scale: A partial invariance approach. Psychological Test and Assessment Modeling, 62(2), 179-231.

Kim, H., & Plake, B. (1993). Monte Carlo simulation comparison of two-stage testing and computer adaptive testing. Unpublished doctoral dissertation, University of Nebraska, Lincoln.

Kirsch, I., & Lennon, M. L. (2017). PIAAC: a new design for a new era. Large-scale Assessments in Education, 5(1), 1-22.

Ling, G., Attali, Y., Finn, B., & Stone, E. A. (2017). Is a computerized adaptive test more motivating than a fixed-item test? Applied Psychological Measurement, 41(7), 495–511. https://doi.org/10.1177/0146621617707556

Lord, F. M. (1971). A theoretical study of two-stage testing. Psychometrika, 36(3), 227-242. https://doi.org/10.1007/BF02297844

Luo, X., & Kim, D. (2018). A top‐down approach to designing the computerized adaptive multistage test. Journal of Educational Measurement, 55(2), 243-263.

Magis, D., Yan, D., & Von Davier, A. A. (2017). Computerized adaptive and multistage testing with R: Using packages catr and mstr. Springer.

Makhorin A (2017). GNU Linear Programming Kit. Version 4.61, URL http://www.gnu. org/software/glpk/glpk.html. Martin, A. J., & Lazendic, G. (2018). Computer-adaptive testing: Implications for students’ achievement, motivation, engagement, and subjective test experience. Journal of Educational Psychology, 110(1), 27–45. https://doi.org/10.1037/edu0000205

Mead, A. D. (2006). An introduction to multistage testing. Applied Measurement in Education, 19(3), 185–187. https://doi.org/10.1207/s15324818ame1903_1

MEB (2021). 2021 Ortaöğretim Kurumlarına İlişkin Merkezi Sınav Raporu. Milli Eğitim Bakanlığı.

Morris, T. P., White, I. R., & Crowther, M. J. (2019). Using simulation studies to evaluate statistical methods. Statistics in medicine, 38(11), 2074-2102.

OECD (2023). PISA 2022 Results (Volume I): The State of Learning and Equity in Education, OECD Publishing, Paris, https://doi.org/10.1787/53f23881-en.

OECD (2024), PISA 2022 Technical Report, PISA, OECD Publishing, Paris, https://doi.org/10.1787/01820d6d-en.

Ortner, T. M., Weißkopf, E., & Koch, T. (2014). I will probably fail: Higher ability students’ motivational experiences during adaptive achievement testing. European Journal of Psychological Assessment: Official Organ of the European Association of Psychological Assessment, 30(1), 48–56. https://doi.org/10.1027/1015-5759/a000168

Patsula, L. N., & Hambleton, R. K. (1999). A comparative study of ability estimates obtained from computer-adaptive and multi-stage testing. In annual meeting of the National Council on Measurement in Education, Montreal, Quebec.

Pine, S. M., Church, A. T., Gialluca, K. A., & Weiss, D. J. (1979). Effects of Computerized Adaptive Testing on Black and White Students. Minnesota Univ Minneapolis Dept Of Psychology.

Saatçi̇oğlu, F. M., & Atar, H. Y. (2022). Investigation of the effect of parameter estimation and classification accuracy in mixture IRT models under different conditions. International Journal of Assessment Tools in Education, 9(4), 1013–1029. https://doi.org/10.21449/ijate.1164590

Stark, S., & Chernyshenko, O. S. (2006). Multistage testing: Widely or narrowly applicable?. Applied Measurement in Education, 19(3), 257-260.

Şahin, A., & Weiss, D. J. (2015). Effects of calibration sample size and item bank size on ability estimation in computerized adaptive testing. Educational Sciences Theory & Practice, 15(6). https://doi.org/10.12738/estp.2015.6.0102

Tay, P. H. (2015). On-the-fly assembled multistage adaptive testing. University of Illinois at Urbana-Champaign. Tomashev, M. V., Avdeev, A. S., & Krasnova, M. V. (2018). Adaptive testing as a tool for managing quality of education. Informatics and Education, 9, 27–33. https://doi.org/10.32517/0234-0453-2018-33-9-27-33

van der Linden, W. J. (2009). Constrained adaptive testing with shadow tests. In Elements of adaptive testing (pp. 31-55). Springer, New York, NY.

van der Linden, W. J. (2010). Elements of adaptive testing. C. A. Glas (Ed.). New York, NY: Springer.

van der Linden, W. J. (2018). Optimal test design. Handbook of item response theory: Vol. 3. Applications, 167-195.

van der Linden, W. J. (2021). Review of the shadow-test approach to adaptive testing. Behaviormetrika, 1-22.

van der Linden, W. J., & Diao, Q. (2016). Using a universal shadow-test assembler with multistage testing. Computerized multistage testing: Theory and applications, 101-118.

van der Linden, W. J., & Veldkamp, B. P. (2004). Constraining item exposure in computerized adaptive testing with shadow tests. Journal of Educational and Behavioral Statistics: A Quarterly Publication Sponsored by the American Educational Research Association and the American Statistical Association, 29(3), 273–291. https://doi.org/10.3102/10769986029003273

van der Linden, W. J., Breithaupt, K., Chuah, S. C., & Zhang, Y. (2007). Detecting differential speededness in multistage testing. Journal of Educational Measurement, 44(2), 117–130. https://doi.org/10.1111/j.1745-3984.2007.00030.x

Xu, L., Jiang, Z., Han, Y., Liang, H., & Ouyang, J. (2023). Developing computerized Adaptive Testing for a national health professionals exam: An attempt from psychometric simulations. Perspectives on Medical Education, 12(1), 462–471. https://doi.org/10.5334/pme.855

Yamamoto, K., Shin, H. J., & Khorramdel, L. (2018). Multistage adaptive testing design in international large-scale assessments. Educational Measurement Issues and Practice, 37(4), 16–27. https://doi.org/10.1111/emip.12226

Yan, D., Von Davier, A. A., & Lewis, C. (Eds.). (2016). Computerized multistage testing: Theory and applications. CRC Press.

Yasuda, J. I., Mae, N., Hull, M. M., & Taniguchi, M. A. (2021). Optimizing the length of computerized adaptive testing for the force concept inventory. Physical review physics education research, 17(1), 1-15.

Yasuda, J.-I., Mae, N., Hull, M. M., & Taniguchi, M.-A. (2021). Optimizing the length of computerized adaptive testing for the Force Concept Inventory. Physical Review Physics Education Research, 17(1). https://doi.org/10.1103/physrevphyseducres.17.010115

Yigiter, M. S., & Dogan, N. (2023). Computerized multistage testing: Principles, designs and practices with R. Measurement: Interdisciplinary Research and Perspectives, 21(4), 254–277. https://doi.org/10.1080/15366367.2022.2158017

Yiğiter, M. S., & Boduroğlu, E. (2024). Item Response Theory assumptions: A comprehensive review of studies with document analysis. International Journal of Educational Studies and Policy, 5(2), 119-138. https://doi.org/10.5281/ZENODO.14016086

Yi̇ği̇ter, M. S., & Doğan, N. (2023). The effect of test design on misrouting in computerized multistage testing. International Journal of Turkish Education Sciences, 2023(21), 549–587. https://doi.org/10.46778/goputeb.1267319 Zheng, W. (2016). Making test batteries adaptive by using multistage testing techniques (Doctoral dissertation, University of North Carolina, Greensboro, NC).

Zheng, Y., & Chang, H.-H. (2015). On-the-fly assembled multistage adaptive testing. Applied Psychological Measurement, 39(2), 104–118. https://doi.org/10.1177/0146621614544519