SABİT VE ANINDA BİREYSELLEŞTİRİLMİŞ ÇOK AŞAMALI TESTLERİN KARŞILAŞTIRMALI İNCELENMESİ: ÖLÇME KESİNLİĞİ VE MADDE GÜVENLİĞİNE İLİŞKİN ÇIKARIMLAR

Mahmut Sami Yiğiter; Nuri Doğan

doi:10.24315/tred.1665684

Araştırma Makalesi

Comparative Study of Fixed and On-the-Fly Computerized Multistage Testing: Implications for Measurement Accuracy and Item Security

Yıl 2026, Cilt: 16 Sayı: 2 , 766 - 820 , 25.04.2026

Mahmut Sami Yiğiter , Nuri Doğan

https://doi.org/10.24315/tred.1665684

https://izlik.org/JA63WW33FJ

Öz

In recent years, adaptive testing techniques such as Computerized Adaptive Testing (CAT) and Computerized Multistage Testing (MST) have been increasingly incorporated into large-scale evaluations. This study aims to compare Fixed-MST (F-MST) and On-the-Fly MST (O-MST), a novel approach in which items are grouped into modules based on the participant’s ability level, in terms of measurement precision and item security across various simulation scenarios. The simulations were carried out using item parameter distributions derived from the 3PL model applied in TIMSS. A total of 72 different conditions were analyzed to compare O-MST with F-MST. The findings on measurement precision reveal that O-MST performs better than F-MST, especially when the test lengths are shorter, where O-MST shows substantially higher measurement precision. Moreover, when examining ability distributions, O-MST demonstrates better measurement precision compared to F-MST, particularly in cases of non-normal distributions. A significant result from this study is that the measurement precision of O-MST improves as the length of the final module increases, whereas the measurement precision of F-MST becomes more similar to O-MST as the length of the initial module increases. Regarding item security, O-MST employed a greater number of items and exhibited a lower item exposure rate compared to F-MST in all conditions. The favorable results in terms of measurement precision and item security for O-MST are discussed within the framework of large-scale assessments and relevant literature.

Anahtar Kelimeler

Computerized Multistage Testing , Adaptive Testing , Item Security , Item Exposure Rate.

Kaynakça

Arvey, R. D., Strickland, W., Drauden, G., & Martin, C. (1990). Motivational components of test taking. Personnel Psychology, 43(4), 695–716. https://doi.org/10.1111/j.1744-6570.1990.tb00679.x
Bergstrom, B. A., Lunz, M. E., & Gershon, R. C. (1992). Altering the level of difficulty in computer adaptive testing. Applied Measurement in Education, 5(2), 137–149. https://doi.org/10.1207/s15324818ame0502_4
Boztunç Öztürk, N. (2019). How the length and characteristics of routing module affect ability estimation in ca-MST? Universal Journal of Educational Research, 7(1), 164–170. https://doi.org/10.13189/ujer.2019.070121
Breithaupt, K. J., Mills, C. N., & Melican, G. J. (2006). Facing the opportunities of the future. Computer-based testing and the Internet: Issues and advances, 219-251.
Bulut, O. (2021). Beyond multiple-choice with digital assessments. ELearn, 2021(Special Issue), 1–10. https://doi.org/10.1145/3472394
Bulut, O., & Sünbül, Ö. (2017). R Programlama Dili ile Madde Tepki Kuramında Monte Carlo Simülasyon Çalışmaları. Egitimde ve Psikolojide Olcme ve Degerlendirme Dergisi, 8(3), 266–287. https://doi.org/10.21031/epod.305821
Cai, L., Albano, A. D., & Roussos, L. A. (2021). An investigation of item calibration methods in multistage testing. Measurement: Interdisciplinary Research and Perspectives, 19(3), 163–178. https://doi.org/10.1080/15366367.2021.1878778
Carlson, S. (2000). ETS finds flaws in the way online GRE rates some students. Chronicle of Higher Education, 47(8), A47.
Cetin-Berber, D. D., Sari, H. I., & Huggins-Manley, A. C. (2019). Imputation methods to deal with missing responses in computerized adaptive multistage testing. Educational and psychological measurement, 79(3), 495-511.
Chang, H.-H. (2004). Understanding computerized adaptive testing: From Robbins-Monro to Lord and beyond. In D. Kaplan (Ed.), The Sage handbook of quantitative methodology for the social sciences (pp. 117-133). Thousand Oaks, CA: Sage.
Chang, H.-H. (2015). Psychometrics behind Computerized Adaptive Testing. Psychometrika, 80(1), 1–20. https://doi.org/10.1007/s11336-014-9401-5
Chang, H.-H., & Ying, Z. (2008). To weight or not to weight? Balancing influence of initial items in adaptive testing. Psychometrika, 73(3), 441–450.
Choi, S. W., & van der Linden, W. J. (2018). Ensuring content validity of patient-reported outcomes: a shadow-test approach to their adaptive measurement. Quality of Life Research, 27(7), 1683-1693.
Choi, S. W., Lim, S., & van der Linden, W. J. (2021). TestDesign: an optimal test design approach to constructing fixed and adaptive tests in R. Behaviormetrika, 1-39.
Choi, S. W., Moellering, K. T., Li, J., & van der Linden, W. J. (2016). Optimal reassembly of shadow tests in CAT. Applied psychological measurement, 40(7), 469-485. https://doi.org/10.1177/0146621616654597.
Cohen, J. (1988). Statistical power analysis fort he behavioral sciences (2nd ed.). Hillsdale, NJ: Erlbaum.
Drasgow, F., Luecht, R. M., & Bennett, R. E. (2006). Technology and testing. Educational measurement, 4, 471-515.
Demir, H., & Gelbal, S. (2025). A systematic review on Computerized Adaptive Testing. Journal of Education Faculty, 27(1), 137–150. https://doi.org/10.17556/erziefd.1577880
Ebenbeck, N., & Gebhardt, M. (2022). Simulating computerized adaptive testing in special education based on inclusive progress monitoring data. Frontiers in Education, 7. https://doi.org/10.3389/feduc.2022.945733
Feinberg, R. A., & Rubright, J. D. (2016). Conducting simulation studies in psychometrics. Educational Measurement: Issues and Practice, 35(2), 36-49.
Fleishman, A. I. (1978). A method for simulating non-normal distributions. Psychometrika, 43(4), 521-532.
Fraenkel, J. R., Wallen, N. E., & Hyun, H. H. (2012). How to design and evaluate research in education. McGram-Hill Publishing.
Gür, R., & Gülleroğlu, H. (2020). The effect of item exposure control methods on measurement precision and test security under different measurement conditions in computerized adaptive testing. TED EĞİTİM VE BİLİM, 45(202), 113–139. https://doi.org/10.15390/eb.2020.8256
Hambleton, R. K., & Xing, D. (2006). Optimal and nonoptimal computer-based test designs for making pass–fail decisions. Applied Measurement in Education, 19(3), 221-239.
Han, K. C. T., & Guo, F. (2016). Multistage testing by shaping modules on the fly. In Computerized Multistage Testing (pp. 157-172). Chapman and Hall/CRC.
Han, K. T. (2007). WinGen: Windows software that generates item response theory parameters and item responses. Applied Psychological Measurement, 31(5), 457–459. https://doi.org/10.1177/0146621607299271
Harwell, M., Stone, C. A., Hsu, T. C. & Kirisci, L. (1996). Monte Carlo studies in item response theory. Applied Psychological Measurement, 20(2), 101-125. doi: 10.1177/014662169602000201
Harwell, M., Stone, C. A., Hsu, T.-C., & Kirisci, L. (1996). Monte Carlo studies in item response theory. Applied Psychological Measurement, 20(2), 101–125. https://doi.org/10.1177/014662169602000201
Hendrickson, A. (2007). An NCME instructional module on multistage testing. Educational Measurement Issues and Practice, 26(2), 44–52. https://doi.org/10.1111/j.1745-3992.2007.00093.x
Khorramdel, L., Pokropek, A., Joo, S. H., Kirsch, I., & Halderman, L. (2020). Examining gender DIF and gender differences in the PISA 2018 reading literacy scale: A partial invariance approach. Psychological Test and Assessment Modeling, 62(2), 179-231.
Kim, H., & Plake, B. (1993). Monte Carlo simulation comparison of two-stage testing and computer adaptive testing. Unpublished doctoral dissertation, University of Nebraska, Lincoln.
Kirsch, I., & Lennon, M. L. (2017). PIAAC: a new design for a new era. Large-scale Assessments in Education, 5(1), 1-22.
Ling, G., Attali, Y., Finn, B., & Stone, E. A. (2017). Is a computerized adaptive test more motivating than a fixed-item test? Applied Psychological Measurement, 41(7), 495–511. https://doi.org/10.1177/0146621617707556
Lord, F. M. (1971). A theoretical study of two-stage testing. Psychometrika, 36(3), 227-242. https://doi.org/10.1007/BF02297844
Luo, X., & Kim, D. (2018). A top‐down approach to designing the computerized adaptive multistage test. Journal of Educational Measurement, 55(2), 243-263.
Magis, D., Yan, D., & Von Davier, A. A. (2017). Computerized adaptive and multistage testing with R: Using packages catr and mstr. Springer.
Makhorin A (2017). GNU Linear Programming Kit. Version 4.61, URL http://www.gnu. org/software/glpk/glpk.html. Martin, A. J., & Lazendic, G. (2018). Computer-adaptive testing: Implications for students’ achievement, motivation, engagement, and subjective test experience. Journal of Educational Psychology, 110(1), 27–45. https://doi.org/10.1037/edu0000205
Mead, A. D. (2006). An introduction to multistage testing. Applied Measurement in Education, 19(3), 185–187. https://doi.org/10.1207/s15324818ame1903_1
MEB (2021). 2021 Ortaöğretim Kurumlarına İlişkin Merkezi Sınav Raporu. Milli Eğitim Bakanlığı.
Morris, T. P., White, I. R., & Crowther, M. J. (2019). Using simulation studies to evaluate statistical methods. Statistics in medicine, 38(11), 2074-2102.
OECD (2023). PISA 2022 Results (Volume I): The State of Learning and Equity in Education, OECD Publishing, Paris, https://doi.org/10.1787/53f23881-en.
OECD (2024), PISA 2022 Technical Report, PISA, OECD Publishing, Paris, https://doi.org/10.1787/01820d6d-en.
Ortner, T. M., Weißkopf, E., & Koch, T. (2014). I will probably fail: Higher ability students’ motivational experiences during adaptive achievement testing. European Journal of Psychological Assessment: Official Organ of the European Association of Psychological Assessment, 30(1), 48–56. https://doi.org/10.1027/1015-5759/a000168
Patsula, L. N., & Hambleton, R. K. (1999). A comparative study of ability estimates obtained from computer-adaptive and multi-stage testing. In annual meeting of the National Council on Measurement in Education, Montreal, Quebec.
Pine, S. M., Church, A. T., Gialluca, K. A., & Weiss, D. J. (1979). Effects of Computerized Adaptive Testing on Black and White Students. Minnesota Univ Minneapolis Dept Of Psychology.
Saatçi̇oğlu, F. M., & Atar, H. Y. (2022). Investigation of the effect of parameter estimation and classification accuracy in mixture IRT models under different conditions. International Journal of Assessment Tools in Education, 9(4), 1013–1029. https://doi.org/10.21449/ijate.1164590
Stark, S., & Chernyshenko, O. S. (2006). Multistage testing: Widely or narrowly applicable?. Applied Measurement in Education, 19(3), 257-260.
Şahin, A., & Weiss, D. J. (2015). Effects of calibration sample size and item bank size on ability estimation in computerized adaptive testing. Educational Sciences Theory & Practice, 15(6). https://doi.org/10.12738/estp.2015.6.0102
Tay, P. H. (2015). On-the-fly assembled multistage adaptive testing. University of Illinois at Urbana-Champaign. Tomashev, M. V., Avdeev, A. S., & Krasnova, M. V. (2018). Adaptive testing as a tool for managing quality of education. Informatics and Education, 9, 27–33. https://doi.org/10.32517/0234-0453-2018-33-9-27-33
van der Linden, W. J. (2009). Constrained adaptive testing with shadow tests. In Elements of adaptive testing (pp. 31-55). Springer, New York, NY.
van der Linden, W. J. (2010). Elements of adaptive testing. C. A. Glas (Ed.). New York, NY: Springer.
van der Linden, W. J. (2018). Optimal test design. Handbook of item response theory: Vol. 3. Applications, 167-195.
van der Linden, W. J. (2021). Review of the shadow-test approach to adaptive testing. Behaviormetrika, 1-22.
van der Linden, W. J., & Diao, Q. (2016). Using a universal shadow-test assembler with multistage testing. Computerized multistage testing: Theory and applications, 101-118.
van der Linden, W. J., & Veldkamp, B. P. (2004). Constraining item exposure in computerized adaptive testing with shadow tests. Journal of Educational and Behavioral Statistics: A Quarterly Publication Sponsored by the American Educational Research Association and the American Statistical Association, 29(3), 273–291. https://doi.org/10.3102/10769986029003273
van der Linden, W. J., Breithaupt, K., Chuah, S. C., & Zhang, Y. (2007). Detecting differential speededness in multistage testing. Journal of Educational Measurement, 44(2), 117–130. https://doi.org/10.1111/j.1745-3984.2007.00030.x
Xu, L., Jiang, Z., Han, Y., Liang, H., & Ouyang, J. (2023). Developing computerized Adaptive Testing for a national health professionals exam: An attempt from psychometric simulations. Perspectives on Medical Education, 12(1), 462–471. https://doi.org/10.5334/pme.855
Yamamoto, K., Shin, H. J., & Khorramdel, L. (2018). Multistage adaptive testing design in international large-scale assessments. Educational Measurement Issues and Practice, 37(4), 16–27. https://doi.org/10.1111/emip.12226
Yan, D., Von Davier, A. A., & Lewis, C. (Eds.). (2016). Computerized multistage testing: Theory and applications. CRC Press.
Yasuda, J. I., Mae, N., Hull, M. M., & Taniguchi, M. A. (2021). Optimizing the length of computerized adaptive testing for the force concept inventory. Physical review physics education research, 17(1), 1-15.
Yasuda, J.-I., Mae, N., Hull, M. M., & Taniguchi, M.-A. (2021). Optimizing the length of computerized adaptive testing for the Force Concept Inventory. Physical Review Physics Education Research, 17(1). https://doi.org/10.1103/physrevphyseducres.17.010115
Yigiter, M. S., & Dogan, N. (2023). Computerized multistage testing: Principles, designs and practices with R. Measurement: Interdisciplinary Research and Perspectives, 21(4), 254–277. https://doi.org/10.1080/15366367.2022.2158017
Yiğiter, M. S., & Boduroğlu, E. (2024). Item Response Theory assumptions: A comprehensive review of studies with document analysis. International Journal of Educational Studies and Policy, 5(2), 119-138. https://doi.org/10.5281/ZENODO.14016086
Yi̇ği̇ter, M. S., & Doğan, N. (2023). The effect of test design on misrouting in computerized multistage testing. International Journal of Turkish Education Sciences, 2023(21), 549–587. https://doi.org/10.46778/goputeb.1267319 Zheng, W. (2016). Making test batteries adaptive by using multistage testing techniques (Doctoral dissertation, University of North Carolina, Greensboro, NC).
Zheng, Y., & Chang, H.-H. (2015). On-the-fly assembled multistage adaptive testing. Applied Psychological Measurement, 39(2), 104–118. https://doi.org/10.1177/0146621614544519

Yıl 2026, Cilt: 16 Sayı: 2 , 766 - 820 , 25.04.2026

Mahmut Sami Yiğiter , Nuri Doğan

https://doi.org/10.24315/tred.1665684

https://izlik.org/JA63WW33FJ

Öz

Kaynakça

Arvey, R. D., Strickland, W., Drauden, G., & Martin, C. (1990). Motivational components of test taking. Personnel Psychology, 43(4), 695–716. https://doi.org/10.1111/j.1744-6570.1990.tb00679.x
Bergstrom, B. A., Lunz, M. E., & Gershon, R. C. (1992). Altering the level of difficulty in computer adaptive testing. Applied Measurement in Education, 5(2), 137–149. https://doi.org/10.1207/s15324818ame0502_4
Boztunç Öztürk, N. (2019). How the length and characteristics of routing module affect ability estimation in ca-MST? Universal Journal of Educational Research, 7(1), 164–170. https://doi.org/10.13189/ujer.2019.070121
Breithaupt, K. J., Mills, C. N., & Melican, G. J. (2006). Facing the opportunities of the future. Computer-based testing and the Internet: Issues and advances, 219-251.
Bulut, O. (2021). Beyond multiple-choice with digital assessments. ELearn, 2021(Special Issue), 1–10. https://doi.org/10.1145/3472394
Bulut, O., & Sünbül, Ö. (2017). R Programlama Dili ile Madde Tepki Kuramında Monte Carlo Simülasyon Çalışmaları. Egitimde ve Psikolojide Olcme ve Degerlendirme Dergisi, 8(3), 266–287. https://doi.org/10.21031/epod.305821
Cai, L., Albano, A. D., & Roussos, L. A. (2021). An investigation of item calibration methods in multistage testing. Measurement: Interdisciplinary Research and Perspectives, 19(3), 163–178. https://doi.org/10.1080/15366367.2021.1878778
Carlson, S. (2000). ETS finds flaws in the way online GRE rates some students. Chronicle of Higher Education, 47(8), A47.
Cetin-Berber, D. D., Sari, H. I., & Huggins-Manley, A. C. (2019). Imputation methods to deal with missing responses in computerized adaptive multistage testing. Educational and psychological measurement, 79(3), 495-511.
Chang, H.-H. (2004). Understanding computerized adaptive testing: From Robbins-Monro to Lord and beyond. In D. Kaplan (Ed.), The Sage handbook of quantitative methodology for the social sciences (pp. 117-133). Thousand Oaks, CA: Sage.
Chang, H.-H. (2015). Psychometrics behind Computerized Adaptive Testing. Psychometrika, 80(1), 1–20. https://doi.org/10.1007/s11336-014-9401-5
Chang, H.-H., & Ying, Z. (2008). To weight or not to weight? Balancing influence of initial items in adaptive testing. Psychometrika, 73(3), 441–450.
Choi, S. W., & van der Linden, W. J. (2018). Ensuring content validity of patient-reported outcomes: a shadow-test approach to their adaptive measurement. Quality of Life Research, 27(7), 1683-1693.
Choi, S. W., Lim, S., & van der Linden, W. J. (2021). TestDesign: an optimal test design approach to constructing fixed and adaptive tests in R. Behaviormetrika, 1-39.
Choi, S. W., Moellering, K. T., Li, J., & van der Linden, W. J. (2016). Optimal reassembly of shadow tests in CAT. Applied psychological measurement, 40(7), 469-485. https://doi.org/10.1177/0146621616654597.
Cohen, J. (1988). Statistical power analysis fort he behavioral sciences (2nd ed.). Hillsdale, NJ: Erlbaum.
Drasgow, F., Luecht, R. M., & Bennett, R. E. (2006). Technology and testing. Educational measurement, 4, 471-515.
Demir, H., & Gelbal, S. (2025). A systematic review on Computerized Adaptive Testing. Journal of Education Faculty, 27(1), 137–150. https://doi.org/10.17556/erziefd.1577880
Ebenbeck, N., & Gebhardt, M. (2022). Simulating computerized adaptive testing in special education based on inclusive progress monitoring data. Frontiers in Education, 7. https://doi.org/10.3389/feduc.2022.945733
Feinberg, R. A., & Rubright, J. D. (2016). Conducting simulation studies in psychometrics. Educational Measurement: Issues and Practice, 35(2), 36-49.
Fleishman, A. I. (1978). A method for simulating non-normal distributions. Psychometrika, 43(4), 521-532.
Fraenkel, J. R., Wallen, N. E., & Hyun, H. H. (2012). How to design and evaluate research in education. McGram-Hill Publishing.
Gür, R., & Gülleroğlu, H. (2020). The effect of item exposure control methods on measurement precision and test security under different measurement conditions in computerized adaptive testing. TED EĞİTİM VE BİLİM, 45(202), 113–139. https://doi.org/10.15390/eb.2020.8256
Hambleton, R. K., & Xing, D. (2006). Optimal and nonoptimal computer-based test designs for making pass–fail decisions. Applied Measurement in Education, 19(3), 221-239.
Han, K. C. T., & Guo, F. (2016). Multistage testing by shaping modules on the fly. In Computerized Multistage Testing (pp. 157-172). Chapman and Hall/CRC.
Han, K. T. (2007). WinGen: Windows software that generates item response theory parameters and item responses. Applied Psychological Measurement, 31(5), 457–459. https://doi.org/10.1177/0146621607299271
Harwell, M., Stone, C. A., Hsu, T. C. & Kirisci, L. (1996). Monte Carlo studies in item response theory. Applied Psychological Measurement, 20(2), 101-125. doi: 10.1177/014662169602000201
Harwell, M., Stone, C. A., Hsu, T.-C., & Kirisci, L. (1996). Monte Carlo studies in item response theory. Applied Psychological Measurement, 20(2), 101–125. https://doi.org/10.1177/014662169602000201
Hendrickson, A. (2007). An NCME instructional module on multistage testing. Educational Measurement Issues and Practice, 26(2), 44–52. https://doi.org/10.1111/j.1745-3992.2007.00093.x
Khorramdel, L., Pokropek, A., Joo, S. H., Kirsch, I., & Halderman, L. (2020). Examining gender DIF and gender differences in the PISA 2018 reading literacy scale: A partial invariance approach. Psychological Test and Assessment Modeling, 62(2), 179-231.
Kim, H., & Plake, B. (1993). Monte Carlo simulation comparison of two-stage testing and computer adaptive testing. Unpublished doctoral dissertation, University of Nebraska, Lincoln.
Kirsch, I., & Lennon, M. L. (2017). PIAAC: a new design for a new era. Large-scale Assessments in Education, 5(1), 1-22.
Ling, G., Attali, Y., Finn, B., & Stone, E. A. (2017). Is a computerized adaptive test more motivating than a fixed-item test? Applied Psychological Measurement, 41(7), 495–511. https://doi.org/10.1177/0146621617707556
Lord, F. M. (1971). A theoretical study of two-stage testing. Psychometrika, 36(3), 227-242. https://doi.org/10.1007/BF02297844
Luo, X., & Kim, D. (2018). A top‐down approach to designing the computerized adaptive multistage test. Journal of Educational Measurement, 55(2), 243-263.
Magis, D., Yan, D., & Von Davier, A. A. (2017). Computerized adaptive and multistage testing with R: Using packages catr and mstr. Springer.
Makhorin A (2017). GNU Linear Programming Kit. Version 4.61, URL http://www.gnu. org/software/glpk/glpk.html. Martin, A. J., & Lazendic, G. (2018). Computer-adaptive testing: Implications for students’ achievement, motivation, engagement, and subjective test experience. Journal of Educational Psychology, 110(1), 27–45. https://doi.org/10.1037/edu0000205
Mead, A. D. (2006). An introduction to multistage testing. Applied Measurement in Education, 19(3), 185–187. https://doi.org/10.1207/s15324818ame1903_1
MEB (2021). 2021 Ortaöğretim Kurumlarına İlişkin Merkezi Sınav Raporu. Milli Eğitim Bakanlığı.
Morris, T. P., White, I. R., & Crowther, M. J. (2019). Using simulation studies to evaluate statistical methods. Statistics in medicine, 38(11), 2074-2102.
OECD (2023). PISA 2022 Results (Volume I): The State of Learning and Equity in Education, OECD Publishing, Paris, https://doi.org/10.1787/53f23881-en.
OECD (2024), PISA 2022 Technical Report, PISA, OECD Publishing, Paris, https://doi.org/10.1787/01820d6d-en.
Ortner, T. M., Weißkopf, E., & Koch, T. (2014). I will probably fail: Higher ability students’ motivational experiences during adaptive achievement testing. European Journal of Psychological Assessment: Official Organ of the European Association of Psychological Assessment, 30(1), 48–56. https://doi.org/10.1027/1015-5759/a000168
Patsula, L. N., & Hambleton, R. K. (1999). A comparative study of ability estimates obtained from computer-adaptive and multi-stage testing. In annual meeting of the National Council on Measurement in Education, Montreal, Quebec.
Pine, S. M., Church, A. T., Gialluca, K. A., & Weiss, D. J. (1979). Effects of Computerized Adaptive Testing on Black and White Students. Minnesota Univ Minneapolis Dept Of Psychology.
Saatçi̇oğlu, F. M., & Atar, H. Y. (2022). Investigation of the effect of parameter estimation and classification accuracy in mixture IRT models under different conditions. International Journal of Assessment Tools in Education, 9(4), 1013–1029. https://doi.org/10.21449/ijate.1164590
Stark, S., & Chernyshenko, O. S. (2006). Multistage testing: Widely or narrowly applicable?. Applied Measurement in Education, 19(3), 257-260.
Şahin, A., & Weiss, D. J. (2015). Effects of calibration sample size and item bank size on ability estimation in computerized adaptive testing. Educational Sciences Theory & Practice, 15(6). https://doi.org/10.12738/estp.2015.6.0102
Tay, P. H. (2015). On-the-fly assembled multistage adaptive testing. University of Illinois at Urbana-Champaign. Tomashev, M. V., Avdeev, A. S., & Krasnova, M. V. (2018). Adaptive testing as a tool for managing quality of education. Informatics and Education, 9, 27–33. https://doi.org/10.32517/0234-0453-2018-33-9-27-33
van der Linden, W. J. (2009). Constrained adaptive testing with shadow tests. In Elements of adaptive testing (pp. 31-55). Springer, New York, NY.
van der Linden, W. J. (2010). Elements of adaptive testing. C. A. Glas (Ed.). New York, NY: Springer.
van der Linden, W. J. (2018). Optimal test design. Handbook of item response theory: Vol. 3. Applications, 167-195.
van der Linden, W. J. (2021). Review of the shadow-test approach to adaptive testing. Behaviormetrika, 1-22.
van der Linden, W. J., & Diao, Q. (2016). Using a universal shadow-test assembler with multistage testing. Computerized multistage testing: Theory and applications, 101-118.
van der Linden, W. J., & Veldkamp, B. P. (2004). Constraining item exposure in computerized adaptive testing with shadow tests. Journal of Educational and Behavioral Statistics: A Quarterly Publication Sponsored by the American Educational Research Association and the American Statistical Association, 29(3), 273–291. https://doi.org/10.3102/10769986029003273
van der Linden, W. J., Breithaupt, K., Chuah, S. C., & Zhang, Y. (2007). Detecting differential speededness in multistage testing. Journal of Educational Measurement, 44(2), 117–130. https://doi.org/10.1111/j.1745-3984.2007.00030.x
Xu, L., Jiang, Z., Han, Y., Liang, H., & Ouyang, J. (2023). Developing computerized Adaptive Testing for a national health professionals exam: An attempt from psychometric simulations. Perspectives on Medical Education, 12(1), 462–471. https://doi.org/10.5334/pme.855
Yamamoto, K., Shin, H. J., & Khorramdel, L. (2018). Multistage adaptive testing design in international large-scale assessments. Educational Measurement Issues and Practice, 37(4), 16–27. https://doi.org/10.1111/emip.12226
Yan, D., Von Davier, A. A., & Lewis, C. (Eds.). (2016). Computerized multistage testing: Theory and applications. CRC Press.
Yasuda, J. I., Mae, N., Hull, M. M., & Taniguchi, M. A. (2021). Optimizing the length of computerized adaptive testing for the force concept inventory. Physical review physics education research, 17(1), 1-15.
Yasuda, J.-I., Mae, N., Hull, M. M., & Taniguchi, M.-A. (2021). Optimizing the length of computerized adaptive testing for the Force Concept Inventory. Physical Review Physics Education Research, 17(1). https://doi.org/10.1103/physrevphyseducres.17.010115
Yigiter, M. S., & Dogan, N. (2023). Computerized multistage testing: Principles, designs and practices with R. Measurement: Interdisciplinary Research and Perspectives, 21(4), 254–277. https://doi.org/10.1080/15366367.2022.2158017
Yiğiter, M. S., & Boduroğlu, E. (2024). Item Response Theory assumptions: A comprehensive review of studies with document analysis. International Journal of Educational Studies and Policy, 5(2), 119-138. https://doi.org/10.5281/ZENODO.14016086
Yi̇ği̇ter, M. S., & Doğan, N. (2023). The effect of test design on misrouting in computerized multistage testing. International Journal of Turkish Education Sciences, 2023(21), 549–587. https://doi.org/10.46778/goputeb.1267319 Zheng, W. (2016). Making test batteries adaptive by using multistage testing techniques (Doctoral dissertation, University of North Carolina, Greensboro, NC).
Zheng, Y., & Chang, H.-H. (2015). On-the-fly assembled multistage adaptive testing. Applied Psychological Measurement, 39(2), 104–118. https://doi.org/10.1177/0146621614544519

SABİT VE ANINDA BİREYSELLEŞTİRİLMİŞ ÇOK AŞAMALI TESTLERİN KARŞILAŞTIRMALI İNCELENMESİ: ÖLÇME KESİNLİĞİ VE MADDE GÜVENLİĞİNE İLİŞKİN ÇIKARIMLAR

Yıl 2026, Cilt: 16 Sayı: 2 , 766 - 820 , 25.04.2026

Mahmut Sami Yiğiter , Nuri Doğan

https://doi.org/10.24315/tred.1665684

https://izlik.org/JA63WW33FJ

Öz

Son yıllarda, Bireyselleştirilmiş Bilgisayarlı Testler (BBT) ve Bireyselleştirilmiş Çok Aşamalı Testler (BÇAT) gibi uyarlanabilir test teknikleri, büyük ölçekli değerlendirmelere giderek daha fazla dahil edilmektedir. Bu çalışmanın amacı, maddelerin katılımcının yetenek düzeyine göre modüller halinde gruplandırıldığı yeni bir yaklaşım olan Sabit-BÇAT (S-BÇAT) ve Anında BÇAT'ı (A-BÇAT) çeşitli simülasyon senaryolarında ölçme kesinliği ve madde güvenliği açısından karşılaştırmaktır. Simülasyonlar, TIMSS'te uygulanan maddelerin 3PL modelinden türetilen madde parametre dağılımları kullanılarak gerçekleştirilmiştir. A-BÇAT ile S-BÇAT'ı karşılaştırmak için toplam 72 farklı koşul analiz edilmiştir. Ölçme kesinliğine ilişkin bulgular, A-BÇAT'ın S-BÇAT'tan daha iyi performans gösterdiğini, özellikle de test uzunlukları daha kısa olduğunda, A-BÇAT'ın önemli ölçüde daha yüksek ölçme kesinliği gösterdiğini ortaya koymaktadır. Ayrıca, yetenek dağılımları incelendiğinde, A-BÇAT, özellikle normal olmayan dağılımlarda S-BÇAT'a kıyasla daha iyi ölçme kesinliği göstermektedir. Bu çalışmadan elde edilen önemli bir sonuç, A-BÇAT'ın ölçme kesinliğinin son modülün uzunluğu arttıkça iyileşmesi, S-BÇAT'ın ölçme kesinliğinin ise başlangıç modülünün uzunluğu arttıkça A-BÇAT'a daha çok benzemesidir. Madde güvenliği ile ilgili olarak, A-BÇAT daha fazla sayıda madde kullanmış ve tüm koşullarda S-BÇAT'a kıyasla daha düşük bir madde maruz kalma oranı sergilemiştir. A-BÇAT için ölçme kesinliği ve madde güvenliği açısından olumlu sonuçlar tartışılmaktadır.

Anahtar Kelimeler

Madde Güvenliği , Bireyselleştirilmiş Çok Aşamalı Testler , Bilgisayarlı Testler , Madde Güvenliği , Madde Teşhir Oranı

Etik Beyan

Bu araştırmada kullanılan veriler bilgisayar programları aracılığıyla belirli koşullar altında model karşılaştırmaları yapmak amacıyla üretilmiştir. Herhangi bir canlı üzerinde uygulama yapılmadığından etik kurul izni alınmasına gerek yoktur. Hacettepe Üniversitesi Etik Beyan Muafiyet formu eklenmiştir.

Kaynakça

Arvey, R. D., Strickland, W., Drauden, G., & Martin, C. (1990). Motivational components of test taking. Personnel Psychology, 43(4), 695–716. https://doi.org/10.1111/j.1744-6570.1990.tb00679.x
Bergstrom, B. A., Lunz, M. E., & Gershon, R. C. (1992). Altering the level of difficulty in computer adaptive testing. Applied Measurement in Education, 5(2), 137–149. https://doi.org/10.1207/s15324818ame0502_4
Boztunç Öztürk, N. (2019). How the length and characteristics of routing module affect ability estimation in ca-MST? Universal Journal of Educational Research, 7(1), 164–170. https://doi.org/10.13189/ujer.2019.070121
Breithaupt, K. J., Mills, C. N., & Melican, G. J. (2006). Facing the opportunities of the future. Computer-based testing and the Internet: Issues and advances, 219-251.
Bulut, O. (2021). Beyond multiple-choice with digital assessments. ELearn, 2021(Special Issue), 1–10. https://doi.org/10.1145/3472394
Bulut, O., & Sünbül, Ö. (2017). R Programlama Dili ile Madde Tepki Kuramında Monte Carlo Simülasyon Çalışmaları. Egitimde ve Psikolojide Olcme ve Degerlendirme Dergisi, 8(3), 266–287. https://doi.org/10.21031/epod.305821
Cai, L., Albano, A. D., & Roussos, L. A. (2021). An investigation of item calibration methods in multistage testing. Measurement: Interdisciplinary Research and Perspectives, 19(3), 163–178. https://doi.org/10.1080/15366367.2021.1878778
Carlson, S. (2000). ETS finds flaws in the way online GRE rates some students. Chronicle of Higher Education, 47(8), A47.
Cetin-Berber, D. D., Sari, H. I., & Huggins-Manley, A. C. (2019). Imputation methods to deal with missing responses in computerized adaptive multistage testing. Educational and psychological measurement, 79(3), 495-511.
Chang, H.-H. (2004). Understanding computerized adaptive testing: From Robbins-Monro to Lord and beyond. In D. Kaplan (Ed.), The Sage handbook of quantitative methodology for the social sciences (pp. 117-133). Thousand Oaks, CA: Sage.
Chang, H.-H. (2015). Psychometrics behind Computerized Adaptive Testing. Psychometrika, 80(1), 1–20. https://doi.org/10.1007/s11336-014-9401-5
Chang, H.-H., & Ying, Z. (2008). To weight or not to weight? Balancing influence of initial items in adaptive testing. Psychometrika, 73(3), 441–450.
Choi, S. W., & van der Linden, W. J. (2018). Ensuring content validity of patient-reported outcomes: a shadow-test approach to their adaptive measurement. Quality of Life Research, 27(7), 1683-1693.
Choi, S. W., Lim, S., & van der Linden, W. J. (2021). TestDesign: an optimal test design approach to constructing fixed and adaptive tests in R. Behaviormetrika, 1-39.
Choi, S. W., Moellering, K. T., Li, J., & van der Linden, W. J. (2016). Optimal reassembly of shadow tests in CAT. Applied psychological measurement, 40(7), 469-485. https://doi.org/10.1177/0146621616654597.
Cohen, J. (1988). Statistical power analysis fort he behavioral sciences (2nd ed.). Hillsdale, NJ: Erlbaum.
Drasgow, F., Luecht, R. M., & Bennett, R. E. (2006). Technology and testing. Educational measurement, 4, 471-515.
Demir, H., & Gelbal, S. (2025). A systematic review on Computerized Adaptive Testing. Journal of Education Faculty, 27(1), 137–150. https://doi.org/10.17556/erziefd.1577880
Ebenbeck, N., & Gebhardt, M. (2022). Simulating computerized adaptive testing in special education based on inclusive progress monitoring data. Frontiers in Education, 7. https://doi.org/10.3389/feduc.2022.945733
Feinberg, R. A., & Rubright, J. D. (2016). Conducting simulation studies in psychometrics. Educational Measurement: Issues and Practice, 35(2), 36-49.
Fleishman, A. I. (1978). A method for simulating non-normal distributions. Psychometrika, 43(4), 521-532.
Fraenkel, J. R., Wallen, N. E., & Hyun, H. H. (2012). How to design and evaluate research in education. McGram-Hill Publishing.
Gür, R., & Gülleroğlu, H. (2020). The effect of item exposure control methods on measurement precision and test security under different measurement conditions in computerized adaptive testing. TED EĞİTİM VE BİLİM, 45(202), 113–139. https://doi.org/10.15390/eb.2020.8256
Hambleton, R. K., & Xing, D. (2006). Optimal and nonoptimal computer-based test designs for making pass–fail decisions. Applied Measurement in Education, 19(3), 221-239.
Han, K. C. T., & Guo, F. (2016). Multistage testing by shaping modules on the fly. In Computerized Multistage Testing (pp. 157-172). Chapman and Hall/CRC.
Han, K. T. (2007). WinGen: Windows software that generates item response theory parameters and item responses. Applied Psychological Measurement, 31(5), 457–459. https://doi.org/10.1177/0146621607299271
Harwell, M., Stone, C. A., Hsu, T. C. & Kirisci, L. (1996). Monte Carlo studies in item response theory. Applied Psychological Measurement, 20(2), 101-125. doi: 10.1177/014662169602000201
Harwell, M., Stone, C. A., Hsu, T.-C., & Kirisci, L. (1996). Monte Carlo studies in item response theory. Applied Psychological Measurement, 20(2), 101–125. https://doi.org/10.1177/014662169602000201
Hendrickson, A. (2007). An NCME instructional module on multistage testing. Educational Measurement Issues and Practice, 26(2), 44–52. https://doi.org/10.1111/j.1745-3992.2007.00093.x
Khorramdel, L., Pokropek, A., Joo, S. H., Kirsch, I., & Halderman, L. (2020). Examining gender DIF and gender differences in the PISA 2018 reading literacy scale: A partial invariance approach. Psychological Test and Assessment Modeling, 62(2), 179-231.
Kim, H., & Plake, B. (1993). Monte Carlo simulation comparison of two-stage testing and computer adaptive testing. Unpublished doctoral dissertation, University of Nebraska, Lincoln.
Kirsch, I., & Lennon, M. L. (2017). PIAAC: a new design for a new era. Large-scale Assessments in Education, 5(1), 1-22.
Ling, G., Attali, Y., Finn, B., & Stone, E. A. (2017). Is a computerized adaptive test more motivating than a fixed-item test? Applied Psychological Measurement, 41(7), 495–511. https://doi.org/10.1177/0146621617707556
Lord, F. M. (1971). A theoretical study of two-stage testing. Psychometrika, 36(3), 227-242. https://doi.org/10.1007/BF02297844
Luo, X., & Kim, D. (2018). A top‐down approach to designing the computerized adaptive multistage test. Journal of Educational Measurement, 55(2), 243-263.
Magis, D., Yan, D., & Von Davier, A. A. (2017). Computerized adaptive and multistage testing with R: Using packages catr and mstr. Springer.
Makhorin A (2017). GNU Linear Programming Kit. Version 4.61, URL http://www.gnu. org/software/glpk/glpk.html. Martin, A. J., & Lazendic, G. (2018). Computer-adaptive testing: Implications for students’ achievement, motivation, engagement, and subjective test experience. Journal of Educational Psychology, 110(1), 27–45. https://doi.org/10.1037/edu0000205
Mead, A. D. (2006). An introduction to multistage testing. Applied Measurement in Education, 19(3), 185–187. https://doi.org/10.1207/s15324818ame1903_1
MEB (2021). 2021 Ortaöğretim Kurumlarına İlişkin Merkezi Sınav Raporu. Milli Eğitim Bakanlığı.
Morris, T. P., White, I. R., & Crowther, M. J. (2019). Using simulation studies to evaluate statistical methods. Statistics in medicine, 38(11), 2074-2102.
OECD (2023). PISA 2022 Results (Volume I): The State of Learning and Equity in Education, OECD Publishing, Paris, https://doi.org/10.1787/53f23881-en.
OECD (2024), PISA 2022 Technical Report, PISA, OECD Publishing, Paris, https://doi.org/10.1787/01820d6d-en.
Ortner, T. M., Weißkopf, E., & Koch, T. (2014). I will probably fail: Higher ability students’ motivational experiences during adaptive achievement testing. European Journal of Psychological Assessment: Official Organ of the European Association of Psychological Assessment, 30(1), 48–56. https://doi.org/10.1027/1015-5759/a000168
Patsula, L. N., & Hambleton, R. K. (1999). A comparative study of ability estimates obtained from computer-adaptive and multi-stage testing. In annual meeting of the National Council on Measurement in Education, Montreal, Quebec.
Pine, S. M., Church, A. T., Gialluca, K. A., & Weiss, D. J. (1979). Effects of Computerized Adaptive Testing on Black and White Students. Minnesota Univ Minneapolis Dept Of Psychology.
Saatçi̇oğlu, F. M., & Atar, H. Y. (2022). Investigation of the effect of parameter estimation and classification accuracy in mixture IRT models under different conditions. International Journal of Assessment Tools in Education, 9(4), 1013–1029. https://doi.org/10.21449/ijate.1164590
Stark, S., & Chernyshenko, O. S. (2006). Multistage testing: Widely or narrowly applicable?. Applied Measurement in Education, 19(3), 257-260.
Şahin, A., & Weiss, D. J. (2015). Effects of calibration sample size and item bank size on ability estimation in computerized adaptive testing. Educational Sciences Theory & Practice, 15(6). https://doi.org/10.12738/estp.2015.6.0102
Tay, P. H. (2015). On-the-fly assembled multistage adaptive testing. University of Illinois at Urbana-Champaign. Tomashev, M. V., Avdeev, A. S., & Krasnova, M. V. (2018). Adaptive testing as a tool for managing quality of education. Informatics and Education, 9, 27–33. https://doi.org/10.32517/0234-0453-2018-33-9-27-33
van der Linden, W. J. (2009). Constrained adaptive testing with shadow tests. In Elements of adaptive testing (pp. 31-55). Springer, New York, NY.
van der Linden, W. J. (2010). Elements of adaptive testing. C. A. Glas (Ed.). New York, NY: Springer.
van der Linden, W. J. (2018). Optimal test design. Handbook of item response theory: Vol. 3. Applications, 167-195.
van der Linden, W. J. (2021). Review of the shadow-test approach to adaptive testing. Behaviormetrika, 1-22.
van der Linden, W. J., & Diao, Q. (2016). Using a universal shadow-test assembler with multistage testing. Computerized multistage testing: Theory and applications, 101-118.
van der Linden, W. J., & Veldkamp, B. P. (2004). Constraining item exposure in computerized adaptive testing with shadow tests. Journal of Educational and Behavioral Statistics: A Quarterly Publication Sponsored by the American Educational Research Association and the American Statistical Association, 29(3), 273–291. https://doi.org/10.3102/10769986029003273
van der Linden, W. J., Breithaupt, K., Chuah, S. C., & Zhang, Y. (2007). Detecting differential speededness in multistage testing. Journal of Educational Measurement, 44(2), 117–130. https://doi.org/10.1111/j.1745-3984.2007.00030.x
Xu, L., Jiang, Z., Han, Y., Liang, H., & Ouyang, J. (2023). Developing computerized Adaptive Testing for a national health professionals exam: An attempt from psychometric simulations. Perspectives on Medical Education, 12(1), 462–471. https://doi.org/10.5334/pme.855
Yamamoto, K., Shin, H. J., & Khorramdel, L. (2018). Multistage adaptive testing design in international large-scale assessments. Educational Measurement Issues and Practice, 37(4), 16–27. https://doi.org/10.1111/emip.12226
Yan, D., Von Davier, A. A., & Lewis, C. (Eds.). (2016). Computerized multistage testing: Theory and applications. CRC Press.
Yasuda, J. I., Mae, N., Hull, M. M., & Taniguchi, M. A. (2021). Optimizing the length of computerized adaptive testing for the force concept inventory. Physical review physics education research, 17(1), 1-15.
Yasuda, J.-I., Mae, N., Hull, M. M., & Taniguchi, M.-A. (2021). Optimizing the length of computerized adaptive testing for the Force Concept Inventory. Physical Review Physics Education Research, 17(1). https://doi.org/10.1103/physrevphyseducres.17.010115
Yigiter, M. S., & Dogan, N. (2023). Computerized multistage testing: Principles, designs and practices with R. Measurement: Interdisciplinary Research and Perspectives, 21(4), 254–277. https://doi.org/10.1080/15366367.2022.2158017
Yiğiter, M. S., & Boduroğlu, E. (2024). Item Response Theory assumptions: A comprehensive review of studies with document analysis. International Journal of Educational Studies and Policy, 5(2), 119-138. https://doi.org/10.5281/ZENODO.14016086
Yi̇ği̇ter, M. S., & Doğan, N. (2023). The effect of test design on misrouting in computerized multistage testing. International Journal of Turkish Education Sciences, 2023(21), 549–587. https://doi.org/10.46778/goputeb.1267319 Zheng, W. (2016). Making test batteries adaptive by using multistage testing techniques (Doctoral dissertation, University of North Carolina, Greensboro, NC).
Zheng, Y., & Chang, H.-H. (2015). On-the-fly assembled multistage adaptive testing. Applied Psychological Measurement, 39(2), 104–118. https://doi.org/10.1177/0146621614544519

Toplam 65 adet kaynakça vardır.

Ayrıntılar

Birincil Dil	Türkçe
Konular	Bilgisayar Tabanlı Sınav Uygulamaları
Bölüm	Araştırma Makalesi
Yazarlar	Mahmut Sami Yiğiter 0000-0002-2896-0201 Nuri Doğan 0000-0001-6274-2016
Gönderilme Tarihi	25 Mart 2025
Kabul Tarihi	27 Ekim 2025
Yayımlanma Tarihi	25 Nisan 2026
DOI	https://doi.org/10.24315/tred.1665684
IZ	https://izlik.org/JA63WW33FJ
Yayımlandığı Sayı	Yıl 2026 Cilt: 16 Sayı: 2

Kaynak Göster

APA	Yiğiter, M. S., & Doğan, N. (2026). SABİT VE ANINDA BİREYSELLEŞTİRİLMİŞ ÇOK AŞAMALI TESTLERİN KARŞILAŞTIRMALI İNCELENMESİ: ÖLÇME KESİNLİĞİ VE MADDE GÜVENLİĞİNE İLİŞKİN ÇIKARIMLAR. Trakya Eğitim Dergisi, 16(2), 766-820. https://doi.org/10.24315/tred.1665684

Makale Dosyaları

Tam Metin