Research Article
BibTex RIS Cite

The difference between estimated and perceived item difficulty: An empirical study

Year 2024, Volume: 11 Issue: 2, 368 - 387
https://doi.org/10.21449/ijate.1376160

Abstract

Test development is a complicated process that demands examining various factors, one of them being writing items of varying difficulty. It is important to use items of a different range of difficulty to ensure that the test results accurately indicate the test-taker's abilities. Therefore, the factors affecting item difficulty should be defined, and item difficulties should be estimated before testing. This study aims to investigate the factors that affect estimated and perceived item difficulty in the High School Entrance Examination in Türkiye and to improve estimation accuracy by giving feedback to the experts. The study started with estimating item difficulty for 40 items belonging to reading comprehension, grammar, and reasoning based on data. Then, the experts' predictions were compared with the estimated item difficulty and feedback was provided to improve the accuracy of their predictions. The study found that some item features (e.g., length and readability) did not affect the estimated difficulty but affected the experts' item difficulty perceptions. Based on these results, the study concludes that providing feedback to experts can improve the factors affecting their item difficulty estimates. So, it can help improve the quality of future tests and provide feedback to experts to improve their ability to estimate item difficulty accurately.

Ethical Statement

Gazi University Ethics Committee, E77082166-604.01.02-711551, dated 02.08.2023.

References

  • Aljehani, D.K., Pullishery, F., Osman, O., & Abuzenada, B.M. (2020). Relationship of text length of multiple-choice questions on item psychometric properties–A retrospective study. Saudi J Health Sci, 9, 84-87. https://doi.org/10.4103/sjhs.sjhs_76_20
  • AlKhuzaey, S., Grasso, F., Payne, T.R., & Tamma, V. (2021). A Systematic Review of Data-Driven Approaches to Item Difficulty Prediction. In I. Roll, D. McNamara, S. Sosnovsky, R. Luckin, & V. Dimitrova, Artificial Intelligence in Education Cham. https://doi.org/10.1007/978-3-030-78292-4_3
  • Allalouf, A., Hambleton, R., & Sireci, S. (1999). Identifying the causes of dif in translated verbal items. Journal of Educational Measurement, 36(3), 185 198. https://doi.org/10.1111/j.1745-3984.1999.tb00553.x
  • Attali, Y., Saldivia, L., Jackson, C., Schuppan, F., & Wanamaker, W. (2014). Estimating item difficulty with comparative judgments. ETS Research Report Series, 2014(2), 1-8. https://doi.org/10.1002/ets2.12042
  • Bejar, I.I. (1983). Subject matter experts' assessment of item statistics. Applied Psychological Measurement, 7(3), 303-310. https://doi.org/10.1002/j.2333-8504.1981.tb01274.x
  • Benton, T. (2020). How Useful Is Comparative Judgement of Item Difficulty for Standard Maintaining? Research Matters, 29, 27-35.
  • Berenbon, R., & McHugh, B. (2023). Do subject matter experts’ judgments of multiple‐choice format suitability predict item quality?. Educational Measurement Issues and Practice, 42(3), 13-21. https://doi.org/10.1111/emip.12570
  • Berk, R.A. (1986). A consumer’s guide to setting performance standards on criterion-referenced tests. Review of Educational Research, 56(1), 137 172. https://doi.org/10.3102/00346543056001137
  • Bock, R.D., Murakl, E., & Pfeiffenberger, W. (1988). Item pool maintenance in the presence of item parameter drift. Journal of Educational Measurement, 25(4), 275-285. https://doi.org/www.jstor.org/stable/1434961
  • Boldt, R.F. (1998). GRE analytical reasoning item statistics prediction study. ETS Research Report Series, 1998(2), i-23. https://doi.org/10.1002/j.2333-8504.1998.tb01786.x
  • Caldwell, D.J., & Pate, A.N. (2013). Effects of question formats on student and item performance. American Journal of Pharmaceutical Education, 77(4). https://doi.org/10.5688/ajpe77471
  • Choi, I.-C., & Moon, Y. (2020). Predicting the difficulty of EFL tests based on corpus linguistic features and expert judgment. Language Assessment Quarterly, 17(1), 18-42. https://doi.org/10.1080/15434303.2019.1674315
  • Dalum, J., Christidis, N., Myrberg, I.H., Karlgren, K., Leanderson, C., & Englund, G.S. (2022). Are we passing the acceptable? Standard setting of theoretical proficiency tests for foreign trained dentists. European Journal of Dental Education. https://doi.org/10.1111/eje.12851
  • Davies, E. (2021). Predicting item difficulty in the assessment of Welsh. Collated Papers for the ALTE 7th International Conference, Madrid, Spain.
  • El Masri, Y.H., Ferrara, S., Foltz, P.W., & Baird, J.-A. (2017). Predicting item difficulty of science national curriculum tests: the case of key stage 2 assessments. The Curriculum Journal, 28(1), 59-82. https://doi.org/10.1080/09585176.2016.1232201
  • Embretson, S., & Wetzel, C. (1987). Component latent trait models for paragraph comprehension tests. Applied Psychological Measurement, 11(2), 175 193. https://doi.org/10.1177/014662168701100207
  • Enright, M.K., Allen, N., & Kim, M.I. (1993). A Complexity Analysis of Items from a Survey of Academic Achievement in the Life Sciences. ETS Research Report Series, 1993(1), i-32. https://doi.org/10.1002/j.2333-8504.1993.tb01529.x
  • Fergadiotis, G., Swiderski, A., & Hula, W. (2018). Predicting confrontation naming item difficulty. Aphasiology, 33(6), 689-709. https://doi.org/10.1080/02687038.2018.1495310
  • Ferrara, S., Steedle, J.T., & Frantz, R.S. (2022). Response Demands of Reading Comprehension Test Items: A Review of Item Difficulty Modeling Studies. Applied Measurement in Education, 35(3), 237-253. https://doi.org/10.1080/08957347.2022.2103135
  • Förster, N., & Kuhn, J.-T. (2021). Ice is hot and water is dry: Developing equivalent reading tests using rule-based item design. European Journal of Psychological Assessment. https://doi.org/10.1027/1015-5759/a000691
  • Fortus, R., Coriat, R., & Fund, S. (2013). Prediction of item difficulty in the English Subtest of Israel's Inter-university psychometric entrance test. In Validation in language assessment (pp. 61-87). Routledge.
  • Fraenkel, J.R. & Wallen, dan Norman E. (2006). How to Design and Evaluate Research in Education. McGraw-Hill Education, USA.
  • Freedle, R., & Kostin, I. (1993). The prediction of TOEFL reading comprehension item difficulty for expository prose passages for three item types: Main idea, inference, and supporting idea items. ETS Research Report Series, 1993(1), i 48. https://doi.org/10.1002/j.2333-8504.1993.tb01524.x
  • Gao, L., & Rogers, W. (2010). Use of tree-based regression in the analyses of l2 reading test items. Language Testing, 28(1), 77-104. https://doi.org/10.1177/0265532210364380
  • Giguère, G., Brouillette-Alarie, S., & Bourassa, C. (2022). A look at the difficulty and predictive validity of ls/cmi items with rasch modeling. Criminal Justice and Behavior, 50(1), 118-138. https://doi.org/10.1177/00938548221131956
  • González-Brenes, J., Huang, Y., & Brusilovsky, P. (2014). General features in knowledge tracing to model multiple subskills, temporal item response theory, and expert knowledge. The 7th international conference on educational data mining (pp. 84–91), London. https://doi.org/pdfs.semanticscholar.org/0002/fab1c9f0904105312031cdc18dce358863a6.pdf
  • Gorin, J.S., & Embretson, S. E. (2006). Item diffficulty modeling of paragraph comprehension items. Applied Psychological Measurement, 30(5), 394 411. https://doi.org/10.1177/0146621606288554
  • Haladyna, T.M., Downing, S.M., & Rodriguez, M.C. (2002). A review of multiple-choice item-writing guidelines for classroom assessment. Applied measurement in education, 15(3), 309-333. https://doi.org/10.1207/S15324818AME1503_5
  • Hamamoto Filho, P.T., Silva, E., Ribeiro, Z.M.T., Hafner, M.d.L.M.B., Cecilio-Fernandes, D., & Bicudo, A.M. (2020). Relationships between Bloom’s taxonomy, judges’ estimation of item difficulty and psychometric properties of items from a progress test: a prospective observational study. Sao Paulo Medical Journal, 138, 33 39. https://doi.org/10.1590/1516-3180.2019.0459.R1.19112019
  • Hambleton, R.K., & Jirka, S.J. (2011). Anchor-based methods for judgmentally estimating item statistics. In Handbook of test development (pp. 413-434). Routledge.
  • Hambleton, R.K., Sireci, S.G., Swaminathan, H., Xing, D., & Rizavi, S. (2003). Anchor-Based Methods for Judgmentally Estimating Item Difficulty Parameters. LSAC Research Report Series, Newtown, PA.
  • Herzog, M., Sari, M., Olkun, S., & Fritz, A. (2021). Validation of a model of sustainable place value understanding in Turkey. International Electronic Journal of Mathematics Education, 16(3), em0659. https://doi.org/10.29333/iejme/11295
  • Hontangas, P., Ponsoda, V., Olea, J., & Wise, S.L. (2000). The choice of item difficulty in self-adapted testing. European Journal of Psychological Assessment, 16(1), 3. https://doi.org/10.1027/1015-5759.16.1.3
  • Hsu, F.-Y., Lee, H.-M., Chang, T.-H., & Sung, Y.-T. (2018). Automated estimation of item difficulty for multiple-choice tests: An application of word embedding techniques. Information Processing & Management, 54(6), 969 984. https://doi.org/10.1016/j.ipm.2018.06.007
  • Huang, Z., Liu, Q., Chen, E., Zhao, H., Gao, M., Wei, S., Su, Y., & Hu, G. (2017). Question Difficulty Prediction for READING Problems in Standard Tests. Proceedings of the AAAI Conference on Artificial Intelligence, 31(1). https://doi.org/10.1609/aaai.v31i1.10740
  • Impara, J.C., & Plake, B.S. (1998). Teachers' ability to estimate item difficulty: A test of the assumptions in the Angoff standard setting method. Journal of Educational Measurement, 35(1), 69-81. https://doi.org/10.1111/j.1745-3984.1998.tb00528.x
  • Kibble, J.D., & Johnson, T. (2011). Are faculty predictions or item taxonomies useful for estimating the outcome of multiple-choice examinations? Advances in physiology education, 35(4), 396-401. https://doi.org/10.1152/advan.00062.2011
  • Kolen, M.J., & Brennan, R.L. (2004). Test equating, scaling, and linking methods and practices. Springer New York, NY. https://doi.org/10.1007/978-1-4939-0317-7
  • Le Hebel, F., Tiberghien, A., Montpied, P., & Fontanieu, V. (2019). Teacher prediction of student difficulties while solving a science inquiry task: example of PISA science items. International Journal of Science Education, 41(11), 1517 1540. https://doi.org/10.1080/09500693.2019.1615150
  • Lin, C.-S., Lu, Y.-L., & Lien, C.-J. (2021). Association between Test Item's Length, Difficulty, and Students' Perceptions: Machine Learning in Schools' Term Examinations. Universal Journal of Educational Research, 9(6), 1323 1332. https://doi.org/10.13189/ujer.2021.090622
  • Liu, X., & Read, J. (2021). Investigating the Skills Involved in Reading Test Tasks through Expert Judgement and Verbal Protocol Analysis: Convergence and Divergence between the Two Methods. Language Assessment Quarterly, 18(4), 357 381. https://doi.org/10.1080/15434303.2021.1881964
  • Lumley, T., Routitsky, A., Mendelovits, J., & Ramalingam, D. (2012). A framework for predicting item difficulty in reading tests Proceedings of the annual meeting of the American educational research association (AERA), Vancouver, BC, Canada.
  • MacGregor, D., Kenyon, D., Christenson, J., & Louguit, M. (2008). Predicting item difficulty: A rubrics-based approach. American Association of Applied Linguistics. March, Washington, DC. https://doi.org/10.1109/FIE.2015.7344299
  • Masri, Y., Baird, J., & Graesser, A. (2016). Language effects in international testing: the case of pisa 2006 science items. Assessment in Education Principles Policy and Practice, 23(4), 427-455. https://doi.org/10.1080/0969594x.2016.1218323
  • Mislevy, R.J., Sheehan, K.M., & Wingersky, M. (1993). How to equate tests with little or no data. Journal of Educational Measurement, 30(1), 55 78. https://doi.org/www.jstor.org/stable/1435164
  • Noroozi, S., & Karami, H. (2022). A scrutiny of the relationship between cognitive load and difficulty estimates of language test items. Language Testing in Asia, 12(1). https://doi.org/10.1186/s40468-022-00163-8
  • Oliveri, M., & Ercikan, K. (2011). Do different approaches to examining construct comparability in multilanguage assessments lead to similar conclusions?. Applied Measurement in Education, 24(4), 349 366. https://doi.org/10.1080/08957347.2011.607063
  • Rupp, A.A., Garcia, P., & Jamieson, J. (2001). Combining multiple regression and CART to understand difficulty in second language reading and listening comprehension test items. International Journal of Testing, 1(3 4), 185 216. https://doi.org/10.1080/15305058.2001.9669470
  • Sano, M. (2015). Automated capturing of psycho-linguistic features in reading assessment text. Annual meeting of the National Council on Measurement in Education, , Chicago, IL, USA.
  • Santi, K.L., Kulesz, P.A., Khalaf, S., & Francis, D.J. (2015). Developmental changes in reading do not alter the development of visual processing skills: an application of explanatory item response models in grades K 2. Frontiers in Psychology, 6, 116. https://doi.org/10.3389/fpsyg.2015.00116
  • Segall, D.O., Moreno, K.E., & Hetter, R.D. (1997). Item pool development and evaluation. In Computerized adaptive testing: From inquiry to operation. (pp. 117-130). American Psychological Association. https://doi.org/10.1037/10244-012
  • Septia, N.W., Indrawati, I., Juriana, J., & Rudini, R. (2022). An Analysis of Students’ Difficulties in Reading Comprehension. EEdJ: English Education Journal, 2(1), 11-22. https://doi.org/10.55047/romeo
  • Stenner, A.J. (2022). Measuring reading comprehension with the Lexile framework. In Explanatory Models, Unit Standards, and Personalized Learning in Educational Measurement: Selected Papers by A. Jackson Stenner (pp. 63-88). Springer. https://doi.org/10.1007/978-981-19-3747-7_6
  • Stiller, J., Hartmann, S., Mathesius, S., Straube, P., Tiemann, R., Nordmeier, V., Krüger, D., & Upmeier zu Belzen, A. (2016). Assessing scientific reasoning: A comprehensive evaluation of item features that affect item difficulty. Assessment & Evaluation in Higher Education, 41(5), 721-732. https://doi.org/10.1080/02602938.2016.1164830
  • Sung, P.-J., Lin, S.-W., & Hung, P.-H. (2015). Factors Affecting Item Difficulty in English Listening Comprehension Tests. Universal Journal of Educational Research, 3(7), 451-459. https://doi.org/10.13189/ujer.2015.030704
  • Swaminathan, H., Hambleton, R.K., Sireci, S.G., Xing, D., & Rizavi, S.M. (2003). Small sample estimation in dichotomous item response models: Effect of priors based on judgmental information on the accuracy of item parameter estimates. Applied Psychological Measurement, 27(1), 27-51. https://doi.org/10.1177/0146621602239475
  • Sydorenko, T. (2011). Item writer judgments of item difficulty versus actual item difficulty: A case study. Language Assessment Quarterly, 8(1), 34 52. https://doi.org/10.1080/15434303.2010.536924
  • Toyama, Y. (2021). What Makes Reading Difficult? An Investigation of the Contributions of Passage, Task, and Reader Characteristics on Comprehension Performance. Reading Research Quarterly, 56(4), 633-642. https://doi.org/10.1002/rrq.440
  • Trace, J., Brown, J.D., Janssen, G., & Kozhevnikova, L. (2017). Determining cloze item difficulty from item and passage characteristics across different learner backgrounds. Language Testing, 34(2), 151-174. https://doi.org/10.1177/0265532215623581
  • Urhahne, D., & Wijnia, L. (2021). A review on the accuracy of teacher judgments. Educational Research Review, 32, 100374. https://doi.org/10.1016/j.edurev.2020.100374
  • Valencia, S.W., Wixson, K.K., Ackerman, T., & Sanders, E. (2017). Identifying text-task-reader interactions related to item and block difficulty in the national assessment for educational progress reading assessment. In: San Mateo, CA: National Center for Education Statistics.
  • Van der Linden, W.J., & Pashley, P.J. (2009). Item selection and ability estimation in adaptive testing. In Elements of adaptive testing (pp. 3-30). Springer, New York, NY. https://doi.org/10.1007/978-0-387-85461-8_1
  • Wauters, K., Desmet, P., & Van Den Noortgate, W. (2012). Item difficulty estimation: An auspicious collaboration between data and judgment. Computers & Education, 58(4), 1183-1193. https://doi.org/10.1016/j.compedu.2011.11.020
  • Ying-hui, H. (2006). An investigation into the task features affecting EFL listening comprehension test performance. The Asian EFL Journal Quarterly, 8(2), 33-54.

The difference between estimated and perceived item difficulty: An empirical study

Year 2024, Volume: 11 Issue: 2, 368 - 387
https://doi.org/10.21449/ijate.1376160

Abstract

Test development is a complicated process that demands examining various factors, one of them being writing items of varying difficulty. It is important to use items of a different range of difficulty to ensure that the test results accurately indicate the test-taker's abilities. Therefore, the factors affecting item difficulty should be defined, and item difficulties should be estimated before testing. This study aims to investigate the factors that affect estimated and perceived item difficulty in the High School Entrance Examination in Türkiye and to improve estimation accuracy by giving feedback to the experts. The study started with estimating item difficulty for 40 items belonging to reading comprehension, grammar, and reasoning based on data. Then, the experts' predictions were compared with the estimated item difficulty and feedback was provided to improve the accuracy of their predictions. The study found that some item features (e.g., length and readability) did not affect the estimated difficulty but affected the experts' item difficulty perceptions. Based on these results, the study concludes that providing feedback to experts can improve the factors affecting their item difficulty estimates. So, it can help improve the quality of future tests and provide feedback to experts to improve their ability to estimate item difficulty accurately.

Ethical Statement

This research was presented as an oral presentation at the NCME 2023 congress APRIL 12-15, 2023 - CHICAGO, IL, USA.

References

  • Aljehani, D.K., Pullishery, F., Osman, O., & Abuzenada, B.M. (2020). Relationship of text length of multiple-choice questions on item psychometric properties–A retrospective study. Saudi J Health Sci, 9, 84-87. https://doi.org/10.4103/sjhs.sjhs_76_20
  • AlKhuzaey, S., Grasso, F., Payne, T.R., & Tamma, V. (2021). A Systematic Review of Data-Driven Approaches to Item Difficulty Prediction. In I. Roll, D. McNamara, S. Sosnovsky, R. Luckin, & V. Dimitrova, Artificial Intelligence in Education Cham. https://doi.org/10.1007/978-3-030-78292-4_3
  • Allalouf, A., Hambleton, R., & Sireci, S. (1999). Identifying the causes of dif in translated verbal items. Journal of Educational Measurement, 36(3), 185 198. https://doi.org/10.1111/j.1745-3984.1999.tb00553.x
  • Attali, Y., Saldivia, L., Jackson, C., Schuppan, F., & Wanamaker, W. (2014). Estimating item difficulty with comparative judgments. ETS Research Report Series, 2014(2), 1-8. https://doi.org/10.1002/ets2.12042
  • Bejar, I.I. (1983). Subject matter experts' assessment of item statistics. Applied Psychological Measurement, 7(3), 303-310. https://doi.org/10.1002/j.2333-8504.1981.tb01274.x
  • Benton, T. (2020). How Useful Is Comparative Judgement of Item Difficulty for Standard Maintaining? Research Matters, 29, 27-35.
  • Berenbon, R., & McHugh, B. (2023). Do subject matter experts’ judgments of multiple‐choice format suitability predict item quality?. Educational Measurement Issues and Practice, 42(3), 13-21. https://doi.org/10.1111/emip.12570
  • Berk, R.A. (1986). A consumer’s guide to setting performance standards on criterion-referenced tests. Review of Educational Research, 56(1), 137 172. https://doi.org/10.3102/00346543056001137
  • Bock, R.D., Murakl, E., & Pfeiffenberger, W. (1988). Item pool maintenance in the presence of item parameter drift. Journal of Educational Measurement, 25(4), 275-285. https://doi.org/www.jstor.org/stable/1434961
  • Boldt, R.F. (1998). GRE analytical reasoning item statistics prediction study. ETS Research Report Series, 1998(2), i-23. https://doi.org/10.1002/j.2333-8504.1998.tb01786.x
  • Caldwell, D.J., & Pate, A.N. (2013). Effects of question formats on student and item performance. American Journal of Pharmaceutical Education, 77(4). https://doi.org/10.5688/ajpe77471
  • Choi, I.-C., & Moon, Y. (2020). Predicting the difficulty of EFL tests based on corpus linguistic features and expert judgment. Language Assessment Quarterly, 17(1), 18-42. https://doi.org/10.1080/15434303.2019.1674315
  • Dalum, J., Christidis, N., Myrberg, I.H., Karlgren, K., Leanderson, C., & Englund, G.S. (2022). Are we passing the acceptable? Standard setting of theoretical proficiency tests for foreign trained dentists. European Journal of Dental Education. https://doi.org/10.1111/eje.12851
  • Davies, E. (2021). Predicting item difficulty in the assessment of Welsh. Collated Papers for the ALTE 7th International Conference, Madrid, Spain.
  • El Masri, Y.H., Ferrara, S., Foltz, P.W., & Baird, J.-A. (2017). Predicting item difficulty of science national curriculum tests: the case of key stage 2 assessments. The Curriculum Journal, 28(1), 59-82. https://doi.org/10.1080/09585176.2016.1232201
  • Embretson, S., & Wetzel, C. (1987). Component latent trait models for paragraph comprehension tests. Applied Psychological Measurement, 11(2), 175 193. https://doi.org/10.1177/014662168701100207
  • Enright, M.K., Allen, N., & Kim, M.I. (1993). A Complexity Analysis of Items from a Survey of Academic Achievement in the Life Sciences. ETS Research Report Series, 1993(1), i-32. https://doi.org/10.1002/j.2333-8504.1993.tb01529.x
  • Fergadiotis, G., Swiderski, A., & Hula, W. (2018). Predicting confrontation naming item difficulty. Aphasiology, 33(6), 689-709. https://doi.org/10.1080/02687038.2018.1495310
  • Ferrara, S., Steedle, J.T., & Frantz, R.S. (2022). Response Demands of Reading Comprehension Test Items: A Review of Item Difficulty Modeling Studies. Applied Measurement in Education, 35(3), 237-253. https://doi.org/10.1080/08957347.2022.2103135
  • Förster, N., & Kuhn, J.-T. (2021). Ice is hot and water is dry: Developing equivalent reading tests using rule-based item design. European Journal of Psychological Assessment. https://doi.org/10.1027/1015-5759/a000691
  • Fortus, R., Coriat, R., & Fund, S. (2013). Prediction of item difficulty in the English Subtest of Israel's Inter-university psychometric entrance test. In Validation in language assessment (pp. 61-87). Routledge.
  • Fraenkel, J.R. & Wallen, dan Norman E. (2006). How to Design and Evaluate Research in Education. McGraw-Hill Education, USA.
  • Freedle, R., & Kostin, I. (1993). The prediction of TOEFL reading comprehension item difficulty for expository prose passages for three item types: Main idea, inference, and supporting idea items. ETS Research Report Series, 1993(1), i 48. https://doi.org/10.1002/j.2333-8504.1993.tb01524.x
  • Gao, L., & Rogers, W. (2010). Use of tree-based regression in the analyses of l2 reading test items. Language Testing, 28(1), 77-104. https://doi.org/10.1177/0265532210364380
  • Giguère, G., Brouillette-Alarie, S., & Bourassa, C. (2022). A look at the difficulty and predictive validity of ls/cmi items with rasch modeling. Criminal Justice and Behavior, 50(1), 118-138. https://doi.org/10.1177/00938548221131956
  • González-Brenes, J., Huang, Y., & Brusilovsky, P. (2014). General features in knowledge tracing to model multiple subskills, temporal item response theory, and expert knowledge. The 7th international conference on educational data mining (pp. 84–91), London. https://doi.org/pdfs.semanticscholar.org/0002/fab1c9f0904105312031cdc18dce358863a6.pdf
  • Gorin, J.S., & Embretson, S. E. (2006). Item diffficulty modeling of paragraph comprehension items. Applied Psychological Measurement, 30(5), 394 411. https://doi.org/10.1177/0146621606288554
  • Haladyna, T.M., Downing, S.M., & Rodriguez, M.C. (2002). A review of multiple-choice item-writing guidelines for classroom assessment. Applied measurement in education, 15(3), 309-333. https://doi.org/10.1207/S15324818AME1503_5
  • Hamamoto Filho, P.T., Silva, E., Ribeiro, Z.M.T., Hafner, M.d.L.M.B., Cecilio-Fernandes, D., & Bicudo, A.M. (2020). Relationships between Bloom’s taxonomy, judges’ estimation of item difficulty and psychometric properties of items from a progress test: a prospective observational study. Sao Paulo Medical Journal, 138, 33 39. https://doi.org/10.1590/1516-3180.2019.0459.R1.19112019
  • Hambleton, R.K., & Jirka, S.J. (2011). Anchor-based methods for judgmentally estimating item statistics. In Handbook of test development (pp. 413-434). Routledge.
  • Hambleton, R.K., Sireci, S.G., Swaminathan, H., Xing, D., & Rizavi, S. (2003). Anchor-Based Methods for Judgmentally Estimating Item Difficulty Parameters. LSAC Research Report Series, Newtown, PA.
  • Herzog, M., Sari, M., Olkun, S., & Fritz, A. (2021). Validation of a model of sustainable place value understanding in Turkey. International Electronic Journal of Mathematics Education, 16(3), em0659. https://doi.org/10.29333/iejme/11295
  • Hontangas, P., Ponsoda, V., Olea, J., & Wise, S.L. (2000). The choice of item difficulty in self-adapted testing. European Journal of Psychological Assessment, 16(1), 3. https://doi.org/10.1027/1015-5759.16.1.3
  • Hsu, F.-Y., Lee, H.-M., Chang, T.-H., & Sung, Y.-T. (2018). Automated estimation of item difficulty for multiple-choice tests: An application of word embedding techniques. Information Processing & Management, 54(6), 969 984. https://doi.org/10.1016/j.ipm.2018.06.007
  • Huang, Z., Liu, Q., Chen, E., Zhao, H., Gao, M., Wei, S., Su, Y., & Hu, G. (2017). Question Difficulty Prediction for READING Problems in Standard Tests. Proceedings of the AAAI Conference on Artificial Intelligence, 31(1). https://doi.org/10.1609/aaai.v31i1.10740
  • Impara, J.C., & Plake, B.S. (1998). Teachers' ability to estimate item difficulty: A test of the assumptions in the Angoff standard setting method. Journal of Educational Measurement, 35(1), 69-81. https://doi.org/10.1111/j.1745-3984.1998.tb00528.x
  • Kibble, J.D., & Johnson, T. (2011). Are faculty predictions or item taxonomies useful for estimating the outcome of multiple-choice examinations? Advances in physiology education, 35(4), 396-401. https://doi.org/10.1152/advan.00062.2011
  • Kolen, M.J., & Brennan, R.L. (2004). Test equating, scaling, and linking methods and practices. Springer New York, NY. https://doi.org/10.1007/978-1-4939-0317-7
  • Le Hebel, F., Tiberghien, A., Montpied, P., & Fontanieu, V. (2019). Teacher prediction of student difficulties while solving a science inquiry task: example of PISA science items. International Journal of Science Education, 41(11), 1517 1540. https://doi.org/10.1080/09500693.2019.1615150
  • Lin, C.-S., Lu, Y.-L., & Lien, C.-J. (2021). Association between Test Item's Length, Difficulty, and Students' Perceptions: Machine Learning in Schools' Term Examinations. Universal Journal of Educational Research, 9(6), 1323 1332. https://doi.org/10.13189/ujer.2021.090622
  • Liu, X., & Read, J. (2021). Investigating the Skills Involved in Reading Test Tasks through Expert Judgement and Verbal Protocol Analysis: Convergence and Divergence between the Two Methods. Language Assessment Quarterly, 18(4), 357 381. https://doi.org/10.1080/15434303.2021.1881964
  • Lumley, T., Routitsky, A., Mendelovits, J., & Ramalingam, D. (2012). A framework for predicting item difficulty in reading tests Proceedings of the annual meeting of the American educational research association (AERA), Vancouver, BC, Canada.
  • MacGregor, D., Kenyon, D., Christenson, J., & Louguit, M. (2008). Predicting item difficulty: A rubrics-based approach. American Association of Applied Linguistics. March, Washington, DC. https://doi.org/10.1109/FIE.2015.7344299
  • Masri, Y., Baird, J., & Graesser, A. (2016). Language effects in international testing: the case of pisa 2006 science items. Assessment in Education Principles Policy and Practice, 23(4), 427-455. https://doi.org/10.1080/0969594x.2016.1218323
  • Mislevy, R.J., Sheehan, K.M., & Wingersky, M. (1993). How to equate tests with little or no data. Journal of Educational Measurement, 30(1), 55 78. https://doi.org/www.jstor.org/stable/1435164
  • Noroozi, S., & Karami, H. (2022). A scrutiny of the relationship between cognitive load and difficulty estimates of language test items. Language Testing in Asia, 12(1). https://doi.org/10.1186/s40468-022-00163-8
  • Oliveri, M., & Ercikan, K. (2011). Do different approaches to examining construct comparability in multilanguage assessments lead to similar conclusions?. Applied Measurement in Education, 24(4), 349 366. https://doi.org/10.1080/08957347.2011.607063
  • Rupp, A.A., Garcia, P., & Jamieson, J. (2001). Combining multiple regression and CART to understand difficulty in second language reading and listening comprehension test items. International Journal of Testing, 1(3 4), 185 216. https://doi.org/10.1080/15305058.2001.9669470
  • Sano, M. (2015). Automated capturing of psycho-linguistic features in reading assessment text. Annual meeting of the National Council on Measurement in Education, , Chicago, IL, USA.
  • Santi, K.L., Kulesz, P.A., Khalaf, S., & Francis, D.J. (2015). Developmental changes in reading do not alter the development of visual processing skills: an application of explanatory item response models in grades K 2. Frontiers in Psychology, 6, 116. https://doi.org/10.3389/fpsyg.2015.00116
  • Segall, D.O., Moreno, K.E., & Hetter, R.D. (1997). Item pool development and evaluation. In Computerized adaptive testing: From inquiry to operation. (pp. 117-130). American Psychological Association. https://doi.org/10.1037/10244-012
  • Septia, N.W., Indrawati, I., Juriana, J., & Rudini, R. (2022). An Analysis of Students’ Difficulties in Reading Comprehension. EEdJ: English Education Journal, 2(1), 11-22. https://doi.org/10.55047/romeo
  • Stenner, A.J. (2022). Measuring reading comprehension with the Lexile framework. In Explanatory Models, Unit Standards, and Personalized Learning in Educational Measurement: Selected Papers by A. Jackson Stenner (pp. 63-88). Springer. https://doi.org/10.1007/978-981-19-3747-7_6
  • Stiller, J., Hartmann, S., Mathesius, S., Straube, P., Tiemann, R., Nordmeier, V., Krüger, D., & Upmeier zu Belzen, A. (2016). Assessing scientific reasoning: A comprehensive evaluation of item features that affect item difficulty. Assessment & Evaluation in Higher Education, 41(5), 721-732. https://doi.org/10.1080/02602938.2016.1164830
  • Sung, P.-J., Lin, S.-W., & Hung, P.-H. (2015). Factors Affecting Item Difficulty in English Listening Comprehension Tests. Universal Journal of Educational Research, 3(7), 451-459. https://doi.org/10.13189/ujer.2015.030704
  • Swaminathan, H., Hambleton, R.K., Sireci, S.G., Xing, D., & Rizavi, S.M. (2003). Small sample estimation in dichotomous item response models: Effect of priors based on judgmental information on the accuracy of item parameter estimates. Applied Psychological Measurement, 27(1), 27-51. https://doi.org/10.1177/0146621602239475
  • Sydorenko, T. (2011). Item writer judgments of item difficulty versus actual item difficulty: A case study. Language Assessment Quarterly, 8(1), 34 52. https://doi.org/10.1080/15434303.2010.536924
  • Toyama, Y. (2021). What Makes Reading Difficult? An Investigation of the Contributions of Passage, Task, and Reader Characteristics on Comprehension Performance. Reading Research Quarterly, 56(4), 633-642. https://doi.org/10.1002/rrq.440
  • Trace, J., Brown, J.D., Janssen, G., & Kozhevnikova, L. (2017). Determining cloze item difficulty from item and passage characteristics across different learner backgrounds. Language Testing, 34(2), 151-174. https://doi.org/10.1177/0265532215623581
  • Urhahne, D., & Wijnia, L. (2021). A review on the accuracy of teacher judgments. Educational Research Review, 32, 100374. https://doi.org/10.1016/j.edurev.2020.100374
  • Valencia, S.W., Wixson, K.K., Ackerman, T., & Sanders, E. (2017). Identifying text-task-reader interactions related to item and block difficulty in the national assessment for educational progress reading assessment. In: San Mateo, CA: National Center for Education Statistics.
  • Van der Linden, W.J., & Pashley, P.J. (2009). Item selection and ability estimation in adaptive testing. In Elements of adaptive testing (pp. 3-30). Springer, New York, NY. https://doi.org/10.1007/978-0-387-85461-8_1
  • Wauters, K., Desmet, P., & Van Den Noortgate, W. (2012). Item difficulty estimation: An auspicious collaboration between data and judgment. Computers & Education, 58(4), 1183-1193. https://doi.org/10.1016/j.compedu.2011.11.020
  • Ying-hui, H. (2006). An investigation into the task features affecting EFL listening comprehension test performance. The Asian EFL Journal Quarterly, 8(2), 33-54.
There are 64 citations in total.

Details

Primary Language English
Subjects Measurement Theories and Applications in Education and Psychology
Journal Section Articles
Authors

Ayfer Sayın 0000-0003-1357-5674

Okan Bulut 0000-0001-5853-1267

Early Pub Date May 22, 2024
Publication Date
Submission Date October 15, 2023
Acceptance Date May 2, 2024
Published in Issue Year 2024 Volume: 11 Issue: 2

Cite

APA Sayın, A., & Bulut, O. (2024). The difference between estimated and perceived item difficulty: An empirical study. International Journal of Assessment Tools in Education, 11(2), 368-387. https://doi.org/10.21449/ijate.1376160

23824         23823             23825