On the Statistical and Heuristic Difficulty Estimates of a High Stakes Test in Iran

Ali Darabi Bazvand; Sheila Kheirzade; Alireza Ahmadi

doi:10.21449/ijate.546709

Research Article

On the Statistical and Heuristic Difficulty Estimates of a High Stakes Test in Iran

Year 2019, , 330 - 343, 15.10.2019

Ali Darabi Bazvand , Sheila Kheirzade , Alireza Ahmadi

https://doi.org/10.21449/ijate.546709

Cited By: 2

Abstract

The findings of previous research into the
compatibility of stakeholders’ perceptions with statistical estimations of item
difficulty are not seemingly consistent. Furthermore, most research shows that
teachers’ estimation of item difficulty is not reliable since they tend to
overestimate the difficulty of easy items and underestimate the difficulty of
difficult items. Therefore, the present study aims to analyze a high stakes
test in terms of heuristic (test takers’ standpoint) and statistical difficulty
(CTT and IRT) and investigate the extent to which the findings from the two
perspectives converge. Results indicate that, 1) the whole test along
with its sub-tests is difficult which might lead to test invalidity; 2) the
respondents’ ratings of the total test in terms of difficulty level are almost
convergent with the difficulty values indicated by IRT and CTT, except for the two subtests where students underestimated the
difficulty values, and 3) CTT difficulty estimates are convergent with IRT
difficulty estimates. Therefore, it can be concluded that students’
perceptions of item difficulty might be a better estimate of test difficulty
and a combination of test takers’ perceptions and statistical difficulty might
provide a better picture of item difficulty in assessment contexts.

Keywords

Classical true score theory, Heuristic difficulty, High stakes test, Item response theory, Statistical difficulty

References

Alderson, J. C. (1993). Judgments in language testing. In D. Douglas & C. Chapelle (eds.), A new decade of language testing (pp. 46–57). Arlington. VA: TESOL.
Apostolou, E. (2010). Comparing perceived and actual task and text difficulty in the assessment of listening comprehension. In Lancaster University Postgraduate Conference in Linguistics & Language Teaching (pp. 26-47).
Bachman, L. (2002). Some reflections on task-based language performance assessment. Language Testing, 19, 453–476.
Bachman, L. F. (1990). Fundamental considerations in language testing. Oxford university press.
Baker, F. (2001). The basics of item response theory., College Park: ERIC Clearinghouse on Assessment and Evaluation, University of Maryland.
Bejar, I. (1983). Subject matter experts’ assessment of item statistics. Applied Psychological Measurement, 7, 303–310
Bereby-Meijer, Y., Meijer, J., & Flascher, O. M. (2002). Prospect theory analysis of guessing in multiple choice tests. Journal of Behavioral Decision Making, 15, 313–327.
Birnbaum, A. (1968). Some latent trait models and their use in inferring an examinee’s ability. In F. M. Lord. & M. R. Novick (Eds.), statistical theories of mental test scores. Reading, MA: Addison-Wesley.
Brindley, G., & Slatyer, H. (2002). Exploring task difficulty in ESL listening assessment. Language Testing, 19, 369-394.
Brown, S., & Glasner, A. (1999). Assessment matters in higher education. Buckingham: SRHE and Open University Press.
Chapelle, C. A., Enright, M. K., & Jamieson, J. (2010). Does an argument‐based approach to validity make a difference?. Educational Measurement: Issues and Practice, 29, 3-13.
Cizek, G. J. (2012). Defining and distinguishing validity: Interpretations of score meaning and justifications of test use. Psychological Methods, 17, 31.
Conejo, R., Guzmán, E., Perez-De-La-Cruz, J. L., & Barros, B. (2014). An empirical study on the quantitative notion of task difficulty. Expert Systems with Applications, 41, 594-606.
Elder, C., Iwashita, N., & McNamara, T. (2002). Estimating the difficulty of oral proficiency tasks: what does the test-taker have to offer?. Language Testing, 19, 347-368.
Embretson, S. E., & Reise, S. (2000). Item response theory for psychologists. Mahwah: Erlbaum.
Farhady, H. (1998). A critical review of the English section of the BA and MA University Entrance Examination. In the Proceedings of the conference on MA tests in Iran, Ministry of Culture and Higher Education, Center for Educational Evaluation. Tehran, Iran.
Freedle, R., & Kostin, I. (1999). Does the text matter in a multiple-choice test of comprehension? The case for the construct validity of TOEFL's minitalks. Language Testing, 16, 2-32.
Goodwin, L. D. (1996). Focus on quantitative methods: Determining cut-off scores. Research in Nursing & Health, 19, 249–256.
Hajforoush, H. (2002). Negative consequences of entrance exams on instructional objectives and a proposal for removing them. Proceedings of the Isfahan University Conference on Evaluating the Issues of the Entrance Exams.
Hambleton, R. K., & Jones, R. W. (1993). An NCME instructional module on: Comparison of classical test theory and item response theory and their applications to test development. Educational Measurement: Issues and Practice, 12, 38-47.
Hambleton, R. K., Swaminathan, H., & Rogers, H. J. (1991). Fundamentals of item response theory. Newbury Park, CA: Sage.
Hambleton, R., & Jirka, S. (2006). Anchor-based methods for judgmentally estimating item statistics. In S. Downing & T. Haladyna (Eds.), Handbook of test development (pp. 399–420). Mahwah, NJ: Erlbaum.
Hamp-Lyons, L., & Mathias, S. P. (1994). Examining expert judgments of task difficulty on essay tests. Journal of Second Language Writing, 3, 49–68.
Hubley, A. M., & Zumbo, B. D. (2011). Validity and the consequences of test interpretation and use. Social Indicators Research, 103, 219.
Impara, J. C., & Plake, B. S. (1998). Teachers’ ability to estimate item difficulty: A test of the assumptions in the Angoff standard setting method. Journal of Educational Measurement, 35, 69–81.
Johnson, R.C., & Riazi, M. (2013). Assessing the assessments: Using an argument-based validity framework to assess the validity and use of an English placement system in a foreign language context. Papers in Language Testing and Assessment, 2, 31-58
Knoch, U., & Elder, C. (2013). A framework for validating post-entry language assessments (PELAs). Papers in Language Testing and Assessment, 2, 48-66.
Kostin, I. (2004). Exploring item characteristics that are related to difficulty of TOEFL dialogue items (TOEFL Research Rep. No. 79). Princeton, NJ: ETS.
Lee, F. L. (1996). Electronic homework: an intelligent tutoring system in mathematics. (Doctoral Dissertation). The Chinese University of Hong Kong. Hong Kong, China.
Lee, F. L., & Heyworth, R. M. (2000). Problem complexity: a measure of problem difficulty in algebra by using computer. Education Journal, 28, 85–107.
Magno, C. (2009). Demonstrating the difference between Classical Test Theory and Item Response Theory using derived test data. The International Journal of Educational and Psychological Assessment, 1, 1-11.
Nickerson, R. S. (1999). How we know-and sometimes misjudge-what others know: Imputing one’s own knowledge to others. Psychological Bulletin, 125, 737–759.
Pardos, Z. A., & Heffernan, N. T. (2011). KT-IDEM: Introducing item difficulty to the knowledge tracing model. In J. Konstan, R. Conejo, J. L. Marzo, & N. Oliver (Eds.), Proceedings of the 19th international conference on user modeling, adaptation and personalization (Vol. 6787, pp. 243–254). Lecture Notes in Computer Science.
Razavipur, K. (2014). On the substantive and predictive validity facets of the university entrance exam for English majors. Research in Applied Linguistics, 5, 77-90.
Razmjoo, S. A. (2006). A content analysis of university entrance examination for English majors in 1382. Journal of Social Sciences and Humanities, Shiraz University, 46, 67-75.
Rezvani, R., & Sayyadi, A. (2016). Ph. D. instructors’ and students’ insights into the validity of the new Iranian TEFL Ph. D. program Entrance Exam. Theory and Practice in Language Studies, 6, 1111-1120.
Rupp, A. A., Garcia, P., & Jamieson, J. (2001). Combining multiple regression and CART to understand difficulty in second language reading and listening comprehension test items. International Journal of Testing, 1, 185-216.
Shojaee, M. & Gholipoor, R. (2005). Recommended draft of applying university student system survey and designing acceptance model of university student. Research Center of the Parliamnet, No. 7624.
Stanley, J. C. (1971). Reliability. In R. L. Thorndike (Ed.), Educational measurement (2nd ed., pp. 356-442). Washington, DC: American Council on Education
van de Watering, G., & van der Rijt, J. (2006). Teachers’ and students’ perceptions of assessments: A review and a study into the ability and accuracy of estimating the difficulty levels of assessment items. Educational Research Review, 1, 133-147.
van der Linden, W., & Hambleton, R.K. (1996). Item response theory: Brief history, common models, and extensions. In W. van der Linden & R.K. Hambleton (Eds.), Handbook of modern item-response theory (pp. 1–28). Berlin: Springer-Verlag.
Verhoeven, B. H., Verwijnen, G. M., Muijtjens, A. M. M., Scherpbier, A. J. J. A., & Van der Vleuten, C. P. M. (2002). Panel expertise for an Angoff standard setting procedure in progress testing: Item writers compared to recently graduated students. Medical Education, 36, 860–867.
Wauters, K., Desmet, P., & van Den Noortgate, W. (2012). Item difficulty estimation: An auspicious collaboration between data and judgment. Computers & Education, 58, 1183–1193.
Xi, X. (2008). Methods of test validation. In E. Shohamy & N. H. Hornberger (Eds.), Encyclopedia of Language and Education, 2nd edn, vol. 7: Language testing and assessment (pp. 177–196). New York: Springer.

On the Statistical and Heuristic Difficulty Estimates of a High Stakes Test in Iran

Year 2019, , 330 - 343, 15.10.2019

Ali Darabi Bazvand , Sheila Kheirzade , Alireza Ahmadi

https://doi.org/10.21449/ijate.546709

Cited By: 2

Abstract

The findings of previous research into the compatibility of stakeholders’ perceptions with statistical estimations of item difficulty are not seemingly consistent. Furthermore, most research shows that teachers’ estimation of item difficulty is not reliable since they tend to overestimate the difficulty of easy items and underestimate the difficulty of difficult items. Therefore, the present study aims to analyze a high stakes test in terms of heuristic (test takers’ standpoint) and statistical difficulty (CTT and IRT) and investigate the extent to which the findings from the two perspectives converge. Results indicate that, 1) the whole test along with its sub-tests is difficult which might lead to test invalidity; 2) the respondents’ ratings of the total test in terms of difficulty level are almost convergent with the difficulty values indicated by IRT and CTT, except for the two subtests where students underestimated the difficulty values, and 3) CTT difficulty estimates are convergent with IRT difficulty estimates. Therefore, it can be concluded that students’ perceptions of item difficulty might be a better estimate of test difficulty and a combination of test takers’ perceptions and statistical difficulty might provide a better picture of item difficulty in assessment contexts.

Keywords

Classical true score theory, Heuristic difficulty, High stakes test, Item response theory, Statistical difficulty

References

Alderson, J. C. (1993). Judgments in language testing. In D. Douglas & C. Chapelle (eds.), A new decade of language testing (pp. 46–57). Arlington. VA: TESOL.
Apostolou, E. (2010). Comparing perceived and actual task and text difficulty in the assessment of listening comprehension. In Lancaster University Postgraduate Conference in Linguistics & Language Teaching (pp. 26-47).
Bachman, L. (2002). Some reflections on task-based language performance assessment. Language Testing, 19, 453–476.
Bachman, L. F. (1990). Fundamental considerations in language testing. Oxford university press.
Baker, F. (2001). The basics of item response theory., College Park: ERIC Clearinghouse on Assessment and Evaluation, University of Maryland.
Bejar, I. (1983). Subject matter experts’ assessment of item statistics. Applied Psychological Measurement, 7, 303–310
Bereby-Meijer, Y., Meijer, J., & Flascher, O. M. (2002). Prospect theory analysis of guessing in multiple choice tests. Journal of Behavioral Decision Making, 15, 313–327.
Birnbaum, A. (1968). Some latent trait models and their use in inferring an examinee’s ability. In F. M. Lord. & M. R. Novick (Eds.), statistical theories of mental test scores. Reading, MA: Addison-Wesley.
Brindley, G., & Slatyer, H. (2002). Exploring task difficulty in ESL listening assessment. Language Testing, 19, 369-394.
Brown, S., & Glasner, A. (1999). Assessment matters in higher education. Buckingham: SRHE and Open University Press.
Chapelle, C. A., Enright, M. K., & Jamieson, J. (2010). Does an argument‐based approach to validity make a difference?. Educational Measurement: Issues and Practice, 29, 3-13.
Cizek, G. J. (2012). Defining and distinguishing validity: Interpretations of score meaning and justifications of test use. Psychological Methods, 17, 31.
Conejo, R., Guzmán, E., Perez-De-La-Cruz, J. L., & Barros, B. (2014). An empirical study on the quantitative notion of task difficulty. Expert Systems with Applications, 41, 594-606.
Elder, C., Iwashita, N., & McNamara, T. (2002). Estimating the difficulty of oral proficiency tasks: what does the test-taker have to offer?. Language Testing, 19, 347-368.
Embretson, S. E., & Reise, S. (2000). Item response theory for psychologists. Mahwah: Erlbaum.
Farhady, H. (1998). A critical review of the English section of the BA and MA University Entrance Examination. In the Proceedings of the conference on MA tests in Iran, Ministry of Culture and Higher Education, Center for Educational Evaluation. Tehran, Iran.
Freedle, R., & Kostin, I. (1999). Does the text matter in a multiple-choice test of comprehension? The case for the construct validity of TOEFL's minitalks. Language Testing, 16, 2-32.
Goodwin, L. D. (1996). Focus on quantitative methods: Determining cut-off scores. Research in Nursing & Health, 19, 249–256.
Hajforoush, H. (2002). Negative consequences of entrance exams on instructional objectives and a proposal for removing them. Proceedings of the Isfahan University Conference on Evaluating the Issues of the Entrance Exams.
Hambleton, R. K., & Jones, R. W. (1993). An NCME instructional module on: Comparison of classical test theory and item response theory and their applications to test development. Educational Measurement: Issues and Practice, 12, 38-47.
Hambleton, R. K., Swaminathan, H., & Rogers, H. J. (1991). Fundamentals of item response theory. Newbury Park, CA: Sage.
Hambleton, R., & Jirka, S. (2006). Anchor-based methods for judgmentally estimating item statistics. In S. Downing & T. Haladyna (Eds.), Handbook of test development (pp. 399–420). Mahwah, NJ: Erlbaum.
Hamp-Lyons, L., & Mathias, S. P. (1994). Examining expert judgments of task difficulty on essay tests. Journal of Second Language Writing, 3, 49–68.
Hubley, A. M., & Zumbo, B. D. (2011). Validity and the consequences of test interpretation and use. Social Indicators Research, 103, 219.
Impara, J. C., & Plake, B. S. (1998). Teachers’ ability to estimate item difficulty: A test of the assumptions in the Angoff standard setting method. Journal of Educational Measurement, 35, 69–81.
Johnson, R.C., & Riazi, M. (2013). Assessing the assessments: Using an argument-based validity framework to assess the validity and use of an English placement system in a foreign language context. Papers in Language Testing and Assessment, 2, 31-58
Knoch, U., & Elder, C. (2013). A framework for validating post-entry language assessments (PELAs). Papers in Language Testing and Assessment, 2, 48-66.
Kostin, I. (2004). Exploring item characteristics that are related to difficulty of TOEFL dialogue items (TOEFL Research Rep. No. 79). Princeton, NJ: ETS.
Lee, F. L. (1996). Electronic homework: an intelligent tutoring system in mathematics. (Doctoral Dissertation). The Chinese University of Hong Kong. Hong Kong, China.
Lee, F. L., & Heyworth, R. M. (2000). Problem complexity: a measure of problem difficulty in algebra by using computer. Education Journal, 28, 85–107.
Magno, C. (2009). Demonstrating the difference between Classical Test Theory and Item Response Theory using derived test data. The International Journal of Educational and Psychological Assessment, 1, 1-11.
Nickerson, R. S. (1999). How we know-and sometimes misjudge-what others know: Imputing one’s own knowledge to others. Psychological Bulletin, 125, 737–759.
Pardos, Z. A., & Heffernan, N. T. (2011). KT-IDEM: Introducing item difficulty to the knowledge tracing model. In J. Konstan, R. Conejo, J. L. Marzo, & N. Oliver (Eds.), Proceedings of the 19th international conference on user modeling, adaptation and personalization (Vol. 6787, pp. 243–254). Lecture Notes in Computer Science.
Razavipur, K. (2014). On the substantive and predictive validity facets of the university entrance exam for English majors. Research in Applied Linguistics, 5, 77-90.
Razmjoo, S. A. (2006). A content analysis of university entrance examination for English majors in 1382. Journal of Social Sciences and Humanities, Shiraz University, 46, 67-75.
Rezvani, R., & Sayyadi, A. (2016). Ph. D. instructors’ and students’ insights into the validity of the new Iranian TEFL Ph. D. program Entrance Exam. Theory and Practice in Language Studies, 6, 1111-1120.
Rupp, A. A., Garcia, P., & Jamieson, J. (2001). Combining multiple regression and CART to understand difficulty in second language reading and listening comprehension test items. International Journal of Testing, 1, 185-216.
Shojaee, M. & Gholipoor, R. (2005). Recommended draft of applying university student system survey and designing acceptance model of university student. Research Center of the Parliamnet, No. 7624.
Stanley, J. C. (1971). Reliability. In R. L. Thorndike (Ed.), Educational measurement (2nd ed., pp. 356-442). Washington, DC: American Council on Education
van de Watering, G., & van der Rijt, J. (2006). Teachers’ and students’ perceptions of assessments: A review and a study into the ability and accuracy of estimating the difficulty levels of assessment items. Educational Research Review, 1, 133-147.
van der Linden, W., & Hambleton, R.K. (1996). Item response theory: Brief history, common models, and extensions. In W. van der Linden & R.K. Hambleton (Eds.), Handbook of modern item-response theory (pp. 1–28). Berlin: Springer-Verlag.
Verhoeven, B. H., Verwijnen, G. M., Muijtjens, A. M. M., Scherpbier, A. J. J. A., & Van der Vleuten, C. P. M. (2002). Panel expertise for an Angoff standard setting procedure in progress testing: Item writers compared to recently graduated students. Medical Education, 36, 860–867.
Wauters, K., Desmet, P., & van Den Noortgate, W. (2012). Item difficulty estimation: An auspicious collaboration between data and judgment. Computers & Education, 58, 1183–1193.
Xi, X. (2008). Methods of test validation. In E. Shohamy & N. H. Hornberger (Eds.), Encyclopedia of Language and Education, 2nd edn, vol. 7: Language testing and assessment (pp. 177–196). New York: Springer.

There are 44 citations in total.

Details

Primary Language	English
Subjects	Studies on Education
Journal Section	Articles
Authors	Ali Darabi Bazvand 0000-0002-2620-4648 Sheila Kheirzade 0000-0003-4665-0554 Alireza Ahmadi
Publication Date	October 15, 2019
Submission Date	March 29, 2019
Published in Issue	Year 2019

Cite

APA	Darabi Bazvand, A., Kheirzade, S., & Ahmadi, A. (2019). On the Statistical and Heuristic Difficulty Estimates of a High Stakes Test in Iran. International Journal of Assessment Tools in Education, 6(3), 330-343. https://doi.org/10.21449/ijate.546709

Cited By

Dimensionality, discrimination power and difficulty of English test items: the case of graduate exam for healthcare applicants

Journal of Medical Education Development

https://doi.org/10.61186/edcj.17.55.108

Çoktan Seçmeli Maddelerde Uzmanlarca Öngörülen ve Ampirik Olarak Hesaplanan Güçlük İndekslerinin Karşılaştırılması

Journal of Computer and Education Research

https://doi.org/10.18009/jcer.1000934

Article Files

Full Text

23823 23825 23824