Is This Reliable Enough? Examining Classification Consistency and Accuracy in a Criterion-Referenced Test

Susanne Alger

doi:10.21449/ijate.245198

Research Article

Is This Reliable Enough? Examining Classification Consistency and Accuracy in a Criterion-Referenced Test

Year 2016, , 137 - 150, 01.07.2016

Susanne Alger

https://doi.org/10.21449/ijate.245198

Abstract

One important step for assessing the quality of
a test is to examine the reliability of test score interpretation. Which aspect
of reliability is the most relevant depends on what type of test it is and how
the scores are to be used. For criterion-referenced tests, and in particular
certification tests, where students are classified into performance categories,
primary focus need not be on the size of error but on the impact of this error
on classification. This impact can be described in terms of classification
consistency and classification accuracy. In this article selected methods from
classical test theory for estimating classification consistency and
classification accuracy were applied to the theory part of the Swedish driving
licence test, a high-stakes criterion-referenced test which is rarely studied
in terms of reliability of classification. The results for this particular test
indicated a level of classification consistency that falls slightly short of
the recommended level which is why lengthening the test should be considered.
More evidence should also be gathered as to whether the placement of the
cut-off score is appropriate since this has implications for the validity of
classifications.

Keywords

reliability, criterion-referenced test, driving licence test, classification consistency, decision consistency, single administration

References

Alger, S., & Sundström, A. (2013). Agreement of driving examiners’ assessments – Evaluating the reliability of the Swedish driving test. Transportation Research Part F: Traffic Psychology and Behaviour, 19(0), 22-30. doi: http://dx.doi.org/10.1016/j.trf.2013.02.004
American Educational Research Association, American Psychological Association, & National Council on Measurement in Education. (2014). Standards for educational and psychological testing. Washington, DC: American Educational Research Association.
Baughan, C. J., & Simpson, H. (1999). Consistency of driving performance at the time of the L-test, and implications for driver testing. In G. B. Grayson (Ed.), Behavioural Research in Road Safety IX. Crowthorne: Transport Research Laboratory.
Berk, R. A. (1980). A Consumers' Guide to Criterion-Referenced Test Reliability. Journal of Educational Measurement, 17(4), 323-349. doi: 10.1111/j.1745-3984.1980.tb00835.x
Brennan, R. L. (2004). Manual for BB-CLASS: A Computer Program that uses the Beta- Binomial Model for Classification Consistency and Accuracy. Version 1. (CASMA Research Report No. 9). Retrieved from the Center for Advanced Studies in Measurement http://www.education.uiowa.edu/docs/default-source/casma--- research/09casmareport.pdf?sfvrsn=2 at The University of Iowa website:
Brennan, R. L. (Ed.) (2006). Educational measurement. (4th ed.) Westport, CT: Praeger Publishers.
Brennan, R. L., & Wan, L. (2004). Bootstrap procedures for estimating decision consistency for single-administration complex assessments (CASMA Research Report No. 7). Iowa City: University of Iowa, Center for Advanced Studies in Measurement and Assessment. Retrieved from http://www.education.uiowa.edu/centers/casma/publications-data-file
Breyer, F. J., & Lewis, C. (1994). Pass-Fail Reliability for Tests with Cut-Scores: A Simplified Method. ETS Research Report Series, 1994(2), i-30. doi: 10.1002/j.2333- 8504.1994.tb01612.x
Crocker, L., & Algina, J. (1986). Introduction to classical and modern test theory. New York, NY: Holt, Rinehart and Winston, Inc.
Guo, F. (2006). Expected classification accuracy using the latent distribution. Practical Assessment, http://pareonline.net/getvn.asp?v=11&n=6 & Evaluation, 11(6), 1-6. Retrieved from
Haladyna, T. M., Downing, S. M., & Rodriguez, M. C. (2002). A review of multiple-choice item-writing guidelines for classroom assessment. Applied Measurement in Education, 15(3), 309-333. doi: 10.1207/S15324818AME1503_5
Han, K. T., & Hambleton, R. K. (2007). User's Manual: WinGen (Center for Educational Assessment Report No. 642). Amherst, MA: University of Massachusetts, School of Education. Retrieved from http://www.umass.edu/remp/software/simcata/wingen/homeF.html
Hambleton, R. K., Swaminathan, H., Algina, J., & Coulson, D. B. (1978). Criterion- referenced testing and measurement: A review of technical issues and developments. Review of Educational Research, 1-47. Retrieved from http://www.jstor.org/stable/1169908
Hanson, B. A., & Brennan, R. L. (1990). An Investigation of Classification Consistency Indexes Estimated under Alternative Strong True Score Models. Journal of Educational Measurement, 27(4), 345-359. doi: 10.1111/j.1745-3984.1990.tb00753.x
Henriksson, W., Sundström, A., & Wiberg, M. (2004). The Swedish driving-license test: A summary of studies from the department of educational measurement. (EM 44) Umeå: Department of Educational Measurement, Umeå University. Available from the Umeå university website: http://www.jus.umu.se/digitalAssets/59/59522_em-45.pdf
Huynh, H. (1976). On the reliability of decisions in domain-referenced testing. Journal of Educational Measurement, 13(4), 253-264. doi: 10.1111/j.1745-3984.1976.tb00016.x
Huynh, H. (1990). Computation and statistical inference for decision consistency indexes based on the Rasch model. Journal of Educational and Behavioral Statistics, 15(4), 353-368. doi: 10.3102/10769986015004353
Huynh, H., & Saunders, J. C. (1980). Accuracy of Two Procedures for Estimating Reliability of Mastery Tests. Journal of Educational Measurement, 17(4), 351-358. doi: 10.2307/1434874
Lathrop, Q. N. (2015). Practical Issues in Estimating Classification Accuracy and Consistency with R Package cacIRT. Practical Assessment, Research & Evaluation, 20(18), 2. Retrieved from http://pareonline.net/getvn.asp?v=20&n=18
Lee, W. C. (2010). Classification Consistency and Accuracy for Complex Assessments Using Item Response Theory. Journal of Educational Measurement, 47(1), 1-17. doi: 10.1111/j.1745-3984.2009.00096.x
Lee, W.-C., Hanson, B. A., & Brennan, R. L. (2002). Estimating consistency and accuracy indices for multiple classifications. Applied Psychological Measurement, 26(4), 412- 432. doi:10.1177/014662102237797
Livingston, S. A., & Lewis, C. (1995). Estimating the Consistency and Accuracy of Classifications Based on Test Scores. Journal of Educational Measurement, 32(2), 179– 197. Retrieved from http://www.jstor.org/stable/1435147
Meyer, J. P. (2010). Understanding Measurement: Reliability. New York: Oxford University Press.
Peng, C. Y. J., & Subkoviak, M. J. (1980). A Note on Huynh's Normal Approximation Procedure for Estimating Criterion-Referenced Reliability. Journal of Educational Measurement, 17(4), 359-368. doi: 10.1111/j.1745-3984.1980.tb00837.x
Reiner, T. W., & Hagge, R. A. (2006). Evaluation of the class C driver license written knowledge tests. Retrieved from the State of California Department of Motor Vehicles website: 3a0ddb791542/S2- 221.pdf?MOD=AJPERES&CONVERT_TO=url&CACHEID=b01cf8b0-d6e4-4532- 86f6-3a0ddb791542
Rudner, L. M. (2005). Expected classification accuracy. Practical Assessment Research & Evaluation, 10(13). Available online: http://pareonline.net/getvn.asp?v=10&n=13
Schuwirth, L. W. T., & van der Vleuten, C. P. M. (2011). General overview of the theories used in assessment: AMEE Guide No. 57. Medical Teacher, 33(10), 783-797. doi: 10.3109/0142159X.2011.611022
Siegrist, S. (Ed.). (1999). Driver training, testing and licensing - towards a theory-based management of young drivers' injury risk in road traffic. Results of EU-project GADGET, Work Package 3. BFU-report 40. Bern: Schweizerische Beratungsstelle Für Unfallverhütung.
Subkoviak, M. J. (1976). Estimating reliability from a single administration of a criterion- referenced test. Journal of Educational Measurement, 13(4), 265-276. doi: 10.1111/j.1745-3984.1976.tb00017.x
Subkoviak, M. J. (1978). Empirical Investigation of Procedures for Estimating Reliability for Mastery Tests. Journal of Educational Measurement, 15(2), 111-116. doi: 10.2307/1433864
Subkoviak, M. J. (1988). A practitioner's guide to computation and interpretation of reliability indices for mastery tests. Journal of Educational Measurement, 25(1), 47-55. doi: 10.1111/j.1745-3984.1988.tb00290.x
Sundström, A. (2003). Den svenska förarprövningen. Sambandet mellan kunskapsprovet och körprovet, provens struktur samt körkortsutbildningens betydelse [Driver testing in Sweden. A study of the relationship between the theoretical and practical test, the structure of the tests and the effect of driver education on test performance]. (PM 183). Umeå: Pedagogiska institutionen, enheten för pedagogiska mätningar. Available from the portal.org/smash/get/diva2:588958/FULLTEXT01.pdf Archive On-line DiVA website: http://umu.diva
Wang, T., Kolen, M. J., & Harris, D. J. (2000). Psychometric properties of scale scores and performance levels for performance assessments using polytomous IRT. Journal of Educational Measurement, 37, 141-162. doi: 10.1111/j.1745-3984.2000.tb01080.x
Wainer, H. (Ed.). (2000). Computerized adaptive testing: A primer. (2nd Edition). Mahwah, NJ: Lawrence Erlbaum Associates.
van der Linden, W. J., & Glas, C. A. (2010). Elements of adaptive testing. New York, NY: Springer New York.
Wiberg, M. (2004). Klassisk och modern testteori. Analys av det teoretiska och praktiska körkortsprovet [Classical and modern test theory: analysis of the theoretical and practical driving-license test]. (BVM 5) Umeå universitet: Institutionen för beteendevetenskapliga mätningar. Available from the Academic Archive On-line DiVA website: http://umu.diva-portal.org/smash/get/diva2:467117/FULLTEXT01.pdf
Wiberg, M., & Sundström, A. (2009). A comparison of two approaches to correction of restriction of range in correlation analysis. Practical Assessment, Research & Evaluation, 14(5), 2. http://www.pareonline.net/getvn.asp?v=14&n=5
Woodruff, D. J., & Sawyer, R. L. (1989). Estimating Measures of Pass-Fail Reliability From Parallel Half-Tests. Applied Psychological Measurement, 13(1), 33-43. doi: 10.1177/014662168901300104
Wyse, A. E., & Hao, S. (2012). An Evaluation of Item Response Theory Classification Accuracy and Consistency Indices. Applied Psychological Measurement, 36(7), 602- 624. doi: 10.1177/0146621612451522

Is This Reliable Enough? Examining Classification Consistency and Accuracy in a Criterion-Referenced Test

Year 2016, , 137 - 150, 01.07.2016

Susanne Alger

https://doi.org/10.21449/ijate.245198

Abstract

One important step for assessing the quality of a test is to examine the reliability of test score interpretation. Which aspect of reliability is the most relevant depends on what type of test it is and how the scores are to be used. For criterion-referenced tests, and in particular certification tests, where students are classified into performance categories, primary focus need not be on the size of error but on the impact of this error on classification. This impact can be described in terms of classification consistency and classification accuracy. In this article selected methods from classical test theory for estimating classification consistency and classification accuracy were applied to the theory part of the Swedish driving licence test, a high-stakes criterion-referenced test which is rarely studied in terms of reliability of classification. The results for this particular test indicated a level of classification consistency that falls slightly short of the recommended level which is why lengthening the test should be considered. More evidence should also be gathered as to whether the placement of the cut-off score is appropriate since this has implications for the validity of classifications.

Keywords

reliability, criterion-referenced test, driving licence test, classification consistency, decision consistency, single administration

References

Alger, S., & Sundström, A. (2013). Agreement of driving examiners’ assessments – Evaluating the reliability of the Swedish driving test. Transportation Research Part F: Traffic Psychology and Behaviour, 19(0), 22-30. doi: http://dx.doi.org/10.1016/j.trf.2013.02.004
American Educational Research Association, American Psychological Association, & National Council on Measurement in Education. (2014). Standards for educational and psychological testing. Washington, DC: American Educational Research Association.
Baughan, C. J., & Simpson, H. (1999). Consistency of driving performance at the time of the L-test, and implications for driver testing. In G. B. Grayson (Ed.), Behavioural Research in Road Safety IX. Crowthorne: Transport Research Laboratory.
Berk, R. A. (1980). A Consumers' Guide to Criterion-Referenced Test Reliability. Journal of Educational Measurement, 17(4), 323-349. doi: 10.1111/j.1745-3984.1980.tb00835.x
Brennan, R. L. (2004). Manual for BB-CLASS: A Computer Program that uses the Beta- Binomial Model for Classification Consistency and Accuracy. Version 1. (CASMA Research Report No. 9). Retrieved from the Center for Advanced Studies in Measurement http://www.education.uiowa.edu/docs/default-source/casma--- research/09casmareport.pdf?sfvrsn=2 at The University of Iowa website:
Brennan, R. L. (Ed.) (2006). Educational measurement. (4th ed.) Westport, CT: Praeger Publishers.
Brennan, R. L., & Wan, L. (2004). Bootstrap procedures for estimating decision consistency for single-administration complex assessments (CASMA Research Report No. 7). Iowa City: University of Iowa, Center for Advanced Studies in Measurement and Assessment. Retrieved from http://www.education.uiowa.edu/centers/casma/publications-data-file
Breyer, F. J., & Lewis, C. (1994). Pass-Fail Reliability for Tests with Cut-Scores: A Simplified Method. ETS Research Report Series, 1994(2), i-30. doi: 10.1002/j.2333- 8504.1994.tb01612.x
Crocker, L., & Algina, J. (1986). Introduction to classical and modern test theory. New York, NY: Holt, Rinehart and Winston, Inc.
Guo, F. (2006). Expected classification accuracy using the latent distribution. Practical Assessment, http://pareonline.net/getvn.asp?v=11&n=6 & Evaluation, 11(6), 1-6. Retrieved from
Haladyna, T. M., Downing, S. M., & Rodriguez, M. C. (2002). A review of multiple-choice item-writing guidelines for classroom assessment. Applied Measurement in Education, 15(3), 309-333. doi: 10.1207/S15324818AME1503_5
Han, K. T., & Hambleton, R. K. (2007). User's Manual: WinGen (Center for Educational Assessment Report No. 642). Amherst, MA: University of Massachusetts, School of Education. Retrieved from http://www.umass.edu/remp/software/simcata/wingen/homeF.html
Hambleton, R. K., Swaminathan, H., Algina, J., & Coulson, D. B. (1978). Criterion- referenced testing and measurement: A review of technical issues and developments. Review of Educational Research, 1-47. Retrieved from http://www.jstor.org/stable/1169908
Hanson, B. A., & Brennan, R. L. (1990). An Investigation of Classification Consistency Indexes Estimated under Alternative Strong True Score Models. Journal of Educational Measurement, 27(4), 345-359. doi: 10.1111/j.1745-3984.1990.tb00753.x
Henriksson, W., Sundström, A., & Wiberg, M. (2004). The Swedish driving-license test: A summary of studies from the department of educational measurement. (EM 44) Umeå: Department of Educational Measurement, Umeå University. Available from the Umeå university website: http://www.jus.umu.se/digitalAssets/59/59522_em-45.pdf
Huynh, H. (1976). On the reliability of decisions in domain-referenced testing. Journal of Educational Measurement, 13(4), 253-264. doi: 10.1111/j.1745-3984.1976.tb00016.x
Huynh, H. (1990). Computation and statistical inference for decision consistency indexes based on the Rasch model. Journal of Educational and Behavioral Statistics, 15(4), 353-368. doi: 10.3102/10769986015004353
Huynh, H., & Saunders, J. C. (1980). Accuracy of Two Procedures for Estimating Reliability of Mastery Tests. Journal of Educational Measurement, 17(4), 351-358. doi: 10.2307/1434874
Lathrop, Q. N. (2015). Practical Issues in Estimating Classification Accuracy and Consistency with R Package cacIRT. Practical Assessment, Research & Evaluation, 20(18), 2. Retrieved from http://pareonline.net/getvn.asp?v=20&n=18
Lee, W. C. (2010). Classification Consistency and Accuracy for Complex Assessments Using Item Response Theory. Journal of Educational Measurement, 47(1), 1-17. doi: 10.1111/j.1745-3984.2009.00096.x
Lee, W.-C., Hanson, B. A., & Brennan, R. L. (2002). Estimating consistency and accuracy indices for multiple classifications. Applied Psychological Measurement, 26(4), 412- 432. doi:10.1177/014662102237797
Livingston, S. A., & Lewis, C. (1995). Estimating the Consistency and Accuracy of Classifications Based on Test Scores. Journal of Educational Measurement, 32(2), 179– 197. Retrieved from http://www.jstor.org/stable/1435147
Meyer, J. P. (2010). Understanding Measurement: Reliability. New York: Oxford University Press.
Peng, C. Y. J., & Subkoviak, M. J. (1980). A Note on Huynh's Normal Approximation Procedure for Estimating Criterion-Referenced Reliability. Journal of Educational Measurement, 17(4), 359-368. doi: 10.1111/j.1745-3984.1980.tb00837.x
Reiner, T. W., & Hagge, R. A. (2006). Evaluation of the class C driver license written knowledge tests. Retrieved from the State of California Department of Motor Vehicles website: 3a0ddb791542/S2- 221.pdf?MOD=AJPERES&CONVERT_TO=url&CACHEID=b01cf8b0-d6e4-4532- 86f6-3a0ddb791542
Rudner, L. M. (2005). Expected classification accuracy. Practical Assessment Research & Evaluation, 10(13). Available online: http://pareonline.net/getvn.asp?v=10&n=13
Schuwirth, L. W. T., & van der Vleuten, C. P. M. (2011). General overview of the theories used in assessment: AMEE Guide No. 57. Medical Teacher, 33(10), 783-797. doi: 10.3109/0142159X.2011.611022
Siegrist, S. (Ed.). (1999). Driver training, testing and licensing - towards a theory-based management of young drivers' injury risk in road traffic. Results of EU-project GADGET, Work Package 3. BFU-report 40. Bern: Schweizerische Beratungsstelle Für Unfallverhütung.
Subkoviak, M. J. (1976). Estimating reliability from a single administration of a criterion- referenced test. Journal of Educational Measurement, 13(4), 265-276. doi: 10.1111/j.1745-3984.1976.tb00017.x
Subkoviak, M. J. (1978). Empirical Investigation of Procedures for Estimating Reliability for Mastery Tests. Journal of Educational Measurement, 15(2), 111-116. doi: 10.2307/1433864
Subkoviak, M. J. (1988). A practitioner's guide to computation and interpretation of reliability indices for mastery tests. Journal of Educational Measurement, 25(1), 47-55. doi: 10.1111/j.1745-3984.1988.tb00290.x
Sundström, A. (2003). Den svenska förarprövningen. Sambandet mellan kunskapsprovet och körprovet, provens struktur samt körkortsutbildningens betydelse [Driver testing in Sweden. A study of the relationship between the theoretical and practical test, the structure of the tests and the effect of driver education on test performance]. (PM 183). Umeå: Pedagogiska institutionen, enheten för pedagogiska mätningar. Available from the portal.org/smash/get/diva2:588958/FULLTEXT01.pdf Archive On-line DiVA website: http://umu.diva
Wang, T., Kolen, M. J., & Harris, D. J. (2000). Psychometric properties of scale scores and performance levels for performance assessments using polytomous IRT. Journal of Educational Measurement, 37, 141-162. doi: 10.1111/j.1745-3984.2000.tb01080.x
Wainer, H. (Ed.). (2000). Computerized adaptive testing: A primer. (2nd Edition). Mahwah, NJ: Lawrence Erlbaum Associates.
van der Linden, W. J., & Glas, C. A. (2010). Elements of adaptive testing. New York, NY: Springer New York.
Wiberg, M. (2004). Klassisk och modern testteori. Analys av det teoretiska och praktiska körkortsprovet [Classical and modern test theory: analysis of the theoretical and practical driving-license test]. (BVM 5) Umeå universitet: Institutionen för beteendevetenskapliga mätningar. Available from the Academic Archive On-line DiVA website: http://umu.diva-portal.org/smash/get/diva2:467117/FULLTEXT01.pdf
Wiberg, M., & Sundström, A. (2009). A comparison of two approaches to correction of restriction of range in correlation analysis. Practical Assessment, Research & Evaluation, 14(5), 2. http://www.pareonline.net/getvn.asp?v=14&n=5
Woodruff, D. J., & Sawyer, R. L. (1989). Estimating Measures of Pass-Fail Reliability From Parallel Half-Tests. Applied Psychological Measurement, 13(1), 33-43. doi: 10.1177/014662168901300104
Wyse, A. E., & Hao, S. (2012). An Evaluation of Item Response Theory Classification Accuracy and Consistency Indices. Applied Psychological Measurement, 36(7), 602- 624. doi: 10.1177/0146621612451522

There are 39 citations in total.

Details

Primary Language	English
Subjects	Studies on Education
Journal Section	Articles
Authors	Susanne Alger This is me
Publication Date	July 1, 2016
Submission Date	January 15, 2016
Published in Issue	Year 2016

Cite

APA	Alger, S. (2016). Is This Reliable Enough? Examining Classification Consistency and Accuracy in a Criterion-Referenced Test. International Journal of Assessment Tools in Education, 3(2), 137-150. https://doi.org/10.21449/ijate.245198

Article Files

Full Text

23823 23825 23824