Year 2016, Volume 3 , Issue 2, Pages 137 - 150 2016-07-01

Is This Reliable Enough? Examining Classification Consistency and Accuracy in a Criterion-Referenced Test

Susanne Alger [1]

One important step for assessing the quality of a test is to examine the reliability of test score interpretation. Which aspect of reliability is the most relevant depends on what type of test it is and how the scores are to be used. For criterion-referenced tests, and in particular certification tests, where students are classified into performance categories, primary focus need not be on the size of error but on the impact of this error on classification. This impact can be described in terms of classification consistency and classification accuracy. In this article selected methods from classical test theory for estimating classification consistency and classification accuracy were applied to the theory part of the Swedish driving licence test, a high-stakes criterion-referenced test which is rarely studied in terms of reliability of classification. The results for this particular test indicated a level of classification consistency that falls slightly short of the recommended level which is why lengthening the test should be considered. More evidence should also be gathered as to whether the placement of the cut-off score is appropriate since this has implications for the validity of classifications. 

reliability, criterion-referenced test, driving licence test, classification consistency, decision consistency, single administration
  • Alger, S., & Sundström, A. (2013). Agreement of driving examiners’ assessments – Evaluating the reliability of the Swedish driving test. Transportation Research Part F: Traffic Psychology and Behaviour, 19(0), 22-30. doi:
  • American Educational Research Association, American Psychological Association, & National Council on Measurement in Education. (2014). Standards for educational and psychological testing. Washington, DC: American Educational Research Association.
  • Baughan, C. J., & Simpson, H. (1999). Consistency of driving performance at the time of the L-test, and implications for driver testing. In G. B. Grayson (Ed.), Behavioural Research in Road Safety IX. Crowthorne: Transport Research Laboratory.
  • Berk, R. A. (1980). A Consumers' Guide to Criterion-Referenced Test Reliability. Journal of Educational Measurement, 17(4), 323-349. doi: 10.1111/j.1745-3984.1980.tb00835.x
  • Brennan, R. L. (2004). Manual for BB-CLASS: A Computer Program that uses the Beta- Binomial Model for Classification Consistency and Accuracy. Version 1. (CASMA Research Report No. 9). Retrieved from the Center for Advanced Studies in Measurement research/09casmareport.pdf?sfvrsn=2 at The University of Iowa website:
  • Brennan, R. L. (Ed.) (2006). Educational measurement. (4th ed.) Westport, CT: Praeger Publishers.
  • Brennan, R. L., & Wan, L. (2004). Bootstrap procedures for estimating decision consistency for single-administration complex assessments (CASMA Research Report No. 7). Iowa City: University of Iowa, Center for Advanced Studies in Measurement and Assessment. Retrieved from
  • Breyer, F. J., & Lewis, C. (1994). Pass-Fail Reliability for Tests with Cut-Scores: A Simplified Method. ETS Research Report Series, 1994(2), i-30. doi: 10.1002/j.2333- 8504.1994.tb01612.x
  • Crocker, L., & Algina, J. (1986). Introduction to classical and modern test theory. New York, NY: Holt, Rinehart and Winston, Inc.
  • Guo, F. (2006). Expected classification accuracy using the latent distribution. Practical Assessment, & Evaluation, 11(6), 1-6. Retrieved from
  • Haladyna, T. M., Downing, S. M., & Rodriguez, M. C. (2002). A review of multiple-choice item-writing guidelines for classroom assessment. Applied Measurement in Education, 15(3), 309-333. doi: 10.1207/S15324818AME1503_5
  • Han, K. T., & Hambleton, R. K. (2007). User's Manual: WinGen (Center for Educational Assessment Report No. 642). Amherst, MA: University of Massachusetts, School of Education. Retrieved from
  • Hambleton, R. K., Swaminathan, H., Algina, J., & Coulson, D. B. (1978). Criterion- referenced testing and measurement: A review of technical issues and developments. Review of Educational Research, 1-47. Retrieved from
  • Hanson, B. A., & Brennan, R. L. (1990). An Investigation of Classification Consistency Indexes Estimated under Alternative Strong True Score Models. Journal of Educational Measurement, 27(4), 345-359. doi: 10.1111/j.1745-3984.1990.tb00753.x
  • Henriksson, W., Sundström, A., & Wiberg, M. (2004). The Swedish driving-license test: A summary of studies from the department of educational measurement. (EM 44) Umeå: Department of Educational Measurement, Umeå University. Available from the Umeå university website:
  • Huynh, H. (1976). On the reliability of decisions in domain-referenced testing. Journal of Educational Measurement, 13(4), 253-264. doi: 10.1111/j.1745-3984.1976.tb00016.x
  • Huynh, H. (1990). Computation and statistical inference for decision consistency indexes based on the Rasch model. Journal of Educational and Behavioral Statistics, 15(4), 353-368. doi: 10.3102/10769986015004353
  • Huynh, H., & Saunders, J. C. (1980). Accuracy of Two Procedures for Estimating Reliability of Mastery Tests. Journal of Educational Measurement, 17(4), 351-358. doi: 10.2307/1434874
  • Lathrop, Q. N. (2015). Practical Issues in Estimating Classification Accuracy and Consistency with R Package cacIRT. Practical Assessment, Research & Evaluation, 20(18), 2. Retrieved from
  • Lee, W. C. (2010). Classification Consistency and Accuracy for Complex Assessments Using Item Response Theory. Journal of Educational Measurement, 47(1), 1-17. doi: 10.1111/j.1745-3984.2009.00096.x
  • Lee, W.-C., Hanson, B. A., & Brennan, R. L. (2002). Estimating consistency and accuracy indices for multiple classifications. Applied Psychological Measurement, 26(4), 412- 432. doi:10.1177/014662102237797
  • Livingston, S. A., & Lewis, C. (1995). Estimating the Consistency and Accuracy of Classifications Based on Test Scores. Journal of Educational Measurement, 32(2), 179– 197. Retrieved from
  • Meyer, J. P. (2010). Understanding Measurement: Reliability. New York: Oxford University Press.
  • Peng, C. Y. J., & Subkoviak, M. J. (1980). A Note on Huynh's Normal Approximation Procedure for Estimating Criterion-Referenced Reliability. Journal of Educational Measurement, 17(4), 359-368. doi: 10.1111/j.1745-3984.1980.tb00837.x
  • Reiner, T. W., & Hagge, R. A. (2006). Evaluation of the class C driver license written knowledge tests. Retrieved from the State of California Department of Motor Vehicles website: 3a0ddb791542/S2- 221.pdf?MOD=AJPERES&CONVERT_TO=url&CACHEID=b01cf8b0-d6e4-4532- 86f6-3a0ddb791542
  • Rudner, L. M. (2005). Expected classification accuracy. Practical Assessment Research & Evaluation, 10(13). Available online:
  • Schuwirth, L. W. T., & van der Vleuten, C. P. M. (2011). General overview of the theories used in assessment: AMEE Guide No. 57. Medical Teacher, 33(10), 783-797. doi: 10.3109/0142159X.2011.611022
  • Siegrist, S. (Ed.). (1999). Driver training, testing and licensing - towards a theory-based management of young drivers' injury risk in road traffic. Results of EU-project GADGET, Work Package 3. BFU-report 40. Bern: Schweizerische Beratungsstelle Für Unfallverhütung.
  • Subkoviak, M. J. (1976). Estimating reliability from a single administration of a criterion- referenced test. Journal of Educational Measurement, 13(4), 265-276. doi: 10.1111/j.1745-3984.1976.tb00017.x
  • Subkoviak, M. J. (1978). Empirical Investigation of Procedures for Estimating Reliability for Mastery Tests. Journal of Educational Measurement, 15(2), 111-116. doi: 10.2307/1433864
  • Subkoviak, M. J. (1988). A practitioner's guide to computation and interpretation of reliability indices for mastery tests. Journal of Educational Measurement, 25(1), 47-55. doi: 10.1111/j.1745-3984.1988.tb00290.x
  • Sundström, A. (2003). Den svenska förarprövningen. Sambandet mellan kunskapsprovet och körprovet, provens struktur samt körkortsutbildningens betydelse [Driver testing in Sweden. A study of the relationship between the theoretical and practical test, the structure of the tests and the effect of driver education on test performance]. (PM 183). Umeå: Pedagogiska institutionen, enheten för pedagogiska mätningar. Available from the Archive On-line DiVA website: http://umu.diva
  • Wang, T., Kolen, M. J., & Harris, D. J. (2000). Psychometric properties of scale scores and performance levels for performance assessments using polytomous IRT. Journal of Educational Measurement, 37, 141-162. doi: 10.1111/j.1745-3984.2000.tb01080.x
  • Wainer, H. (Ed.). (2000). Computerized adaptive testing: A primer. (2nd Edition). Mahwah, NJ: Lawrence Erlbaum Associates.
  • van der Linden, W. J., & Glas, C. A. (2010). Elements of adaptive testing. New York, NY: Springer New York.
  • Wiberg, M. (2004). Klassisk och modern testteori. Analys av det teoretiska och praktiska körkortsprovet [Classical and modern test theory: analysis of the theoretical and practical driving-license test]. (BVM 5) Umeå universitet: Institutionen för beteendevetenskapliga mätningar. Available from the Academic Archive On-line DiVA website:
  • Wiberg, M., & Sundström, A. (2009). A comparison of two approaches to correction of restriction of range in correlation analysis. Practical Assessment, Research & Evaluation, 14(5), 2.
  • Woodruff, D. J., & Sawyer, R. L. (1989). Estimating Measures of Pass-Fail Reliability From Parallel Half-Tests. Applied Psychological Measurement, 13(1), 33-43. doi: 10.1177/014662168901300104
  • Wyse, A. E., & Hao, S. (2012). An Evaluation of Item Response Theory Classification Accuracy and Consistency Indices. Applied Psychological Measurement, 36(7), 602- 624. doi: 10.1177/0146621612451522
Subjects Education, Scientific Disciplines
Published Date July
Journal Section Articles

Author: Susanne Alger


Publication Date : July 1, 2016

APA Alger, S . (2016). Is This Reliable Enough? Examining Classification Consistency and Accuracy in a Criterion-Referenced Test . International Journal of Assessment Tools in Education , 3 (2) , 137-150 . DOI: 10.21449/ijate.245198