Detection of aberrant testing behaviour in unproctored CAT via a verification test

Ebru Balta; Arzu Uçar

doi:10.21449/ijate.1598330

Research Article

Detection of aberrant testing behaviour in unproctored CAT via a verification test

Year 2025, Volume: 12 Issue: 3, 681 - 700, 04.09.2025

Ebru Balta , Arzu Uçar

https://doi.org/10.21449/ijate.1598330

Abstract

Unproctored Computerized Adaptive Testing (CAT) is gaining traction due to its convenience, flexibility, and scalability, particularly in high-stakes assessments. However, the lack of proctor can give rise to aberrant testing behavior. These behaviors can impair the validity of test scores. This paper explores the use of a verification test to detect aberrant testing behavior in unproctored CAT environments. This study aims to use multiple measures to detect aberrant response patterns in CAT via a paper-and-pencil (P&P) test as well as to compare the sensitivity and specificity performances of the l_z person-fit statistic (PFS) using no-stage and two-stage (l_z is used after the Kullback–Leibler divergence (KLD) measure) methods in different conditions. Three factors were manipulated – the aberrance percentage, the aberrance scenario, and the aberrant examinee’s ability range. The study found that in all scenarios, the specificity performance of l_z in classifying examinees was higher than its sensitivity performance in no-stage and two-stage analyses. However, the sensitivity performance of〖 l〗_z was higher in two-stage analysis.

Keywords

Aberrant testing behaviour , l_z person-fit statistic , Divergence measure , Unproctored CAT , Verification test.

References

Aguado, D., Vidal, A., Olea, J., Ponsoda, V., Barrada, J.R., & Abad, F.J. (2018). Cheating on unproctored internet test applications: An analysis of a verification test in a real personnel selection context. The Spanish Journal of Psychology, 21, E62. https://doi.org/10.1017/sjp.2018.50
Armstrong, R.D., Kung, M.T., & Roussos, R.A. (2010). A method to determine targets for multi-stage adaptive tests using integer programming. European Journal of Operatinal Research, 3, 709-718. https://doi.org/10.1016/j.ejor.2009.12.009
Armstrong, R., & Shi, M. (2009). A parametric cumulative sum statistic for person fit. Applied Psychological Measurement, 33(5), 391-410. https://doi.org/10.1177/0146621609331961
Armstrong R.D., Stoumbos, Z.G., Kung, M.T., & Shi, M. (2007). On the performance of the 〖 l〗_z person fit statistic. Practical Assessment Research & Evaluation, 12(16). https://doi.org/10.7275/xz5d-7j62
Baker, F.B., & Kim, S.H. (2004). Item response theory: Parameter estimation techniques. Marcel Bekker Inc
Balta, E., & Dogan, C. D. (2024). Investigation of preknowledge cheating via joint hierarchical modeling patterns of response accuracy and response time. SAGE Open, 14(4), 1-15. https://doi.org/10.1177/21582440241297946
Balta, E., & Ucar, A. (2022). Bilgisayar ortamında bireye uyarlanmış test uygulamalarında ölçme kesinliğinin ve test uzunluğunun farklı koşullar altında incelenmesi [Investigation of measurement precision and test length in computerized adaptive testing under different conditions]. E International Journal of Educational Research, 13(1), 51 68. https://doi.org/10.19160/e-ijer.1023098
Barrada J.R., Abad F.J., & Veldkamp B.P. (2009). Comparison of methods for controlling maximum exposure rates in computerized adaptive testing. Psicothema, 21(2), 313-320.
Barrada, J.R., Mazuela, P., & Olea, J. (2006). Maximum information stratification method for controlling item exposure in computerized adaptive testing. Psicothema, 18(1), 156- 159.
Belov, D.I. (2011). Detection of answer copying based on the structure of a high-stakes test. Applied Psychological Measurement, 35(7), 495 517. https://doi.org/10.1177/0146621611420705
Belov, D.I. (2013). Detection of test collusion via Kullback–Leibler divergence. Journal of Educational Measurement, 50(2), 141–163. https://doi.org/10.1111/jedm.12008
Belov, D.I. (2014). Detecting item preknowledge in computerized adaptive testing using information theory and combinatorial optimization. Journal of Computerized Adaptive Testing, 2(3), 37–58. https://doi.org/10.7333/1410-0203037
Belov, D.I. (2016). Comparing the performance of eight item preknowledge detection statistics. Applied Psychological Measurement, 40(2), 83 97. https://doi.org/10.1177/0146621615603327
Belov, D.I., & Armstrong, R.D. (2010). Automatic detection of answer copying via Kullback–Leibler divergence and K-index. Applied Psychological Measurement, 34(6), 379–392. https://doi.org/10.1177/0146621610370453
Belov, D., Pashley, P., Lewis, C., & Armstrong, R. (2007). Detecting aberrant responses with Kullback–Leibler distance. In K. Shigemasu, A. Okada, T. Imaizumi, & T. Hoshino (Eds.), New trends in psychometrics (pp. 7–14). Universal Academy Press.
Bock, R.D., & Aitkin, M. (1981). Marginal maximum likelihood estimation of item parameters: Application of an EM algorithm. Psychometrika, 46(4), 443 459. https://link.springer.com/article/10.1007/BF02293801
Bradlow, E.T., Weiss, R.E., & Cho, M. (1998). Bayesian identification of outliers in computerized adaptive testing. Journal of the American Statistical Association, 93(443), 910-919. https://doi.org/10.1080/01621459.1998.10473747
Bishop, C.M. (2006). Pattern Recognition and Machine Learning. Springer.
Chapman, D.S., & Webster, J. (2003). The use of technologies in the recruiting, screening, and selection processes for job candidates. International Journal of Selection and Assessment, 11(2), 113–120. https://doi.org/10.1111/1468-2389.00234
Chang, H., & Zhang, J. (2002). Hypergeometric family and item overlap rates in computerized adaptive testing. Psychometrika, 67 (3), 387-398. https://doi.org/10.1007/BF02294991
Chang, H., & Zhang, J. (2003, December, 3-5). Assessing CAT security breaches by the item pooling index [Oral presentation]. The Annual Meeting of National Council on Measurement in Education, Chicago, IL, USA.
Chao, H.Y., Chen, J.H., & Chen, S.Y. (2011, July,19-22). Applying Kullback-Leibler divergence to detect examinees with item pre-knowledge in computerized adaptive testing [Oral presentation]. The 17th International Meeting of the Psychometric Society, Hong Kong.
Choe, E.M., Zhang, J., & Chang, H.H. (2018). Sequential detection of compromised items using response times in computerized adaptive testing. Psychometrika, 83(3), 650-673. https://doi.org/10.1007/s11336-017-9596-3
Cizek, G., & Wollack, J. (2017). Identification of item preknowledge by the methods of information theory and combinatorial optimization. In G. Cizek, & J. Wollack (Eds.), Handbook of quantitative methods for detecting cheating on tests (pp.217–233). R outledge.
Coyne, I., & International Test Commission. (2006). International Guidelines on Computer-Based and Internet-Delivered Testing. International Journal of Testing, 6(2), 143–171. https://doi.org/10.1207/s15327574ijt0602_4
Cui, Z. (2022). On measuring adaptivity of an adaptive test. Measurement: Interdisciplinary Research and Perspectives,20(1),21-33. https://doi.org/10.1080/15366367.2021.1922232
Davey, T., & Nering, N. (2002). Controlling item exposure and maintaining item security. In C.N. Mills, M.T. Potenza, J.J. Fremer., & W.C. Ward (Eds.), Computer-based testing: Building the foundation for future assessments (pp. 165-191). Lawrence Erlbaum Associates.
Deng, H., Ansley, T., & Chang, H. (2010). Stratified and maximum information item selection procedures in computer adaptive testing. Journal of Educational Measurement, 47(2), 202-226. https://www.jstor.org/stable/20778948
Dimitrov, D.M., & Smith, R.M. (2006). Adjusted rasch person-fit statistics. Journal of Applied Measurement, 7(2), 170-183.
Drasgow, F., Levine, M.V., & McLaughlin, M.E. (1987). Detecting inappropriate test scores with optimal and practical appropriateness indices. Applied Psychological Measurement, 11(1), 59–79. https://doi.org/10.1177/0146621687011001
Drasgow, F., Levine, M., & Williams, E. (1985). Appropriateness measurement with polychotomous item response models and standardized indices. British Journal of Mathematical and Statistical Psychology, 38(1), 67-86. https://doi.org/10.1111/j.2044-8317.1985.tb00817.x
Egberink, I., Meijer, R., Veldkamp, B., Schakel, L., & Smid, N. (2010). Detection of aberrant item score patterns in a computerized adaptive test: An empirical example using the CUSUM. Personality and Individual Differences, 48(8), 921 925. https://doi.org/10.1016/j.paid.2010.02.023
Eggen, T. (2004). Contributions to the theory and practice of computerized adaptive testing (Publication No. 305136454) [Doctoral dissertation, University of Twente]. ProQuest Dissertations Publishing. https://www.proquest.com/docview/305136454/C97B190BA46B4519PQ/1?accountid=135193&sourcetype=Dissertations%20&%20Theses
Embretson, S.E., & Reise, S.P. (2000). Item response theory for psychologists. Lawrence Erlbaum Associates.
Erdem-Kara, B., & Dogan, N. (2022). The effect of ratio of items indicating differential item functioning on computer adaptive and multi-stage tests. International Journal of Assessment Tools in Education, 9(3), 682-696. https://doi.org/10.21449/ijate.1105769
Foster, D. (2013). Security issues in technology-based testing. In J.A. Wollack, & J.J. Fremer (Eds.), Handbook of test security (pp. 39–83). Routledge.
Fox, J.-P., & Marianti, S. (2017). Person-fit statistics for joint models for accuracy and speed. Journal of Educational Measurement, 54(2), 243–262. https://www.jstor.org/stable/45148424
Glas, C.A., & Linden, W. (2003). Computerized adaptive testing with item cloning. Applied Psychological Measurement, 27(4), 247 261. https://doi.org/10.1177/0146621603027004001
Goren, S., Kara, H., Erdem-Kara, B., & Kelecioglu, H. (2022). The effect of aberrant responses on ability estimation in computer adaptive tests. Journal of Measurement and Evaluation in Education and Psychology, 13(3), 256-268. https://doi.org/10.21031/epod.1067307
Guo, J., & Drasgow, F. (2010). Identifying cheating on unproctored internet tests: The Z-test and the likelihood ratio test. International Journal of Selection and Assessment, 18(4), 351-364. https://doi.org/10.1111/j.1468-2389.2010.00518.x
Guo, J., Tay, L., & Drasgow, F. (2009). Conspiracies and test compromise: An evaluation of the resistance of test systems to small-scale cheating. International Journal of Testing, 9(4), 283–309. https://doi.org/10.1080/15305050903351901
Haberman, S., & Lee, Y. (2017). A statistical procedure for testing unusually frequent exactly matching responses and nearly matching responses. Educational Testing Service, (Research Report No:RR 17 23). https://www.ets.org/research/policy_research_reports/publications/report/2017/jxrq.html
Haladyna M.T. (2011). Handbook of Test Development. Taylor and Francis Press.
Hambleton, R.K., Swaminathan, H., & Rogers, H. J. (1991). Fundamentals of item response theory. Sage Publications.
Hambleton, R.K., & Xing, D. (2006). Optimal and nonoptimal computer-based test designs for making pass-fail decisions. Applied Measurement in Education, 19(3), 221–239. https://doi.org/10.1207/s15324818ame1903_4
Han, K.T. (2009, June,2-3). A gradual maximum information ratio approach to item selection in computerized adaptive testing [Oral presentation]. The 2009 Conference on Computerized Adaptive Testing, Minnesota, USA.
Ho, T. (2010). A comparison of item selection procedures using different ability estimation methods in computerized adaptive testing based on generalized partial credit model (Publication No.3428993) [Doctoral dissertation, The State University of Texas]. ProQuest Dissertations Publishing. https://www.proquest.com/docview/760034367/CBFB3EA847AA4FB3PQ/1?accountid=135193&sourcetype=Dissertations%20&%20Theses
Karabatsos, G. (2003). Comparing the aberrant response detection performance of thirty-six person fit statistics. Applied Measurement in Education, 16(4), 277 298. https://doi.org/10.1207/S15324818AME1604_2
Kolen, M.J., & Brennan, R.L. (2008). Test equating, scaling, and linking: Methods and Practices. Springer.
Klauer, K.C. (1991). An exact and optimal standardized person test for assessing consistency with the Rasch model. Psychometrika, 56(2), 213 228. https://doi.org/10.1007/BF02294459
Klauer, K.C., & Rettig, K. (1990). An approximately standardized person test for assessing consistency with a latent trait model. British Journal of Mathematical and Statistical Psychology, 43(2), 193–206. https://doi.org/10.1111/j.2044-8317.1990.tb00935.x
Kingston, N., & Clark, A. (2014). Test fraud: Statistical detection and methodology. Routledge.
Kullback, S., & Leibler, R.A. (1951). On information and sufficiency. Annals of Mathematical Statistics, 22(1), 79–86. https://doi.org/10.1214/aoms/1177729694
Lee, S.Y. (2018). A mixture model approach to detect examinees with item Preknowledge (Publication No.10830593) [Doctoral dissertation, The University of Wisconsin-Madison]. University of Wisconsin Madison Library. https://asset.library.wisc.edu/1711.dl/FJW23RSLFRKJK8X/R/file-e9109.pdf
Lee, Y.H., & Chen, H. (2011). A review of recent response-time analyses in educationaltesting. Psychological Test and Assessment Modeling, 53(3), 359–379.
Lee, Y., & Haberman, S. (2016). Investigating test-taking behaviors using timing and processdata. International Journal of Testing, 16(3), 240 267. https://doi.org/10.1080/15305058.2015.1085385
Lee S.Y., & Wollack J. (2017, September, 6-8). A mixture model to detect item preknowledge using item responses and response times [Oral presentation]. The 2017 Conference on Test Security, Madison, USA.
Levine, M.V., & Drasgow, F. (1988). Optimal appropriateness measurement. Psychometrika, 53(2), 161–176. https://doi.org/10.1007/BF02294130
Levine, M.V., & Rubin, D.B. (1979). Measuring the appropriateness of multiple-choice test scores. Journal of Educational Statistics, 4(4), 269–290. https://doi.org/10.2307/1164595
Lord, F.M. (1980). Applications of Item Response Theory to Practical Testing Problems. Lawrence Erlbaum Associates.
Li, X., Huang, C., & Harris, D. (2014). Examining individual and cluster test irregularities in mixed-format testing [Oral presentation]. The 2014 Conference on Test Security. Iowa City, USA.
Li, M.F., & Olejnik, S. (1997). The power of Rasch person–fit statistics in detecting unusual response patterns. Applied Psychological Measurement, 21(3), 215 231. https://doi.org/10.1177/01466216970213002
Lievens, F., & Burke, E. (2011). Dealing with the threats inherent in unproctored Internet testing of cognitive ability: Results from a large-scale operational test program. Journal of Occupational and Organizational Psychology, 84(4), 817 824. https://doi.org/10.1348/096317910X522672
Liu, X. (2019, June, 10-13). Detecting aberrant behavior in CAT: The lognormal response time model [Oral presentation]. The Annual Meeting of the International Association for Computerized Adaptive Testing, Minnesota, USA.
Liu, C., Han, K.T., & Li, J. (2019). Compromised item detection for computerized adaptive testing. Frontiers in Psychology, 10 (829), 1-16. https://doi.org/10.3389/fpsyg.2019.00829
Magis, D., & Barrada, J.R. (2017). Computerized adaptive testing with R: Recent updates of the package catR. Journal of Statistical Software, 76(1), 1 19. https://doi.org/10.18637/jss.v076.c01
Magis, D., & Raîche, G. (2012). Random generation of response patterns under computerized adaptive testing with the R package catR. Journal of Statistical Software, 48(8), 1-31. https://www.jstatsoft.org/article/view/v048i08
Magis, D., Yan, D., & Von Davier, A.A. (2017). Computerized adaptive and multistage testing with R: Using packages catR and mstR. Springer.
Man, K., Harring, J.R., Ouyang, Y., & Thomas, S.L. (2018) Response time based nonparametric Kullback-Leibler Divergence Measure for detecting aberrant test-taking behavior. International Journal of Testing, 18(2), 155 177. https://doi.org/10.1080/15305058.2018.1429446
Marianti, S., Fox, J.-P., Avetisyan, M., Veldkamp, B.P., & Tijmstra, J. (2014). Testing for aberrant behavior in response time modeling. Journal of Educational and Behavioral Statistics, 39 (6), 426–451. https://doi.org/10.3102/1076998614559412
Maynes, D.D. (2005). M_4: A new answer-copying index. Caveon Test Security, Midvale, UT. https://www.caveon.com/
Maynes, D.D. (2014b). A method for measuring performance inconsistency by using score differences. In N. M. Kingston, & A.K., Clark, (Eds.), Test Fraud: Statistical detection and methodology, (pp 186-199). Routledge.
Messick, S. (1994). The interplay of evidence and consequences in the validation of performance assessments. Educational Researcher, 23(2), 13 23. https://doi.org/10.3102/0013189X023002013
McLeod, L.D., & Lewis, C. (1999). Detecting item memorization in the CAT environment. Applied Psychological Measurement, 23(2), 147 160. https://doi.org/10.1177/01466219922031275
McLeod, L.D., Lewis, C., & Thissen, D. (2003). A Bayesian method for the detection of item preknowledge in computerized adaptive testing. Applied Psychological Measurement, 27(2), 121–137. https://doi.org/10.1177/0146621602250534
Meijer, R., & Sijtsma, K. (1995). Detection of aberrant item score patterns: A Review of recent developments. Applied Measurement in Education, 8(3), 261 272. https://doi.org/10.1207/s15324818ame0803_5
Meijer, R. (2002). Outlier detection in high-stakes certification testing. Journal of Educational Measurement, 39(3), 219-233. https://doi.org/10.1111/j.1745-3984.2002.tb01175.x
Meijer, R.R., & Sijtsma, K. (2001). Methodology review: Evaluating person fit. Applied Psychological Measurement, 25(2), 107-135. https://doi.org/10.1177/01466210122031957
Meijer, R.R., & Tendeiro, J.N. (2014). The use of person-fit scores in high stakes educational testing: How to use them and what they tell us. Law School Admission Council. (LSAC Research Report 14-03). https://www.lsac.org/data-research/research/use-person-fit-scores-high-stakes-educational-testing-how-use-them-and-what
Meyer, J.P. (2010). A mixture Rasch model with item response time components. Applied Psychological Measurement, 34(7), 521-538. https://doi.org/10.1177/0146621609355451
Molenaar, I.W., & Hoijtink, H. (1990). The many null distributions of person fit indices. Psychometrika, 55(1), 75–106. https://doi.org/10.1007/BF02294745
Murphy, K.P. (2012). Machine Learning: A Probabilistic Perspective. The MIT Press.
Naglieri, J.A., Drasgow, F., Schmit, M., Handler, L., Prifitera, A., Margolis, A., & Velasquez, R. (2004). Psychological testing on the internet: New problems, old issues. The American Psychologist, 59(3), 150–162. https://doi.org/10.1037/0003-066X.59.3.150
Nering, M.L. (1995). The distribution of person fit using true and estimated person parameters. Applied Psychological Measurement, 19(2), 121 129. https://doi.org/10.1177/014662169501900201
Nering, M.L. (1997). The distribution of indexes of person fit within the computerized adaptive testing environment. Applied Psychological Measurement, 21(2), 115 127. https://doi.org/10.1177/01466216970212002
Nering, M.L., & Meijer, R.R. (1998). A comparison of the person response function and the lz person fit statistic. Applied Psychological Measurement, 22(1), 53 69. https://doi.org/10.1177/01466216980221004
Nye, C.D., Do, B., Drasgow, F., & Fine, S. (2008). Two-step testing in employee selection: Is score inflation a problem? International Journal of Selection and Assessment, 16(2), 112–120. https://doi.org/10.1111/j.1468-2389.2008.00416.x
Pan, Y., Sinharay, S., Livne, O., & Wollack, J.A. (2022). A machine learning approach for detecting item compromise and preknowledge in computerized adaptive testing. Psychological Test and Assessment Modeling, 64(4), 385 424. https://doi.org/10.31234/osf.io/hk35a
Pardo, L. (2006). Statistical Inference Based on Divergence Measures. Chapman & Hall.
Parshall, C.G., Spray, J.A., Kalohn, J.C., & Davey, T. (2002). Practical considerations in computer-based testing (statistics for social and behavioral sciences). Springer.
Partchev, I. (2017). ‘irtoys: Simple interface to the estimation and plotting of IRT Models’ (R package version 0.2.1). https://cran.rproject.org/web/packages/irtoys/irtoys.pdf
Pearlman, K. (2009). Unproctored internet testing: Practical, legal, and ethical concerns. Industrial and Organizational Psychology: Perspectives on Science and Practice, 2(1), 14–19. https://doi.org/10.1111/j.1754-9434.2008.01099.x
Raton-Lopez, M., Rodriquez-Alvarez, X.M., Suarez- Cadarso, C., & Sampedro-Gude, F. (2014). OptimalCutpoints: Computing optimal cutpoints in diagnostic tests. (R package version 1.1.5). https://cran.rproject.org/web/packages/OptimalCutpoints/OptimalCutpoints.pdf
Reise, S.P. (1995). Scoring method and the detection of person misfit in a personality assessment context. Applied Psychological Measurement, 19(3), 213 229. https://doi.org/10.1177/014662169501900301
Reise, S.P., & Due, A.M. (1991). The influence of test characteristics on the detection of aberrant response patterns. Applied Psychological Measurement, 15(3), 217 226. https://doi.org/10.1177/014662169101500301
Rost, J. (1990). Rasch models in latent classes: An integration of two approaches to item analysis. Applied Psychological Measurement, 14(3), 271 282. https://doi.org/10.1177/014662169001400305
Rizavi, S.M. (2001). The effect of test characteristics on aberrant response patterns in computer adaptive testing (Publication No.3027247) [Doctoral dissertation, University of Massachusetts Amherst]. ProQuest Dissertations Publishing. https://www.proquest.com/docview/304699823/1BC249C4F0834BF2PQ/1?accountid=135193&sourcetype=Dissertations%20&%20Theses
Rizavi, S., & Swaminathan, H. (2001, April, 10-14). The effect of test and examinee characteristics on the occurrence of aberrant response patterns in a computerized adaptive test [Oral presentation]. The Annual Meeting of the American Educational Research Association, Seattle, USA.
Ryan, A.M., Inceoglu, I., Bartram, D., Golubovich, J., Grand, J., Reeder, M., Derous, E., Nikolaou, I., & Yao, X. (2015). Trends in testing: Highlights of a global survey. In I. Nikolaou., & J.K. Oostrom (Eds.). Employee Recruitment, Selection, and Assessment: Contemporary Issues for Theory and Practice, (pp. 136–153). Routledge.
Sanz, S., Luzardo, M., García, C., & Abad, F.J. (2020). Detecting cheating methods on unproctored internet tests. Psicothema, 32(4), 549 558. https://doi.org/10.7334/psicothema2020.86
Sarı, H.I. (2019). Investigating consequences of using item pre-knowledge in computerized multistage testing. Gazi University Journal of Gazi Educational Faculty, 39(2), 1113-1134. https://doi.org/10.17152/gefad.535376
Schnipke, D.L., & Scrams, D.J. (1997). Modeling item response times with a two-state mixture model: A new method of measuring speededness. Journal of Educational Measurement, 34(3), 213–232. https://doi.org/10.1111/j.1745-3984.1997.tb00516.x
Segall, D.O. (2001, April, 10-14). Detecting test compromise in high-stakes computerized adaptive testing: A verification testing approach [Oral presentation]. The Annual Meeting of the National Council on Measurement in Education, Seattle, USA.
Segall, D.O. (2004). Computerized adaptive testing. In Kempf-Leanard (Eds.), The encyclopedia of social measurement (pp. 429–438). Academic Press.
Shu, Z. (2010). Detecting test cheating using a deterministic, gated item response theory model, (Publication No. 3434164) [Doctoral dissertation, The University of North Carolina at Greensboro]. ProQuest Dissertations Publishing. https://www.proquest.com/docview/845237696/4015D86F0434446BPQ/1?accountid=135193&sourcetype=Dissertations%20&%20Theses
Shu, Z., Henson, R., & Luecht, R. (2013). Using deterministic, gated item response theory model to detect test cheating due to item compromise. Psychometrika, 78(3), 481-497. https://doi.org/10.1007/s11336-012-9311-3
Smith, R.M. (1985). A comparison of Rasch person analysis and robust estimators. Educational and Psychological Measurement, 45(3), 433 444. https://doi.org/10.1177/001316448504500
Sinharay, S. (2017a). Detection of item preknowledge using likelihood ratio test and score test. Journal of Educational and Behavioral Statistics, 42(1), 46 68. https://doi.org/10.3102/1076998616673
Sinharay, S. (2017b). Which statistic should be used to detect item preknowledge when the set of compromised items is known? Applied Psychological Measurement, 41(6), 403–421. https://doi.org/10.1177/0146621617698453
Sinharay, S. (2020). Detection of item preknowledge using response times. Applied Psychological Measurement, 44(5), 376–392. https://doi.org/10.1177/0146621620909893
Snijders, T.A.B. (2001). Asymptotic null distribution of person fit statistics with estimated person parameter. Psychometrika, 66(3), 331-342. https://doi.org/10.1007/BF02294437
Sotaridona, L.S., & Meijer, R.R. (2002). Statistical properties of the K-index for detecting answer copying in a multiple-choice test. Journal of Educational Measurement, 39(2), 115–132. https://www.jstor.org/stable/1435251
Statisticat, L.L.C. (2016). ‘LaplacesDemon: Complete environment for bayesian inference’ (R package version 16.0.1) https://cran.r project.org/web/packages/LaplacesDemon/vignettes/LaplacesDemonTutorial.pdf
Steinkamp, S. (2017). Identifying aberrant responding: Use of multiple measures [Doctoral dissertation, The University of Minnesota]. University Digital Conservancy. https://hdl.handle.net/11299/188885.
Stocking, M.L. (1992). Controlling item exposure rates in a realistic adaptive testing paradigm. Educational Testing Service. (Research Report No. 93 2). https://onlinelibrary.wiley.com/doi/pdf/10.1002/j.2333-8504.1993.tb01513.x
St-Onge, C., Valois, P., Abdous, B., & Germain, S. (2011). Accuracy of person-fit statistics: A Monte Carlo study of the influence of aberrance rates. Applied Psychological Measurement, 35(6), 419–432. https://doi.org/10.1177/0146621610391777
Sunbul, O., & Yormaz, S. (2018). Investigating the performance of omega index according to item parameters and ability levels. Eurasian Journal of Educational Research, 74, 207–226. https://doi.org/10.14689/ejer.2018.74.11
Tatsuoka, K. (1984). Caution indices based on item response theory. Psychometrika, 49(1), 95–110. https://doi.org/10.1007/BF02294208
Tendeiro, J.N., Meijer, R.R., & Niessen, A.S.M. (2016). PerFit: An R package for person-fit analysis in IRT. Journal of Statistical Software, 74(5), 1 27. https://doi.org/10.18637/jss.v074.i05
Thompson, N.A. (2007b, June,7). Computerized classification testing with composite hypotheses [Oral presentation]. The GMAC Conference on Computerized Adaptive Testing, Minneapolis, USA.
Thompson, N.A., & Weiss, D.A. (2011). A framework for the development of computerized adaptive tests. Practical Assessment, Research & Evaluation, 16(1),1 9. http://pareonline.net/getvn.asp?v=16&n=1
Thiessen, B. (2008). Relationship between test security policies and test score manipulations (Publication No.3347249) [Doctoral dissertation, University of Iowa]. ProQuest Dissertations Publishing. https://www.proquest.com/docview/304633912/793D439F6D09431APQ/1?accountid=135193&sourcetype=Dissertations%20&%20Theses
Tippins, N.T., Beaty, J., Drasgow, F., Gibson, W.M., Pearlman, K., Segall, D.O., & Shepherd, W. (2006). Unproctored internet testing in employment settings. Personnel Psychology, 59 (1), 189–225. https://doi.org/10.1111/j.1744-6570.2006.00909.x
Trabin, T.E., & Weiss, D.J. (1983). The person response curve: Fit of individuals to item response theory models. In D.J. Weiss (Eds.), New horizons in testing, (pp. 83–108). Academic Press.
Ucar, A. (2021). Kopya belirlemede benzerlik indeklerinin birey-uyum istatistikleri aracılığıylaaşamalı kullanımının I.tip hatalarının ve gücünün belirlenmesi [Investigation of type-I-error and power of similarity indices by using two-stage analysis via person-fit statistics] [Doctoral dissertation, Ankara University]. National Thesis Center. https://tez.yok.gov.tr/UlusalTezMerkezi/tarama.jsp
Ucar, A., & Dogan, C. D. (2021). Defining cut point for Kullback-Leibler divergence to detect answer copying. International Journal of Assessment Tools in Education, 8(1), 156–166. https://doi.org/10.21449/ijate.864078
van der Linden, W.J. (2006). A lognormal model for response times on test items. Journal of Educational and Behavioral Statistics, 31(2), 181 204. https://doi.org/10.3102/10769986031002181
van der Linden, W.J. (2007). A hierarchical framework for modeling speed and accuracy on test items. Psychometrika, 72 (3), 287–308. https://doi.org/10.1007/s11336-006-1478-z
van der Linden, W.J. (2008). Using response times for item selection in adaptive testing. Journal of Educational and Behavioral Statistics, 33(1), 5-20. https://doi.org/10.3102/1 076998607302626
van der Linden, W.J., & Guo, F. (2008). Bayesian procedures for identifying aberrant response-time patterns in adaptive testing. Psychometrika, 73(3), 365 384. https://doi.org/10.1007/s11336-007-9046-8
van der Linden, W.J., Klein Entink, R.H., & Fox, J.-P. (2010). IRT parameter estimation with response times as collateral information. Applied Psychological Measurement, 34(5), 327–347. https://doi.org/10.1177/0146621609349800
van der Linden, W.J., & Pashley, P.J. (2010). Item selection and ability estimation in adaptivetesting. In W.J. van der Linden, & C.A.W. Glas (Ed.), Elements of adaptive testing (pp. 429 – 438). Springer.
van der Linden, W.J., & Sotaridona, L. (2006). Detecting answer copying when the regular response process follows a known response model. Journal of Educational and Behavioral Statistics, 31(3), 283-304. https://www.jstor.org/stable/4122441
van der Linden, W.J., & van Krimpen-Stoop, E.M. (2003). Using response times to detect aberrant responses in computerized adaptive testing. Psychometrika, 68(2), 251-265. https://doi.org/10.1007/BF02294800
van Krimpen-Stoop, E.M.L.A., & Meijer, R.R. (2002). Detection of person misfit in computerized adaptive tests with polytomous items. Applied Psychological Measurement, 26(2), 164-180. https://doi.org/10.1177/01421602026002004
Veldkamp, B.P. (2012). Ensuring the future of computerized adaptive testing. In T.J.H.M. Eggen., & B.P. Veldkamp (Eds.). Psychometrics in practice at RCEC, (pp.39-50). RCEC.
Veldkamp, B.P., & van der Linden. W.J. (2010). Designing item pools for adaptive testing. In W.J. van der Linden., & C.A.W. Glas (Eds.). Computerized adaptive testing: Theory and practice, (pp.149-162). Springer.
von Davier, M., & Rost, J. (2007). Mixture distribution item response models. In C.R. Rao., & S. Sinharay (Eds.) Handbook of Statistics, (pp.643-661). Elsevier.
Wang, K. (2017). A fair comparison of the performance of computerized adaptive testing and multistage adaptive testing (Publication No. 10273809) [Doctoral dissertation, Michigan State University]. ProQuest Dissertations Publishing. https://www.proquest.com/docview/1901897901/
Wang, C., & Xu, G. (2015). A mixture hierarchical model for response times and response accuracy. British Journal of Mathematical and Statistical Psychology, 68(3), 456-477. https://doi.org/10.1111/bmsp.12054
Wang, C., Xu, G., Shang, Z., & Kuncel, N. (2018). Detecting aberrant behavior and item preknowledge: A comparison of mixture modeling method and residual method. Journal of Educational and Behavioral Statistics, 43(4), 469 501. https://doi.org/10.3102/1076998618767123
Wollack, J.A. (1997). A nominal response model approach for detecting answer copying. Applied Psychological Measurement, 21(4), 307 320. https://doi.org/10.1177/01466216970214002
Wollack, J.A. (2006). Simultaneous use of multiple answer copying indexes to improve detection rates. Applied Measurement in Education, 19(4), 265 288. https://doi.org/10.1207/s15324818ame1904_3
Wollack, J.A., & Maynes, D. (2011, April, 7-11). Detection of test collusion using item response data [Oral presentation]. The 2011 Annual Meeting of the National Council on Measurement in Education, New Orleans, USA.
Wright, B., & Masters, G. (1982). Rating Scale Analysis: Rasch Measurement. MESA Press.
Wright, N.A., Meade, A.W., & Gutierrez, S.L. (2014). Using invariance to examine cheating in unproctored ability tests. International Journal of Selection and Assessment, 22(1), 12–22. https://doi.org/10.1111/ijsa.12053
Wright, B., & Stone, M. (1979). Best test design: Rasch measurement. MESA Press.
Wise, S. (2023). Expanding the meaning of adaptive testing to enhance validity. Journal of Computerized Adaptive Testing, 10(2), 22-31. https://doi.org/10.7333/2305-1002022
Wunder, R.S., Thomas, L.L., & Luo, Z. (2010). Administering assessments and decision-making. In J.L. Farr., & N.T. Tippins (Eds.). Handbook of Employee Selection, (pp. 377–398). Routledge.
Yan, D. (2020). Multistage testing in practice. In H. Jiao., & R.W. Lissitz (Eds.). Application of Artificial Intelligence to Assessment, (pp.141-160). Information Age Publications.
Yormaz, S. (2019). Test güvenliği açısından bireyler arasındaki olası iş birliğinin incelenmesi [Investigation of possible collusion between examinees in terms of test securtiy] [Doctoral dissertation, Mersin University]. National Thesis Center. https://tez.yok.gov.tr/UlusalTezMerkezi/tarama.jsp
Yormaz, S., & Sunbul, O. (2017). Determination of type I error rates and power of answer copying ındices under various conditions. Educatıonal Sciences: Theory & Practıce, 17(1), 5-26. https://doi.org/10.12738/estp.2017.1.0105
Yi, Q., Zhang, J., & Chang, H.H. (2006). Severity of organized item theft in computerized adaptive testing: An empirical study. Educational Testing Service. (ETS Research Report, RR-06-22). http://dx.doi.org/10.1002/j.2333-8504.2006.tb02028.x
Yi, Q., Zhang, J., & Chang, H.H. (2008). Severity of organized item theft in computerized adaptive testing: A simulation study. Applied Psychological Measurement, 32(7), 543-558. https://doi.org/10.1177/0146621607311336
Zhan, P., Jiao, H., Wang, W.-C., & Man, K. (2018). A multidimensional hierarchical framework for modeling speed and ability in computer-based multidimensional tests. arXiv preprintarXiv:1807.04003. https://doi.org/10.48550/arXiv.1807.04003
Zhang, J. (2014). A sequential procedure for detecting compromised items in the item pool of a CAT system. Applied Psychological Measurement, 38(2), 87 104. https://doi.org/10.1177/0146621613510062
Zhang, J., & Li, J. (2016). Monitoring items in real time to enhance CAT security. Journal of Educational Measurement, 53(2), 131-151. https://doi.org/10.1111/jedm.12104
Zhang, Y., Searcy, C.A., & Horn, L. (2011, April, 9-11). Mapping clusters of aberrant patterns in item responses [Oral presentation]. The Annual Meeting of the National Council on Measurement in Education, New Orleans, USA.
Zhong, W. (2022). Using item response theory to detect potential aberrant behaviors in a multi-stage test: An example of the norwegian language test (Publication No. 304) [Master thesis, The University of Oslo]. CEMO Centre for Educational Measurement. https://www.duo.uio.no/handle/10852/55851/discover?rpp=100&sort_by=dc.date.issued_dt&order=DESC
Zopluoğlu, C. (2016). Classification performance of answer-copying indices under different types of IRT models. Applied Psychological Measurement, 40(8), 592–607. https://doi.org/10.1177/0146621616664724

Detection of aberrant testing behaviour in unproctored CAT via a verification test

Year 2025, Volume: 12 Issue: 3, 681 - 700, 04.09.2025

Ebru Balta , Arzu Uçar

https://doi.org/10.21449/ijate.1598330

Abstract

Keywords

Aberrant testing behaviour , l_z person-fit statistic , Divergence measure , Unproctored CAT , Verification test.

References

Aguado, D., Vidal, A., Olea, J., Ponsoda, V., Barrada, J.R., & Abad, F.J. (2018). Cheating on unproctored internet test applications: An analysis of a verification test in a real personnel selection context. The Spanish Journal of Psychology, 21, E62. https://doi.org/10.1017/sjp.2018.50
Armstrong, R.D., Kung, M.T., & Roussos, R.A. (2010). A method to determine targets for multi-stage adaptive tests using integer programming. European Journal of Operatinal Research, 3, 709-718. https://doi.org/10.1016/j.ejor.2009.12.009
Armstrong, R., & Shi, M. (2009). A parametric cumulative sum statistic for person fit. Applied Psychological Measurement, 33(5), 391-410. https://doi.org/10.1177/0146621609331961
Armstrong R.D., Stoumbos, Z.G., Kung, M.T., & Shi, M. (2007). On the performance of the 〖 l〗_z person fit statistic. Practical Assessment Research & Evaluation, 12(16). https://doi.org/10.7275/xz5d-7j62
Baker, F.B., & Kim, S.H. (2004). Item response theory: Parameter estimation techniques. Marcel Bekker Inc
Balta, E., & Dogan, C. D. (2024). Investigation of preknowledge cheating via joint hierarchical modeling patterns of response accuracy and response time. SAGE Open, 14(4), 1-15. https://doi.org/10.1177/21582440241297946
Balta, E., & Ucar, A. (2022). Bilgisayar ortamında bireye uyarlanmış test uygulamalarında ölçme kesinliğinin ve test uzunluğunun farklı koşullar altında incelenmesi [Investigation of measurement precision and test length in computerized adaptive testing under different conditions]. E International Journal of Educational Research, 13(1), 51 68. https://doi.org/10.19160/e-ijer.1023098
Barrada J.R., Abad F.J., & Veldkamp B.P. (2009). Comparison of methods for controlling maximum exposure rates in computerized adaptive testing. Psicothema, 21(2), 313-320.
Barrada, J.R., Mazuela, P., & Olea, J. (2006). Maximum information stratification method for controlling item exposure in computerized adaptive testing. Psicothema, 18(1), 156- 159.
Belov, D.I. (2011). Detection of answer copying based on the structure of a high-stakes test. Applied Psychological Measurement, 35(7), 495 517. https://doi.org/10.1177/0146621611420705
Belov, D.I. (2013). Detection of test collusion via Kullback–Leibler divergence. Journal of Educational Measurement, 50(2), 141–163. https://doi.org/10.1111/jedm.12008
Belov, D.I. (2014). Detecting item preknowledge in computerized adaptive testing using information theory and combinatorial optimization. Journal of Computerized Adaptive Testing, 2(3), 37–58. https://doi.org/10.7333/1410-0203037
Belov, D.I. (2016). Comparing the performance of eight item preknowledge detection statistics. Applied Psychological Measurement, 40(2), 83 97. https://doi.org/10.1177/0146621615603327
Belov, D.I., & Armstrong, R.D. (2010). Automatic detection of answer copying via Kullback–Leibler divergence and K-index. Applied Psychological Measurement, 34(6), 379–392. https://doi.org/10.1177/0146621610370453
Belov, D., Pashley, P., Lewis, C., & Armstrong, R. (2007). Detecting aberrant responses with Kullback–Leibler distance. In K. Shigemasu, A. Okada, T. Imaizumi, & T. Hoshino (Eds.), New trends in psychometrics (pp. 7–14). Universal Academy Press.
Bock, R.D., & Aitkin, M. (1981). Marginal maximum likelihood estimation of item parameters: Application of an EM algorithm. Psychometrika, 46(4), 443 459. https://link.springer.com/article/10.1007/BF02293801
Bradlow, E.T., Weiss, R.E., & Cho, M. (1998). Bayesian identification of outliers in computerized adaptive testing. Journal of the American Statistical Association, 93(443), 910-919. https://doi.org/10.1080/01621459.1998.10473747
Bishop, C.M. (2006). Pattern Recognition and Machine Learning. Springer.
Chapman, D.S., & Webster, J. (2003). The use of technologies in the recruiting, screening, and selection processes for job candidates. International Journal of Selection and Assessment, 11(2), 113–120. https://doi.org/10.1111/1468-2389.00234
Chang, H., & Zhang, J. (2002). Hypergeometric family and item overlap rates in computerized adaptive testing. Psychometrika, 67 (3), 387-398. https://doi.org/10.1007/BF02294991
Chang, H., & Zhang, J. (2003, December, 3-5). Assessing CAT security breaches by the item pooling index [Oral presentation]. The Annual Meeting of National Council on Measurement in Education, Chicago, IL, USA.
Chao, H.Y., Chen, J.H., & Chen, S.Y. (2011, July,19-22). Applying Kullback-Leibler divergence to detect examinees with item pre-knowledge in computerized adaptive testing [Oral presentation]. The 17th International Meeting of the Psychometric Society, Hong Kong.
Choe, E.M., Zhang, J., & Chang, H.H. (2018). Sequential detection of compromised items using response times in computerized adaptive testing. Psychometrika, 83(3), 650-673. https://doi.org/10.1007/s11336-017-9596-3
Cizek, G., & Wollack, J. (2017). Identification of item preknowledge by the methods of information theory and combinatorial optimization. In G. Cizek, & J. Wollack (Eds.), Handbook of quantitative methods for detecting cheating on tests (pp.217–233). R outledge.
Coyne, I., & International Test Commission. (2006). International Guidelines on Computer-Based and Internet-Delivered Testing. International Journal of Testing, 6(2), 143–171. https://doi.org/10.1207/s15327574ijt0602_4
Cui, Z. (2022). On measuring adaptivity of an adaptive test. Measurement: Interdisciplinary Research and Perspectives,20(1),21-33. https://doi.org/10.1080/15366367.2021.1922232
Davey, T., & Nering, N. (2002). Controlling item exposure and maintaining item security. In C.N. Mills, M.T. Potenza, J.J. Fremer., & W.C. Ward (Eds.), Computer-based testing: Building the foundation for future assessments (pp. 165-191). Lawrence Erlbaum Associates.
Deng, H., Ansley, T., & Chang, H. (2010). Stratified and maximum information item selection procedures in computer adaptive testing. Journal of Educational Measurement, 47(2), 202-226. https://www.jstor.org/stable/20778948
Dimitrov, D.M., & Smith, R.M. (2006). Adjusted rasch person-fit statistics. Journal of Applied Measurement, 7(2), 170-183.
Drasgow, F., Levine, M.V., & McLaughlin, M.E. (1987). Detecting inappropriate test scores with optimal and practical appropriateness indices. Applied Psychological Measurement, 11(1), 59–79. https://doi.org/10.1177/0146621687011001
Drasgow, F., Levine, M., & Williams, E. (1985). Appropriateness measurement with polychotomous item response models and standardized indices. British Journal of Mathematical and Statistical Psychology, 38(1), 67-86. https://doi.org/10.1111/j.2044-8317.1985.tb00817.x
Egberink, I., Meijer, R., Veldkamp, B., Schakel, L., & Smid, N. (2010). Detection of aberrant item score patterns in a computerized adaptive test: An empirical example using the CUSUM. Personality and Individual Differences, 48(8), 921 925. https://doi.org/10.1016/j.paid.2010.02.023
Eggen, T. (2004). Contributions to the theory and practice of computerized adaptive testing (Publication No. 305136454) [Doctoral dissertation, University of Twente]. ProQuest Dissertations Publishing. https://www.proquest.com/docview/305136454/C97B190BA46B4519PQ/1?accountid=135193&sourcetype=Dissertations%20&%20Theses
Embretson, S.E., & Reise, S.P. (2000). Item response theory for psychologists. Lawrence Erlbaum Associates.
Erdem-Kara, B., & Dogan, N. (2022). The effect of ratio of items indicating differential item functioning on computer adaptive and multi-stage tests. International Journal of Assessment Tools in Education, 9(3), 682-696. https://doi.org/10.21449/ijate.1105769
Foster, D. (2013). Security issues in technology-based testing. In J.A. Wollack, & J.J. Fremer (Eds.), Handbook of test security (pp. 39–83). Routledge.
Fox, J.-P., & Marianti, S. (2017). Person-fit statistics for joint models for accuracy and speed. Journal of Educational Measurement, 54(2), 243–262. https://www.jstor.org/stable/45148424
Glas, C.A., & Linden, W. (2003). Computerized adaptive testing with item cloning. Applied Psychological Measurement, 27(4), 247 261. https://doi.org/10.1177/0146621603027004001
Goren, S., Kara, H., Erdem-Kara, B., & Kelecioglu, H. (2022). The effect of aberrant responses on ability estimation in computer adaptive tests. Journal of Measurement and Evaluation in Education and Psychology, 13(3), 256-268. https://doi.org/10.21031/epod.1067307
Guo, J., & Drasgow, F. (2010). Identifying cheating on unproctored internet tests: The Z-test and the likelihood ratio test. International Journal of Selection and Assessment, 18(4), 351-364. https://doi.org/10.1111/j.1468-2389.2010.00518.x
Guo, J., Tay, L., & Drasgow, F. (2009). Conspiracies and test compromise: An evaluation of the resistance of test systems to small-scale cheating. International Journal of Testing, 9(4), 283–309. https://doi.org/10.1080/15305050903351901
Haberman, S., & Lee, Y. (2017). A statistical procedure for testing unusually frequent exactly matching responses and nearly matching responses. Educational Testing Service, (Research Report No:RR 17 23). https://www.ets.org/research/policy_research_reports/publications/report/2017/jxrq.html
Haladyna M.T. (2011). Handbook of Test Development. Taylor and Francis Press.
Hambleton, R.K., Swaminathan, H., & Rogers, H. J. (1991). Fundamentals of item response theory. Sage Publications.
Hambleton, R.K., & Xing, D. (2006). Optimal and nonoptimal computer-based test designs for making pass-fail decisions. Applied Measurement in Education, 19(3), 221–239. https://doi.org/10.1207/s15324818ame1903_4
Han, K.T. (2009, June,2-3). A gradual maximum information ratio approach to item selection in computerized adaptive testing [Oral presentation]. The 2009 Conference on Computerized Adaptive Testing, Minnesota, USA.
Ho, T. (2010). A comparison of item selection procedures using different ability estimation methods in computerized adaptive testing based on generalized partial credit model (Publication No.3428993) [Doctoral dissertation, The State University of Texas]. ProQuest Dissertations Publishing. https://www.proquest.com/docview/760034367/CBFB3EA847AA4FB3PQ/1?accountid=135193&sourcetype=Dissertations%20&%20Theses
Karabatsos, G. (2003). Comparing the aberrant response detection performance of thirty-six person fit statistics. Applied Measurement in Education, 16(4), 277 298. https://doi.org/10.1207/S15324818AME1604_2
Kolen, M.J., & Brennan, R.L. (2008). Test equating, scaling, and linking: Methods and Practices. Springer.
Klauer, K.C. (1991). An exact and optimal standardized person test for assessing consistency with the Rasch model. Psychometrika, 56(2), 213 228. https://doi.org/10.1007/BF02294459
Klauer, K.C., & Rettig, K. (1990). An approximately standardized person test for assessing consistency with a latent trait model. British Journal of Mathematical and Statistical Psychology, 43(2), 193–206. https://doi.org/10.1111/j.2044-8317.1990.tb00935.x
Kingston, N., & Clark, A. (2014). Test fraud: Statistical detection and methodology. Routledge.
Kullback, S., & Leibler, R.A. (1951). On information and sufficiency. Annals of Mathematical Statistics, 22(1), 79–86. https://doi.org/10.1214/aoms/1177729694
Lee, S.Y. (2018). A mixture model approach to detect examinees with item Preknowledge (Publication No.10830593) [Doctoral dissertation, The University of Wisconsin-Madison]. University of Wisconsin Madison Library. https://asset.library.wisc.edu/1711.dl/FJW23RSLFRKJK8X/R/file-e9109.pdf
Lee, Y.H., & Chen, H. (2011). A review of recent response-time analyses in educationaltesting. Psychological Test and Assessment Modeling, 53(3), 359–379.
Lee, Y., & Haberman, S. (2016). Investigating test-taking behaviors using timing and processdata. International Journal of Testing, 16(3), 240 267. https://doi.org/10.1080/15305058.2015.1085385
Lee S.Y., & Wollack J. (2017, September, 6-8). A mixture model to detect item preknowledge using item responses and response times [Oral presentation]. The 2017 Conference on Test Security, Madison, USA.
Levine, M.V., & Drasgow, F. (1988). Optimal appropriateness measurement. Psychometrika, 53(2), 161–176. https://doi.org/10.1007/BF02294130
Levine, M.V., & Rubin, D.B. (1979). Measuring the appropriateness of multiple-choice test scores. Journal of Educational Statistics, 4(4), 269–290. https://doi.org/10.2307/1164595
Lord, F.M. (1980). Applications of Item Response Theory to Practical Testing Problems. Lawrence Erlbaum Associates.
Li, X., Huang, C., & Harris, D. (2014). Examining individual and cluster test irregularities in mixed-format testing [Oral presentation]. The 2014 Conference on Test Security. Iowa City, USA.
Li, M.F., & Olejnik, S. (1997). The power of Rasch person–fit statistics in detecting unusual response patterns. Applied Psychological Measurement, 21(3), 215 231. https://doi.org/10.1177/01466216970213002
Lievens, F., & Burke, E. (2011). Dealing with the threats inherent in unproctored Internet testing of cognitive ability: Results from a large-scale operational test program. Journal of Occupational and Organizational Psychology, 84(4), 817 824. https://doi.org/10.1348/096317910X522672
Liu, X. (2019, June, 10-13). Detecting aberrant behavior in CAT: The lognormal response time model [Oral presentation]. The Annual Meeting of the International Association for Computerized Adaptive Testing, Minnesota, USA.
Liu, C., Han, K.T., & Li, J. (2019). Compromised item detection for computerized adaptive testing. Frontiers in Psychology, 10 (829), 1-16. https://doi.org/10.3389/fpsyg.2019.00829
Magis, D., & Barrada, J.R. (2017). Computerized adaptive testing with R: Recent updates of the package catR. Journal of Statistical Software, 76(1), 1 19. https://doi.org/10.18637/jss.v076.c01
Magis, D., & Raîche, G. (2012). Random generation of response patterns under computerized adaptive testing with the R package catR. Journal of Statistical Software, 48(8), 1-31. https://www.jstatsoft.org/article/view/v048i08
Magis, D., Yan, D., & Von Davier, A.A. (2017). Computerized adaptive and multistage testing with R: Using packages catR and mstR. Springer.
Man, K., Harring, J.R., Ouyang, Y., & Thomas, S.L. (2018) Response time based nonparametric Kullback-Leibler Divergence Measure for detecting aberrant test-taking behavior. International Journal of Testing, 18(2), 155 177. https://doi.org/10.1080/15305058.2018.1429446
Marianti, S., Fox, J.-P., Avetisyan, M., Veldkamp, B.P., & Tijmstra, J. (2014). Testing for aberrant behavior in response time modeling. Journal of Educational and Behavioral Statistics, 39 (6), 426–451. https://doi.org/10.3102/1076998614559412
Maynes, D.D. (2005). M_4: A new answer-copying index. Caveon Test Security, Midvale, UT. https://www.caveon.com/
Maynes, D.D. (2014b). A method for measuring performance inconsistency by using score differences. In N. M. Kingston, & A.K., Clark, (Eds.), Test Fraud: Statistical detection and methodology, (pp 186-199). Routledge.
Messick, S. (1994). The interplay of evidence and consequences in the validation of performance assessments. Educational Researcher, 23(2), 13 23. https://doi.org/10.3102/0013189X023002013
McLeod, L.D., & Lewis, C. (1999). Detecting item memorization in the CAT environment. Applied Psychological Measurement, 23(2), 147 160. https://doi.org/10.1177/01466219922031275
McLeod, L.D., Lewis, C., & Thissen, D. (2003). A Bayesian method for the detection of item preknowledge in computerized adaptive testing. Applied Psychological Measurement, 27(2), 121–137. https://doi.org/10.1177/0146621602250534
Meijer, R., & Sijtsma, K. (1995). Detection of aberrant item score patterns: A Review of recent developments. Applied Measurement in Education, 8(3), 261 272. https://doi.org/10.1207/s15324818ame0803_5
Meijer, R. (2002). Outlier detection in high-stakes certification testing. Journal of Educational Measurement, 39(3), 219-233. https://doi.org/10.1111/j.1745-3984.2002.tb01175.x
Meijer, R.R., & Sijtsma, K. (2001). Methodology review: Evaluating person fit. Applied Psychological Measurement, 25(2), 107-135. https://doi.org/10.1177/01466210122031957
Meijer, R.R., & Tendeiro, J.N. (2014). The use of person-fit scores in high stakes educational testing: How to use them and what they tell us. Law School Admission Council. (LSAC Research Report 14-03). https://www.lsac.org/data-research/research/use-person-fit-scores-high-stakes-educational-testing-how-use-them-and-what
Meyer, J.P. (2010). A mixture Rasch model with item response time components. Applied Psychological Measurement, 34(7), 521-538. https://doi.org/10.1177/0146621609355451
Molenaar, I.W., & Hoijtink, H. (1990). The many null distributions of person fit indices. Psychometrika, 55(1), 75–106. https://doi.org/10.1007/BF02294745
Murphy, K.P. (2012). Machine Learning: A Probabilistic Perspective. The MIT Press.
Naglieri, J.A., Drasgow, F., Schmit, M., Handler, L., Prifitera, A., Margolis, A., & Velasquez, R. (2004). Psychological testing on the internet: New problems, old issues. The American Psychologist, 59(3), 150–162. https://doi.org/10.1037/0003-066X.59.3.150
Nering, M.L. (1995). The distribution of person fit using true and estimated person parameters. Applied Psychological Measurement, 19(2), 121 129. https://doi.org/10.1177/014662169501900201
Nering, M.L. (1997). The distribution of indexes of person fit within the computerized adaptive testing environment. Applied Psychological Measurement, 21(2), 115 127. https://doi.org/10.1177/01466216970212002
Nering, M.L., & Meijer, R.R. (1998). A comparison of the person response function and the lz person fit statistic. Applied Psychological Measurement, 22(1), 53 69. https://doi.org/10.1177/01466216980221004
Nye, C.D., Do, B., Drasgow, F., & Fine, S. (2008). Two-step testing in employee selection: Is score inflation a problem? International Journal of Selection and Assessment, 16(2), 112–120. https://doi.org/10.1111/j.1468-2389.2008.00416.x
Pan, Y., Sinharay, S., Livne, O., & Wollack, J.A. (2022). A machine learning approach for detecting item compromise and preknowledge in computerized adaptive testing. Psychological Test and Assessment Modeling, 64(4), 385 424. https://doi.org/10.31234/osf.io/hk35a
Pardo, L. (2006). Statistical Inference Based on Divergence Measures. Chapman & Hall.
Parshall, C.G., Spray, J.A., Kalohn, J.C., & Davey, T. (2002). Practical considerations in computer-based testing (statistics for social and behavioral sciences). Springer.
Partchev, I. (2017). ‘irtoys: Simple interface to the estimation and plotting of IRT Models’ (R package version 0.2.1). https://cran.rproject.org/web/packages/irtoys/irtoys.pdf
Pearlman, K. (2009). Unproctored internet testing: Practical, legal, and ethical concerns. Industrial and Organizational Psychology: Perspectives on Science and Practice, 2(1), 14–19. https://doi.org/10.1111/j.1754-9434.2008.01099.x
Raton-Lopez, M., Rodriquez-Alvarez, X.M., Suarez- Cadarso, C., & Sampedro-Gude, F. (2014). OptimalCutpoints: Computing optimal cutpoints in diagnostic tests. (R package version 1.1.5). https://cran.rproject.org/web/packages/OptimalCutpoints/OptimalCutpoints.pdf
Reise, S.P. (1995). Scoring method and the detection of person misfit in a personality assessment context. Applied Psychological Measurement, 19(3), 213 229. https://doi.org/10.1177/014662169501900301
Reise, S.P., & Due, A.M. (1991). The influence of test characteristics on the detection of aberrant response patterns. Applied Psychological Measurement, 15(3), 217 226. https://doi.org/10.1177/014662169101500301
Rost, J. (1990). Rasch models in latent classes: An integration of two approaches to item analysis. Applied Psychological Measurement, 14(3), 271 282. https://doi.org/10.1177/014662169001400305
Rizavi, S.M. (2001). The effect of test characteristics on aberrant response patterns in computer adaptive testing (Publication No.3027247) [Doctoral dissertation, University of Massachusetts Amherst]. ProQuest Dissertations Publishing. https://www.proquest.com/docview/304699823/1BC249C4F0834BF2PQ/1?accountid=135193&sourcetype=Dissertations%20&%20Theses
Rizavi, S., & Swaminathan, H. (2001, April, 10-14). The effect of test and examinee characteristics on the occurrence of aberrant response patterns in a computerized adaptive test [Oral presentation]. The Annual Meeting of the American Educational Research Association, Seattle, USA.
Ryan, A.M., Inceoglu, I., Bartram, D., Golubovich, J., Grand, J., Reeder, M., Derous, E., Nikolaou, I., & Yao, X. (2015). Trends in testing: Highlights of a global survey. In I. Nikolaou., & J.K. Oostrom (Eds.). Employee Recruitment, Selection, and Assessment: Contemporary Issues for Theory and Practice, (pp. 136–153). Routledge.
Sanz, S., Luzardo, M., García, C., & Abad, F.J. (2020). Detecting cheating methods on unproctored internet tests. Psicothema, 32(4), 549 558. https://doi.org/10.7334/psicothema2020.86
Sarı, H.I. (2019). Investigating consequences of using item pre-knowledge in computerized multistage testing. Gazi University Journal of Gazi Educational Faculty, 39(2), 1113-1134. https://doi.org/10.17152/gefad.535376
Schnipke, D.L., & Scrams, D.J. (1997). Modeling item response times with a two-state mixture model: A new method of measuring speededness. Journal of Educational Measurement, 34(3), 213–232. https://doi.org/10.1111/j.1745-3984.1997.tb00516.x
Segall, D.O. (2001, April, 10-14). Detecting test compromise in high-stakes computerized adaptive testing: A verification testing approach [Oral presentation]. The Annual Meeting of the National Council on Measurement in Education, Seattle, USA.
Segall, D.O. (2004). Computerized adaptive testing. In Kempf-Leanard (Eds.), The encyclopedia of social measurement (pp. 429–438). Academic Press.
Shu, Z. (2010). Detecting test cheating using a deterministic, gated item response theory model, (Publication No. 3434164) [Doctoral dissertation, The University of North Carolina at Greensboro]. ProQuest Dissertations Publishing. https://www.proquest.com/docview/845237696/4015D86F0434446BPQ/1?accountid=135193&sourcetype=Dissertations%20&%20Theses
Shu, Z., Henson, R., & Luecht, R. (2013). Using deterministic, gated item response theory model to detect test cheating due to item compromise. Psychometrika, 78(3), 481-497. https://doi.org/10.1007/s11336-012-9311-3
Smith, R.M. (1985). A comparison of Rasch person analysis and robust estimators. Educational and Psychological Measurement, 45(3), 433 444. https://doi.org/10.1177/001316448504500
Sinharay, S. (2017a). Detection of item preknowledge using likelihood ratio test and score test. Journal of Educational and Behavioral Statistics, 42(1), 46 68. https://doi.org/10.3102/1076998616673
Sinharay, S. (2017b). Which statistic should be used to detect item preknowledge when the set of compromised items is known? Applied Psychological Measurement, 41(6), 403–421. https://doi.org/10.1177/0146621617698453
Sinharay, S. (2020). Detection of item preknowledge using response times. Applied Psychological Measurement, 44(5), 376–392. https://doi.org/10.1177/0146621620909893
Snijders, T.A.B. (2001). Asymptotic null distribution of person fit statistics with estimated person parameter. Psychometrika, 66(3), 331-342. https://doi.org/10.1007/BF02294437
Sotaridona, L.S., & Meijer, R.R. (2002). Statistical properties of the K-index for detecting answer copying in a multiple-choice test. Journal of Educational Measurement, 39(2), 115–132. https://www.jstor.org/stable/1435251
Statisticat, L.L.C. (2016). ‘LaplacesDemon: Complete environment for bayesian inference’ (R package version 16.0.1) https://cran.r project.org/web/packages/LaplacesDemon/vignettes/LaplacesDemonTutorial.pdf
Steinkamp, S. (2017). Identifying aberrant responding: Use of multiple measures [Doctoral dissertation, The University of Minnesota]. University Digital Conservancy. https://hdl.handle.net/11299/188885.
Stocking, M.L. (1992). Controlling item exposure rates in a realistic adaptive testing paradigm. Educational Testing Service. (Research Report No. 93 2). https://onlinelibrary.wiley.com/doi/pdf/10.1002/j.2333-8504.1993.tb01513.x
St-Onge, C., Valois, P., Abdous, B., & Germain, S. (2011). Accuracy of person-fit statistics: A Monte Carlo study of the influence of aberrance rates. Applied Psychological Measurement, 35(6), 419–432. https://doi.org/10.1177/0146621610391777
Sunbul, O., & Yormaz, S. (2018). Investigating the performance of omega index according to item parameters and ability levels. Eurasian Journal of Educational Research, 74, 207–226. https://doi.org/10.14689/ejer.2018.74.11
Tatsuoka, K. (1984). Caution indices based on item response theory. Psychometrika, 49(1), 95–110. https://doi.org/10.1007/BF02294208
Tendeiro, J.N., Meijer, R.R., & Niessen, A.S.M. (2016). PerFit: An R package for person-fit analysis in IRT. Journal of Statistical Software, 74(5), 1 27. https://doi.org/10.18637/jss.v074.i05
Thompson, N.A. (2007b, June,7). Computerized classification testing with composite hypotheses [Oral presentation]. The GMAC Conference on Computerized Adaptive Testing, Minneapolis, USA.
Thompson, N.A., & Weiss, D.A. (2011). A framework for the development of computerized adaptive tests. Practical Assessment, Research & Evaluation, 16(1),1 9. http://pareonline.net/getvn.asp?v=16&n=1
Thiessen, B. (2008). Relationship between test security policies and test score manipulations (Publication No.3347249) [Doctoral dissertation, University of Iowa]. ProQuest Dissertations Publishing. https://www.proquest.com/docview/304633912/793D439F6D09431APQ/1?accountid=135193&sourcetype=Dissertations%20&%20Theses
Tippins, N.T., Beaty, J., Drasgow, F., Gibson, W.M., Pearlman, K., Segall, D.O., & Shepherd, W. (2006). Unproctored internet testing in employment settings. Personnel Psychology, 59 (1), 189–225. https://doi.org/10.1111/j.1744-6570.2006.00909.x
Trabin, T.E., & Weiss, D.J. (1983). The person response curve: Fit of individuals to item response theory models. In D.J. Weiss (Eds.), New horizons in testing, (pp. 83–108). Academic Press.
Ucar, A. (2021). Kopya belirlemede benzerlik indeklerinin birey-uyum istatistikleri aracılığıylaaşamalı kullanımının I.tip hatalarının ve gücünün belirlenmesi [Investigation of type-I-error and power of similarity indices by using two-stage analysis via person-fit statistics] [Doctoral dissertation, Ankara University]. National Thesis Center. https://tez.yok.gov.tr/UlusalTezMerkezi/tarama.jsp
Ucar, A., & Dogan, C. D. (2021). Defining cut point for Kullback-Leibler divergence to detect answer copying. International Journal of Assessment Tools in Education, 8(1), 156–166. https://doi.org/10.21449/ijate.864078
van der Linden, W.J. (2006). A lognormal model for response times on test items. Journal of Educational and Behavioral Statistics, 31(2), 181 204. https://doi.org/10.3102/10769986031002181
van der Linden, W.J. (2007). A hierarchical framework for modeling speed and accuracy on test items. Psychometrika, 72 (3), 287–308. https://doi.org/10.1007/s11336-006-1478-z
van der Linden, W.J. (2008). Using response times for item selection in adaptive testing. Journal of Educational and Behavioral Statistics, 33(1), 5-20. https://doi.org/10.3102/1 076998607302626
van der Linden, W.J., & Guo, F. (2008). Bayesian procedures for identifying aberrant response-time patterns in adaptive testing. Psychometrika, 73(3), 365 384. https://doi.org/10.1007/s11336-007-9046-8
van der Linden, W.J., Klein Entink, R.H., & Fox, J.-P. (2010). IRT parameter estimation with response times as collateral information. Applied Psychological Measurement, 34(5), 327–347. https://doi.org/10.1177/0146621609349800
van der Linden, W.J., & Pashley, P.J. (2010). Item selection and ability estimation in adaptivetesting. In W.J. van der Linden, & C.A.W. Glas (Ed.), Elements of adaptive testing (pp. 429 – 438). Springer.
van der Linden, W.J., & Sotaridona, L. (2006). Detecting answer copying when the regular response process follows a known response model. Journal of Educational and Behavioral Statistics, 31(3), 283-304. https://www.jstor.org/stable/4122441
van der Linden, W.J., & van Krimpen-Stoop, E.M. (2003). Using response times to detect aberrant responses in computerized adaptive testing. Psychometrika, 68(2), 251-265. https://doi.org/10.1007/BF02294800
van Krimpen-Stoop, E.M.L.A., & Meijer, R.R. (2002). Detection of person misfit in computerized adaptive tests with polytomous items. Applied Psychological Measurement, 26(2), 164-180. https://doi.org/10.1177/01421602026002004
Veldkamp, B.P. (2012). Ensuring the future of computerized adaptive testing. In T.J.H.M. Eggen., & B.P. Veldkamp (Eds.). Psychometrics in practice at RCEC, (pp.39-50). RCEC.
Veldkamp, B.P., & van der Linden. W.J. (2010). Designing item pools for adaptive testing. In W.J. van der Linden., & C.A.W. Glas (Eds.). Computerized adaptive testing: Theory and practice, (pp.149-162). Springer.
von Davier, M., & Rost, J. (2007). Mixture distribution item response models. In C.R. Rao., & S. Sinharay (Eds.) Handbook of Statistics, (pp.643-661). Elsevier.
Wang, K. (2017). A fair comparison of the performance of computerized adaptive testing and multistage adaptive testing (Publication No. 10273809) [Doctoral dissertation, Michigan State University]. ProQuest Dissertations Publishing. https://www.proquest.com/docview/1901897901/
Wang, C., & Xu, G. (2015). A mixture hierarchical model for response times and response accuracy. British Journal of Mathematical and Statistical Psychology, 68(3), 456-477. https://doi.org/10.1111/bmsp.12054
Wang, C., Xu, G., Shang, Z., & Kuncel, N. (2018). Detecting aberrant behavior and item preknowledge: A comparison of mixture modeling method and residual method. Journal of Educational and Behavioral Statistics, 43(4), 469 501. https://doi.org/10.3102/1076998618767123
Wollack, J.A. (1997). A nominal response model approach for detecting answer copying. Applied Psychological Measurement, 21(4), 307 320. https://doi.org/10.1177/01466216970214002
Wollack, J.A. (2006). Simultaneous use of multiple answer copying indexes to improve detection rates. Applied Measurement in Education, 19(4), 265 288. https://doi.org/10.1207/s15324818ame1904_3
Wollack, J.A., & Maynes, D. (2011, April, 7-11). Detection of test collusion using item response data [Oral presentation]. The 2011 Annual Meeting of the National Council on Measurement in Education, New Orleans, USA.
Wright, B., & Masters, G. (1982). Rating Scale Analysis: Rasch Measurement. MESA Press.
Wright, N.A., Meade, A.W., & Gutierrez, S.L. (2014). Using invariance to examine cheating in unproctored ability tests. International Journal of Selection and Assessment, 22(1), 12–22. https://doi.org/10.1111/ijsa.12053
Wright, B., & Stone, M. (1979). Best test design: Rasch measurement. MESA Press.
Wise, S. (2023). Expanding the meaning of adaptive testing to enhance validity. Journal of Computerized Adaptive Testing, 10(2), 22-31. https://doi.org/10.7333/2305-1002022
Wunder, R.S., Thomas, L.L., & Luo, Z. (2010). Administering assessments and decision-making. In J.L. Farr., & N.T. Tippins (Eds.). Handbook of Employee Selection, (pp. 377–398). Routledge.
Yan, D. (2020). Multistage testing in practice. In H. Jiao., & R.W. Lissitz (Eds.). Application of Artificial Intelligence to Assessment, (pp.141-160). Information Age Publications.
Yormaz, S. (2019). Test güvenliği açısından bireyler arasındaki olası iş birliğinin incelenmesi [Investigation of possible collusion between examinees in terms of test securtiy] [Doctoral dissertation, Mersin University]. National Thesis Center. https://tez.yok.gov.tr/UlusalTezMerkezi/tarama.jsp
Yormaz, S., & Sunbul, O. (2017). Determination of type I error rates and power of answer copying ındices under various conditions. Educatıonal Sciences: Theory & Practıce, 17(1), 5-26. https://doi.org/10.12738/estp.2017.1.0105
Yi, Q., Zhang, J., & Chang, H.H. (2006). Severity of organized item theft in computerized adaptive testing: An empirical study. Educational Testing Service. (ETS Research Report, RR-06-22). http://dx.doi.org/10.1002/j.2333-8504.2006.tb02028.x
Yi, Q., Zhang, J., & Chang, H.H. (2008). Severity of organized item theft in computerized adaptive testing: A simulation study. Applied Psychological Measurement, 32(7), 543-558. https://doi.org/10.1177/0146621607311336
Zhan, P., Jiao, H., Wang, W.-C., & Man, K. (2018). A multidimensional hierarchical framework for modeling speed and ability in computer-based multidimensional tests. arXiv preprintarXiv:1807.04003. https://doi.org/10.48550/arXiv.1807.04003
Zhang, J. (2014). A sequential procedure for detecting compromised items in the item pool of a CAT system. Applied Psychological Measurement, 38(2), 87 104. https://doi.org/10.1177/0146621613510062
Zhang, J., & Li, J. (2016). Monitoring items in real time to enhance CAT security. Journal of Educational Measurement, 53(2), 131-151. https://doi.org/10.1111/jedm.12104
Zhang, Y., Searcy, C.A., & Horn, L. (2011, April, 9-11). Mapping clusters of aberrant patterns in item responses [Oral presentation]. The Annual Meeting of the National Council on Measurement in Education, New Orleans, USA.
Zhong, W. (2022). Using item response theory to detect potential aberrant behaviors in a multi-stage test: An example of the norwegian language test (Publication No. 304) [Master thesis, The University of Oslo]. CEMO Centre for Educational Measurement. https://www.duo.uio.no/handle/10852/55851/discover?rpp=100&sort_by=dc.date.issued_dt&order=DESC
Zopluoğlu, C. (2016). Classification performance of answer-copying indices under different types of IRT models. Applied Psychological Measurement, 40(8), 592–607. https://doi.org/10.1177/0146621616664724

There are 160 citations in total.

Details

Primary Language	English
Subjects	Computer Based Exam Applications, Similation Study
Journal Section	Research Article
Authors	Ebru Balta 0000-0002-2173-7189 Arzu Uçar 0000-0002-0099-1348
Submission Date	December 8, 2024
Acceptance Date	June 18, 2025
Early Pub Date	July 21, 2025
Publication Date	September 4, 2025
Published in Issue	Year 2025 Volume: 12 Issue: 3

Cite

APA	Balta, E., & Uçar, A. (2025). Detection of aberrant testing behaviour in unproctored CAT via a verification test. International Journal of Assessment Tools in Education, 12(3), 681-700. https://doi.org/10.21449/ijate.1598330

Article Files

Full Text

23823 23825 23824