Investigation of Classification Accuracy, Test Length and Measurement Precision at Computerized Adaptive Classification Tests *

This study aims to compare Sequential Probability Ratio Test (SPRT) and Confidence Interval (CI) classification criteria, Maximum Fisher Information method on the basis of estimated-ability (MFI-EB) and Cut-Point (MFI-CB) item selection methods while ability estimation method is Weighted Likelihood Estimation (WLE) in Computerized Adaptive Classification Testing (CACT), according to the Average Classification Accuracy (ACA), Average Test Length (ATL), and measurement precision under content balancing (Constrained Computerized Adaptive Testing: CCAT and Modified Multinomial Model: MMM) and item exposure control (Sympson-Hetter Method: SH and Item Eligibility Method: IE) when the classification is done based on two, three, or four categories for a unidimensional pool of dichotomous items. Forty-eight conditions are created in Monte Carlo (MC) simulation for the data, generated in R software, including 500 items and 5000 examinees, and the results are calculated over 30 replications. As a result of the study, it was observed that CI performs better in terms of ATL, and SPRT performs better in ACA and correlation, bias, Root Mean Squared Error (RMSE), and Mean Absolute Error (MAE) values, sequentially; MFI-EB is more useful than MFI-CB. It was also seen that MMM is more successful in content balancing, whereas CCAT is better in terms of test efficiency (ATL and ACA), and IE is superior in terms of item exposure control though SH is more beneficial in test efficiency. Besides, increasing the number of classification categories increases ATL but decreases ACA, and it gives better results in terms of the correlation, bias, RMSE, and MAE values.


INTRODUCTION
Testing in education might have various objectives.These objectives include increasing the effectiveness of education, assessing students individually, making selection or placement decisions, certification, monitoring learning progress, and testing for diagnostic purposes.To achieve these objectives, it seems to be critical to have access to timely and accurate information about learners' level of ability.In this regard, Computerized Adaptive Testing (CAT) is one of the greatest reflections of developments in information and communication technologies in the field of education and contributes to making more qualified and effective evaluations.
Unlike traditional paper-pencil tests, a CAT system uses different test forms in real time based on their individualized performance to test individuals with different levels of ability (Bao, Shen, Wang, & Bradshaw, 2021).The goal of CAT is to estimate each individual's latent ability and select the most appropriate test items (i.e., the most informative item) from the item pool for an individual based on his or her current performance (Eggen & Straetmans, 2000).At the end of the process, CAT provides more reliable estimates of ability using fewer items compared to traditional tests (Bao et al., 2021;16 Fan, Wang, Chang, & Douglas, 2012;Thompson, 2009).These advantages of CAT can be seen as the main reason for preferring large scale CAT applications such as the Graduate Management Admission Test (GMAT), the Graduate Record Examination (GRE), and the National Assessment of Educational Progress (NAEP).The main purpose of testing individuals may sometimes be the accuracy of classifications, such as passed or failed, apart from the effective estimate of ability.In that case, a Computerized Adaptive Classification Test (CACT) is preferred.Since important decisions are made based on the classification (e.g., retention, high school graduation, career selection), efficient and accurate classification is of critical importance (Thompson & Ro, 2007).
Additionally, test effectiveness is important for both CATs and CACTs.High test effectiveness in CAT applications with a unidimensional item pool means fewer items and lower standard errors for ability estimation (van der Linden & Hambleton, 1996 as cited in Thompson, 2009).Unlike CATs, CACTs use as few items as possible and aim at low classification errors to achieve test effectiveness (Thompson, 2009).

Purpose of the Study
An extensive review of literature on CACT applications revealed that most of the studies considered classification in only two categories (e.g., Gündeğer & Doğan, 2018a;Lau, 1996;Reckase, 1983;Spray & Reckase, 1996), and content balancing and item exposure control were not taken into account.Furthermore, classification criteria (e.g., Kingsbury & Weiss, 1980;Spray & Reckase, 1996;Thompson, 2009) and item selection methods were mostly compared (e.g., Gündeğer & Doğan, 2018b;Eggen, 1999;Lin & Spray, 2000), and the performance of different item selection methods was examined by crossing the item selection methods with classification criteria (e.g., Eggen & Straetmans, 2000;Thompson & Ro, 2007).Besides, there are a few studies that compared the performance of classification criteria in terms of Average Classification Accuracy (ACA) and Average Test Length (ATL) according to different item exposure control methods (Huebner, 2012;Lau & Wang, 1999).A study used the Sympson-Hetter (SH) item exposure control method together with the spiral method for content balancing (Huebner & Li, 2012).Considering the contribution of accurate classifications to selecting, monitoring, or placing individuals based on the test results, there seems to be a need for new research in CACT using different research designs.It is thus thought that this study will contribute to a deeper understanding of CACT applications.
The main purpose of this study was to examine the performance of different classification criteria and item selection methods used in CACT applications when weighted likelihood estimation (WLE) is used for ability estimation under various conditions of classification category numbers, content balancing, and item exposure control methods in terms of average classification accuracy, average test length, the correlation between true and estimated ability levels, bias, root mean squared error (RMSE), and mean absolute error (MAE).The research problems are as follows: Given that WLE is the ability estimation method, and the sequential probability ratio test (SPRT) with indifference region (IR) constant value δ: .20,and the confidence interval with CI: 90% confidence level are the classification criteria, how do the values of average classification accuracy, average test length, the correlation between true and estimated ability levels, bias, RMSE, and MAE change in two, three or four-category classifications where the followings are considered together?study according to ACA, ATL, measurement precision, and its results, and suggestions for future research.

METHOD
In this study, Monte Carlo (MC) simulations were performed, and CACT application results were compared using simulated datasets.If other research methods answer the questions What happened, and how, and why? simulation studies help answer the question What if ...? In simulation studies, it is possible to examine more complex systems as possible different conditions into the future can be created (Dooley, 2002).The datasets used were generated in the R program (R Core Team, 2013) based on the conditions examined in the study.The dependent variables of the study were ACA, ATL, correlation between real ability values and estimated ability values (r), bias, RMSE, and MAE.The independent variables were classification criteria (SPRT and CI), item selection methods (MFI-EB and MFI-CB), content balancing methods (CCAT and MMM), item exposure control methods (SH and IE), and the number of classification categories (two, three, and four).Therefore, the study had 48 simulation conditions = 2 classification criteria x 2 item selection methods x 2 content balancing methods x 2 item exposure control methods x 3 classification category numbers.

Data Generation
The data used in this study were generated by simulation in accordance with certain properties.

Generation of item and ability parameters for Monte Carlo (MC) simulation
This study was conducted as an MC simulation study by taking Thompson's (2011) study into consideration.The item pool was composed of 500 items under Item Response Theory (IRT) threeparameter logistic model (3PLM) for each of 30 replications.Since both estimate-based and cut scorebased item selection methods (MFI-EB and MFI-CB) were used and two-, three-or four-category classifications were made, the item pool was composed of items that provide a high amount of information at and around the cut-point θ = 0 and cover the ability level range (-3, 3).For the items in the pool, the a parameter was generated from a uniform distribution U[0.5, 2.0] to represent medium and high levels of discrimination considering the study of Kingsbury and Weiss (1980), the b parameter was generated from a normal distribution N(-0.5, 1.5) to be close to the actual values in applications as pointed out in Thompson (2009) and Warm (1989), and the c parameter was generated from a normal distribution N(0.20, 0.05) again to be close to an actual application in keeping with Thompson (2009).In addition, ability parameters of 5000 examinees were generated from a normal distribution N(0, 1) within a range of (-3, +3) for each of 30 replications.

CACT Simulation Conditions
CACT simulation conditions, used in this study, were explained in detail under subheadings.

Starting point
Available prior information about examinees can be used as the starting point in CACT (Weiss & Kingsbury, 1984;Yang, Poggio, & Glasnapp, 2006).Although not used very often, the population mean can also be defined as the starting point (Thompson, 2007b).In this research, the starting point for all conditions was determined as θ = 0. Intelligent item selection methods where the computer program evaluates the unused items in the pool and decides which would be the best item to use next are generally classified into two groups: estimatebased and cut score-based (Thompson, 2007b).When IRT is used as the psychometric model, the cut score-based methods such as MFI, maximum Kullback-Leibler information (KLI), and log-odds ratio methods can be preferred (Lin & Spray, 2000).Traditionally, an item selection method that maximizes Fisher information at the cut-point is used with SPRT.SPRT is expected to yield better results, especially as the indifference region increases (Eggen, 1999).MFI-EB and MFI-CB methods were used for item selection in this study.

Ability estimation
Based on the literature, there are several ability estimation methods for binary scoring (1-0) and unidimensional item response theory modeling.The most common and widely used ability estimation methods include Maximum Likelihood Estimation (MLE), Marginal Maximum Likelihood Estimation (MMLE), Weighted Likelihood Estimation (WLE), and the Bayesian estimation methods such as Owen's Bayesian sequential method, Maximum A Posteriori (MAP), and expected a posteriori (EAP).Warm (1989) noted that all these methods can produce some biased estimates.Bias affects the accuracy of classification decisions systematically (Wang & Wang, 2001).Additionally, Warm (1989) concluded that, especially in fixed-length tests, estimations made by WLE had less bias compared to estimations made by MLE and MAP.He discussed that when WLE is used for various lengths of adaptive tests, the test is similar to MAP but ends with fewer items than MLE, and he proposed the WLE method, which is a modified version of MLE, for ability estimation.This estimation method may reduce item exposure and test time, thereby enhancing the usefulness of the test.Thus, it can be considered as an advantage to use WLE for CACT and CAT applications.WLE is a method that reduces bias and works on the basis of item parameters and a weighting function specific to ability levels (Warm, 1989).WLE is most often preferred in CACT applications (Eggen & Straetmans, 2000;Nydick, Nozawa, & Zhu, 2012;Wouda & Eggen, 2009;Yang et al., 2006).Considering its advantages and its position in the literature about classification, WLE was used as an ability estimation method in this study.The WLE ability estimation method is a condition that was kept constant in simulations.

Classification criteria
There are three basic classification criteria based on IRT in CACT applications: SPRT, CI, and Bayesian decision theory.All three classification criteria require fewer items than traditional fixedform tests and provide a similar level of classification accuracy (Kingsbury & Weiss, 1983).Previous research has shown that CI is more effective in estimate-based item selections, while SPRT is more effective in cutscore-based item selections (Eggen & Straetmans, 2000;Spray & Reckase, 1996;Thompson, 2009).It has also been shown that SPRT is more effective than CI, especially in terms of classification accuracy (Eggen, & Straetmans, 2000).Furthermore, as Thompson (2009) pointed out, the most used classification criterion in CACT studies is SPRT.Against this background, the classification criteria were determined as SPRT (δ: .20)and CI (90%) in this study.

Content balancing
In the content-balanced ICT applications, examinees are measured by a test that represents each of the content areas as appropriately as possible and has higher validity.The most commonly used content balancing methods in CACT studies are the spiralling method (Kingsbury & Zara, 1989) (e.g., Finkelman, 2008;Huebner, 2012) and the constrained CAT (CCAT) method (e.g., Eggen & Straetmans, 2000;Huebner & Li, 2012).Lin (2011) used a modified multinomial model (MMM) for content balancing.However, no research has been found that compares CCAT and MMM in the literature.Therefore, in this study, unlike the previous studies, two different content balancing 19 methods, namely CCAT and MMM, were used.The minimum number of items to be used before terminating the test was set at 10, and the maximum number of items was set at 70 to ensure content balancing conditions.In cases where CCAT and MMM were included in the study conditions, the item pool generated with 500 items in the R program was divided into four content areas using random item assignment.Then, items were selected using the functions and loops written by the researcher in line with these content areas.The target proportions of four content areas were set at 40%, 30%, 20%, and 10%, respectively.

Item exposure
In CAT applications in which the item exposure control is not used, the selection of the items only based on maximum information could result in overexposure of items.On the other hand, both test security and more balanced use of item pool are considered while maintaining measurement precision when item exposure control techniques are implemented (Leroux et al., 2019).A search of the literature showed that the most used item exposure control methods in CACT applications are the random item selection method based on randomness strategies and the SH method (Sympson & Hetter, 1985) based on conditional selection strategies.Because randomness strategies are believed to be not effective under realistic test conditions, this research focused on the SH method and the IE method (van der Linden & Veldkamp, 2004), which is based on the same approach as the SH method.The maximum desired item exposure rate for the SH and IE methods used in the item exposure control was taken as rmax = .20(Leung, Chang, & Hau, 2002), which is a frequently used value in line with the studies of Huebner (2012) and Huebner and Li (2012).

Number of classification categories
Much of the research in CACT so far has used only two categories, such as failed-passed and a single cut-point.A two-category classification such as failed-passed was used in Huebner (2012), Lin and Spray (2000), Reckase (1983), Sie, Finkelman, Riley, and Smits (2015), Thompson (2009), van Groen, Eggen, and Veldkamp (2016).Both two-and three-category classifications were used in Eggen (1999) and Thompson (2007a).A three-category classification was used in Nydick et al. (2012).Both threeand five-category classifications were used in Yang et al. (2006).This research used two-, three-and four-category classifications to compare the changes.The ability parameters generated in R for the examinees were utilized to determine the cutting points for the classifications.The generated ability parameters were ranked from the low ability level to the high ability level.Through the method used in Eggen and Straetmans (2000), a cut-point was determined for the two-category classification, two cut-points were determined for the three-category classification, and three cut-points were determined for the four-category classification.In the two-category classification, the first half of the skill levels ranked from low to high were coded as Level 1 and the second half as Level 2.Then, the cut-point (CP = 0.00) was determined by taking 70% of the highest ability level in Level 1.Similarly, in the three-category classification, the ranked ability levels were encoded as Level 1, Level 2, and Level 3, and the cut-points were defined as CP1 = -0.29 and CP2 = 0.31.In the four-category classification, the ability levels were encoded as Level 1, Level 2, Level 3, and Level 4 and the cut-points were defined as CP1 = -0.47,CP2 = -0.01,and CP3 = 0.48.

Data Analysis
Thirty replications were conducted for each of the 48 simulation conditions generated within the scope of the research, and the values of the dependent variables were obtained by calculating the average of the replications.The value of the correlation between true and estimated ability levels was calculated using the Pearson correlation coefficient (PCC), while the bias, RMSE, and MAE values were calculated following formulas written in the R program.
Bias is calculated using the formula below where the sum of the difference between the last estimated ability level (  ̂) and the true ability level (  ) is divided by the number of examinees (n) (Miller, & Miller, 2004): RMSE is equal to the square root of the sum of squared of differences between the   ̂ and   divided by n: MAE is calculated by dividing the sum of the absolute value of the difference between   ̂ and   by n: Additionally, functions and loops were written in the R program in addition to the item selection method for content balancing and item exposure control.

RESULTS
The results obtained for each subproblem of the study are presented under subheadings.

Results on the First Subproblem
Table 1 shows the values calculated by averaging 30 replications performed for each simulation condition related to the first research subproblem.SPRT= sequential probability ratio test, CI= confidence interval, MFI-EB= maximum fisher information method on the basis of estimated-ability, MFI-CB= maximum fisher information method on the basis of cut-point.
As seen in Table 1, in the two-, three-and four-category classifications, the ACA values were quite high and ranged from .82 to .94, and the ATL values ranged from 22.95 to 42.88 when SPRT was used for classification.On the other hand, when CI was used for classification, the ACA values were relatively lower and ranged from .71 to .90, and the ATL values ranged from 11.33 to 13.82.When the item selection methods MFI-EB and MFI-CB were used with the same classification criteria, similar results were obtained in terms of test effectiveness.In addition, an increase in the number of classification categories caused the test effectiveness to decrease for both classification criteria.In other words, it increased the ATL but reduced the ACA.
The values of the correlation (r) between the examinees' estimated and true ability levels ranged from .90 to .96 for SPRT and .87 to .91 for CI.With respect to the conditions in which the classification criteria were crossed by the item selection methods, higher correlations were calculated for both classification criteria in the conditions in which MFI-EB was used compared to the conditions in which MFI-CB was used.Additionally, similar correlation values were obtained in response to the increase in the number of classification categories.The bias calculated for the condition where SPRT and MFI-EB were used together (ranging from -0.014 to -0.011) was lower compared to that calculated for the condition where SPRT and MFI-CB were used together (ranging from 0.012 to 0.019).Similarly, the bias calculated for the condition where CI and MFI-EB were used together (ranging from 0.015 to 0.016) was lower compared to that calculated for the condition where CI and MFI-CB were used together (ranging from 0.017 to 0.020).The case is similar for the RMSE value, which takes into account the standard error of the estimation along with the bias, and for the MAE value.Accordingly, it can be said that lower bias, RMSE, and MAE values were found when the SPRT classification criterion or the MFI-EB item selection method was used.Furthermore, the increase in the number of categories did not exert a great effect on the bias but relatively decreased the RMSE and MAE values.

Results on the Second Subproblem
Table 2 demonstrates the values calculated by averaging 30 replications performed for each condition related to the second research subproblem, which incorporated CCAT and MMM for content balancing and SH and IE for item exposure control.
As seen in Table 2, in all conditions where the MMM content balancing method was used, the used content rates achieved the desired content rates (40%, 30%, 20%, and 10%, respectively).In the conditions where the CCAT content balancing method was used, the used content rates were above or below the desired content rates.For example, as seen in Table 2, in the condition where SPRT was used with MFI-CB, item exposure was controlled using IE, and a four-category classification was made, the CCAT content rates were found to be approximately 32%, 28%, 23%, and 16%, respectively.In addition, in the conditions where the IE item exposure control method was used, the proportion of items overexposed (OEX) was lower and the mean exposure rate of overexposed items (MOEX) achieved the desired rmax = .20.On the other hand, in the conditions where SH was used, OEX was higher, and MOEX was considerably higher than the desired rmax= .20.For example, as seen in Table 2, when SPRT and MFI-EB were used together, content balancing was done using CCAT, and a four-category classification was made, the OEX value calculated for item exposure controlled using SH was approximately .25, and the MOEX value was .29.In other words, approximately 25% of the items were above the maximum item exposure rate (rmax = .20),and the mean item exposure was calculated to be approximately .29.
As seen in Table 2, another comparison using the same classification criteria and item selection method showed that although the CCAT content balancing method performed better with a slight difference in terms of test effectiveness, it generally produced similar results to MMM.In addition, the SH item exposure control method performed better compared to IE in terms of test effectiveness.The best result in terms of ATL (ATL = 11.13 and ACA = .88)was recorded in the condition where CI, MFI-EB, CCAT, and SH were used together, and a two-category classification was made, while the worst result (ATL = 51.93 and ACA = .75)was recorded in the condition where SPRT, MFI-CB, MMM, and IE were used together, and a four-category classification was made.To put it differently, it can be said that among the best and worst results, ATL was nearly five times higher, while ACA declined considerably.

DISCUSSION and CONCLUSION
Because the primary focus of this study is on classification accuracy, the ACA values calculated under different conditions are of great importance in interpreting the findings.In line with the research findings, high ACA values were calculated under all research conditions.The SPRT classification criterion performed better than CI and achieved a higher rate of classifying examinees into the accurate categories.On the other hand, the CI classification criterion performed better in terms of ATL under all research conditions and required fewer items to classify examinees compared to SPRT.This finding is in agreement with those obtained by Gündeğer and Doğan (2018a), Nydick et al. (2012), Thompson (2009), and Thompson and Ro (2007).These studies, in general, reported that the classifications made using CI ended with lower ATL and ACA compared to those made using SPRT.Therefore, comparing the SPRT and CI classification criteria used in the research in terms of classification accuracy, it may be suggested to prefer SPRT which yielded higher ACA values.On the other hand, comparing SPRT and CI in terms of ATL, CI seems to be preferable as it requires fewer items to classify examinees and terminate the test.Nevertheless, it should be noted that with respect to high-risk tests (e.g., tests applied in the field of medicine and directly related to human life), it is of key importance to choose the method which achieves a higher classification accuracy despite the increasing number of items.In CACTs, ATL, and ACA are often evaluated together for test effectiveness.If a decision is to be made to choose the best performing classification criterion in terms of test effectiveness, it may be suggested to use CI for conditions where both classification criteria achieve a good level of classification accuracy.
This research found that the SPRT classification criterion performed better than CI, and the MFI-EB item selection method performed better than MFI-CB in terms of measurement precision.Accordingly, under the conditions where the SPRT classification criterion or the MFI-EB item selection method was used, the values of correlation between examinees' true and estimated ability levels were higher while the bias, RMSE, and MAE values were lower.It can thus be said that examinees' last ability levels were more precise and closer to their true ability levels when the classification criterion was SPRT or when the item selection method was MFI-EB.A possible explanation of this result might be that the item pool was composed of items that provide great information at and around the cutting point θ = 0. Additionally, the MFBI-EB item selection method achieved relatively better results compared to MFI-CB in terms of test effectiveness.In other words, when MFBI-EB was used, lower ATL values and similar ACA values were obtained.
The analysis results showed that the values of correlation between examinees' true and estimated ability levels were quite high, especially when the WLE ability estimation method was used together with the SPRT classification criterion and the MFI-EB item selection method.It can thus be said that the WLE method performs successfully.
Comparing the findings presented in Table 1 and Table 2, it can be seen that relatively higher ATL and lower ACA values were obtained in line with expectations when content balancing and item exposure control were added to the research conditions.According to Thompson (2007b), content 6575 Eğitimde ve Psikolojide Ölçme ve Değerlendirme Dergisi Journal of Measurement and Evaluation in Education and Psychology 18 Item selection

Demir, S., Atar, B. / Investigation of Classification Accuracy, Test Length and Measurement Precision at Computerized Adaptive Classification Tests ___________________________________________________________________________________
___________________________________________________________________________________________________________________ ISSN: 1309 -6575 Eğitimde ve Psikolojide Ölçme ve Değerlendirme Dergisi Journal of Measurement and Evaluation in Education and Psychology

Table 1 .
Comparison of the Classification Criteria (CC) and Item Selection Methods (ISM) According to the Average Test Length (ATL), Average Classification Accuracy (ACA), and Measurement Precision With Correlation (r), Bias, Root Mean Squared Error (RMSE), and Mean Absolute Error (MAE) Values When the number of Classification Categories (NCC) Based on Two, Three, or Four

Demir, S., Atar, B. / Investigation of Classification Accuracy, Test Length and Measurement Precision at Computerized Adaptive Classification Tests ___________________________________________________________________________________
SPRT yielded better results in terms of ACA, and CI yielded better results in terms of ATL.

Table 2 .
Comparison of The Classification Criteria (CC) and Item Selection Methods (ISM) According to the Average Test Length (ATL), Average Classification Accuracy (ACA), and Measurement Precision With Correlation (R), Bias, Root Mean Squared Error (RMSE), and Mean Absolute Error (MAE) Values

/ Investigation of Classification Accuracy, Test Length and Measurement Precision at Computerized Adaptive Classification Tests ___________________________________________________________________________________
) values ranged from .90 to .96 in the conditions where SPRT was used, while they ranged from .85 to .90 in the conditions where CI was used.The bias values ranged from -0.018 to 0.009 in the conditions where SPRT was used, while they ranged from 0.004 to 0.016 in the conditions where CI was used.The highest RMSE value (0.52) and the highest MAE value (0.41) were observed when CI, MFI-CB, CCAT (or MMM), and IE were used together, and a two-category classification was made.On the other hand, the lowest RMSE value (0.30) was observed when SPRT, MFI-EB, CCAT (or MMM), and SH were used together with four-category classification, and the lowest MAE value (0.22) was observed when SPRT, MFI-EB, CCAT, and SH were used together with fourcategory classification.In summary, parallel to the findings in Table1, CI performed better in terms of ATL, while SPRT performed better in terms of ACA.As the number of classification categories increased, ATL increased but ACA decreased.With respect to the correlation (r), bias, RMSE, and MAE values, SPRT performed better than CI, and MFI-EB performed better than MFI-CB.Furthermore, in response to the increased number of categories, the correlation and bias resulted in similar values, while the RMSE and MAE values were relatively lower.
___________________________________________________________________________________________________________________ ISSN: 1309 -6575 Eğitimde ve Psikolojide Ölçme ve Değerlendirme Dergisi Journal of Measurement and Evaluation in Education and Psychology