An Evaluation of 4PL IRT and DINA Models for Estimating Pseudo-Guessing and Slipping Parameters *

In an achievement test, the examinees with the required knowledge and skill on a test item are expected to answer the item correctly while the examinees with a lack of necessary information on the item are expected to give an incorrect answer. However, an examinee can give a correct answer to the multiple-choice test items through guessing or sometimes give an incorrect response to an easy item due to anxiety or carelessness. Either case may cause a bias estimation of examinee abilities and item parameters. Four-parameter logistic item response theory (4PL IRT) model and the deterministic inputs, noisy, and gate (DINA) model can be used to mitigate these negative impacts on the parameter estimations. The current simulation study aims to compare the estimated pseudo-guessing and slipping parameters from the 4PL IRT model and the DINA model under several study conditions. The DINA model was used to simulate the datasets in the study. The study results showed that the bias of the estimated slipping and guessing parameters from both 4PL IRT and DINA models were reasonably small in general although the estimated slipping and guessing parameters were more biased when datasets were analyzed through the 4PL IRT model rather than the DINA model (i.e., the average bias for both guessing and slipping parameters = .00 from DINA model, but .08 from 4PL IRT model). Accordingly, both 4PL IRT and DINA models can be considered for analyzing the datasets contaminated with guessing and slipping effects.


INTRODUCTION
Psychological and educational tests are usually used for observing a sample of examinees' behaviors. Many of them focus on measuring the abilities and skills of examinees. Therefore, it is important to know how an examinee's ability determines the correctness of an answer on an item (Lord, 2012). In an achievement test, a correct response is expected from an examinee with the required knowledge on the item whereas an examinee without the necessary knowledge on the item is supposed to give an incorrect answer (Rowley & Traub, 1977). However, this assumption may not hold for the multiplechoice test items. In a test with multiple-choice test items, an examinee's response may be a reflection of true ability, guessing behavior or unexpected incorrect response (i.e., slipping effect) due to anxiety or carelessness . Under the presence of guessing and slipping effects, the estimation of examinees' abilities and item parameters might be biased. These two effects can be modeled using item response theory (IRT) models and cognitive diagnostic models (CDMs). IRT models explain the relationship between an examinee's observed test performance and its underlying latent abilities through a mathematical function (Hambleton & Swaminathan, 1985). On the other hand, CDMs are used for determining whether an examinee has a set of attributes in order to solve a problem correctly in a test (de la Torre, 2009). CDMs have many common aspects with IRT models. For example, Junker (2001), used deterministic inputs, noisy, and gate (DINA; Haertel, 1989;Junker & Sijtsma, 2001)

models as an initial tool for
The DINA Model DINA model, proposed by Junker and Sijtsma (2001), requires configuring a Q matrix (Tatsuoka, 1983) as the other CDM models do. This matrix is composed of (J × K times) 1 and 0s, including attributes in the columns and items in the rows of the matrix. The element in the jth row and kth column of the matrix is showed as qjk. If qjk equals 1, it means an examinee is required to possess the corresponding attribute in order to answer the item correctly. If the attribute is not required for answering the item correctly, qjk becomes 0 in the Q matrix. Assume vector yi represents the observed score of an examinee i to J items and the elements of yi are statistically independent of the required attributes vector for the test αi = {αi1, αi2, ... , αiK}. Using Q-matrix and respondent's skills vector, DINA model produces the ηij in Equation 1. (1) In Equation 1, if an examinee possesses all necessary attributes for the correct answer on the item, ηij = 1; otherwise, ηij = 0. DINA model allows an examinee possessing all required attributes to miss an item (slip) or an examinee without at least one of the required attributes to answer the item correctly (guess). DINA model includes a guess (g) and slip (s) parameter for each test item. The parameter gj is defined by gj = P(Yij = 1 | ηij = 0), and the parameter sj is defined by sj = P(Yij = 0 | ηij = 1). Accordingly, the probability of correct response on item j given an examinee i with an attribute profile αi is formulated as in Equation 2.
DINA model can be implemented in computer software programs, including OxEdit (Doornik, 2018), LatentGold (Vermunt & Magidson, 2016), Mplus (Muthén & Muthén, 1998-2017, "CDM" package (Robitzsch, Kiefer, George, & Uenlue, 2019) and "GDINA" package (Ma & de la Torre, 2020) available as R program (R Core Team, 2017). However, it is essential to emphasize that the implementation of the DINA model is not limited to these computer software programs. Barton and Lord (1981) proposed 4PL IRT to model a parameter for the upper asymptote in the item characteristic curve. This model accounts for unexpected incorrect responses (missing) of examinees with a high ability level due to anxiety and carelessness. In the general form of this model, the probability of correct response given the ability level is formulated as in Equation 3.

The 4PL IRT Model
In Equation 3, Xij is the observed score of an examinee i on item j, k is the number of latent factors, Θ is the vector of examinee abilities, cj is the pseudo-guessing parameter of item j, dj is the upper asymptote parameter (i.e., slipping parameter) of item j, ajk is the discrimination parameter of item j 133 on the latent factor k, and bj is the intercept of item j, which is the multiplication of item discrimination and item difficulty (see Barton & Lord, 1981;de Ayala, 2009). Although Barton and Lord (1981) proposed using a common upper asymptote across all test items, the general form of the 4PL model allows estimating a different upper asymptote for each test item. One-, Two-, and Three-Parameter Logistic (1PL, 2PL, and 3PL) IRT models for dichotomous items have attracted great attention in the last decade (Magis, 2013). On the other hand, 4PL IRT model was not a commonly used IRT model among practitioners and researchers until recent years due to no indication for the benefit of using 4PL IRT model, the difficulties with the estimation of upper asymptote, and the unavailability of computer software programs that can be accessed by practitioners and researchers for using 4PL IRT model (Barton & Lord, 1981;Hambleton & Swaminathan, 1985;Loken & Rulison, 2010). However, the 4PL IRT model has become more popular in recent years, especially in the literature on IRT and computerized adaptive testing (CAT), with the development of very powerful computer software programs such as the "mirt" package in R program (Chalmers, 2012;Magis, 2013;Meng et al., 2019). Many studies have contributed to the improvement of the 4PL IRT model regarding its application in the field and parameter estimation (e.g., Culpepper, 2016;Liao et al., 2012;Loken & Rulison, 2010;Magis, 2013;Meng et al., 2019;Rulison & Loken, 2009;. Although the conventional IRT models allow test-takers' abilities to be scaled and ordered in one or more continuous latent factors, these IRT models including 4PL IRT model are not useful to assess test-takers' strengths and weaknesses in the latent factors because IRT models do not tell if some behaviors related to the latent factors (attributes) are mastered. Unlike IRT models, CDMs were basically proposed with the purpose of identifying test-takers' strengths and weaknesses through assessing the presence or absence of several necessary attributes to solve the problems in a test ( Although the literature has many studies investigating the important factors for the estimation of item parameters accurately in IRT models and CDMs separately, there are only a few studies directly comparing the item parameters from IRT models and CDMs in the same research (e.g., 2PL vs. pG-DINA in Yakar, 2017). In addition, there are some studies employing the 4PL IRT model within the CAT (e.g., Liao et al., 2012;. However, it is also important to investigate the parameter recovery in the 4PL IRT model for a fixed (non-adaptive) test via a simulation study because the fixed tests are commonly used in educational and psychological assessments. When the similarity between IRT models and DINA model, a restricted latent model, is taken into consideration (Culpepper, 2016;Hoijtink & Molenaar, 1997;Junker, 2001;Junker & Sijtsma, 2001;Meng et al., 2019), the current study may be helpful for the field to show the similarities and differences between 4PL IRT model and DINA model, and the important study design factors for the accurate estimation of the guessing and slipping parameters. Accordingly, the current simulation study aims to compare the estimated c-g and d-s parameters from the 4PL IRT model and the DINA model using the simulated datasets through the DINA model under several study conditions.

Simulation Study Design
All data were generated and analyzed in the R program (R Core Team, 2017). DINA model was used for data generation. In the literature, the test length was usually between 20 and 40 in many studies (e.g., Chiu, 2008;de la Torre, 2008de la Torre, , 2009de la Torre, , 2011de la Torre & Douglas, 2004de la Torre & Lee, 2010, 2013Henson & Douglas, 2005). In the data generation, test length was fixed as J = 20 or 40 items considering these studies in the literature. The review of the literature also showed that the 134 studied g and s parameters tend to be between .0 and .45 (e.g., Chiu, 2008;de la Torre & Douglas, 2004;DeMars, 2007;Henson & Douglas, 2005;. In addition, the intervals of these parameters corresponding to the low, moderate, and high levels were different across the studies. In this study, three levels of g and s parameters were manipulated in the data generation: .0 -.15 (low), .15 -.30 (moderate), and .30 -.45 (high). Then, these levels were crossed between g and s parameters in the data generation. The values of g and s parameters were equally spaced with an increment of .0075 and .00375 for the conditions with 20 and 40 items, respectively. Specifically, these values were obtained taking the ratio of intervals to test length (e.g., for the test with 20 items and the parameter values between .0 and .15, .15/20 = .0075). Then, the values of g and s parameters were fixed to g = s = .0075 for the first item, .015 for the second item, and .15 for the last item when test length was 20, and both g and s parameters were low (.0 -.15) in the data generation. Different values were chosen for the level of correlation among factors/attributes corresponding to the weak, moderate, and strong correlations across different studies in the literature. In this study, the correlation among the attributes was fixed to r = .2 (weak), .5 (moderate) or .8 (strong) considering the studies by Finch (2010),   (2017) found that a minimum sample size of 1000 is necessary to obtain accurate ability estimates in the 4PL IRT model. Therefore, in this study, the sample size was fixed to N = 3000 considering the adequacy of the sample size for the convergence of parameters to a solution. The number of attributes is usually between 4 and 8 in the literature (e.g., Chiu, 2008 . Because there were many simulation conditions included in this study and the use of a great number of attributes in a simulation study is very time consuming (de la Torre & Douglas, 2004), the number of attributes was fixed to K = 3 or 5. Four different Q-matrices were used in the data generation (2 test lengths x 2 different numbers of attributes). Each item was linked to one attribute in all Q-matrices (one-attribute items), and the number of items was distributed across the attributes as evenly as possible. Overall, there were a total of 108 conditions for data generation (3 g levels x 3 s levels x 3 correlation levels x 2 test lengths x 2 numbers of attributes). The number of replications for each condition was 100.

Data Analysis
Each dataset was analyzed using a multidimensional 4PL IRT model and a DINA model. Before the analysis of datasets using the multidimensional 4PL IRT model, the dimensionality of datasets was investigated via Factor 9.2 (Lorenzo- Seva & Ferrando, 2006). Parallel analysis with the tetrachoric correlation indicated that the dimensionality assumption was met for the use of the multidimensional IRT model (i.e., it was in line with the factor structure of the datasets in the data generation via DINA model). The local independence assumption was assumed to be met because it is not within the scope of this study. Expectation-maximization (EM) algorithm was used to estimate the item parameters through 4PL IRT and DINA models because it was the default estimation method in the R packages that were used for 4PL IRT and DINA models in the study. Specifically, the analysis of datasets was conducted in the "CDM" package (Robitzsch et al., 2019) for the DINA model and the "mirt" package (Chalmers, 2012)

RESULTS
Results were summarized using the average RMSE of the item parameters and creating its 95% confidence intervals by the 4PL IRT and DINA models across the study conditions. The RMSE of guessing parameters are presented across 4PL and DINA models in Figure 1. The RMSE of the guessing parameters were almost zero across all levels of c-g parameters (c-g parameters = .0, .15, and .3; see Figure 1a), all levels of d-s parameters (d-s parameters = .0, .15, and .3; see Figure 1b), all levels of the correlation among factors/attributes (r = .2, .5, and .8; see Figure 1c), all numbers of attributes (K = 3 and 5; see Figure 1d), and all test lengths (J = 20 and 40; see Figure 1e) in the study when DINA model was fit to the data.  In addition, its 95% confidence intervals were so small across all these study conditions that they did not appear in any figure for DINA models. However, the average RMSE of the guessing parameters became larger across all study conditions when the 4PL IRT model was fit to the data in lieu of the DINA model (see Figure 1a, 1b, 1c, 1d, and 1e). Furthermore, the RMSE of the guessing parameters were larger for 4PL IRT model under the conditions with a larger c-g parameter in the data generation (the 95% confidence interval of the average RMSE for the guessing parameters was between .04 and .05 when c-g parameters = .0, between .08 and .12 when c-g parameters = .15, and between .13 and .17 when c-g parameters = . 3; see Figure 1a). Similarly, for 4PL IRT model, the average RMSE of the guessing parameters became larger when the number of factors/attributes was greater, the test was shorter, d-s parameters were higher, and the correlation among factors/attributes was weaker, as expected (see Figure 1b, 1c, 1d, and 1e). However, among these four study conditions, the number of factors/attributes was the only significant study condition for the size of the RMSE of the guessing parameters from 4PL IRT model when the overlap between the 95% confidence intervals was considered (the 95% confidence interval of the average RMSE for the guessing parameters was between .05 and .07 when K = 3, and between .11 and .15 when K = 5; see Figure 1d). Overall, the similar results were also found for the RMSE of the slipping parameters (see Figure 2).  The average RMSE of the slipping parameters with its confidence interval was almost identical to the RMSE of the guessing parameters across all study conditions for both DINA and 4PL IRT models with one exception (see Figure 2b, 2c, 2d, and 2e). The RMSE of the slipping parameters became larger for 4PL IRT model under the conditions with a larger d-s parameter rather than c-g parameter in the data generation considering the confidence intervals of average RMSEs across the study conditions (the 95% confidence interval of the average RMSE for the slipping parameters was between .04 and .05 when d-s parameters = .0, between .08 and .12 when d-s parameters = .15, and between .13 and .17 when d-s parameters = .3; see Figure 2a).
The bias of the guessing and slipping parameters were calculated as the expectation of the difference between the item parameters estimated from DINA or 4PL IRT models and their corresponding values from the true model in the data generation. Results were summarized using the average bias of the item parameters and creating its 95% confidence intervals by 4PL IRT and DINA models across the study conditions. The bias of guessing parameters are presented across 4PL and DINA models in Figure 3.  As expected from the RMSEs of the guessing parameters, when the guessing parameters were estimated through DINA model, the bias of the guessing parameters were almost zero with a very narrow confidence interval across all levels of c-g parameters (c-g = .0, .15, and .3; see Figure 3a), all levels of d-s parameters (d-s = .0, .15, and .3; see Figure 3b), all levels of the correlation among factors/attributes (r = .2, .5, and .8; see Figure 3c), all numbers of attributes (K = 3 and 5; see Figure  3d), and all test lengths (J = 20 and 40; see Figure 3e) in the study. Unlike the DINA model, the guessing parameters were overestimated across all study conditions when the 4PL IRT model was used to estimate the guessing parameters (see Figure 3a, 3b, 3c, 3d, and 3e). In addition, the overestimation of the guessing parameters became more severe for the 4PL IRT model under the  However, among these study conditions, the value of c-g parameter and the number of factors/attributes in the data generation were the only study conditions that made a significant difference on the bias of the guessing parameters from 4PL IRT model considering the overlap between the 95% confidence intervals (the 95% confidence interval of the average bias for guessing 140 parameters was between .03 and .04 when c-g parameters = .0, between .07 and .11 when c-g parameters = .15, and between .10 and .15 when c-g parameters = .3; between .04 and .05 when K = 3, and between .10 and .14 when K = 5; see Figure 3a and Figure 3d, respectively). The similar results were also found for the bias of the slipping parameters (see Figure 4). However, like the RMSE of the slipping parameters, the overestimation of the slipping parameters were more severe under the conditions with a larger d-s parameter rather than a larger c-g parameter in the data generation when the 95% confidence intervals of the average bias for the slipping parameters were taken into consideration across the study conditions (i.e., the 95% confidence interval of the average bias for slipping parameters was between .03 and .04 when d-s parameters = .0, between .07 and .11 when ds parameters = .15, and between .10 and .15 when d-s parameters = .3; but the 95% confidence interval of the average bias for slipping parameters was between .05 and .09 when c-g parameters = .0, between .06 and .10 when c-g parameters = .15, and between .08 and .12 when c-g parameters = .

Journal of Measurement and Evaluation in Education and
3; see Figure  4a and 4b).

DISCUSSION and CONCLUSION
Multiple-choice test items might be regarded as a popular item type in educational and psychological assessments. However, in a test with multiple-choice test items, some test takers may guess a correct answer (i.e., guessing effect), or miss it because of anxiety or carelessness (i.e., slipping effect). The estimation of item parameters and test-takers' abilities might be biased when the guessing effect and/or the slipping effect are not modeled in data analyses. The DINA model and 4PL IRT model consider the guessing and slipping effects through including a parameter for the guessing effect (i.e., g parameter in DINA model and c parameter in 4PL IRT model) and a parameter for the slipping effect (i.e., s parameter in DINA model and d parameter in 4PL IRT model) when analyzing data and estimating model parameters such as item parameters and test-takers' abilities. The current simulation study purported to compare the estimated c-g and d-s parameters from the 4PL IRT model and DINA model through manipulating the number of attributes, the level of correlation among attributes, test length, the level of g parameter, and the level of s parameter.
The research findings indicate that the guessing and slipping parameters were estimated correctly across all study conditions when the DINA model was used to analyze the datasets in the study (e.g., the RMSEs of the guessing and slipping parameters were almost zero across all study conditions). The good performance of the DINA model is consistent with the results in the literature (e.g., Chiu, 2008;de la Torre & Lee, 2010). However, an important limitation of the current study is the use of the DINA model for data generation. Fitting the correct model (i.e., DINA model) might be a possible reason for the estimation of slipping and guessing parameters correctly. Thus, it might be helpful to use an empirical dataset for the evaluation of guessing and slipping parameters estimated via 4PL IRT and DINA models in a future study.
A typical test length is 15 or 20 to estimate the model parameters accurately in the CDMs, and the model parameters are estimated more accurately via the DINA model as the sample size becomes larger (de la Torre, 2009;. In the current study, the test length was fixed as 20 or 40 items, and the sample size was fixed at 3000 in the data generation. The large sample size and the long test length might be other possible reasons for the estimation of slipping and guessing parameters accurately via the DINA model. Future work may consider investigating the impact of a shorter test length (e.g., < 15 or 20) and a smaller sample size (e.g., < 3000) on the accuracy of guessing and slipping parameters estimated via 4PL IRT and DINA models.
Both guessing and slipping parameters were overestimated when the 4PL IRT model was chosen to estimate these two item parameters in lieu of the DINA model. The number of attributes made a significant difference in the overestimation of both guessing and slipping parameters when the 4PL IRT model was fit to the data. The overestimation of the guessing and slipping parameters from the 4PL IRT model became more severe when the number of attributes was greater in the data generation. While the number of attributes became greater for the conditions with the same test length, there were fewer items per attribute. Parameter estimates tend to be more biased for a shorter test (Hulin, Lissak,  Drasgow, 1982). This might be a possible reason for the overestimation of the guessing and slipping parameters more severely under the conditions with a greater number of attributes given the same test length.
The value of guessing parameters in the data generation was another significant study condition for the estimation of guessing parameters through the 4PL IRT model. The guessing parameters were overestimated more under the conditions with a larger guessing parameter in the data generation. This was not consistent with the results from DeMars' (2007) study where the overestimation was more severe for the conditions with a lower guessing parameter. DeMars fits a unidimensional 3PL IRT model to the datasets that followed a multidimensional 3PL IRT model whereas we analyzed the datasets with the multidimensional factor structure and the slipping effect through fitting a multidimensional 4PL IRT model to the datasets. In addition, due to the small sample size (i.e., 1000), the estimated guessing parameters were biased towards the mean of prior distribution (i.e., .2) in DeMars' study (i.e., the bias = .05, .02, .01, -.01, and -.03 for c = . 10, .15, .20, .25, and .30, respectively). However, a relatively larger sample size (i.e., 3000) was used in the current study. These might be some possible reasons for the difference between the findings. Although the average bias of the guessing parameters became larger for the 4PL IRT model under the conditions with a higher slipping parameter, a weaker correlation among attributes, and a shorter test in the data generation, the bias difference was not significant considering the overlap between the 95% confidence intervals. This is consistent with the findings in the literature considering the impact of test length and correlation among attributes (e.g., Hulin et al., 1982;Svetina, Valdivia, Underhill, Dai, & Wang, 2017).
When the slipping parameters were estimated through the 4PL IRT model, the overestimation of slipping parameters was more severe under the conditions with a greater slipping parameter in the data generation. However, the bias of the slipping parameters from the 4PL IRT model did not differ across the different levels of the guessing parameters, the correlation among attributes, and the test length in the data generation when the 95% confidence interval of the average bias was taken into consideration. The findings related to the estimated slipping parameters may not be generalized to other study conditions, and there is a need for more studies investigating the parameter recovery in the 4PL IRT model under different study conditions. For example, as mentioned before, the sample size was not manipulated in the current study, and the chosen sample size was limited to 3000 for data generation. However, it is common to use a sample size less than 3000 in literature (see Conway & Huffcutt, 2003;Henson & Roberts, 2006;Jackson, Gillaspy, & Purc-Stephenson, 2009). Although it is recommended that the sample size for running a 3PL model or a DINA model should be larger than 1000 to obtain accurate parameter estimates, there is no rule of thumb for the required sample size of the 4PL IRT model Hulin et al., 1982). Accordingly, the sample size (e.g., < 3000) might be manipulated in future work to investigate the lower limit for the sample size for running a 4PL IRT model. In addition, it might be helpful to study whether the manipulation of sample size will make a difference in the estimation of slipping and guessing parameters by interacting with the other study conditions such as test length and the correlation among attributes.
Although the estimated slipping and guessing parameters were more biased when datasets were analyzed through the 4PL IRT model than the DINA model, the bias of the estimated slipping and guessing parameters from both 4PL IRT and DINA models were reasonably small in general. Overall, the average bias of both guessing and slipping parameters was smaller than .1 across all study conditions, except the conditions with a high guessing/slipping parameter or a great number of attributes in the data generation. Accordingly, both 4PL IRT and DINA models can be preferred for analyzing the datasets contaminated with guessing and slipping effects. However, it is important to consider the aforementioned limitations of the current simulation study before deciding whether the study results can be generalized to other study settings.