The Impact of Q-matrix Misspecification and Model Misuse on Classification Accuracy in the Generalized DINA Model*

This simulation study explored the impact of Q-matrix misspecification and model misuse on examinees’ classification accuracy within the generalized deterministic input, noisy “and” gate (G-DINA) model framework under the different conditions. The data was generated by saturated G-DINA model. Along with the generating model, two reduced models were used to fit the data: the additive CDM (A-CDM) and DINA model. The manipulated conditions included number of respondents, attribute correlations and test length. Two types of classification accuracy were examined: the overall classification accuracy and the class-specific classification accuracy. Results showed that the Q-matrix misspecification influenced classification accuracy more ominously than model misuse. The proportion of examinees classified correctly for each latent class was related to the types of Q-matrix misspecification. More test items had greater positive impact on classification accuracy than more respondents taking the test.


INTRODUCTION
Researchers and educational stakeholders have increasingly demanded more formative test information (Mislevy, 2006;Robets & Gierl, 2010;Rupp & Templin, 2008). They often wish to obtain the classification of respondents with respect to their skills. Teachers, students and parents often want to know the individual's level of skill mastery to facilitate an individual's development. Cognitive diagnosis models (CDMs) are used to measure the respondents' knowledge structures and the multiple attributes for the purpose of making classification-based decisions (Rupp, Templin, & Henson, 2010). 392 data, the Q-matrix may be misspecified. Previous research has shown that parameter estimates and classification accuracy were affected by the misspecification of Q-matrix (e.g., Rupp & Templin, 2008;Kunina-Habenicht, Rupp, & Wilhelm, 2012). Specifically, Rupp and Templin (2008) used the different types of Q-matrix misspecification under the DINA model. They found the Q-matrix misspecification had caused biased parameter estimates and lower classification accuracy corresponding to the examinees' latent class. However, questions such as whether the results may be generalizable to more general contexts. The purpose of this study is to estimate the effects of specific types of Q-misspecification on examinee classification accuracy under the generalized G-DINA model.
The rest of the manuscript is structured as follows: In the theoretical framework, we first provide an overview of the Q-matrix, types of Q-matrix misspecifications and the generalized DINA model. In the method, the simulation design, the model estimation and the outcome assessment are described.
Next the findings of this study are described. Lastly this manuscript is closed with a discussion of the findings.

Q-matrix
A critical step in cognitive diagnostic model is to develop the Q-matrix because CDM and the Qmatrix are essential modeling process. Developing the Q-matrix defines the attribute structure measured by an assessment. An example of a JxK Q-matrix can be demonstrated as follows: Where j indicates "item" and k indicates "attributes." The element, qjk, is specified as "1" if the jth item requires the kth attribute to answer this item correctly; otherwise, qjk is specified as "0".
In a Q-matrix, each element qjk indicates whether the item j measures the attribute k, where qjk =1 means item j measures the attribute k and qjk =0 means item j does not measure the attribute k. The Q-matrix reflects the loading structure of the multiple attributes on the items. The Q-matrix is specified by content experts and this specification process is a subjective activity (Rupp, Templin, & Henson, 2010). Hence, the quality of the Q-matrix determines the diagnostic information obtained from the CDM analysis.

The Generalized DINA Model Framework
The generalized DINA model, like all other CDMs, requires a J x K Q-matrix. The G-DINA discriminates latent classes into 2 * latent groups, where * = ∑ =1 represents the required attributes for item j. Each latent group is reduced to an attribute vector represented by * . In this study, it would suffice to use the reduced vector * = ( 1 * , … , * * ) instead of the full vector = ( 1 , … , ). Each latent group has the probability of answering correctly the item represented by P( * ). The item response function (IRF) for G-DINA could be written as: where 0 is the intercept, is the main effect by αk , ′ is the interaction effect by αk and αk', and 12… * is the interaction effect by α1 ,…, αk*.
Another special case of the G-DINA model is the A-CDM, which contains only the intercept and the main effects. The IRF for A-CDM is defined as follows: This model contains only the intercept and the main effect of each attribute.

Simulation Study
The simulation study was aimed to examine the effects of Q-matrix misspecification and CDM misuse on classification accuracy. All data generation and estimations were conducted using the software R (R Core team, 2016). The data was generated using the saturated model G-DINA. The number of respondents, the correlation between attributes, and the number of items measured in a test were manipulated and resulted in 12 data-generating conditions with 1000 replications for each condition. For each of the generated datasets, three CDMs within the G-DINA framework were applied for the data analysis: the G-DINA, A-CDM and DINA models. Six Q-matrices, including 1 correctly specified Q-matrix and 5 misspecified Q-matrices were examined. In total, there were 216 different settings for data analysis, which included 18 diverse estimations and 12 different data-generating conditions.

Number of respondents
Three levels of number of respondents reflecting small, moderate and large samples were investigated in this study: N = 500, 1000 and 5000. Previous research has shown this is a relevant factor that influences model fit, parameter estimates, and classification (Chen, de la Torre, & Zhang, 2013;Cui, Gierl, & Chang, 2012;de la Torre, 2009;de la Torre & Douglas, 2004;Shu, Henson, & Willse, 2013). Several studies have shown that number of respondents should be at least 500 in order to have an acceptable model fit and relatively accurate parameter estimates even when using the reduced model as the generated model (Chen et al., 2013;Cui et al., 2012;Shu et al., 2013). The pilot study indicated that when the sample size increased to 500, the model fit achieved an acceptable level.

Number of attributes
This study focused on one level of the number of attributes K =4. A review of the CDM simulation studies indicates that there are usually three to eight attributes being designed in an assessment, which also reflects the number of attributes in application examples (Cheng, 2009;Chen et al., 2013;DeCarlo, 2012;de la Torre, 2009;de la Torre & Douglas, 2004;Huebner & Wang, 2011;Kunina-Habenicht et al., 2012). Considering all the other factors being manipulated in the simulation and a fairly large estimation process, the attributes' number was fixed at four in this study.

Marginal attribute difficulty
A multivariate normal distribution for latent attributes with the mean vector and correlation matrix were used to generate respondents' true attribute patterns. In this study, the mean vector of(0, 0, 0, 0) was used for the four attributes test; this led to the same marginal mastery proportions for all attributes of .50. This mean vector is also called marginal attribute difficulty.

Correlation between attributes
Two levels of attribute correlation were set to values of .4 and .8 to represent moderate and high correlation (Henson, Roussos, Douglas & He, 2008), respectively. A range of .3 to .9 of the tetrachoric correlation is typical in educational assessment and CDM research (Cui et al., 2012;Henson, Templin & Douglas, 2007;Kunina-Habenicht et al., 2012). A weakly correlated attributes level could be included as a contrast, but I chose not to do this to keep the overall simulation and estimation manageable. The correlations were set to be equal across all attribute pairs in the correlation matrix.

Q-matrix specification
The number of items in a test was set to two levels in this study: J =14 and 28. The number of items and the number of attributes measured in a test are associated. For K =4, the number of all possible attribute patterns was 2 4 =16, and there are 15 attribute patterns. Considering the computational time, we set the maximum number of attributes being assessed by an item to three. The item 1-14 in Table  1 showed the Q-matrix specification for generation when J=14. This simulation design also investigated the conditions where the test length is equal to and greater than the number of possible attribute patterns. Two levels of the item number were examined in this study: J = 14 and 28 for the number of attributes K = 4. The Q-matrix for J=28 was a duplicate of Q-matrix for J-14. The correctly identified Q-matrix for J = 28 is also shown in Table 1. The Q-matrix for J = 14 was embedded as a subset of this Q-matrix.  Different types of the Q-matrix misspecification were investigated: under-fitting the Q-matrix (defining 1 as 0), over-fitting the Q-matrix (defining 0 as 1), and a balanced misfit (exchanging 0 and 1). As shown in Table 2, taking the test with J=14 items as an example, qt-14 was the true Q-matrix for data generation. Two under-specified Q-matrices qu3-14 and qu2-14 meant that qu3-14 Q-matrix changed all 3-attribute items into selected 2-attribute items, and this selection of the attribute deletion was random for each item; qu2-14 Q-matrix changed all 2-attribute items into selected 1-attribute items, and this selection of the attribute deletion was random for each item. Similarly, two overspecifications qo1-14 and qo2-14 Q-matrices were created by randomly selecting the attribute being added. For creating the balanced misfit for the Q-matrix (qm-14), the items that needed to be altered were first randomly selected; then, the attributes that needed to be altered were selected randomly for each item. The assessment with number of items J = 28 had doubled the items as in the assessment J =14. The misspecification of Q-matrix in J =28 only occurred in items 1 to 14, and items 15 to 28 always remained the same as in true Q-matrix (qt-28). In this way, the number of misspecified items in J = 28 was the same as in J =14 when controlling the type of misspecification, which made the results comparable for different test length.

Item parameter specification for data generation
The parameter setting was referenced from an empirical study (Basokcu, Ogretmen, &Kelecioglus, 2013). The true item parameters ( ) used in this simulation study were ranged from 0.12 to 0.68, and the detailed values were presented in Table 3. For simplicity, all the one-attribute items used the same parameter setting, and the same idea was followed for the two-and three-attribute items.

Model selection
Each of the generated datasets was analyzed by three CDMs within the G-DINA framework. The true generating model was the G-DINA model. In addition to the true model, two misused CDMs were used to analyze the data. misusage of CDM refers to incorrect parameterization of the modeling process. As two comparison models, A-CDM contained only intercept and main effects for each item; and the DINA model contained only intercept and the highest order of interaction effect for each item.

Outcome Measures
Classification accuracy (CA) is defined as the degree to which the classification of examinees' latent classes analyzed by observed data agrees with examinees' true latent classes (Cui et al., 2012). The simulated examinee attribute patterns were used as the true examinees' latent classes; the attribute patterns estimated from the response data using MLE method were used as the estimated latent classes. The simulated and estimated latent class were then compared for each examinee. If they were consistent, a value of "1" was assigned to the examinee to represent being classified accurately; otherwise, a value of "0" was assigned for being classified inaccurately. By taking the average of 0/1 over all examinees and all replications, the overall correct classification rates were calculated for each condition, which refers to overall classification accuracy (OCA). By taking the average of 0/1 for the examinees by each latent class, the class-specific correct classification rates were calculated, which refers to class-specific classification accuracy (CCA). In order to simplify the interpretation of the findings, the CCA was calculated based on one generating condition (n = 5000, ρ = .4 and J = 14) and being fitted with the various CDMs and Q-matrices. The OCA and CCA were then compared for all the estimation settings.

RESULTS
In CDM estimations, the classification is usually of primary interest because the decisions about the examinees are made based on the classification (Rupp, Templin, & Henson, 2010). Two types of the classification accuracy were illustrated in this part: the OCA and CCA.   Table 4. The impact of CDM misuse and Q-matrix misspecification on OCA is examined in Figure  1. As shown in Table 4, when test length increased, the correct overall classification rates were much higher. For example, in G-DINA model with qt matrix, the correct overall classification rates went up from .711 to .886 as test length increased from J=14 to 28,controlling ρ= .4 and N = 500. This is expected because more pieces of information provided by the items for each dimension can be used to detect the classification. Second, as the sample size increased, the overall classification rates slightly increased for all conditions. For example, again in G-DINA model with qt matrix, the overall classification accuracy increased from .886 to .893 as sample size increased from 500 to Comparing the effects of J and N on classification accuracy, we can see that more items in a test are more critical than more examinees to get a better classification accuracy. Third, the increase in attribute correlation slightly increased the overall classification accuracy with few exceptional conditions.

Figure 1. Overall Classification Accuracy (OCA) by CDM and Q-matrices
To investigate the effects of the misspecification of CDM and Q-matrix, the correct overall classification rates were shown in Figure 1. The classification rates used in this figure were collapsed over the other factors N, J and for the simpler illustration.
For CDM misuse, Figure 1 showed that the overall classification accuracy was highest in G-DINA no matter which specified Q-matrix was used. This makes sense because G-DINA was the generating model. Comparing the other two CDMs, A-CDM has higher classification rates than DINA. The A-CDM yielded very similar overall classification rates with the true model G-DINA where A-CDM contained only main effects of the attributes and omitted all the interactions. The DINA model showed the lowest classification rates among three CDMs where DINA contained only the highest order of interactions among attributes.
For investigating Q-matrix misspecification, the condition qt was the correct Q-matrix and could be used as baseline rates in each CDM. Figure 1 showed that the OCA in qt was higher than the other misspecified Q-matrices in three CDMs. The effects of the misspecified Q-matrices on classification accuracy were then compared with the true Q-matrix in different CDMs. The classification rates in G-DINA and A-CDM showed similar patterns for the Q-matrix misspecification. Within these two models, the OCA for the condition qu3, qo1 and qo2 was close to the rates in qt. The misspecified qu2 had lower overall classification rates, and the misspecified qm showed the lowest overall classification rates. This is not surprising because the qm included all types of misspecification. To compare the effects of different Q-matrices in the DINA model, the OCA was highest in qt; the condition qu3 yielded almost the same results with qt; while the lowest classification still occurred in qm among all the conditions. The Q-matrices qo2 and qu2 in DINA model yielded the moderate classification rates.

Class-specific Classification Accuracy (CCA)
When we examined the respondents' classification at each latent class level, it was worthwhile to note that classes with more attributes tended to have generally higher classification accuracy in various CDMs and Q-matrices (Table 5). For example, in the G-DINA and qt condition, the CCA ranged from 0.587 to 1 for the class with no attribute to the class with all attributes. The attribute class in which all attributes were mastered (attribute pattern 1111) maintained very high correct classification rates no matter which CDM and Q-matrix were used. Especially in the DINA model, the misclassification of examinees in this attribute class never occurred.
Comparing the different CDMs, the G-DINA model yielded the highest CCA in almost all the latent classes with few exceptions. When using qt in G-DINA model, the correct classification rates for oneattribute mastery classes were at least 65%; and these rates reached at least 80% and 90% for two-and three-attribute mastery classes, respectively. A-CDM performs better than DINA in the classes with zero-, one-and two-attributes. The CCA by using qt and A-CDM were approximately .6 for oneattribute mastery classes, .75 for two-attribute mastery classes, .87 for three-attribute mastery classes.
However, the DINA model had higher than expected classification accuracy in three-and four-attribute mastery classes, even with misspecified Q-matrices. More specifically, focusing on the three-attribute mastery classes, the CCA of the DINA model using qt were .922, .908, .908 and .925, while the G-DINA model using qt has almost the same classification accuracy. In qu2, qo1 and qo2, the CCA of the DINA model was slightly lower than G-DINA and higher than A-CDM in three-attribute classes' estimations. In qu3, DINA even performed best among three CDMs in the classification accuracy of three-attribute latent classes (.887, .906, .906 and .925).
Considering the Q-matrix misspecification, the class-specific classification rates are related to the different types of misspecified Q-matrices (under-, over-or mixed misspecification). G-DINA and A-CDM showed a similar pattern: The over-specified Q-matrices (qo1 and qo2) did not have much impact on the class-specific classification accuracy. The under-specified Q-matrices, especially qu2, had much lower CCA in these two models. While in the DINA model, the misspecified qu2 and qo2 seemed to have a more severe impact on CCA; the qo1 mainly affected the correct classification rates on the classes with fewer attributes. The misspecified qm, for all three fitting models, showed the lowest classification rates, and the low class-specific classification rates occurred in almost all attribute classes.
Furthermore, we noticed that the low class-specific classification rates corresponded to the attribute patterns that matched the manipulated attribute classes. For example, in the misspecified qu2 where two-attribute items were changed into one-attribute items, the correct classification rates of twoattribute mastery classes (e.g. attribute class [1100]) dropped a great deal when compared with qt condition. The correct classification rates of one-attribute mastery classes (e.g. attribute class [0001]) decreased as well in all three CDMs. Unlike G-DINA and A-CDM, in the condition qo1 where the one-attribute items were changed to two-attribute items, the classification rates for having one attribute in DINA were very low which matched the manipulated items. In the condition qo2, the CCA of twoattribute mastery classes were low as well in the DINA model.

DISCUSSION and CONCLUSION
The G-DINA model offers a flexible framework to investigate the issues in examinees' diagnostic classification. The specification of Q-matrix and the choice of CDM play a critical role for achieving better classification accuracy. This study helps to understand better of the effects of CDM misuse and Q-matrix misspecification on classification accuracy under various conditions. The different factors, such as number of test items, number of examinees and attribute correlation, all have certain impacts on examinees' classification. The outcome of CDMs provides meaningful formative test information about the multiple proficiencies of the attributes measured in each examinee. Although this study is sufficiently complex, it clearly can be extended by using a broader range of design. This simulation study contributed in the following four aspects. First, the G-DINA model was used as a framework that aligned with the trend in CDM development. The simulation was conducted in the saturated model and fit the data with two reduced models as well as the saturated model, which better aligns to the practice of real data analysis. Second, both the Q-matrix misspecification and CDM misuse were investigated separately and conjunctively. Third, the under-, over-and mixed misspecified Q-matrices allow us to detect the more specific effects of Q-matrix misspecification under various conditions in a generalized CDM framework. Fourth, the overall classification accuracy and the class-specified classification rates (often the primary interest in CDM analysis) were investigated under different conditions in this study.
Both the number of respondents and test length illustrated clear positive effects on classification accuracy. Despite the model selection and Q-matrix specification, the increase of the number of respondents and/or the test items always demonstrated the growth in the correct classification rates. One noticeable finding is that the increase in test length improved the classification accuracy more dramatically than the increase in sample size. It provides an insightful direction to the practitioners, to assist in making the decision of which factors will be manipulated, in order to effectively improve the examinees' classification accuracy.
Our results also demonstrated that model misuse does not noticeably affect the overall classification accuracy, even though the G-DINA model still maintained the highest level of classification accuracy. We simulated data in the saturated G-DINA model by mimicking the complex empirical situation. When estimating the data with various CDMs, we found the models performed differently under the consideration of examinees' latent classes. For the examinees who have fewer attributes (e.g. one-or two-attribute), G-DINA and A-CDM models yield more accurate classification rates than the DINA model. A-CDM showed a better classification accuracy in the non-attribute mastery class. This may due to the structure of G-DINA and A-CDM models that they contains the main effects. For the examinees who have more attributes (e.g. three-or four-attribute), the DINA model that contained only the highest order of interaction had higher than expected classification accuracy even with the Qmatrix misspecification. Given these, although A-CDM is easier to interpret in practice, if we have large number of attributes, it may be worth considering having higher order interaction effects.
One important finding in this study is that the misspecification of Q-matrix affected the overall classification accuracy in a more obvious way than model misuse. In practical application, the true Qmatrix is unknown and there is a possibility that Q-matrix could be misspecified in the designing process. As expected, the true Q-matrix yielded the most accurate classification. In general, the undermisspecified Q-matrices had more severe impact on CA than over-misspecified Q-matrices especially in the models with main effects. The misspecified Q-matrix qm was most problematic because the correct classification rates were low in almost all the conditions. Although the number of attributes held constant in qm, a large number of misspecification occurred. The qm contained all types of the misspecification and represented the most severe misspecification. Thus, it is not only the number of misspecified items that matters but also the types of misspecification. The attribute structure, rather than the number of attribute by item, is a much more important component in the diagnosis process. In practice, we may face a situation where there is an uncertainty in determining whether one item measures the attribute. We suggest that over-specification may be better than under-specification.
Besides the effect on overall classification rates, the different types of misspecified Q-matrices also showed the effects on the corresponding latent class. When a certain attribute combination is not represented in the Q-matrix, the respondents mastering the same attribute combination are more likely to be misspecified. A typical example in all three CDMs is the misspecified qu2, where two-attribute items were changed into one-attribute items. The classification rates decreased noticeably in the corresponding two-attribute mastery classes in all three CDMs. Thus inferences for the examinees in the associated classes should be more cautious.
Moreover, the effects of differently specified Q-matrices on classification accuracy varied in three CDMs. For example, the over-specified Q-matrix (qo1and qo2) influenced the DINA model more severely, but not in the G-DINA and A-CDM. The balanced misfit Q-matrix qm had shown more 402 dramatic negative effect on the classification rates in DINA than the other two models. This may be due to the different features of three CDMs. The saturated G-DINA model contains the main effects and all the ways of interactions, the A-CDM contains the main effects only, and the DINA includes only the highest order of interaction. In sum, the G-DINA model had a more stable performance in all latent classes when considering Q-matrix misspecification, although A-CDM performed well in zeroand one-attribute mastery classes and DINA showed high classification accuracy in three-and fourattribute mastery classes.
Regardless of the different types of CDMs and Q-matrices, it was noteworthy that the examinees in the latent classes with more attributes had higher classification accuracy, and the examinees in the latent classes with fewer attributes could not be classified accurately. This becomes considerable in practice when applying these CDMs to identify the mastery and non-mastery of multiple attributes, especially for the examinees at the lower end. The attribute class mastering all attributes almost never showed any misspecification rates; while the attribute class with no attributes had low correct classification rates. For addressing the possible reasons of this phenomenon, future research may examine the impact of item difficulty and the distribution of attribute patterns.
In practice, the importance of diagnostic test development framework and Q-matrix validation methods should be emphasized. After the Q-matrix is designed, we recommend validating the Qmatrix using the method proposed in de la Torre (2008) and de la Torre and Chiu (2016) to check the possibility of misspecification. Yet it is not easy to evaluate the correctness of the Q-matrix due to its subjective nature and the complexity when applied to the model. When there is an uncertainly in determining if one item measures the attribute, over-specification may be better than underspecification. In order to classify the examinees into latent groups, the selection of the CDMs may relate to which group of examinees are more concerned with. The saturated model usually yields more stable classification accuracy across all the latent classes. The model with higher order interactions should be considered when there are a number of attributes, although the model with only main effects is easier to interpret. Hopefully the findings of this study will provide some insights for practitioners and researchers in determining the Q-matrix and cognitive diagnostic models when facing various situations.