Examining Invariant Item Ordering Using Mokken Scale Analysis for Polytomously Scored Items*

The aim of the present study is to identify and compare the number of items violating the item ordering, the total number of item pairs causing violation, the test statistics averages and the H T values of the overall test obtained from three separate Mokken IIO models in the simulative datasets generated by the graded response model. All the simulation conditions were comprised of 108 cells: 3 (minimum coefficient of a violation) x 2 (item discrimination levels) x 3 (sample sizes) x 2 (number of items) x 3 (response categories). MIIO, MSCPM and IT methods were used for data analysis. When the findings were considered in general, it was found that the MIIO method yielded the most stable values due to the fact that it was not affected by the lowest violation coefficient and was affected only slightly by simulation conditions. Especially in conditions where the violation coefficient was 0.03 (the default value in the Mokken package), it was recommended to use the MIIO method in identifying item ordering. Even though the MSCPM method yielded similar findings to those of the IT method, it generated more stable findings in particularly high sample sizes. In conditions where sample size, number of items and item discrimination were high, the MSCPM was recommended to be used.


INTRODUCTION
A high score from psychological tests measuring personality or interests generally indicates positive responses regarding the related trait, while a high score from a cognitive test measuring ability indicates a better solution as regards the related cognitive trait. For example, an arithmetic question such as It is possible to think that the latter indicates introvertedness more than the former does. However, in practice, many people prefer to do their work on their own, although they are not introverts. Such conditions show us that it is wrong to establish the order by considering item means. However, it is possible for a group of items to have an invariant item ordering (IIO) and to have a structure by identifying a level of grouping (Ligtvoet et al., 2010, p. 2).
IIO was developed with the aim of overcoming the problems that can stem from ordering test items based solely on item difficulty (Sijstma and Junker, 1996). IIO is the situation where the order of items is the same for all the participants. The benefits of IIO have been proven from various aspects. IIO is defined within the scope of item response theory (IRT). To determine the IIO of test items, they should have the assumptions of IRT models. Sijtsma and Junker (1996) showed that IIO could only be used in IRT models in which item response function (IRF) does not intersect. IIO can only be applied to Rasch (1960) and the double monotonicity model (DMM) in dichotomously scored datasets (Mokken and Lewis, 1982). In polytomously scored datasets, on the other hand, IIO can only be applied to the rating scale model (Andrich, 1978) and the restricted graded response model (Muraki, 1990)  .
The IIO methods are manifest invariant item ordering (MIIO) model, the manifest scale of the cumulative probability model (MSCPM) and increasingness in transposition (IT) model, which is addressed within the scope of Mokken Scaling Analyses (MSA) (Van der Ark, 2012). These are nonparametric methods that require very few assumptions (unidimensionality, latent monotonicity, non-intersection). Each method can generate a fixed item order and items that violate this order Ligtvoet, Van der Ark, Bergsma, and Sijtsma, 2011). The average ratios of the MIIO polytomously scored items were developed with the aim of identifying whether or not polytomously scored items intersected with the item response function. MSCPM examines the manifest item step response function for each item pair. However, this high method of IIO has some disadvantages in practice. Because it compares each item pair individually, it yields an excessive number of comparative findings. For this reason, it has the tendency to propose the fact that all the items lead to violation. The MSCPM method, when compared to the other models, has the potential to yield a higher number of violating items (McGrory, 2015). In the related literature, there is very limited information regarding the details of these methods.
The IIO violating items are initially identified and then they are sequentially removed from the test. This process is continued until there are no IIO violating items remaining in the test. Subsequently, the person scalability coefficient (H T ), which is a measure for individuals' adaptation, is calculated. This coefficient resembles the H coefficient, but it is obtained from the converted data matrix. The H T coefficient, which has a value between 0 ≤ H T ≤ 1 was developed by Sijstma and Meijer (1992) to determine the model-data fit of DMM. The obtained high values in DMM indicate that the person ordering is invariant. In other words, the order of the items is independent of a group of individuals; it is invariant. Negative H T values indicate the violation of the non-intersection assumption (Ligtvoet et al., 2010(Ligtvoet et al., , 2011. According to Sijstma, Meijer and Van der Ark (2011), the H T coefficient is as important as the other scalability coefficients (H, Hi, Hij) because it shows to what extent the person ordering is independent of the Guttman error. However, it is more sensitive than the other scalability coefficients in many respects. IIO values are obtained in situations where IRFs are not close to each other. This situation shows that the H T coefficient should not be used for the purpose of evaluating the quality of a measurement.
MIIO is the default IIO method in the Mokken package in R software. There are numerous studies in which MIIO is applied to various scales to determine the invariant item ordering (Ahmadi, Reidpath, Allotey, and Hassali, 2016;Gibbons, Small, Rick, Burt, Hann, and Bower, 2017;Lee, Chen, Jiang, 314 Chu, Chiu, Chen, and Chen, 2016;Ligtvoet, van der Ark, and Sijtsma, 2008;Saiepour, Najman, Clavarino, Baker, Ware, and Williams, 2014;Stewart, Allison, Baron-Cohen, and Watson, 2015;Stochl, Jones, and Croudace, 2012;Van der Graaf, Segers, and Verhoeven, 2015;Yoon, Shaffer, and Bakken, 2015). However, there are no studies in literature regarding the use of the other two methods for IIO. Sijstma and Meijer (1992) supported their research in which they developed the H T coefficient with a simulation study. In this research conducted on dichotomously scored datasets, the higher the item difficulty and item discrimination coefficients were, the higher the H T coefficient turned out to be. It was observed that sample size and length of test had a limited effect. The other qualities of the item response function and the ability parameter distributions remained constant.
The only study which compared and discussed these three methods based on a single real dataset belongs to Ligtvoet et al. (2011). In this study, two small datasets were used to compare the methods of MIIO, MSCPM and IT. In the eight items of the first dataset, MIIO yielded a violation in two of the total 28 item pairs. Since the common point of these two item pairs was the fifth item, it was recommended that this item be removed from the test. The MSCPM model found violation in seven of the 63 item pairs. It was recommended that the third and sixth items be removed. The IT method was applied for the remaining five items. Violation was observed in two of the 60 item pairs. It was recommended that the first item be removed. In the second dataset, the IRFs of six item pairs were examined. While the MIIO method did not yield any violations, the IT method yielded one and the MSCPM method yielded two violations. Furthermore, in this study, Ligtvoet et al. (2011) conducted a simulation study on the determination of MIIO sensitivity and specificity and the H T coefficient. The findings of this simulation constitutes the foundation of this research study.
In a pilot study (Ligtvoet et al. (2011) on MIIO, MSCPM and IT, it was found that each of these models indicated different items to be removed. When a situation contradictory to IIO emerged, it was observed that MSCPM was more sensitive and generally proposed more items to be removed than MIIO and IT did. The item ordering obtained from IT is expected to be stricter when compared to the other models; thus, findings indicating more items to be removed is expected. For this reason, these preliminary findings are found to be surprising. Another point is that these methods are not hierarchically related; that is, they examine different features of the dataset. Hence, it is normal that they yield different items for remove (Van der Ark, 2012). This finding reported by Van der Ark (2012) seems to be the result of a single study comparing these methods. Hence, it is clear that further studies need to be conducted to compare these methods.

Purpose of the Study
The aim of the present study is to identify and compare the number of items violating the item ordering, the total number of item pairs causing violation, the test statistics averages (t, z and χ 2 values) and the H T values of the overall test obtained from three separate Mokken IIO models in the simulative datasets generated by the graded response model.

Data Simulation Procedures
In polytomously scored datasets, only the rating scale model (Andrich, 1978) and the restricted graded response model (Muraki, 1990) can show IIO. Ligtvoet et al., (2010) study showed that IRFs almost always intersected in dense regions of the latent variable y, so that it seemed safe to use the graded response model. So, graded response model was used to generate data in the present study. The simulation conditions were defined and the model was used to produce datasets. The simulation conditions were as follows: 1. Minimum coefficient of a violation: This value, which was 0.03 by default, was simulated as 0.03, 0.27 and 0.45. A value of 0.00 indicated that the slightest violation would be significant, whereas a value of 0.45 indicated that only where there was a highly significant violation could a violation to be considered significant (Ligtvoet et al., 2011). In other words, this value is a criterion value. A value of or near 0.00 would lead to an increase in the number of items to be proposed for remove and a value of or near 0.45 would lead to a decrease in the number of items to be proposed for remove.
2. Item discrimination levels: Two item discrimination levels, namely low and high, have been defined. A low discrimination level was obtained from a normal distribution with mean of 0.5 and variance of 1; a high discrimination was obtained from a normal distribution with a mean of 1.5 and variance of 1. These coefficients were identified based on the studies by Desa, (2012) and Dodeen (2004). The item difficulty coefficients were obtained from a normal distribution with a mean of 0 and variance of 1.

Sample size:
In the present study, sample sizes were identified as 100, 250 and 500. In simulation studies based on the nonparametric item response theory, sample size was defined to be approximately 200 (Van Abswoude, Van der Ark and Sijstma, 2004; Van Abswoude, Vermunt, Hemker, and . In the present study, sample sizes bigger and smaller than this value have also been defined. The ability distributions were obtained from the normal distributions.
The dependent variables of the present study were the number of items violating the order, the number of item pairs leading to the total violation, the test statistics averages, and the H T values of the overall test. Data generation was performed via the WINGEN 2.0 software program.

Data Analysis
All the simulation conditions are comprised of 108 test conditions: 3 (minimum coefficient of a violation) x 2 (item discrimination levels) x 3 (sample sizes) x 2 (number of items) x 3 (response categories). By applying the MIIO, MSCPM and IT methods, which were addressed within the scope of MSA, the number of items violating the order, the number of item pairs leading to the total violation, the test statistics averages, and the H T values of the overall test were identified for each cell. The analyses were performed via the Mokken 2.8.10 (Van der ark, 2007) package in R software.
The H T coefficient in dichotomously scored datasets was developed by Sijstma and Meijer (1992). In polytomously scored items, Ligtvoet et al., (2011) developed the H T coefficient, which is the primary dependent variable of the present study, by generalizing the interpretation of the H scalability coefficient. When IIO is applied to a dataset that can show IIO, it shows that an H T coefficient of 0.3 or below is an indication of a wrong item ordering. A coefficient between 0.3 and 0.4 shows a low degree of accuracy in item ordering, a coefficient between 0.4 and 0.5 indicates a moderate degree of accuracy in item ordering, and one above 0.5 indicates a high degree of accuracy in item ordering (Ligtvoet et al., 2011).
For IIO to be identified, first the number of items leading to significant violations according to the specified lowest violation coefficient needs to be identified. If no item causes violation, then the presence of IIO for all the k number of items is proved; otherwise, the item causing the most violation is removed from the test. Subsequently, the same method is replicated for the remaining (k-1)(k-2)/2 item pair. If this item also needs to be removed, then the method is replicated for the (k-2)(k-3)/2 item pair. This process is repeated until there are no items causing violation. If there are two or more items with the same number of violations, which items are to be removed are identified by means of two different techniques. The first item to be removed is the one that has the lowest item scalability coefficient (Hi). The second is identified by considering the content of the item (Ligtvoet et al., 2011;Sijtsma and Molenaar, 2002).
In studies where the methods of MIIO, MSCPM and IT are used simultaneously, the items to be removed are those that violate the common order. The level of this violation is identified by means of the lowest violation coefficient and this value, by default, is considered to be 0.03. A decrease in this value indicates that even the slightest violation is accepted. The degree of the violation is determined via the t test technique (t values) in the MIIO method, the z test technique (z values) in the MSCPM method and the chi-squares technique (χ 2 values) in the IT method. The violation causing items that are statistically significant should be removed from the test sequentially; if there are more than one item that cause a high degree of violation, the item with the lowest scalability coefficient is removed from the test (Ligtvoet, 2010).

RESULTS
The findings regarding the number of items violating the order are presented in Table 1. The IT method could not yield findings in conditions with a sample size of 100. In almost all conditions of simulation, the number of items violating the order that the MSCPM and IT methods yielded was higher than that yielded by the MIIO method. Furthermore, while the MSCPM and IT methods were significantly affected by a change in the lowest violation coefficient, of these two methods, IT was mostly affected by this coefficient. In a condition where violation coefficient value was 0.45, IT hardly yielded any item for remove. For example, in one simulation condition with the lowest violation coefficient was 0.03 in the IT method, an average of 12.40 items of 15 items were yielded for remove, while in another condition with the lowest violation coefficient of 0.27, an average of 1.60 items were yielded for remove. Similar examples were present in the MSCPM method as well. However, in the MIIO method, the number of items yielded for remove was quite close for the lowest and highest violation coefficients.
The number of items causing violation in the order was high for all methods across all sample sizes and in conditions where the number of items was 15 and the response categories were 5 and 7. However, in conditions where the number of items was 15, the response category was 7, and the item discrimination level was low, the methods, particularly MIIO, yielded very few number of items to be removed. The MIIO method yielded an average of 0.05, 1.00 and 1.45 items to be removed in samples sizes of 100, 250 and 500, respectively in the specified simulation conditions. These findings are quite surprising. While an increase in the number of items yielded for remove was observed as the sample size increased, no effect of number of items, response categories, and item discrimination on the number of items to be removed for violating the item ordering was observed.
The findings regarding the number of item pairs causing violation are presented in Table 2. In all simulation conditions, the number of item pairs causing violation identified by the IT method was higher than that yielded by the other methods. Especially in conditions where the number of items is 15, and the response categories are 5 and 7, more than 1000 item pairs causing violation were detected. However, in conditions where the lowest violation coefficient was 0.03, these values that were produced in high numbers yielded rather low values (0.00 -74.10) in conditions where the lowest violation coefficients were 0.27 and 0.45. Thus, it was revealed that IT was significantly affected by the lowest violation coefficient in these conditions as well. The MSCPM and IT methods identified a higher number of item pairs to be causing violation than the MIIO method. As the number of these item pairs has an impact on the number of items yielded for remove, it is normal that this finding shows similarity to those presented in Table 1.
As the sample size increased, the number of item pairs causing violation identified by all the methods also increased. In the MSCPM and IT methods, it is observed that as the number of response categories increased, the number of item pairs causing violation also increased. However, the same situation was The findings regarding the average test statistics are presented in Table 3. Because each method utilizes different hypotheses to identify the items to be removed for violating the item ordering, each method yielded different test statistics (t, z and χ 2 values). For this reason, a direct comparison of these methods is not possible. Each method was merely examined based on a comparison in itself. In the MIIO method with a sample size of 100, the obtained statistical values were very close to zero. However, as the sample size increased, these values also increased. Test statistics varied between 0.00 and 5.87. An increase in the lowest violation coefficient had almost never effect on test statistics. The highest statistical values yielded by the MSCPM method was obtained in conditions where the sample size was 100 and the number of items was 5. It was observed that the higher the sample size and number of items were, the more stable the obtained values were. No pattern was observed in the findings yielded by the IT method. The value obtained with the increase in the lowest violation coefficient with the MSCPM method was very close to zero. However, in the IT method, especially in conditions where the sample size was 500, the number of items was 15, the item discrimination is high, and the response categories were 5 and 7, χ 2 values were found to be very high even in conditions with the lowest violation coefficient of 0.45. Almost all the χ 2 values yielded by the IT method were at unexpected levels.
The findings regarding the H T values are presented in Table 4. While the H T values yielded by the MSCPM and IT methods were very close to each other, they were higher than those yielded by the MIIO method. However, the findings obtained from these two methods did not display any significant pattern. As the number of items increased, so did the H T values yielded by all the methods. With a sample size of 250, higher H T values were obtained in conditions where item discrimination was high. However, a similar pattern was not observed in the other simulation conditions. Consistent with the other findings, the MSCPM and IT methods were not affected by the lowest violation coefficient. The highest H T values were yielded by the MSCPM and IT methods in conditions where the lowest violation method was 0.03. On the other hand, the lowest H T values were obtained in conditions where the sample size was 500, the number of items was 15, the response category was 3 and the item discrimination was low.
When such is the case, it was observed in almost all the H T values yielded by the MIIO method that the item ordering was not accurate. On the other hand, the MSCPM and IT methods can produce a moderate or high degree of accurate item ordering, especially in conditions where the lowest violation coefficient was 0.03. In conditions where the lowest violation coefficient was between 0.27 and 0.45, it was frequently observed, as in the MIIO method, that the item ordering used was not accurate.

DISCUSSION and CONCLUSION
This area of research initiated by Ligtvoet (2010) and Ligtvoet et al. (2011) with the methods they developed regarding invariant item ordering in polytomously categorized items is relatively new. Subsequent to these research studies in which methods were developed, even though some empirical studies are encountered in the literature, there are no technical or theoretical research studies. This implies that especially practitioners will be confused and will experience difficulties in deciding which method to use in which conditions and how to interpret the obtained coefficients. Especially in test administrations where items are ordered according to level of item difficultyfrom easy to difficult, identification of the fixed item ordering is highly important for the interpretation of the test scores, especially in situations where items reflect the developmental traits of the measured cognitive stages or where item sets are clustered or hierarchical.
The most important findings obtained in the identification of invariant item ordering are the number of items violating the item ordering, the total number of item pairs causing violation, average test statistics, and the H T values of the overall test (Ligtvoet, 2010). Hence, the present study focused on these values. The number of items violating ordering and the total number of item pairs causing violation yielded by the MSCPM and IT methods were higher than those yielded by the MIIO method. This finding is inconsistent with that reported in a study by Van der Ark (2012), where the MIIO and IT methods yielded a similar number of items to be removed. Moreover, Ligtvoet (2010) indicated that in a condition where the number of items was 20 and the response category was five, the IT method yielded 900 different violations in ordering. In the present study, the IT method yielded more than 1300 violations, much more than what the other methods identified. These two findings are in consistency.
While the MIIO method produced stable test statistics in all simulation conditions, the MSCPM method produced stable values in conditions where the sample size was 250 or above. However, the test statistics yielded by the IT method did not present any significant pattern. The fact that a condition where the lowest violation coefficient was 0.45 yields much higher values than those produced by a coefficient of 0.03 indicates that the values obtained via the IT method entails a high number of errors. While this is not consistent with the findings, the H T values obtained via the MSCPM and IT methods were found to be higher. It was observed that the item ordering in almost all the H T values obtained by means of the MIIO method was incorrect. When the findings were considered in general, it was found that the MIIO method yielded the most stable values due to the fact that it was not affected by the lowest violation coefficient and was affected only slightly by simulation conditions. Especially in conditions where the violation coefficient is 0.03 (the default value in the Mokken package), it is recommended to use the MIIO method in identifying item ordering. Even though the MSCPM method yields similar findings to those of the IT method, it generates more stable findings in particularly high sample sizes. In conditions where sample size, number of items and item discrimination are high, the MSCPM is recommended to be used. However, further studies need to be conducted on the IT method. The use of the IT method is not recommended due to lack of theoretical information.

Journal of Measurement and Evaluation in Education and Psychology
In this relatively new field of study, there is a need for further theoretical and empirical studies. Conducting further studies on obtaining error values as regards invariant item ordering, error type 1 and power analysis is recommended. There is also a need to conduct similar studies on real datasets. Especially MIIO method must be used as a scaling procedure for scale development, person ordering, item ordering and validity studies.