Comparison of Person-Fit Statistics for Polytomous Items in Different Test Conditions *

The validity of individual test scores is an important issue that needs to be studied in psychological and educational assessment. An important factor affecting the validity of individual test scores is aberrant item response behavior. Aberrant item scores may increase/decrease the individuals’ scores and as a result individuals’ ability can be estimated above/below their true ability. Person-fit statistics (PFS) are useful tools to detect aberrant behavior. There are a great number of parametric and nonparametric PFS in the literature. The general purpose of the study is to examine the effectiveness of the parametric and nonparametric PFS in data sets which consist of polytomous items. This study is fundamental research aimed at determining the effectiveness of PFS using simulated data sets. According to the results, as expected, as the Type I error rates (significance alpha level) increased, detection rates (power) increased. In general, it is seen that as the number of misfitting item score vector and number of items increased, detection rates increased. Generally, nonparametric PFS (N-PFS) (especially G P ) detected more aberrant individuals than parametric PFS (P-PFS) l zp . However, in some tests’ conditions l zp detected more aberrant individuals than N-PFS for longer tests. The results indicate that N-PFS outperformed P-PFS in most of the test conditions. test for sample size at α = .01 and α = .05 nominal for sample size 250 at α = .01 nominal level. Statistic G p best performance to detect aberrancy at low aberrancy levels sample size 100 at α = .01 and α = .05 nominal levels, and for sample size 250 at α = .01 nominal level. It is seen that l zp showed best performance to detect aberrancy for all sample sizes and all Type I error rates in high aberrancy level. In addition to for p and G Np very close to each When empirical Type I error rates are examined, it is seen that these values were not exceed their nominal levels in most of test conditions. Only for U3 p , empirical Type I error rate was equal to its α = .01 nominal level for large sample and low aberrancy. it is found that all empirical Type I error rates are smaller than their nominal levels for high aberrancy.


INTRODUCTION
It is known that psychological and educational tests are important in making decisions about individuals and identifying their learning problems, developmental problems, and psychological disturbances. It is clear that test users will focus on individual scores, especially in psychological diagnoses and treatments (Emons, 2003(Emons, , 2009. Therefore, the validity of individual test scores is an important issue that needs to be studied in psychological and educational assessment. An important factor that affects the validity of individual scores is aberrant item response behavior. For example, an individual may give incorrect answers to easy items in an exam because of being anxious during a test. This situation can lead to the person's ability estimated below her/his true ability. Another example is a situation that low-skilled individuals copy correct answers from highly skilled individuals sitting around them. This situation can lead the person's ability estimated above her/his true ability. Not taking the test seriously, lacking motivation, concentration problems in cognitive tests, giving fake responses in personality tests also form the basis for aberrant item responses. Thus, the validity of individuals' ability estimates can be negatively affected (Emons, 2003(Emons, , 2008Sijtsma & Molenaar, 2002).
Aberrant item scores may increase/decrease the individuals' scores and as a result individuals' estimated ability will be above/below their true ability. According to this, the ability of cheaters and lucky guessers are estimated spuriously high, while the abilities of examinees who are confused at the beginning of test, who never reach to items towards the end, who have language deficiencies are estimated lower than their actual ability levels (Meijer, 1996). Moreover, sometimes random guessers or examinees who respond without an idea about the item content, creatives (examinees who interpret items in a creative way) and examinees (misalign their answer sheets) also have aberrant item scores In the literature, log likelihood based lz statistic is the most frequently studied for binary items (Rupp, 2013). It is expressed that the most frequently used P-PFS for polytomous items is lz p ; whereas popular N-PFS include G p , GN p , and U3 p (Emons, 2008;Rupp, 2013;Syu, 2013).
Statistic lz p is the extended version of lz for polytomous items developed by Drasgow, Levine, and Williams (1985). Statistic lz p is assumed to be standard normally distributed under the null model of no aberrance, where large negative values (say less than -1.645) of lz p suggest aberrant response behavior (Meijer, 2003). One of the N-PFS is Guttman errors (G). Statistic G is the number of item pairs for which the respondent passed/answered the difficult item but failed the easy items for dichotomous items. As for polytomous items, G is also based on item pairs. In particular, a Guttman error occurs when a respondent passed difficult steps on one item and fails easy steps on another item (Meijer, 1996(Meijer, , 2003 depends on the total score, while the maximum value of GN p is one and means extreme misfit (Emons, 2008). Another N-PFS is U3 p (Emons, 2008), which is the extended version of U3. Minimum value of U3 p is zero indicating no misfit, a maximum value of U3 p is one indicating extreme misfit (Emons, 2008).
N-PFS have few advantages over P-PFS. N-PFS methods only require the fit of a nonparametric model and do not require fit of more restrictive parametric models (Emons, 2003). In particular, for N-PFS it is sufficient that the data set fits the Mokken Homogeneity Model (MHM). This model assumes unidimensionality, local independence, and monotonicity (i.e., nondecreasing item characteristic curves). Therefore, these assumptions should be examined before using N-PFS (Emons, 2008).
However, the review also shows that the person-fit analyses are studied often for binary items, and only little for polytomous items. Hence, the literature review shows paucity in research on polytomous PFS and need for more studies on the effectiveness of polytomous PFS in various simulated test conditions, especially under small samples and skew distributions of test.

Purpose of the Study
The general purpose of the study is to examine the effectiveness of parametric and nonparametric PFS in data sets which consist of polytomous items. The following questions are addressed, which are in line with the overall objective that is determined: 1. How does the proportion of detected individuals with aberrant item scores vary across test conditions such as sample size, distribution of ability, test length, and proportion of aberrancy which depends on manipulation of items and persons?
2. Which PFS performs best in different test conditions?

METHOD
This study includes a fundamental research aimed at determining the effectiveness of PFS using simulated data sets.

Data Simulation
In this study, data were simulated under Samejima's Graded Response Model (GRM), which is a suitable model for items with ordered answer categories. This model is defined by three basic assumptions, including unidimensionality, local independence, and monotonicity between latent trait and item responses (Hambleton, van der Linden & Wells, 2011;Meijer & Tendeiro, 2018).
To formally define the model, the following notation will be used. Let J be the number of items indexed by j. Each item is assumed to have (M+1) ordered answer categories. Let Xj be the random variable with realizations xj (0, …, M). The core of GRM is the item-step response functions (ISRF), which are defined as: In equation 1, θ is person ability, αj is the item-slope parameter, and δjxj (1, …, M) is the location parameter. This means that each item is modeled by one common discrimination parameter and M location parameters. The location parameters δjxj shows where on the ability scale the probability of score xj (1, …, M) or higher is equal to .50. Because item-step response functions are defined by two parameters, the model is a generalized two parametric logistic model (Embretson & Reise, 2000;Hambleton et al., 2011).
R software was employed to generate simulated data. By using the "catIRT" package (Nydick, 2015) in the R software, data sets that fit for the GRM are produced. Regardless of NIRT analysis (especially for N-PFS), the main reason data are generated based on GRM is that GRM is a special form of the MHM, and data that fit to GRM also fit to the MHM (Emons, 2008;Sijtsma, Emons, Bouwmeester, Nyklícek & Roorda, 2008). In addition, the "fungible" package (Waller & Jones, 2016) was used to generate skewed ability distributions. To compute lz p , one needs estimates of θ, which can be obtained using weighted maximum likelihood estimation method (WML) (Wang, 2001;Warm, 1989). Dedicated algorithms in R programming language were used for WML estimation. Accompanying R code was obtained from Emons and are available upon request.

Design factors
In this study, simulations were done as follows: 1. Data were generated under the null model according to GRM using the test conditions envisaged.
2. According to the aim of the research, data were manipulated to mimic aberrant response behavior.
3. Extreme scores when respondents choose the same extreme response options were excluded from the analyses (e.g., strongly agree or strongly disagree) for all items. That is because Emons (2008) emphasized, extreme scores do not provide adequate information for person-fit analyses.
4. Abilities were estimated using WML estimation. While estimating the abilities, true item parameters for generating the data were used.
5. PFS were computed to detect aberrancy in different conditions with "perfit package" developed by Tendeiro (2016) in R.
Test conditions are the independent variables of the study. Test conditions included different levels of sample size (100, 250, 500, and 1,000), different shapes for the distribution of person ability (normal, positively skewed, and negatively skewed), different levels of test length (J = 10 and J = 30 items), and two levels of aberrancy (low and high). For low level of aberrancy, 20% of respondents showed aberrant response behavior on half of the items; and for high level of aberrancy, 30% of respondents showed aberrant response behavior on all items.  Table 2 shows the descriptive statistics of the simulated ability distribution. For all ability distributions, mean approximately equals zero and standard deviation equals one. Inspection of skewness coefficients shows that under the normal distribution, these coefficients were very close to zero, between of 0.54 to 0.61 for positively skewed distribution, and between of -0.58 to -0.55 for negatively skewed distribution. To generate item responses under the GRM, the a parameters were chosen between 1.50 and 2.00 and b parameters were, consistent with the literature, drawn from the uniform distribution in between -2.00 and 1.50 (Bahry, 2012;Cohen, Kim, & Baker, 1993;DeMars, 2002;Jiang, Wang & Weiss, 2016;Syu, 2013). Table 3 shows the item parameters for the 10 items and 30 items test. Previous studies convincingly showed that the power of PFS relates to the items' discrimination power (Emons, 2008;Meijer, Molenaar, & Sijtsma, 1994;Meijer & Sijtsma, 2001). Higher discrimination power may produce a higher detection rate (Emons, 2008).
There are many kinds of aberrant behavior that may affect test results. One of them is careless and inattention. In some test applications, individuals answer items randomly because they are careless, or a random pattern emerges due to misreading or not reading the questions, or due to alignments errors (Emons, 2008). Randomness-like response behaviors from important types of aberrant behavior (Conijn et al. 2015) and will be the subject of this study. To accomplish this goal, aberrant item response vectors were created by simulating random scores from the uniform distribution similar to Emons's (2008) study.
The selected test conditions are based on the literature (Lee, 2007;Lee, Wollack & Douglas, 2009;Liang, Wells & Hambleton, 2014;Ramsay, 1991;Syu, 2013). In particular, variation in the shape of ability distribution, small sample sizes and short tests are often seen in classroom measurement applications. One condition nevertheless consisted of a large sample size (1,000). This condition was chosen to see how PFS function in large samples and can be seen as a benchmark for the other results.
Data were generated using a fully factorial design including 4 (sample size) × 3 (ability distribution) × 2 (test length) × 2 (aberrancy levels) = 48 conditions. In total 100 replications were obtained for each test condition, thus in total 4800 data sets were simulated.

Data Analysis
Empirical Type I error rates and detection rates (power) are the dependent variables of the study. For each PFS (lz p , U3 p , GN p and G p ), the empirical Type I error rates and detection rates were evaluated at four the theoretical Type I error rates (nominal significance levels) (α = .01, α = .05, α = .10 and α = .20). Empirical Type I error rate is the observed proportion of non-aberrant persons identified as aberrant. Also, the detection rate is the proportion of aberrant persons correctly identified as aberrant (Voncken, 2014).
The theoretical Type I error rates which were chose in the study determined from the literature view results. It is stated in the literature that large alpha levels (e.g., .05, .10 and .20) are preferable because PFS have relatively low power detect aberrancy for small test lengths and low alpha levels (Emons, 2008;Emons, Glas, Meijer & Sijtsma, 2003;Meijer, 2003;Spoden, 2014;Voncken, 2014).
To decide whether a pattern shows significant misfit, one needs to have critical values. Certain rules are followed in the calculation of critical values for the PFS. In particular, the critical values for parametric lz p is determined, as in Voncken's (2014) study, to be -2.32, -1.645, -1.28, and -0.84. These are critical values from the standard normal distribution for alphas of .01, .05, .10 and .20 (one-tailed tests). Because N-PFS lack theoretical distributions, the critical values have to be determined differently. This study uses critical values of N-PFS that were determined automatically by perfit package in a pilot study. These cut-off values were fixed for every simulation and replication. Researchers are strongly recommended to fix the cut-off score with the command set.seed () before identifying individuals with aberrant item patterns according to the cut-off score in the relevant package (Meijer, Niessen & Tendeiro, 2016;Tendeiro, 2016). Otherwise, different critical values with small differences are reached in each calculation.

RESULTS
There are two levels of aberrancy in this study. PFS analysis results are given in Table 4 to Table 9. Table 4 gives the findings for normally distributed ability for 10 items.  Table 4 shows that as sample size increased, the detection rate increased in many test conditions. Almost all conditions, detection rates increased with increasing aberrancy levels. In general, G p showed best performance to detect aberrancy. In addition to these findings, it is found that nonparametric U3 p and GN p statistics are very close to each other. When empirical Type I error rates are examined, it is seen that these values exceed their nominal levels especially for low aberrancy level at α = .01 and α = .05. Also, empirical Type I error rates are smaller than their nominal levels in all conditions for high aberrancy level except for α = .01. It can be seen that as increased of aberrancy, empirical Type I error rates decreased. Table 5 gives the findings for positively skewed ability distribution for 10 items. Table 5 shows empirical Type I error rates and detection rates for PFS for positive distributed ability, for different sample sizes and low and high aberrancy levels. As expected, it is seen that as the Type I error rates increased, the detection rate increased. It is seen that as sample size increased, the detection rate increased in many test conditions for high aberrancy level. Almost all conditions detection rates increased according to the aberrancy level. In general, G p showed best performance to detect aberrancy. In addition to these findings, it is found that nonparametric U3 p and GN p statistics are very close to each other. When empirical Type I error rates are examined, it is seen that these values are smaller than their nominal levels both low and high aberrancy except for α = .01. Empirical Type I error rates are equal to or smaller than their nominal level for α = .01. It can be seen that as increased of aberrancy, empirical Type I error rates decreased. Table 6 gives the findings for negatively skewed distribution for 10 items. Table 6 shows the detection rates for negatively distributed ability, for different sample sizes and low and high aberrancy. It is seen that as the nominal significance level increased, the detection rates increased almost all test conditions. In general, as sample size increased, the detection rates increased. However, detection rates of lz p decreased dramatically for large sample in low aberrancy level when α = .05. Detection rates increased according to the aberrancy level in all test conditions. In general, G p showed best performance to detect aberrancy. In addition to these findings, it is found that nonparametric U3 p and GN p statistics are very close to each other. When empirical Type I error rates are examined, in general, these values are smaller than their nominal levels both low and high aberrancy except for α = .01. Also, empirical Type

384
I error rates are equal to or smaller than their nominal α = .01. It can be seen that as increased of aberrancy, empirical Type I error rates decreased.   Table 7 gives the findings for normally distributed ability for 30 items. Table 7 shows the detection rates for normally distributed ability, for different sample sizes and aberrancy levels. As expected, it is seen that as the nominal significance levels increased, the detection rates increased as well. There is no specific trend regarding the effect of sample size on the detection rates. However, when all test conditions are examined, the highest detection rates were observed in the largest sample. For lz p , detection rates increased with increasing aberrancy levels at all nominal significance levels. In general, G p showed best performance to detect aberrancy in low aberrancy level, while lz p showed best performance to detect aberrancy in high aberrancy level. In addition to these findings, it is found that nonparametric U3 p and GN p statistics were very close to each other. When empirical Type I error rates are examined, it is seen that these values never exceed their nominal levels in all test conditions. Empirical Type I error rates are smaller than or equal to their nominal α = .01 for low aberrancy. Also, all empirical Type I error rates are smaller than their nominal levels for high aberrancy. It can be seen that as increased of aberrancy, empirical Type I error rates decreased.  Table 8 gives the findings for positively skewed ability distribution for 30 items. Table 8 shows the detection rates for PFS for positively skewed distributed ability for different sample sizes, low and high aberrancy. In general, detection rates increased with increasing aberrancy levels. However, for N-PFS results show higher detection rates for low aberrancy level than for high aberrancy level. This result is seen in test conditions which are consist for sample size 100 and at α = .01 and α = .05 nominal levels, for sample size 250 at α = .01 nominal level. Statistic G p showed best performance to detect aberrancy at low aberrancy levels except for sample size 100 at α = .01 and α = .05 nominal levels, and for sample size 250 at α = .01 nominal level. It is seen that lz p showed best performance to detect aberrancy for all sample sizes and all Type I error rates in high aberrancy level. In addition to these findings, it is found that detection rates for nonparametric U3 p and GN p statistics were very close to each other. When empirical Type I error rates are examined, it is seen that these values were not exceed their nominal levels in most of test conditions. Only for U3 p , empirical Type I error rate was equal to its α = .01 nominal level for large sample and low aberrancy. Also, it is found that all empirical Type I error rates are smaller than their nominal levels for high aberrancy.   Table 9 gives the findings for negatively skewed distribution for 30 items. Table 9 shows the detection rates for PFS for negatively skewed distributed ability, for different sample sizes and for low and high aberrancy levels.  Table 9 shows that as expected, as the nominal significance levels increased, the detection rates increased as well. It is also seen in almost all conditions of low aberrancy that as sample size increased, the detection rate increased. Although, it is seen that as sample size increased, the detection rate increased in high aberrancy level for all samples. In general, detection rates increased according to the aberrancy level except for α = .01 and α = .05 for N-PFS. Broadly speaking, across all conditions, G p showed best performance to detect aberrancy at low aberrancy level while lz p showed best performance to detect aberrancy at high aberrancy level. In addition to these findings, it is found that the detection rates of nonparametric U3 p and GN p statistics were very close to each other. When empirical Type I error rates are examined, it is seen that these values did not exceed their nominal levels in high aberrancy. However, empirical Type I error rates are smaller than or equal to their nominal α = .01 for low aberrancy. It can be seen that as increased of aberrancy, empirical Type I error rates decreased.

DISCUSSION and CONCLUSION
The general purpose of the study is to examine the effectiveness of parametric and nonparametric PFS in data sets which consist of polytomous items. According to this aim, data simulated in different test conditions and these data sets were analyzed.
The results confirmed several important effects of significance level, sample size, ability distribution, and aberrance level. As expected, the detection rates increased with increasing nominal significance levels (the theoretical Type I error rates) in all test conditions. Moreover, it is seen that detection rates increased as the number of misfitting item score vector and number of misfitting items increased. Simulation results suggest that the shape of sample distributions has little effect on the detection of aberrancy. So, it can be said that shape of ability distribution (determined in this study's test conditions) is an unimportant factor for the effectiveness of PFS.
In general, sample size affected detection rates. In most of test conditions, it is seen that as sample size increased, detection rates increased. However, this result conflicts with Syu (2013), who studied with parametric lz p and nonparametric G p and U3 p . Syu (2013) only found small differences in the detection rates across sample sizes for specific PFS. In addition to this finding, Syu (2013) stated that findings are tentative because sample size is too small for providing sufficient calculations for PFS.
It is seen that in general, empirical Type I error rates smaller than their nominal levels (the theoretical Type I error rates). However, in all shapes of ability distributions for 10 and 30 items, empirical Type I error rates are equal to or smaller than their nominal level at α = .01. Except of this conclusion, it is seen that for normally distributed sample for 10 items, empirical Type I error rates exceed its nominal level at α = .01. In Voncken's (2014) study, detection rates were determined for binary items. In that study it is found that lz*'s empirical Type I rate exceeds its nominal level at α = .01. Also, it is seen that as increased of aberrancy, empirical Type I error rates decreased. These findings are consistent with Voncken (2014).
To summarize, as expected, as the nominal significance level was set higher, tests were longer, and amount of the aberrant proportions increased, the detection rates increased as well. These findings are consistent with other person-fit studies (Emons, 2008;Meijer & Sijtsma, 2001;Voncken, 2014).
A comparison of the effectiveness of the different PFS showed the following important trends. It is seen that detection rates were very close to each other for P-PFS and N-PFS (especially U3 p and GN p ). However, in general, G p was the most effective in detecting aberrant individuals and even performed better than lz p . These results are consistent with Emons (2008) and Syu (2013). They compared same PFS as used in this study in different test conditions. Like in this study, in their studies G p showed best performance to detect aberrancy. In Syu's (2013) study it's also stated that for small sample sizes N-PFS perform better than P-PFS.
It is found that for all test conditions detection rates were sufficiently high except at α = .01. Detection rates got their maximum value at α = .20. PFS may have very low detection rates at small significance 388 levels of α = .01, which questions their effectiveness at these significance levels. These findings are consistent with literature. Therefore, it is suggested that researchers should choose liberal significance levels (i.e., α = .20) to reach some power in detecting aberrancy (Emons, 2008;Meijer, 2003;Voncken, 2014).
Based on the result, the following general conclusions about the suitability of different statistics can be drawn. Results also showed that for detecting careless and inattention aberrant behavior long tests are more useful than small tests. However, long tests are not always feasible in practice. This renders PIRT models less useful in many applications because they require large sample sizes and sufficiently longer tests to obtain accurate estimates of the item parameters. NIRT models, and accompanying N-PFS do not suffer from these problems as they use observed group statistics and therefore are particularly useful in small samples and short tests (Junker & Sijtsma, 2001;Meijer, 2004;Molenaar, 2001). When PIRT and NIRT models are compared, NIRT models are less restrictive. The main difference between these models is about item characteristic curves. In PIRT model, these curves which are logistic or normal ogive are determined postulated parametric model (Lee et al., 2009;Sodano & Tracey, 2011). However, in NIRT models these curves do not require any parametric forms, especially MHM assumes only that monotony nondecreasing θ (Lee et al., 2009;Sijtsma & Molenaar, 2002). And so, it can be said that NIRT models are more flexible than PIRT models.
It must be emphasized that in practice if researchers want to study aberrant response behavior with N-PFS, researcher should investigate MHM assumptions. MHM can fit with skewed data (Şengül Avşar & Tavşancıl, 2017). MHM is an appropriate model for small samples (Junker & Sijtsma, 2001;Molenaar, 2001). These are MHM's important advantages to their parametric counterparts. Of course, if researchers want to study response aberrancy with P-PFS, they should demonstrate fit of the data with the parametric model assumptions. In general, if data do not fit PIRT models, researchers often can use NIRT models and N-PFS for detecting aberrant individuals.
An assumption was that all individuals answered all items in this study. In other words, there were no missing data in data sets. Missing data effects on PFS and missing data handling methods for best recovery PFS can be investigated. Apart from the test conditions determined in the study, the effectiveness of PFS can be determined by simulating different test conditions. Also, PFS which were used in this study can compared with real data applications.