An Investigation of the Factors Affecting the Vertical Scaling of Multidimensional Mixed-Format Tests*

This study examined the effect of the structure of a common item set (only dichotomous common items – mixed-format common item sets), parameter estimation methods and scale shrinkage on vertical scaling results when multidimensional datasets were used within the context of Common Item Nonequivalent Group (CINEG) design. Interactions between these variables were also investigated. The study was performed using simulated data. Measurement error and bias indexes were used to evaluate the quality of vertical scaling. All the procedures used in the data analysis were replicated 50 times to increase the generalizability of the results. R program was used for the data generation, calibration of the parameters and vertical scaling procedures. Possible interactions were investigated with factorial analysis of variance by using SPSS. The results showed a consistent effect of the common item format in all conditions. In addition, some interactions between the variables were observed. These findings are discussed and some recommendations are provided.


INTRODUCTION
Test scores are among the primary sources of information that educators and educational institutions use in making important decisions about students. Thus, test scores must provide the accurate information to facilitate appropriate decisions (Kolen and Brennan, 2004). However, different forms of the same test are often used due to the reasons such as test safety and follow up student development. A functional link between these forms needs to be established so that the scores from different test forms are comparable. This process is called test linking. Test linking is the process of establishing a relationship between different test forms. There is no requirement that the content and difficulty levels between the test forms for test binding be the same. Test equating is a special form of linking in which the aim is to use the scores between the different test forms interchangeably. Hence, test forms should be similar in content and difficulty (Kolen and Brennan, 2004). Vertical scaling is similar to the equating because different test forms are linked to each other. However, test forms differ in content and difficulty because they reflect progression between classes or age groups. Therefore, while vertical scaling is used to compare different test forms, the scores at each level can not be used in place of each other. When the scores are put onto a common scale, students' grade-to-grade improvement can be seen. The main aim of vertical scaling is to observe student progress.
Dichotomous items were the most widely used item format in the 20th century (Koretz and Hamilton, 2006). Today, however, the use of mixed-format tests, which contain both dichotomous and polytomously scored items, is rapidly becoming widespread. Mixed-format tests offer many advantages. According to Livingston (2009), multiple-choice questions may be used to measure a test taker's ability with high reliability for a wide range of contents, in a short time and at low cost. On the other hand, open-ended questions measure higher-level cognitive skills more effectively, but tests made up of these items have narrower content, are costly to assess, and are likely to be subjective. Mixed-format tests eliminate the disadvantages of different formats and increase the psychometric qualities of the instruments.
One of the variables that this study examines is the effect of scale shrinkage which becomes relevant when measurement tools are applied at different time points to detect students' progress. Scale Shrinkage is the extent to which the variance and range of scores decrease in the second application compared to the first (Yen, 1985). As students continue any program, they become more homogenous in terms of their ability compared to beginning. This leads their score variances to shrink at the later test applications. The scale shrinkage corresponds to the homogeneity. So far, there has been a lack of research on how scale shrinkage affect the results of vertical scaling.
Another important variable examined in this study is the structure of the common item set. Mixedformat tests are scaled vertically through the use of only dichotomous items, mixed-format items (including at least two different items in the common item set and only polytomous items). In this study, only the dichotomous common item set and mixed-format common item set conditions were compared. Although, positive outcomes were obtained for the mixed common item set for vertical scaling applications (e.g., Kim and Lee, 2006), it could be valuable to see the results within the context of the current study where a different combination of variables is included.
Most software programs routinely carry out estimations using expectation -maximization (EM) algorithms (Bock and Aitkin, 1981). Another method that has recently started to be used is the Metropolis-Hastings Robbins-Monro (MHRM) method developed by Chai (2010). The performance of EM and MHRM have been compared in estimating multidimensional dichotomous models (Han and Paek, 2010) and multidimensional polytomous models (Kuo and Sheng, 2016). However, no study was found comparing their performances in the context of multidimension mixed-format scaling. A comparison of these estimation methods could contribute important insights to the literature. Thus, we also varied the estimation method across the study conditions.

Dichotomous Item Response (IRT) Models:
Dichotomous response models are based on a three-parameter logistic model developed by Birnbaum (1968). This model is expressed in formula 1 (Lord and Novick, 1968).
(1) Here θ corresponds to the level of the individual's ability, a to the distinction parameter, b to the difficulty parameter, and c to the so-called chance parameter of the item. When this model was first introduced, it was used for one-dimensional tests, but since the 1980-s it has been used for multidimensional models as well. The generalization of Birnbaum' (1968) model for multidimensional tests is given in the following section.
For example, let us say that, i = 1,……, N are different participants, j = 1,……, and n are test items. Also, suppose m is a latent factor. θ = ( 1 ,…., ). The slope parameters associated with the dimensions are = ( 1 ,……, ). The likelihood of responding to a dichotomous item for multidimensional 3PLMs becomes as presented in formula 2: Here, corresponds to the intersection parameter, corresponds to "chance" parameter, and D corresponds to the scaling constant. This value is generally taken as 1.702, and it is used to transform the logistic metric to the traditional normal ogive metric (Reckase, 2009).

Polytomous IRT Models
Although there are different models for polytomous items in the literature, the graded response model (GRM) is preferred in this study. In this model, developed by Samejima (1969Samejima ( , 1972, if we assume that the discrimination parameters are kept constant. ̃ corresponds to the cumulative probability that a person i with ability level can obtain a score beyond the category k of the item j. For the categories, ̃ could be expressed as follows: (3) Here, corresponds to the discrimination parameter, corresponds to difficulty (or threshold parameter) from the second category to category K, and D corresponds to the scaling constant. The category response function, , corresponds to the difference between two adjacent cumulative probabilities and is expressed as follows: Samejima (1969) and Carlson (1995) used the GRM to generalize this to multidimensional situations.
In the model, the boundaries of the response categories for the categories belonging to item j and the = 1 ,…… ( ( ) −1) intersections are expressed as follows: (5)

Vertical Scaling
In the literature, moment and characteristic methods are the most commonly preferred methods used to apply vertical scaling. The moment methods, namely, mean / sigma (Marco 1977) and mean/mean (Loyd and Hoover, 1980), are the simplest methods, and only the parameter estimates need to be known in order to estimate linking constants. Alternative methods to the moment methods are the characteristic curve methods developed by Haebara (1980) and Stocking and Lord (1983). These methods based on minimization of the differences between characteristics curves of items. Comprehensive analysis and comparisons of these methods were provided by Kolen and Brennan (2004). These methods were extended to link mixed format tests. A detailed information can be found in Kim and Lee (2006)'s study. This study uses the Haebara method.

Purpose of the Study
Based on the literature presented above, the aim of the current study is to investgate the effect of common item structure, scale shrinkage, and estimation methods on the vertical scaling of multidimensional mixed-format tests.

Data Simulation
Simulated datasets were used in the study. In addition, population parameters were also simulated considering that the values can be observable in real testing conditions. The dimensionality structure was prepared considering the two-tier model proposed by Cai (2010). In two-tier models, main dimensions and special dimensions are used as the source of the dimensionality. The terms "main" and "special" do not imply that main dimensions are theoretically more important or the variance/covariance structure between the dimensions is different. There is no theoretical relationship between main and special dimensions in a two-tier model. In addition, special factors are mutually orthogonal, and items have loading from only one dimension. On the other hand, the main factors may be related to each other. In the context of this study, content and item format were regarded as two sources of dimensionality in the data simulation. Accordingly, "content" was regarded as a special dimension source and "item format" as a main dimension source. Dual effect of content and item format on test dimensionality was investigated by Zhang (2016), but no study was found that take both factors into consideration when conducting scaling studies using simulated data. Figure 1 shows the model used for the data simulation in the current study. As seen, the three dimensions based on content and the two dimensions based on item format are intertwined in the model in compatible with two-tier models. The variance-covariance matrix used for the data simulation was established based on this model. The matrix is shown in Fügure 2. As seen in the figure, the relationship between the general factors is set to be 0.75 Among the special factors, this value is 0. Likewise, covariance values are assumed to be 0, showing no relationship between general and special items.

Simulation of Person and Item Parameters
In order to obtain accurate and stable parameter estimations in Multidimensional Item Response Theory (MIRT), as a sample size of 3000 was recomended (Yao and Boughton, 2009). Thus, in this study, the sample size consisted of 3000 simulated examinees. Theta scores were simulated from a normal distribution. The mean for the lower ability group was set to 0 and that for higher ability group was set to 1 with different scale shrinkage levels. One point ability difference between the groups was acceptable value that can be seen in real testing conditions (e. g., Kim, 2007). The theta vectors were simulated for each specific factor. Thus, final θ matrices with 3 × 3000 size were obtained as the population parameters for each group. In addition, variances of the population ability parameters were controlled. For the cases of scale shrinkage, variances were set to a shrinkage of 65% for the higher ability group. This amount of shrinkage was selected based on the study literature review provided by Yen (2005). This level of shrinkage was the case for half of the datasets, while for the rest of the datasets, variances were kept the same for both datasets used in the vertical scaling. In addition, the datasets were created to be composed of 108 items (90 dichitomous and 18 polytomously scored items). In this scenario, there were 54 items (45 dichotomous and 9 polytomous) in each main factor and 36 items (30 dichotomous and 6 polytomous) in each factor.

Journal of Measurement and Evaluation in Education and Psychology
Population a parameters for the generation of the data matrices were generated for each dimension (for each of the main and dimensions dimensions). Thus, a final matrix with a size of 5 × 108 was obtained for each dataset. For the main factors, if the item belonged to a dimension, the mean a value was determined as high discrimination power and fixed at 1 and the standard deviation at 0.15. If an item did not belong to a dimension, the mean value was fixed at 0.2 and the standard deviation at 0.03, because these items were not expected to have high level of discrimination. For special factors, if the item is included in that dimension, the mean value was fixed at 1 and the standard deviation at 0.15, while if the item was not included in that dimension, the all the a values were fixed to be 0 because of simple structure of spesific factors. All the simulated discrimination parameters were selected from the standard normal distribution. The difficulty parameters (b) were produced as 1 × 108 vectors for dichotomous items. For the lower ability group, the mean was set to be 0 with a standard deviation of 1. For the higher ability group, the mean was set to be 1 with a standard deviation of 1. Polytomous items were configured as having a 5point scoring format. For this reason, four intercept parameters for each item were simulated. The threshold values for the lower ability group are simulated with means that ranged from -1.5 to 1.5 with a 1-point increase for every adjacent thresholds. For the higher ability group, same procedure was repeated except the range was set to -1 to 1. The distribution of the difficulty parameters was selected from a normal distribution with a standard deviation of 0.1. Data matrices were simulated as described above using parameter estimates and matrices produced for the calibration process.

Parameter Estimation
In this study, 3PLM and GRM were used to calibrate the mixed-format tests. This combination is preferred in many studies. Rosa, Swygert, Nelson, and Thissen (2001) pointed out that 3PLM is preferred for calibration because more parameters on the items are taken into account and the model therefore gives more information. In the literature, GRM and partial credit models are preferred in the calibration of polytomously response models (Kim and Cohen, 2002, Bastari, 2000, Tate, 2000. Dodd (1984) concluded that the two model types produce similar the results despite being conceptually and mathematically different. Cao, Yin and Gao (2007) also found that the two models yield similar results.
For theta estimation the MAP was preferred in this study. Each data set was calibrated separately so that the scaling process could be performed. In the analysis of each data set for EM cycles, the convergence value and the number of iterations were taken as 0.001 and 500, respectively. For the MH-RM estimation technique, the convergence value was set to 0.0001 and the number of iterations to 2000.

Evaluation Criteria
As the evaluation criteria of the results, root mean square error (RMSE) and bias were used in parallel with similar studies. RMSE shows the amount of random error fort he scaling process. The computation of RMSE is given in formula 6: The bias values provide information on the systematic error detected during scaling process and are calculated as described in formula 7: (7)

Data Analysis
The data analysis was performed using the R statistic program (R Core Team, 2015). Different R packages were loaded and the analyses were carried out. Firstly, the "truncnorm" package developed by Trautmann et al. (2014) was used when d-matrices were derived. This package is preferred for controlling the upper and lower bounds of derived values of population threshold parameters. In this way, it was ensured that the difference between successive threshold parameters did not fall below 0,3 and model-data mis-fit was prevented. Other population parameters were obtained by using the "rnorm" command in the R program.
Later, the "mirt" package developed by Chalmers (2012) was used. Response matrices is producesd with the command "sim", and calibration was performed with "mirt" command. Finally, the ability parameters were estimated using the "fscores" command. When the scaling was performed, the "plink" package developed by Weeks (2010) is utilized. Each analysis was replicated 50 times. Then, the error and bias values of the parameters obtained from 50 replications were used with analysis of variance (ANOVA) to compare the conditions tested, and in 2 × 2 × 2 factorial ANOVA to see the interactions among the conditions being tested.

RESULTS
This section discusses the amount of error (RMSE), bias (Bias) values, and results of the factorial ANOVA for each research question as the major findings of the study. Common item set was excluded in the calculation of the error and bias values. In addition, given values of the dichotomous and polytomous items were calculated separately. As a last caution to the reader, the error and bias values were calculated separately for each of the three dimensions. The values are presented in Table 1. Table 1 shows that the common item structure had a significant effect on some estimates of the synchronization structure. The error values in the threshold parameters of the polytomous items for the first dimension were higher in cases where the mixed-format common item sets were used. Under all conditions, the a parameters of polytomous items were found to have higher in situations where mixed-format common item sets were used. In addition, for the threshold parameters with scale shrinkage and MHRM, the mixed-format common item structure elicited more errors. In the third dimension, the errors and bias values for mixed-format common items were lower except for in the a parameter of the polytomous items.
The scale shrinkage effect was examined as the next. For the first dimension, it was found that for the first dimension, the amount of error and bias obtained for the threshold parameter in the tests using dichotomous items was higher when the scale shrank. With the mixed-format common item structure, the MHRM estimation method and scale shrinkage, and the amount of error was lower for all the item parameters. For the second dimension, the bias amounts for the item parameters in the conditions using EM cycles and the only dichotomous common items were lower with no scale shrinkage. Similarly, the bias values of the ability parameters for datasets using mixed-format common items are also lower when the scale is not shrunk. In the third dimension, when EM cycles and dichotomous items were used, the error and bias amounts of the item parameters were generally lower in the cases of no scale shrinkage.
Regarding the estimation method, the error and bias values were lower with no scale shrinkage for parameters a and b of dichotomous items in the first dimension. On the other hand, with scale shrinkage, only the error values were lower in the EM estimation method. In addition, the error and bias values obtained from the a parameters of the polytomous items for the data in which only dichotomous items were included in the common item set were lower with the EM estimation method. These values for the second dimension showed similar changes to those in first dimension. Unlike for the first dimension, it was seen that, for this dimension, the bias values of the ability parameters were lower when the mixed-format item structure was used and there was no scale.Finally, the findings for the third dimension showed EM cycles produced lower error and bias values for all the item parameters with no scale shrinkage and a dichotomous common item structure was preferred.
Later, the 2 × 2 × 2 factorial ANOVA results were examined to see whether the observed differences in the bias and error values were significant, and whether there was an interaction between the conditions investigated. The results are presented in Table 2.
Regarding the interactions for the first dimension, there was a significant interaction between the CIF and SS conditions for the bias values of the ability parameters (p <.05). According to the analyses performed to test whether the levels of interaction of the CIF and EM conditions were meaningful, these two conditions interacted with the bias values of the threshold parameters of polytomous items (p <.05). Finally, when the interactions between the three conditions were examined, it was found that there was a meaningful three-way interaction for the error values of the a parameters of the dichotomous items (p <.05). The second and third dimensions showed similar results. When all the results were considered en bloc, it could be seen that, in addition to a clear effect of the common item format, the estimation method had effect, at least, for some dimensions. Although, some interactions were observed, they did not come close to providing a meaningful picture.

DISCUSSION
The findings showed that the common item structure significantly affected the amount of errors and bias obtained from the vertical scaling process. Specifically, for mixed-format tests, when the common item set only contained dichotomous items, it caused higher the amount of error, with few exceptions. As stated by Kolen and Brennan (2004), the common set of items needs to be a "mini version" of the total test in terms of content and statistical properties. This means that when polytomous items are placed in the common item set, the common item set becomes more similar to the total and this positively affects the scaling results.
In the light of these findings, it is suggested that test developers should prefer that common item sets contain mixed-format items when vertical scaling is performed even if this involves some difficulties in practice, such as higher cost and a limited number of available polytomously scored items. Moreover, since this study was conducted by using simulation data, caution should be taken when making generalizations for testing applications. In future studies, it is suggested that the current study be replicated using real data.