The Comparison of the Equated Tests Scores by Various Covariates using Bayesian Nonparametric Model

This research is based on obtaining equated scores by using covariates in the Bayesian nonparametric model. As covariates in the study, gender, mathematics self-efficacy scores, and common item scores were used. The distributions were obtained for all score groups. Hellinger Distance was calculated to obtain the distances between the distributions of equated scores by using covariates and the distribution of the target test scores. These distances were compared with the distributions of equated scores obtained from methods based on Item Response Theory. The study was conducted on Canadian and Italian samples of Programme for International Student Assessment (PISA) 2012. PARSCALE and IRTEQ were used for classical methods, and R was used for Bayesian nonparametric model. When gender, mathematics self-efficacy scores, and common item scores were used as covariates in the model, distance values of obtained equated scores to target test scores were close to each other, but their distributions were different. The closest distribution to target test scores was achieved when gender and mathematics self-efficacy scores were used together as covariates in the model, and the farthest distributions were obtained from item response theory methods. As a result of the research, it was determined that the model is more informative than the classical methods.


INTRODUCTION
It is very important to compare the scores of the individuals evaluated by the tests. Equating is used to compare the scores obtained from different test forms that serve the same purpose. One of the most important steps of equating is the selection of the equating method, which differs regarding the use of common items or common individuals. The methods involving common individuals can be classified as single group design, counterbalanced design, and equivalent group design, whereas the method involving common items in non-equivalent groups is named as Non-Equivalent groups with Anchor Test (NEAT) (Branberg & Wiberg, 2011). NEAT is used when there is no chance of applying another questionnaire and the data required to reveal the difference between the groups were obtained from common items/tests Moses, Deng, & Zhang, 2010). The selection of the common tests is crucial in the design, and the selected test should have a similar mean and item difficulty with the tests in question and should represent this test in terms of content (Dorans, Moses, & Eignor, 2010;Kolen, 1988;Kolen & Brennan, 2014;Mittelhaeuser, Beguin, & Sijtsma, 2011;Wei, 2010;. The common test should be onedimensional, should have a high correlation with the scores of the other tests to be equated, and should reflect the exact structure of the test forms (Wallin & Wiberg, 2017). In addition, the use of common tests that address the trends over time in NEAT design may be appropriate only for certain individuals, which may create a bias for equating. If the common tests/items fail to satisfy these conditions, the The most important assumption of NEC design is that covariates are able to explain the difference between groups. The most important step of this design is that the situational distributions of the test scores should be the same in both groups in terms of covariates categories. This is an indication that the achievement of individuals is evaluated according to their categorical characteristics. However, if the test scores to be equated were obtained at different time periods (i.e., equating a new test with an old test), this hypothesis may not be valid because the characteristics of the test scores and the covariates may have changed over time (Wiberg & Branberg, 2015). Although many researchers have described covariates in different terms, they emphasized that these variables are related to test scores, and they can explain the difference between groups (Branberg & Wiberg, 2011;Liou, 1998;Liou et al., 2001;Wiberg & Branberg, 2015;Wright & Dorans,1993). In the literature, the variables such as age, gender, and educational status were observed to be included as covariates (Branberg & Wiberg, 2011;Gonzalez, Barrientos, & Quintana,2015a;Liou et al., 2001;Wiberg & Branberg, 2015;. The accuracy of the prediction may increase with the increase of the number of covariates added to the study, which makes the number of covariates added to the study important. Another important issue is the number of covariate categories. As the number of covariate categories increases, the number of individuals falling into each relevant category may decrease. Therefore, limiting the number of variable categories will give more appropriate results and will strengthen the prediction (Wallin & Wiberg, 2017;Wiberg & Branberg, 2015).
Equating methods are based on various theories and assumptions, which are classified in the literature as Classical Test Theory and Item Response Theory (IRT). However, in recent years, Bayesian approach has come to the fore in test-equating studies.

Bayesian Approach
In the classical approach, the p-value is used to test the significance of null hypotheses, which varies according to the sample and purpose of the researcher (Berger, Boukai, & Wang, 1997;Kruschke, 2010;Kruschke, Aguinis, & Joo, 2012;Lee & Boone, 2011;Rounder, Morey, Speckman, & Province, 2012). This can be considered as a disadvantage because point estimation affects the outputs in terms of reaching an accurate result. The confidence interval used in Bayesian approach carries more information than point estimation. The confidence intervals for posterior inferences generated by Bayesian approach can be expressed with the mean and 95% confidence interval (highest density interval/HDI). The points falling in this range are more accurate than the points that are outside (Kruschke, 2010). 194 parameters, but it has some limitations, whereas the flexible use of the number of parameters in the models constitutes the basis of Bayesian nonparametric approach (De Iorio, Müller, Rosner, & MacEachern, 2004, Müller & Quintana, 2004Orbanz & Teh, 2010;Shah & Ghahramani, 2013). Dirichlet Process (DP) Model is one of the models that have a central role in Bayesian nonparametric approaches (De Iorio et al., 2004;Gonzalez et al., 2015a;Petrone, 1999a). This model allows the inclusion of the covariates in equating process. The randomness effect of the variables on the distribution of the test scores will appear as dependency, which is explained by the Dependent Dirichlet Process (DDP), an extension of the DP model (Barrientos, Jara, & Quintana, 2016;MacEachern, 1999MacEachern, , 2000. However, the selection of prior distributions in Bayes nonparametric approaches is usually very difficult. Petrone (1999aPetrone ( , 1999b suggested using Bernstein-Dirichlet Prior (BDP) model to eliminate this limitation. In their studies, Barrientos et al. (2016) expanded the model further and developed Dependent Bernstein Polynomial Process (DBPP) model. Barrientos et al. (2012Barrientos et al. ( , 2016 discussed two specific types of DBPP. In this study, DBPP involving a dependent stick-breaking process with common weights and predictor-dependent support points was employed. This type is called singleweight DBPP (wDBPP). Z represents covariate space, and Fz represents covariate-dependent random probability distributions.
In this study, the accuracy of the predictions and their contribution to the test equating process were analyzed by comparing the equated scores obtained from Bayesian Nonparametric Model (BNP) by using various covariates at NEC design.

METHOD
The research was conducted with real data. The distribution of equated scores obtained from the scaling methods based on IRT was compared with the distributions of equated scores obtained from the BNP model.

Sample
The data used in the research was obtained from PISA 2012. In order to carry out the equating process in non-equivalent groups, two countries with different success levels were selected. According to PISA 2012 math results, the data of Canada, which was ranked as 13 th with an average score of 518, and Italy, which was ranked as 32 nd with an average score of 485, were taken from the database published by OECD (http://www.oecd.org/pisa/data). The records with missing data were removed, and Italian data with a sample size of 908 and Canadian data with a sample size of 931 were used in the analysis.

Data Collection Tools
In PISA 2012, a cognitive test measuring students' mathematics literacy and a student questionnaire were used. The data of the research is comprised of the Italian students' responses to booklet 5 and Canadian students' responses to booklet 6 of the mathematics sub-test. Booklets 5 and 6 were selected to be used in the research because of the equal number of math questions and the high number of common items. There were 12 common items in the booklets.
Gender and mathematics self-efficacy score (MATHEFF) were used as covariates in the analysis, where gender is a two-category variable and MATHEFF is a continuous variable. In addition, the anchor item scores were taken as the covariate in the BNP model. The reason for using MATHEFF is that it is defined as the variable that explains the mathematics achievement (Ayotola & Adedeji, 2009;Hackett & Betz, 1989;Koğar, 2015;Schulz, 2005;Siegle & McCoach, 2007;Thien & Darmawan, 2016). This variable was derived from the sum of the item scores, where a higher score indicates lower self-efficacy. MATHEFF scores varied between 8-32. But, since the scores range between 0-1 in the model, MATHEFF scores were also converted into the 0-1 range, showing the change within one unit.
Regarding the equating studies performed in non-equivalent groups, the number of common items in the tests should be equal to at least 20% of the number of questions to minimize the equating error (Angoff, 1971). The study was carried out with 24 items in NEC design, and the total score of the common items was used as the covariate. In NEAT design, 12 items were taken as external commonitems, and the study was carried out with 36 items. To avoid them from affecting the model as a different criterion, partially scored items in the booklets were converted into two category-scores.

Data Analysis
In the research, IRT-based scale conversion methods and the analyses using the BNP model were carried out separately. First of all, unidimensionality and local independence were tested for IRT. Factor 10.3 analysis software was used to test unidimensionality, which was analyzed over 36 items. The unidimensionality of 36-item in booklets was taken as the proof of the unidimensionality of the 24-item version. As a result of the factor analysis, Kaiser Mayer Olkin (KMO) value of booklet 5 was found to be .95, whereas Bartlett's value was 7086.60 (df = 630; p < .001). Regarding booklet 6, KMO value was .94 and Bartlett's value was 6427.00 (df = 630, p < .001). KMO values indicated the sufficiency of the sample sizes for the analysis, and Bartlett's value indicated the factorizability of the data set. Regarding these values, it can be said that the tests were unidimensional.
The unidimensionality of the booklets provided insight about local independence assumption. Moreover, in order to test the local independence assumption, the correlation between the items was calculated for the top and bottom 27% of the data (Kelley, 1939). The correlation between the top and bottom groups was found to be lower than the overall correlation; therefore it was concluded that the local independence assumption was met.

Parameter estimation
The two test forms to be scaled in the study are parallel. The parameters obtained from these forms were estimated from different individuals, and the mean and standard deviations of the groups were different; therefore the estimations were made using separate calibration methods.
Equating by NEAT design was performed using ability parameters. The -2loglikelihood values obtained for 2 parameter logistic model (PLM ) and 3 PLM were tested by chi-square test and 3 PLM model was found to be significant. Therefore, the parameters were estimated according to 3 PLM method. Parscale 4.1 program was used in the estimation of item parameters.

Scale conversion
Common items were taken as external common items in NEAT design to allow a comparison with NEC design. IRTEQ software was used to convert the parameters taken from the PARSCALE software to the same scale. Since IRT true-score equating is more accurate and precise (Li, Jiang, & von Davier, 2012), this process was carried out on true-score. In the study, booklet 6 was taken as the target test, whereas booklet 5 was taken as the basic test.

Test equating by Bayes nonparametric approach
In order to make accurate statistical predictions, Markov chain Monte Carlo (MCMC) sampling method was used to obtain a sample representing the universe (Kruschke, 2015;StataCorp, 2015). In this study, MCMC method was used to estimate population parameters (k, γ, w) of the BNP model. General information about the population can be obtained using covariates. MCMC processes were performed separately for Canada and Italy data sets. The covariates and parameters compatible with the data are combined in the files prepared in MCMC sampling by using DBPP.
The covariates used in the research were added to the model as anonymous priors. This fact prevented the bias that may arise from the effects of these variables on the posterior distributions of the scores and ensured a more objective evaluation.
; scale matrix A represents p-dimensional inverted-Wishart distribution with degrees of freedom . The values that Gonzalez et al. (2015a) found to be significant in their study, were also included in their study of 2015b, therefore the following values were used while generating the prior distribution = , = , = . * , = + , and = . MCMC algorithm was run to explain the posterior distribution of wDBPP model and to obtain the posterior distribution samples of all model parameters.
Posterior inference: All computations were coded and performed in R 3.2.1 statistics software. The posterior probability distribution was given by: The posterior predictive distribution was given as below: ( | , ) = ∫ ( , , , | , ) ( | , , , ) shows the sum obtained for the identified distributions.
The number of iterations was first set as 5000 to test the parameters in the generated files. Then, MCMC number was set as 150 000, and the analyses were performed by repeating 10 times for each file in order to obtain a proper distribution. The analyses of the test forms were carried out simultaneously, which took around 10 hours and 23 minutes for each file.
The algorithm of Gibbs and Metropolis-Hastings sampling method was as follows. It was used to explain the posterior distribution obtained by gathering the covariables with the model in MCMC files: An initial * ~( | ( ) ) value is obtained by using Metropolis-Hastings ratio; if the initial value is reasonable, it is accepted; if not, it is rejected, and the process continued until the most appropriate value is obtained (there were 10 values in the research).
An initial * ~( | ( ) ) value is obtained by using Metropolis-Hastings ratio; if the initial value is reasonable, it is accepted; if not, it is rejected, and the process continued until the most appropriate value is obtained (there were 20 γ values in the research).
An initial * ~( | ( ) ) value is obtained by using Metropolis-Hastings ratio; if the initial value is reasonable, it is accepted; if not, it is rejected, and the process continued until the most appropriate value is obtained (there was 1 value in the research).
After completing this stage, the equated scores were obtained using cumulative distributions of the test scores.
The transformation functions are as follows, where T is score distribution; represents the scores obtained from test X, represents the scores obtained from test Y, and z represents the covariates; The analyses conducted to obtain equated scores were completed in 7 days and 6 hours. The equating process was completed by putting the generated profile distributions into the percentiles determined for covariate categories.
DBPP model defines continuous distribution functions in (0-1) range. Therefore, the score estimations were made in this range as Gonzalez et al. (2015b) have done in their study. After equating, the scores were converted to the scale-of-100 so that the highest score will be 100. This is considered as the best scaling method in equating studies involving the tests with different ranges (Livingston, 2004). Therefore, the continuous variables used in the distributions were converted and analyzed in (0-1) range, then the graphics and distributions obtained for equated scores were converted to the scale-of-100 and interpreted.

Comparison criteria
In traditional equating methods, standard criteria such as Root Mean Square Error (RMSE), Mean Square Error (MSE), bias, and standard errors (SE) are used to assess parameter estimation error. However, it is difficult to compare the results obtained by the methods based on different models such as IRT and BNP (Wiberg & Gonzalez, 2016). Therefore, in this study, the comparison of the results using the criteria such as RMSE and MSE was not possible. Hellinger Distance, which provides statistical information, was used in this study to compare the equated scores obtained by BNP and IRT methods to target test's scores. This distance is the sum of the distances between the points of each distribution. There are many forms of Hellinger distance. Hellinger Distance used to compute the distance between two distributions f and g (Boone, Merrick, & Krachey, 2012) is formulated as; The distances between the distributions of the scores were computed according to the method above, and the distributions are shown through graphics in the results part. One of the titles (participants, sample, or working group) should be used with respect to the group formation procedure used in the study. The information about the sampling procedure and the group should be given in this part.

RESULTS
In PISA 2012, the mean score and standard deviation of 908 Italian students, who answered booklet 5, was 51.51 and 20.72, respectively. Whereas the mean score and standard deviation of 931 Canadian students who answered booklet 6 was 52.27 and 22.06 respectively.
Equating errors occurred as a result of scaling according to IRT methods in the NEAT design were computed, and the score distributions obtained from various methods were analyzed.
In the two booklets, answers taken by two non-equivalent groups were used for scaling. RMSE values were calculated. The lowest error was obtained from Mean-Sigma method and the highest error from Stocking-Lord method. New ability parameters were computed, and item parameters of the target test were used for finding true scores. Probability density distributions of each method and their distance from the target test were calculated using Hellinger distances. Regarding the probability density distributions of the predicted scores in Figure 1, the distributions of the scores were observed to be similar and to be at approximately similar distances to the target test's distribution according to the Hellinger distance. Although Mean-Sigma method gave the lowest RMSE, the distributions obtained from the characteristic curve methods were closer to the distribution of the target test. According to Hellinger distance, Stocking-Lord method was the closest distribution with 0.029714.

The distance between the distribution of equated scores obtained by using gender as covariate in the BNP model and the distribution of target test's scores
Gender was taken as covariate, and students' scores were gathered with this variable. Distributions were first examined according to the booklets. Figure 2 shows the distribution of the scores and confidence intervals that best reflect the population for each gender.
Score distribution of female students for Booklet 5 and 6 Score distribution of male students for Booklet 5 Score distribution of male students for Booklet 6 Note. The confidence interval is shown in red to female because it was very narrow. The distributions were observed to be similar. Especially, the distribution of female students was the same for both booklets. The accuracy of the score estimation was checked through confidence intervals. Confidence intervals of female students' score distributions were found to be quite narrow, whereas male students' confidence intervals were wide, which may indicate uncertainty in the estimation of these scores. The decrease in the accuracy may be due to the low number of students in the sample used for the estimation of scores, or due to the fact that the scores of the students having the same profile were distributed in a wide range. The score equated with gender covariate was calculated for each student. The distributions of equated scores and target test's scores were compared. The distance between these distributions was calculated by Hellinger distance. As can be seen from Figure 3, the distribution of equated scores was observed to be sharper than the distribution of the target test's scores. The distance between these two curves was 0.00532, which was approximately one-fifth of the distance obtained by IRT methods.

The distance between the distribution of equated scores obtained by using MATHEFF as covariate in the BNP model and the distribution of target test's scores
MATHEFF was taken as the covariate, and students' scores were associated with this variable. The score distributions that best reflect the population according to MATHEFF levels were computed. The distributions of scores at different MATHEFF levels were analyzed according to booklets. Test score distributions of booklets 5 and 6 according to MATHEFF levels of the students were similar, therefore they are shown in a single graph in figure 4. As students' self-efficacy levels decrease (or for higher values of MATHEFF), the intensity of their scores decreases. Based on these distributions in each profile, it was also possible to see at which scores the students' distribution changed and how this change was affected for both booklets. In the BNP model, the distribution of equated scores was very close to the distribution of the target test' scores. Hellinger distance was calculated as 0.005337. This distance is significantly lower than the distance obtained from IRT methods and the distance of the model obtained using gender.
Compared to the BNP model using gender, the distributions were observed to approach and differentiate from the target test at different points. In the model using MATHEFF, the distribution of equated scores moved away from the target test at the ends, whereas in the model using gender, the distribution of equated scores differed from the target test in average values.

The distance between the distribution of equated scores obtained by using both gender and MATHEFF as covariates in the BNP model and the distribution of target test's scores
Students' MATHEFF scores were examined according to gender. The distributions obtained for female students were similar to males for booklets 5 and 6, therefore, graphs are shown for both genders in figures. Figure 6 and 7 shows the distributions of the students for booklets 5 and 6. Regarding booklet 5, it was observed that the intensity of high scores of both genders' students with low mathematics self-efficacy decreased. In booklet 6, the students of both genders with low mathematics self-efficacy were observed to be clustered around 20. As can be seen from these distributions, students' intensity around high scores decreased as MATHEFF scores get higher, which indicates lower mathematics self-efficacy levels.
So, it can be concluded that booklet 6 was easier than booklet 5 for both female and male students. In addition, the differentiation of the distributions in booklets may indicate that using these two covariates was effective in revealing the differences between the booklets. Equated scores were obtained using the cumulative distributions of these distributions generated by combining covariates and individuals' scores. The probability distributions of equated scores and target tests were examined together in Figure 8. The distribution of equated scores is very close to the target test when both covariables were included in the model; Hellinger distance is also relatively small (0.002107) compared to other models. From Figure 8, it can be seen that equated scores obtained by using two covariates got closer to the target test. In particular, the approximation of distributions to the extreme values might indicate that the model could be used to tolerate the error in extreme values.

The distance between the distribution of equated scores obtained by using common items as covariate in the BNP model and the distribution of target test's scores
In the first part of the study, equated scores were obtained from common items according to IRT scaling methods. In this section, the scores obtained from the sum of common items were used as a covariate. The distributions obtained from the combination of student scores and covariates are shown in Figures 9 and 10. In order to check whether common items reflect the tests or not, the correlation between common test scores and test scores was examined. These correlations were found to be .79 for booklet 5 and .75 for booklet 6. Accordingly, it can be said that common items represent the tests statistically.
According to Figure 9, if common item scores were not included in the model as covariate or they contributed to the model with very low scores in booklet 5, the density of students was observed to increase on average scores and densities towards the end scores decreased. With the increase of common item scores, the shapes of distributions differed from first distributions, and it was observed that low score densities decreased and high score densities increased.
Regarding Figure 10, which shows the analysis results for booklet 6, if common item scores were not included in the model as a covariate or contributed to the model with very low scores, students are concentrated around the mean. The distributions of students were quite similar for other score levels. Therefore, regarding the individuals with other scores than low common item scores, the distributions are similar for both booklets. The differences in common item scores failed to explain the difference in the math achievement of the students. Booklet 6 was observed to be easier than booklet 5.

203
Equated scores were obtained according to common item scores of students. The probability distributions of these scores and target test were examined together, and their distributions are given in Figure 11.

Figure 11. Distribution of the Target Test's Scores and The Scores Obtained from BNP Model with Common Items
Hellinger distance between the distribution of equated scores obtained by using common items as covariate and the distribution of target test scores was calculated as 0.006313. This distance was smaller than the one of the IRT methods, but it was greater than the values obtained from BNP models with other covariates. The distribution of equated scores obtained using common items is similar to the distribution of the equated scores obtained using gender. Both distributions diverged from target test's distribution at the ends. Although the numerical values of Hellinger distances were insufficient, their shapes supported the information given about these distributions.

DISCUSSION and CONCLUSION
In this study, equated scores were computed using the BNP model, bringing a different perspective than classical methods. Gender, mathematics self-efficacy scores, and the sum of common items scores were used as covariates. Equated scores were computed for different covariates, and the distances between these scores' distributions and the distribution of the target test's scores were examined. The explanation of mathematics achievement by the variables and the differences between booklets were interpreted using the BNP model. The results obtained from IRT and BNP models and their interpretation are given below.
The scores taken from common items were considered as the external common test in IRT equating methods; the minimum error was obtained from Mean-Sigma method, whereas the maximum error from the Stocking-Lord method. Therefore, it was concluded that external common items caused more error than moment methods in reducing the difference between items' characteristic curves; and the difference between the discriminant parameters obtained from common tests applied to the groups was less than the difference in characteristic curves. Regarding the distances between the distribution of true scores obtained by IRT scaling methods and distribution of target test's scores, the closest distribution was obtained from Stocking-Lord method. This fact can be expressed as that Stocking-Lord method produced closer values, even though it generated more erroneous predictions than other IRT methods.
In the BNP model, similar score distributions were obtained from female and male students for each booklet when gender was considered as the only covariate. Although gender was seen to be insufficient in showing the difference between the booklets, it was found that booklet 6 was comprised of easier questions than booklet 5. In spite of similar distributions, the confidence intervals of male and female students' distributions were different. Since the same distributions were obtained for the students of both genders, it was concluded that gender has no significant effect on mathematics  (Hall & Hoff, 1988;Lindberg, Hyde, Petersen, & Linn, 2010;Thien & Darmawan, 2016).
In the BNP model, when MATHEFF was taken as the covariate, the distributions of the students with medium and high scores were similar. The distributions of both booklets varied according to the MATHEFF level; therefore, it was found that MATHEFF was effective on mathematics achievement. Thus, it can be concluded that MATHEFF explains mathematics achievement. The literature contains studies showing that MATHEFF explains mathematics achievement (Ayotola & Adedeji, 2009;Ding, 2016;Hackett & Betz, 1989;Koğar, 2015;Schulz, 2005;Siegle & McCoach, 2007;Thien & Darmawan, 2016). In traditional equating, if the knowledge of individuals is not included, score distributions of each student group would be considered to be the same. In this study, the differentiation in the score distribution of the students in various sub-groups was kept under control, and equated scores of each sub-group were computed. Regarding the model in which MATHEFF was used, it was concluded that the distribution of equated scores approaches the distribution of target test's scores.
The most important assumption of NEC design is that the distribution categories obtained from covariates should be the same for the sub-groups (Wiberg & Branberg, 2015). The differences between booklets can be observed using this assumption. Since MATHEFF distributions were similar in both booklets, it was concluded that either this variable could not fully explain the difference between booklets, or the booklets were very similar. However, even in this case, it could be said that booklet 5 contained more difficult questions than booklet 6.
When both MATHEFF and gender were used as covariates in the BNP model, the information obtained from the model was more detailed than the models with a single covariate. If two covariates are used in the model, it is possible to distinguish the variables affecting the distributions of students' mathematics achievement and the magnitude of this effect. The distributions in booklets were the same for both genders. In our case, different distributions were obtained for different booklets and MATHEFF levels. The use of these variables together revealed that they could explain both the difference between booklets and mathematics achievement levels. The distribution of equated scores obtained using two covariates was observed to approach the distribution of target test's scores more than other models.
When the sum of common item scores in the BNP model was used as a covariate, only the distributions of low-score students varied, and the range was quite small. Therefore, the distribution of mediumand high-score students was observed to remain the same. In other words, it was concluded that common items were at the same level and uniform; otherwise they would change the distribution of test scores directly. The same result was obtained for both booklets. The correlation of common item scores was higher for booklet 5 and caused more distributional variations for this booklet. This fact showed that common items were more similar to the questions in booklet 5 and made more distinctions between the sub-groups with different scores in this booklet. Since the distributions obtained from common item scores did not differ significantly according to the booklets, it was concluded that common items don't adequately explain mathematics achievement. The distance between the distribution of the scores equated with common item scores and the distribution of the target test's scores showed the effectiveness of the method but using two covariates in the model was more effective. There are studies supporting the use of covariates for achieving more positive results in equating process, in cases where common items do not possess the properties required for equating or the assumptions of test equating are not satisfied (Dorans & Holland, 2000;Liou et al., 2001;Wright & Dorans, 1993).
When only MATHEFF and only gender were used as a covariate, the distributions did not differ significantly according to booklets. In the model where two covariates were used, distribution differences were observed according to booklets. In the model where the common item scores were used, distribution differences were observed in the low-score student group. This result suggested that in BNP models, common item scores explained the difference between the booklets more than MATHEFF scores. Despite different covariate types used in BNP models, booklet 6 was observed to be easier than booklet 5. Likewise, it is possible to say that the questions in booklet 5 were more distinctive.

205
Regarding the distributions of equated scores and the distances of these distributions to target test, the comparison between IRT methods and BNP models was straightforward. The distributions of equated scores obtained from the BNP model were closer to the distributions of the target test. The distances between the distributions of equated scores using the BNP model and the distribution of target test's scores were smaller. The closest distance was obtained from the distribution of the BNP model using two covariates together. Therefore, it can be said that more precise estimations are obtained by using BNP model. There are many studies supporting that the Bayesian method makes better predictions than classical methods, and it can be used to obtain much useful information Kruschke et al., 2012;van de Schoot, et al., 2013).
It was very difficult to compare BNP models that use different covariates according to Hellinger distances. Even though the numerical values obtained from Hellinger distance between BNP models is not sufficient for decision making, the shape of the distributions supported the information about the distance to the target test. Since BNP model uses score distributions for equating, it doesn't require any limitation such as having a same number of individuals in the basic test and target test. Moreover, there is no need to limit the number of individuals in the sub-groups involved in the tests. In the study, the low number of individuals in some sub-groups and the inclusion of covariates to the model as missinformation caused large confidence intervals. However, in spite of large confidence intervals, BNP models would yield more useful and informative results.
As BNP model keeps group invariance under control, the irregularities and discontinuities of the distributions have been eliminated. For this reason, there is no need for pre-smoothing, the selection of the bandwidth parameter, and the derivation of the standard error of equating used in other equating methods (Gonzalez et al., 2015b). This is an indication of the importance of the model ).
In future research, researchers may use the model for test equating without using any covariate. When covariate is used in the model, the study can be carried out to determine the items with DIF (Differential Item Functioning) according to variable/s' categories. In the model, equated scores can be obtained using different continuous and discrete covariates such as socioeconomic status, age, etc.