An Investigation of Group Invariance in Test Equating According to Gender Test

The aim of this study is to investigate the group invariance condition according to Tucker and Levine observed score equating among linear equating methods. In the study, the 4th and 6th booklets of the PISA 2012 Mathematics subtest were used. Booklets were equated according to group and gender sub-variables, and then group invariance of each condition and WMSE values were calculated. Within this scope, REMSD and RMSD (x) group invariance indexes were employed. The results of the study indicated that, when WMSE values, obtained according to equating methods, were compared, Tucker observed score equating method with regard to whole-group and gender sub-groups produced the lowest error. When RMSD and REMSD values obtained according to gender sub-groups were examined by linear equating methods, it was found that group invariance value is smaller than criterion value for Tucker equating method, while it was greater than criterion value for Levine equating method. Eventually, group invariance condition was met for Tucker observed score equating, but not for Levine observed score equating.


INTRODUCTION
PISA (Programme for International Student Assessment), that enables countries to compare their educational indicators, was administered by OECD in every three years since 2000.PISA application assesses to the extent which students at the age group of 15 are equipped with the basic mathematics, science and reading knowledge and skills in order to help them be a part of the modern society.PISA application aims to determine the extent students' ability to utilize knowledge and skills to use them in real life, understand the new situations, resolve problems, make guesses about what they are unfamiliar with and make judgments.In PISA application, students are required to take the all test item sets that consist of science, mathematics and reading skills.The items sets are incorporated in 13 ___________________________________________________________________________________ booklets and there are some common items to link all the booklets (OECD, 2014).Therefore, it is necessary to equate the scores in order to compare these scores obtained from different booklets.
Equating can be described as the statistical process, which regulates the differences between the tests, forms with the same content and difficulty level and enables the scores obtained from these form to be used interchangeably (Kolen, 1988).The aim of test equating is to make sure that the difficulty of the test form does not create any advantage or disadvantage to the test taker.There are some conditions that must be met in order to equate the test forms.These conditions include equality, symmetry, group invariance and unidimensionality (Hambleton & Swaminathan, 1985).Among these conditions, group invariance means equating function is independent from the sub-groups so that sub-groups do not affect the equating (Kolen, 2004).For example, when two forms of a test are equated, it is possible to obtain the same equated scores for the female and males only when the group invariance condition is met.When group invariance is not ensured, students with different gender and same skills can obtain different equated scores and thus students have advantage or disadvantage because of their genders.In other words, it is fair to say that group invariance is related to equality and objectivity in assessment and evaluation (Dorans, 2004(Dorans, , 2008)).In literature, different group invariance criteria have been developed to assess the accuracy as well as fairness of equated scores.These criteria are based on controlling the correspondence of equation principles (Petersen, Kolen & Hoover, 1989).
Test equating is divided into two groups as traditional and Item Response Theory approaches.Traditional equating methods include mean equating, linear equating and equipercentile equating methods (Kolen & Brennan, 2004).Mean equating is based on the assumption that test forms differ with respect to difficulty levels and this difference is fixed across whole scale.For example, in mean equating how much did responders in the upper group found X form easier than Y form will be the same for the individuals in the lower group (Kolen & Brennan, 2014).The equation of mean equating is as follows: (1) If reference and score distribution of the new form are not equal, equipercentile equating method is used.It is accepted that in the score distribution of X and Y forms, the scores that correspond to the same percentile rank are equal.Equipercentile equating consists of two steps.First, cumulative frequencies of two forms are transferred to a table and cumulative frequency table is drawn.Second the scores that correspond to the same percentile rank are equated.With the scores that are obtained via equipercentile equating method, score distribution of the new form and reference form becomes similar (Livingston, 2004;Kolen, 1988).
When features of two test forms are the same except from means and standard deviations, linear equating is used (Crocker & Algina, 1986;Kolen & Brennan, 2014).In other words, the scores that correspond to the same standard scores (Z scores) are accepted as equal.If the standard deviations of test forms are equal, linear and mean equating will yield the same results.If raw scores and equated scores are given in the same graph, their linear relationship can be illustrated.Linear equating equation is presented in equation 2. (2) In linear equating, if the groups, which take the forms differ in terms of their skills, anchor items are used.Different linear equating methods have been developed to equate the forms, which have common items (Livingston, 2004).

Linear Equating Methods for the Non -Equivalent Groups
Non-Equivalent groups Anchor Test-NEAT, the common items pattern, is administered when it is not possible to administer the test form more than once due to test reliability in non-equivalent groups.In NEAT pattern, both forms incorporate some common items and these forms are administered on the In linear equating for the NEAT pattern, equating relationship prediction is made over a single group by combining non-equivalent Group 1 and Group 2. Braun & Holland (1982) & Angoff (1971) named this group as synthetic.Group 1 and Group 2 classified as synthetic are weighted with w1 and w2.
Weighting has two rules.The first of these rules is that the sums of two weights are completely equal (w1+w2=1) and the second one is that each weight equals to zero or is bigger than zero (w1, w2 ≥ 0).Even tough w1=w2=0,5 where two weights are equal are used in general, synthetic is used in (w1=1, w2=0) when group is only defined as new (Topczewski, Cui, Woodruff, Chen & Fang, 2013;Kolen & Breannan, 2014).In this study, the case in which both weights are equal was used.
Equation for linear equating in non-equivalent groups on common items pattern is the equation used for equating the X observed scores with Y observed scores and s stands for the synthetic group): (3) stands for the mean score of the new form obtained from the synthetic group, stands for the mean score of the reference form obtained from the synthetic group; stands for the standard deviation of the reference form obtained from the synthetic group, stands for the standard deviation of the new form obtained from the synthetic group.Four parameters of synthetic population in Equation 3, are indicated by the following Equations No. 4, 5, 6 and 7 for Group 1 and Group 2.
(4) (5) (6) (7) In non-equivalent groups, common items pattern cannot be calculated directly since Group 1 does not take X form and Group 2 does not take Y form.Therefore, some assumptions are required according to the equating methods used (Kolen & Brennan, 2004).
Linear equating methods that are used in non-equivalent groups common items pattern can be listed as Levine observed score equating, Levine true scores equating, chained linear equating, Braun-Holland Linear equating (Kolen & Brennan, 2014).Since a group can take only one form in non-equivalent groups common items pattern, linear equating also requires powerful statistical assumptions (Chen, Cui, Zhu & Gao, 2010).In this study, since Tucker and Levine observed score equating methods were used, only information about them was mentioned.

Tucker observed score equating
Tucker method was defined by Gulliksen in 1950 (Kolen & Brennan, 2014).The assumptions required for Tucker observed score equating method are related to regression and conditional variance.The first assumption requires the regression on the common item scores of total scores within both samples are equal.Conditional variance assumption requires variances of the total scores conditions are equal for both samples (Chen et al, 2010;Kolen & Brennan, 2014).

Levine observed scored equating
Levine originally developed the method in 1955 without considering the concept of a synthetic population.After improvements, this method became more general than Levine's (1955).X, Y and common items measure the same characteristics and real scores of X, Y and common items are interlinked within both groups.

II.
The regression of X and Y forms on common items are linear and equal within both groups.
The aim of this study is to equate test scores using Tucker and Levine observed score equating methods among linear equating methods according to non-equivalent groups common items pattern and to investigate whether or not group invariance condition of equating methods is met with respect to gender sub-groups.Additionally, in order to assess score equating, this study addressed how group invariance was applied by using real data.

Sub-problems
The purpose of the study is to investigate group invariance of the equated scores obtained from Tucker and Levine observed score equating method with respect to gender.For this purpose, these research questions were examined

Data Collection Tools
For data analysis, the data set of the mathematical literacy items by the Turkish students who participated into PISA 2012 application was used.There were 13 booklets in PISA 2012 application.The 4 th and 6 th booklets were used in this study.4 th booklet included 37, 6 th booklet included 36 items.
Since traditional equating methods are used in the present study, the most difficult item was excluded from the 4 th booklet and the number of the items was equated.The data used in this study were downloaded from official website of OECD (http://pisa2012.acer.edu.au/).Later, correct answers, wrong answers and missing data were coded as 1, 0 and 0, respectively and all partially correct and correct answers to a couple of partially scored items were coded as 1 and the data to be analyzed was made ready.

Data Analysis
Data analysis was conducted at four steps.At the first step, it was examined whether or not the booklets met the equating conditions, at the second one the equated scores were obtained by using different equating methods, at the third one group invariance indexes were calculated in order to see how equating function obtained by each equating method differed across groups and at the final step error in each equating method was calculated.

I.
Step: At the first step of data analysis, it was examined whether or not equating conditions are met.
To this end, primarily it was tested if the data was unidimensional.Tetrachoric correlation based principal components factor analysis, is used in order to determine the unidimensionality.This analysis was conducted with Factor 9.3 (2014) program developed by Lorenzo-Seva.The results of the factor analysis presented in Table 2 demonstrate that there is more than 4 times decline between the 1 st factor and the 4 th factor and the explanation variance of the second factor was quite low.Therefore booklets have a single general factor, which implies the tests meet the unidimensionality assumption.
Ratio test was administered in order to determine whether or not there was a significant relationship between the average difficulties of the forms (Baykul, 1996).The results of the test to compare the average difficulty of the booklets are presented in Table 3.When Table 3 is examined, there is no statistically signifcant difference between difficulty levels of booklets (p>.05).In this case, the equality of average difficulty of the booklets to be equated, which is another condition for equation, is ensured.KR-20 reliability coefficient was calculated in order detect if the booklets to be equated are equally reliable.Fischer's Z statistics was carried out in order to detect if there was a difference between two reliability coefficients (Akhun, 1984).The findings regarding the differences in reliability coefficients are presented in Table 4.When Table 4 is examined, it is seen that there is no significant difference between the reliability of the booklets at .05 alpha level/%95 confidence interval (p>.05).This demonstrates that the booklets meet the equal reliability condition.
T test and Levine test were used to test difference between the mean scores and variances of the booklets, respectively.The findings regarding the analyses are presented in Table 5.When Table 5 is examined, it is seen that there is not a significant difference between the means and variances of the booklets at .05 level.
At the end of the analyses regarding the necessary conditions for equating, it was seen that the tests are one-dimensional, are equal in reliability, variances and average difficulty.

II.
Step: At the second step of the data analysis, equated scores were obtained by using Levine and Tucker equating methods.Tucker and Levine observed score equating was performed in Microsoft Excel program.

III.
Step: At the third step of the data analysis, group invariance indexes were calculated in order to assess whether equated scores obtained via each equating method differed in female and male subgroups.In this study, RMSD(x) and REMSD indexes developed by Dorans and Holland (2000) were employed in order to determine group invariance.

RMSD(x) (Root Mean Square Difference):
The value found by RMSD(x) denotes the distance between the subgroup equating functions and total equating function at a x score level.In literature, studies indicating that RMSD(x) can be adapted to other equating method and patterns are available (von Davier, Holland & Thayer, 2004;von Davier & Wilson, 2008).These studies indicate that RMSD(x) can be reported in the form of other equating methods and patterns by eliminating the denominator of the equation in an unstandardized way.
x: Determined score level of the test form j: Subgroup level : The difference between the equated score calculated based on the equating function of the subgroup j at an x score level with the equated score calculated based on the total equating function wj: The weight that is determined with the help of the ratio of the test-takers with the subgroups for each subgroup : Standard deviations of the scores in Q group (Q stands for the one and only group that is examined in single-group or random groups pattern) are defined with the following equation ( 8) and with the help with this equation, it is possible to determine group invariance in case of singlegroup or equivalent groups equating pattern and linear equating function (Dorans & Holland, 2000).( 9) Dorans and Holland (2000) described the score level independent state of RMSD(x) as REMSD (Root Expected Mean Square Difference). (10) In this equation, Ep stands for the mean score of the distribution found with the help of the differences between the equated scores.A group invariance study yields one REMSD.In literature, some studies stating that REMSD can be adapted to other equating method and patterns are available (von Davier, Holland & Thayer, 2004;von Davier & Wilson, 2008).These studies indicate that RMSD (x) can be reported in the form of other equating methods and patterns by eliminating the denominator of the equation in an unstandardized way.
In assessing the group invariance in equating, DTM criterion, which is taken as the half of the raw score unit and recommended by Dorans, Holland, Thayer & Tateneni (2003) and Dorans (2004) is utilized.It is not a certainly set rule to assess the group invariance based on DTM scope.In this study interpretations were made by considering that the difference smaller than 0.50 between equated score of the whole group and the equated score of a sub-group(s) is negligible and difference bigger than 0.50 is significant (Kolen & Brennan, 2014).

IV.
Step: At the final step of the data analysis, error of each equating method was calculated.In this study, weighted mean squares error (WMSE) was used in order to assess equating error.

WMSE (Weighted Mean Squares Error):
It is used in order determine which method is the most suitable in line with the error of the scores equated according to different equating methods.Weighted mean squares error (WMSE) is calculated by comparing the equated scores corresponding to each raw score at the same skill level (Skaggs & Lissitz, 1986).Skaggs and Lissitz (1988) reported that WMSE  6.The graphs regarding the raw scores obtained for both methods and equated scores are given in the appendix.

___________________________________________________________________________________
When Table 6 is examined, shown raw scores range between 0-36.It is seen that equated scores for all groups range between 0.945 and 36.603.For women range between 0.722-36.028and for men range between 1.007-37.496.As can be seen from the table, according to Tucker equating method, the equated scores between 26-29 range and at 31st raw scores are smaller than the raw scores and the other equated scores are bigger than the raw scores.It was also found that in males, equated scores are higher than the raw scores.Based on these findings, it can be said that 6th booklet was more difficult than 4th one for whole-group and males.Although this was the case for females in a general sense, this situation changes between 26-29 interval and 31st raw scores.As can be seen from Table 7, while the raw scores between 0-36 score interval, the equated scores change between 1.167 and 36.836 for the whole-group, 1.159-35.877for females and 1-38.127 for ___________________________________________________________________________________ males.The results of the Levine observed score equating indicate that raw scores for whole-group and males are lower than the equated scores.However for females while raw scores are lower than equated scores between 0-23 raw score interval, they are lower and higher for some scores between 24-36 score interval.
In linear equating that regulates the difficulty difference of the forms across all scale scores, it was revealed that in both methods used in the study, there was a linear relationship between raw scores and equated scores for whole-group and males.There is no difference across whole number scale and only show difference between 24-36 score interval.It is fair to say that in Levine observed score equating 6th booklet was found to be more difficult than 4th one for whole-group and males and although it was the case for females in a general sense, this situation changes in raw scores between 24-31 interval.
The graphs of RMSD (x) index that correspond to each score in which group invariance of the Tucker and Levine observed score equating is examined according to the gender subgroup are presented in Figure 1 and Figure 2 respectively.The RMSD (x) values are given in the appendix in Table 1.When Figure 1 and Table 1 in appendix are examined, it is seen that RMSD (x) values range between 0.061 and 0.811 for Tucker equating and these values increased in simultaneously with the score in a general sense.However, this case differs when it comes to high scores.For Tucker equating method, the highest RMSD (x) value was obtained at 31 score level and the lowest one was obtained at 1 score level.In Figure 2, it is seen that RMSD (x) values for Levine Equating range between 0.034 and 1.257.Although it is seen that RMSD (x) values increased in simultaneously with the score in Levine equating method, it was found out that the increase was not linear at extreme values.In this method, the highest RMSD(x) score was obtained at 34 score level and the lowest one at 5 score level.
According to RMSD values, there are some fluctuations in the extreme points of the scale in the graph for both equating methods.When the frequency of scores was examined, some extreme scores had fewer frequency than the others.Accordingly, fluctuations in the extreme points can be originated from the difference of frequencies.
In this study, it was found out that RMSD(x) values calculated with both methods were similar, however, RMSD(x) values for Tucker were smaller than the RMSD(x) values for Levine.
On the other hand, it is seen that RMSD(x) values that correspond to the scores between 1 and 18 for Tucker equating are lower than DTM.This means that the difference between the equated score in

___________________________________________________________________________________
whole-group and equated scores in sub-groups is not significant.However, RMSD(x) values that correspond to the scores between 19 and 35 for Tucker equating are higher than DTM which means that the difference between equated score in whole group and equated scores in sub-groups is significant.For Levine equating, it is seen that RMSD(x) values that correspond to the scores between 1 and 15 for are lower than DTM.This means that the difference between the equated score in wholegroup and equated scores in sub-groups is not significant.However, RMSD(x) values that correspond to the scores between 16 and 35 are higher than DTM.Therefore the difference between equated scores in whole-group and equated scores in sub-groups is significant.
RMSD (x) index that correspond to each score in which group invariance of the scores equated according to Tucker and Levine observed score equating in gender sub-group is examined is given above.REMSD values that are calculated at group invariance total score level are presented in Table 8.As shown in the Table 8 Tucker equating, REMSD value was calculated as 0.496 and as 0.668 for Levine equating method.It is seen that REMSD value obtained for Tucker is lower than the REMSD value obtained for Levine.Besides Tucker equating RMSD(x) values are lower than DTM.This implies the difference between the equated score in whole-group and equated scores in sub-groups is not significant.However, for Levine equating, it is seen that RMSD(x) values are higher than DTM.This means that the difference between equated scores in whole-group and equated scores in subgroups is significant.WMSE (AHKO) coefficients were calculated according to Tucker and Levine equating methods and gender sub-group determined for invariance in order to find if Tucker or Levine is more suitable for the PISA 2012 4th and 6th booklets which included mathematics test.The information regarding coefficients is presented in Table 9.Table 9 indicates that according to whole-group and sub-groups, the most suitable method regarding the mathematics sub-test in PISA 2012 included in 4th and 6th booklets is Tucker equating method.It is striking that in Tucker equating method, WMSE value obtained for males is quite higher than the WMSE value obtained for females.It is fair to say that WMSE coefficients obtained for males via both methods from sub-groups are similar.
İnal, H., Akın Arıkan, Ç. / An Investigation of Group Invariance in Test Equating According to Gender ___________________________________________________________________________________ subtest and in order to assess the equitability of the scores, whether or not group invariance is was investigated according to RMSD (x) for each score and REMSD coefficients for total score.
When the scores obtained via linear equating are examined, it was seen that the scores obtained according to Tucker and Levine observed score equating take values out of raw score range.Livingston (2004) maintained that the scores equated in linear equating can go outside the raw score range and that does not create a problem for linear equating and is a characteristics specific to linear equating.Moreover, Livingston (2004) reported that the equated scores at very high and low scores can exceed the score range.This was observed at high scores in both equating methods according to female sub-group.
When WMSE values obtained based on the Tucker and Levine observed score equating methods are compared, it was found out that Tucker observed score equating produced lowest error for both wholegroup and gender sub-group.While errors that are obtained according to Tucker and Levine observed score equating with regard to whole-group and female sub-group show difference, it can be said that errors that are obtained with regard to males sub-group are close.Similar results are obtained when the past studies are examined.A study by Demir & Güler (2014) compared frequency prediction equipercentile equating, Tucker, Levine and Braun-Holland Linear Equating methods and determined that the most appropriate method was Tucker equating method and also reported that Levine observed score equated method had the highest error.Topczewski et al., ( 2013) stated in their study in which they used a different version of Tucker, Angoff-Levine, congeneric -Levine and a different version of congeneric Levine by addressing the differences between the skills of the groups that Tucker equating method was the most suitable one in case that group variance is similar.Chen et al. (2003) performed Tucker and Levine observed score equating methods by using different skills distribution and tests with different difficulty levels and concluded that the results were similar when the difference between the group and tests forms was small.
When RMSD and REMSD values obtained according to gender sub-group via linear equating are examined, it was seen that the RMSD and REMSD values based on Tucker were lower than the ones based on Levine.Besides, the difference between the equated scores in whole-group and the scores equated for sub-groups is not significant for Tucker equating method, although it is significant for Levine equating method.That is to say that while group invariance is at an acceptable level for Tucker equating method, it is not the case for Levine equating method.In the study by von Davier and Han (2004) which compared RMSD values with respect to gender with Levine observed score and chained linear equating methods, it was observed that the equating function with the lowest changing equating rate belonged to Levine while the highest changing function belonged to Tucker method.It was found out that the present study and the relevant study results were not parallel.The study by Dorans, Liu & Hammond (2008) reported in their study in which they compared group invariance by gender with Tucker, Levine and Chained equating methods revealed that if the groups to be equated are similar in terms of average skills, Trucker equating method is more fruitful than Levine and Chained equation results.Also Yin, Brennan & Kolen (2004) investigated the group invariance of linear, parallel-linear and equipercentile equating of mathematics and science tests in their study.They reported that lower REMSD values were obtained via linear and parallel linear equating methods for mathematics tests, while lower REMSD via equipercentile equating was reported for the science test.
It is seen that results of both studies support the current study.
Equitability of scores requires the same meaning regardless of when or when the equalized points are applied.Failure to achieve group invariance in equating function indicates that the difficulty difference of the old and new test forms in NEAT pattern is inconsistent across subgroups (Kim and Walker, 2009).Violation of group invariance condition in equating causes the individuals from different groups who are supposed to have the same score get different equated scores (Dorans, 2004(Dorans, , 2008)).Group invariance is a prerequisite for equating.Failure to achieve group invariance is an indicator that equating has not succeeded completely.However, achieving group invariance does not necessarily mean that equated scores can be used interchangeably.This is because group invariance should not be taken as the only criterion in assessing the quality of the equation (Dorans, Liu & Hammond, 2008).

___________________________________________________________________________________
The usage of group invariance indexes made it possible to decide which equating method can be achieved better than the other.Based on the findings of current study, Tucker equating method was the best option in terms of equating 4 th and 6 th Mathematics Booklets of PISA 2012 and group invariance.
The difference observed in group invariance might be attributed to the difference between the whole and sub-group samples.The sample size of this study is 741, 381 and 360 for the whole group, males and females, respectively and a sample size between 50 and 100 is sufficient for Tucker and Levine observed score equating methods (Parshall, Du Bose Houghton & Kromrey, 1995;Skaggs, 2005;Babcock, Albano & Raymond, 2012).Since the sample sizes are sufficient in this study, it can be said that the difference in group invariance is not affected by the sample size.
In this study, 4 th and 6 th booklets of PISA 2012 mathematics sub test were equated by using Tucker and Levine observed score equating method in non-equivalent groups' common items pattern and it was investigated whether or not group invariance was achieved with regard to gender sub-group.A similar study can be carried out by using different equating methods, equating patterns and different samples.Also, whether or not group invariance condition was met with regard to gender sub-group was examined via RMSD (x) and REMSD indexes.In different studies, difference group invariance indexes can be used according to different sub-groups (socioeconomics, ethnic groups, countries etc.).

___________________________________________________________________________________________________________________
ISSN: 1309 -6575 Eğitimde ve Psikolojide Ölçme ve Değerlendirme Dergisi Journal of Measurement and Evaluation in Education and Psychology Eğitimde ve Psikolojide Ölçme ve Değerlendirme Dergisi ___________________________________________________________________________________ non-equivalent groups.Equating relationship between the test forms is established via common items.Common items are classified as internal and external.If the score obtained from the common items is added to the test score of the test taker, it is called internal anchor, if not, it is called as external anchor.

İnal,
H., Akın Arıkan, Ç. / An Investigation of Group Invariance in Test Equating According to Gender ___________________________________________________________________________________ There are three assumptions of Levine observed score equating.I.
___________________________________________________________________________________ index is quite similar to the total error indexes available in other equating studies.The equation for the calculation of WMSE coefficient is given below: (11) k : The number of the items in Y test.: Variance of the raw scores in Y test.X crit: i. raw score in Y test.XE : the score obtained via equating methods and that correspond to i. raw score in X test.fi: i. raw score frequency in Y test FINDINGS The equated scores of PISA 2012 Mathematics sub-test obtained for Tucker and Levine observed score equating methods with respect to gender and the raw scores are presented in Table

Table 2 .
Results of the factor analysis P.E.V (%): Proportion of explanation variance

An Investigation of Group Invariance in Test Equating According to Gender ___________________________________________________________________________________Table 3 .
Comparison of the average difficulty of the booklets İnal, H., Akın Arıkan, Ç. /

Table 4 .
Comparison of the reliability of booklets

Table 5 .
Comparison of the means and variances of the booklets

Table 6 .
Raw scores and the scores that correspond to these scores that are obtained via Tucker observed score equating methods ___________________________________________________________________________________________________________________ ISSN: 1309 -6575 Eğitimde ve Psikolojide Ölçme ve Değerlendirme Dergisi Journal of Measurement and Evaluation in Education and Psychology Eğitimde

Table 7 .
Raw Scores and the scores corresponding to the raw scores that are obtained via Levine observed score equating method

Table 8 .
Values for Levine and Tucker Equating Methods