A Study on the Identification of Latent Classes Using Mixture Item Response Theory Models: TIMSS 2015 Case Fatıma

Accepted: 24.06.2020 This study examined the existence of latent classes in TIMSS 2015 data from three countries, Singapure, Turkey and South Africa, were analyzed using Mixture Item Response Theory (MixIRT) models (Rasch, 1PL, 2PL and 3PL) on 18 multiple-choice items in the science subtest. Based on the findings, it was concluded that the data obtained from TIMSS 2015 8th grade science subtest have a heterogeneous structure consisting of two latent classes. When the item difficulty parameters in two classes were examined for Singapore, it was determined that the items were considerably easy for the students in Class 1 and the items were easy for the students in Class 2. When the item difficulty parameters in two classes were examined for Turkey, it was found that the items were easy for the students in Class 1 and the items were difficult for the students in Class 2. When the item difficulty parameters in two classes were examined for South Africa, it was ascertained that the items were a bit easy for the students in Class 1 and the items were considerably difficult for the students in Class 2. The findings were discussed in the context of the assumption of parameter invariance and test validity.


Introduction
Accurate understanding and analysis of data in education and related fields are important to obtain reliable and valid measurements and evaluations. In particular results from international large-scale assessments guide the process of education to be more efficient and allow the academic achievements of student groups in one country to be compared with those in other countries (Cook, 2006). A method based on student samples from all participating countries and calibrations of the Item Response Theory (IRT) is implemented to ensure comparability of scores in international large-scale assessments. This method ensures that each participating country contributes an equal amount to the calibration of item parameters (Oliveri & von Davier, 2011).
IRT models are used to accurately determine the success of students in international large-scale assessments (Martin, Mullis, & Hooper, 2016;Yamamoto & Kulick, 2000). Using IRT models, the relationship between an examinee's ability (latent variable) and the probability of the examinee responding correctly to any item is modeled (Harris, 1989). Three different IRT models are used in TIMSS assessment based on item type and scoring method. A threeparameter logistics model is used for multiple-choice items and a two-parameter logistics model for the constructed-response items that were scored as dichotomous. A generalized partial credit model is used for polytomous scored constructed-response items (Martin et al., 2016).
Although Item Response Theory models have many advantages, they have parsimonious assumptions such as unidimensionality, parameters invariance, and local independence (Embretson & Reise, 2000;Hambleton, Swaminathan, & Rogers, 1991). To gather accurate evidence regarding the validity of the model used in the analysis, its assumptions must be met and there must be no biased items (Kreiner & Christensen, 2007). The advantages of the IRT models rely on the validity of the model which requires its assumptions to be met. However, these assumptions are quite difficult to meet in many types of research (von Davier, Rost, & Carstensen, 2007).
The parameter invariance assumption of Item Response Theory means that the estimated item parameter values do not change over different groups (Hambleton et al., 1991;Embretson & Reise, 2000;DeMars, 2010). However, in some cases, different groups can be formed due to the response strategies and techniques that individuals use to respond to the items correctly, and these groups are defined as latent classes (Embretson, 2007;Glück & Spiel, 2007). In other words, differences among individuals in terms of different problem-solving techniques, being familiar with item contents or having different educational backgrounds, etc. can lead to the formation of different latent classes (Mislevy & Huang, 2007;Rijkes & Kelderman, 2006). The presence of many latent classes in data obtained from tests means that the measured psychological construct varies among different groups, thus threatening test validity (Kreiner & Christensen, 2007;Messick 1994;von Davier & Yamamoto, 2007;Toker, 2016). This is because it is reported that assumptions such as unidimensionality, local independence, parameter invariance, and monotonicity and the absence of items with differential item functioning (DIF) in the test can be considered as requirements of construct validity in standard IRT models (Kreiner & Christensen, 2007).
Several models such as multidimensional IRT models (Reckase, 2009), multiple group IRT models (Bock & Zimowski 1997) and Mixture IRT models (de Ayala & Santiago, 2017;Rost 1990;Mislevy & Verhelst 1990) have been developed in case assumptions of standard IRT models are violated or cannot be met. Unlike IRT models, Mixture IRT models do not require parameter invariance assumption and they allow item parameters to vary among latent classes (de Ayala & Santiago, 2017;von Davier & Yamamoto, 2007). The variation of item parameters between latent classes indicates the existence of homogeneous subgroups (Rost, 1990;de Ayala & Santiago, 2017). Analyzing heterogeneous datasets consisting of homogeneous subgroups using standard IRT models can cause misinterpretation of the results (DeMars, 2010;Finch & French, 2012).
In Mixture IRT models, item parameters and individuals' ability distributions (mean and variance) can vary in different latent classes (Rost, 1990;de Ayala & Santiago, 2017). Changes of item and ability parameters reveal that some characteristics of individuals in different classes such as strategies employed by them and familiarity with question types vary (Kreiner & Christensen, 2007;von Davier & Yamamoto, 2007). As a result, the inclusion of individuals in different latent classes based on their abilities enables researchers to obtain more reliable and valid information about item and group traits (de Ayala & Santiago, 2017). That is because using a single and the same parameter estimation for all groups despite latent classes consisting of individuals with different ability levels causes loss of information. Furthermore, it is possible to obtain more information by simultaneously modeling both continuous (ability parameter) and categorical (latent class) data using the Mixture IRT approach (de Ayala & Santiago, 2017).
Analysis of the relevant literature reveals that there are many studies examining the existence of latent classes in international large-scale test data (Choi, Alexeev, &Cohen, 2015;Liu, Liu, & Li, 2018;Oliveri, Ercikan, Zumbo, & Lawless, 2014;Oliveri & von Davier, 2011;Park, Lee, & Xing, 2016;Sen, Cohen & Kim, 2016;Toker, 2016;Zhang, Orrill, & Campbell, 2015). Three latent classes were identified in a study examining the heterogeneity in response patterns of fourth-grade students from Taiwan, Hong Kong, Qatar and Kuwait who participated in PIRLS 2006 . Choi et al. (2015) identified two latent classes using a threeparameter Mixture IRT model in an analysis that they conducted with 11 multiple choice and 15 open-ended (binary scored) items on the 4th grade mathematics data for seven countries that participated in TIMMS-2007 (Austria, Australia, El Salvador, Hong Kong, Qatar, Singapore, Slovakia). Zhang et al. (2015) conducted three separate analyses on the 15 items in the science subtest, 16 items in the mathematics subtest and the total including the science subtest, mathematics subtest and the combination of these two tests in PISA 2009 for Chinese data to gather information about the classification of students in the domains of science and mathematics. They concluded that the data obtained from the Chinese students fitted the twoclass Mixture Rasch model best in each subtest and in cases where those subtests were employed together. Sen et al. (2016), on the other hand, concluded that the data obtained from the South Korean students who achieved the highest success in the 8th grade mathematics subtest in TIMMS 2011 fitted the two-class Mixture Rasch model best.
This study is important as it examines the existence of latent classes in TIMSS 2015 data, interprets model outputs under the Mixture IRT model that fits the data best, and the validity of the TIMSS assessment. Using unidimensional standard IRT models in large-scale tests such as TIMSS and PISA causes latent classes to be ignored, and therefore the parameter invariance assumption is violated (von Davier, Rost, & Carstensen, 2007). In this situation biased results can be obtained in item parameter calibrations (DeMars & Lau, 2011). Furthermore, although it is emphasized that the parameter invariance assumption is necessary for cross-cultural comparisons (Hambleton & Rogers, 1989;Meredith, 1993;Millsap, 2011), testing of assumptions for data obtained from international assessments can be neglected (Park et al., 2016).
This study investigates the heterogeneity of data from Singapore, Turkey and South Africa countries which achieved high, medium and low levels of success in TIMSS 2015 respectively. For this reason, Mixture IRT models enabling parameter calibration in the presence of latent classes are needed. Examining whether latent classes are present in international large-scale assessments based on the stated reasons is important for obtaining reliable and valid results.

Purpose
This study aims to determine the model which the data obtained from Singapore, Turkey, and South Africa countries that received booklet 7 of the 8th grade science subtest in TIMSS 2015 test, which is an international large-scale assessment, fits best in the presence of latent classes, thus contributing to the validity of the model. In this context, answers to the following research questions were sought; (1) Which Mixture IRT (Rasch, 1PL, 2PL and 3PL) model do TIMSS 2015 science subtest items fit better for Singapore, Turkey and South Africa? (2) What are the item parameters based on the model that fits best to data for Singapore, Turkey and South Africa?

Study Group
The study group consists of 436 students from Singapore, 432 students from Turkey and 894 students from South Africa who attended TIMSS 2015 at the 8th grade level and were administered Booklet 7 science subtest. Table 1 shows the mean scores and standard deviations of the students for three countries. The mean scores and standard deviations of the students from Singapore were calculated as 12.41 and 3.90, respectively. For the students from Turkey, the mean scores were 8.57 and the standard deviations were 3.87 while the mean scores of the students in South Africa were 5.12 and the standard deviations were 2.52.

Data Collection Tools
Different test booklets with common items are used to estimate student ability in international large-scale assessments (Xu, 2009). TIMSS 2015 had 14 different booklets organized according to the content domain and cognitive domain at the 4th and 8th grade levels. The 8th grade science subtest of TIMSS 2015 had four different content domains including physics, chemistry, biology and earth sciences as well as three cognitive domains of knowing, applying, and reasoning Each booklet was composed of similar proportions of item types, including multiple-choice and constructed-response items (Martin, et al., 2016). 18 multiplechoice items in booklet 7 of the science subtest were included in the analysis within the scope of this study. Correct answers were coded as 1 while wrong answers were coded as 0.

Data Analysis
The three-parameter Mixture IRT model including item parameters and the guess parameter for each class is shown with equation (1) (Choi, et al., 2015): In this equation; = (1,2, . . , ) indicates latent class membership for the three-parameter Mixture IRT model, ( ) indicates the interclass item difficulty parameter for item i, ( ) indicates interclass item discrimination for item i, ( ) indicates lower-asymptote, i.e. the chance parameter for item i, ( ) indicates the ability parameter for individual j in class and indicates the mixing proportion of individuals in a class. The probability that each individual belongs to one latent class and the mixing proportion of individuals in each class ( ) are estimated with the and 0 ≤ ≤1 restriction (Rost, 1990;Sen et al., 2016). Mixture IRT models are nested models. That is because it turns into a Mixture 2parameter model when the low-asymptote parameter is equal to zero, i.e. the chance is eliminated: the two-parameter Mixture IRT (Mix2PL) is shown with equation (2) (Finch & French, 2012): It is transformed into the 1-parameter model form with the assumption that the chance parameter is equal to zero and the item discrimination parameter is equal for all classes while it is transformed into the Mixture Rasch model form with the assumption that chance parameter is equal to zero and the item discrimination parameter is equal to 1. The formula of the Mixture Rasch model is shown by the following equation (3) (Rost, 1990): In Mixture IRT models, the difficulty, discrimination, and guess parameters have the same meaning as the parameters in the overall IRT framework. Therefore, item difficulty provides information about the probability of an item to be answered correctly by the individual, discrimination indicates how well the item distinguishes between individuals with different levels of the measured construct, and the chance parameter is a measurement of the probability that the individual answers the item correctly by mere chance (de Ayala, 2009).
The Mixture IRT models were analyzed using the Mplus 7.4 program. In the Mplus program, parameter estimations are done using maximum likelihood (ML) and Bayesian estimation methods. Estimation is performed for missing data with the full information maximum likelihood (FIML) method by adding the "Missing All (99)" command (Muthen & Muthen, 2017). In this study ML method was used for parameter estimation and FIML method was used for missing data.

Model Fit
An exploratory approach that starts with a one-class solution and adds additional classes until obtaining the model that best fits the data is adopted for model fit in Mixture IRT models.
In one-class IRT models, both likelihood ratio tests and relative fit indexes can be used to determine the optimal model. On the other hand, the likelihood ratio test is not suitable for model comparisons between Mixture IRT models (Li, Cohen, Kim, & Cho, 2009;Nylund, Asparouhov, & Muthén, 2007). Using relative fit indices such as Akaike Information Criterion (AIC; Akaike, 1974), Bayesian Information Criterion (BIC; Schwarz, 1978), sample size adjusted BIC (SABIC; Sclove, 1987), and consistent AIC (Bozdoğan, 1987) is suggested for model-data fit in Mixture IRT models. However, simulation studies indicate that BIC tended to perform better between these indices (Nylund et al., 2007;Li et al., 2009;Sen, 2018). In this study, the BIC index was given for model-data fit and the AIC index was considered as the supporting index.

Label Switching
The parameters calibrated for Class 1 are sometimes labeled as Class 2 or vice versa as there is no information about the number and nature of the classes in Mixture models (McLachlan & Peel, 2000). This type of label switching can occur in Bayesian and ML estimations (Finch & French, 2012). As class labels are exchanged between data sets, parameter estimates to be collected over potentially mislabeled classes create an undesirable situation. In this case, the label switching problem can be solved by taking the estimated item parameter values as starting values. Model-data fit index values are not affected by Label Switching (Kutscher, Eid, & Crayen, 2019). Table 2 shows the information criteria indices obtained from the analyses aimed at determining which Mixture IRT model (Rasch, 1PL, 2PL, and 3 PL) the data obtained from the science subtest for Singapore, Turkey, and South Africa that attended the 8th grade TIMSS 2015 fits best: The model-data fit indices for the South Africa data indicate that the BIC value was lower for the two-class Mixture 1-parameter model while the AIC value was lower for the three-class Mixture 2-parameter model. Previous research has shown that the AIC index tends to select the model with a higher number of classes (Preinerstorfer & Formann, 2012;Sen, 2018). The lower AIC value for the three-class model is similar to other research results. In this context, it can be said that the South Africa data fits the two-class Mixture 1-parameter model better as the BIC index has a higher performance in terms of model-data fit (Li et al., 2009). From a model-based point of view, it can be said that the two-class model fits the data better for the Mixture Rasch model and the Mixture 2-parameter model while the one-class model fits the data better for the Mixture 3-parameter model based on BIC indices. As a result, it can be said that the data from Singapore fit the two-class Mixture Rasch model better while the data from Turkey and South Africa fit the two-class Mixture 1-parameter model better.

Findings
In an attempt to answer the second research question the item parameters obtained for the classes in the model that fits the data better in Singapore, Turkey, and South Africa, respectively, are provided in Table 3. Table 3 shows the item parameters estimated for the twoclass Mixture models selected for the data obtained from these three countries. The item difficulty (β1 and β2) and item discrimination parameters (α) obtained from two-class Mixture IRT models that better fit the data for Singapore, Turkey and South Africa are shown in Table 3. As the Singapore data fitted the Mixture Rasch model, the discrimination parameter was estimated as 1 while the discrimination parameters for the data from Turkey and South Africa were estimated as 0.717 and 0.446 respectively since they fitted the Mixture 1-parameter model. The item difficulty averages for the first latent classes were estimated as 2.18, -1.48 and -0.16 respectively and for the second latent classes were estimated as -0.49, 0.89 and 1.87 respectively in Singapore, Turkey and South Africa data. When the estimated item difficulty parameters for the Singapore data are examined, it is observed that the item difficulty parameters in Class 1 vary between -9.481 (item10) and 0.338 (item 11), meaning that these items are usually very easy for students in Class 1. The item difficulty parameters of the items in Class 2 vary between -2.663 (item7) and 1.064 (item 1). As a result, it can be stated that the students in Class 2 had a slightly lower performance than the students in Class 1. The fact that the vast majority of the items in both classes have negative difficulty values could indicate that the items were very easy for the students in Singapore.
The analysis of the item difficulty parameters calibrated for the Turkey data reveals that the item difficulty parameters vary between -4.880 (item 7) and 1.699 (item 4) in Class 1 and between -2.119 (item 18) and 3.571(item 12) in Class 2. In this case, it can be said that the items were easier for students in Class 1 and the students in this class performed better while the items were a little harder for the students in Class 2 and the students in this class had a lower performance. When the item difficulty values in the latent classes in the Singapore and Turkey data are compared, it can be stated that the items were harder for the students in Turkey.
A label switching problem encountered in Mixture models was identified in the South Africa data. The analysis output revealed that the item parameters estimated for Class 1 were labeled as Class 2. This problem was solved by taking the estimated item parameter values as starting values (Kutscher et al., 2019). When item difficulty parameters are examined for South Africa data, it is observed that item difficulty parameters in Class 1 range from -3.017 (item 3) to 2.847 (item 1) while item difficulty parameters in Class 2 range from -0.639 (item 15) to 5.987 (item 1). In this case, it can be said that the items were usually a little easier for students in Class 1 and the students in this class performed slightly better while the items were a little harder for the students in Class 2 and the students in this class had a lower performance. When the item difficulty values in the latent classes in the Singapore and Turkey data are compared with the item difficulty values in the latent classes in the South Africa data, it is observed that the items were highly difficult for the students in South Africa. Percentages of students in latent classes for each country are given in Table 4: The conditional probability values for the latent classes given in Table 4 reveal that 59% of the students in Singapore were in Class 1, 41% were in Class 2, 62% of the students in Turkey were in Class 1, 38% were in Class 2, 11% of the students in South Africa were in Class 1 and 89% were in Class 2. The high percentage of underperforming students in the South African overlaps the fact that South Africa ranked last in the 8th grade science test of TIMSS 2015. According to student percentages in latent classes, the data obtained from the students who took the 8th grade science subtest of TIMSS 2015 has a heterogeneous structure consisting of two homogeneous subclasses. Therefore, this result shows that Mixture IRT models are needed to detect latent classes in TIMSS 2015 data.

Discussion and Conclusion
Standard IRT models are used for calibration of item parameters and scaling of individual performances in international large-scale assessments such as TIMSS and PISA (Martin et al., 2016). Literature review revealed that latent classes are ignored at the end of the analyses conducted with IRT models in some studies that employed international large-scale test data (Kreiner & Christensen, 2014;Oliveri & von Davier, 2011;Oliveri & von Davier, 2014;Park et al., 2016). In the presence of latent classes, the parameter invariance assumption of standard IRT models is violated and biased results can be obtained in item parameter calibrations (DeMars & Lau, 2011). In large-scale assessments, the invariance of item parameters is often tested within the context of DIF studies. In these studies, however, the existence of latent classes is not checked, and latent traits are often neglected (Park et al., 2015). Therefore, the analysis was carried out using Mixture IRT models which allow item parameters to vary among latent classes.
Mixture Item Response Theory models (Rasch, 1PL, 2PL and 3PL) analysis results showed that Singapore data fitted the two-class Mixture Rasch model better while Turkey and South Africa data fitted the two-class Mixture 1-parameter model better. Choi et al. (2015) found out that TIMSS 2007 mathematics data fitted the two-class 3-parameter Mixture IRT model best, Zhang et al. (2015) found out that the data obtained from Chinese in the mathematics subtest of PISA 2009 fitted the two-class Mixture Rasch model best, and Sen et al. (2016) found out that the data obtained from South Korea in the 8th grade mathematics subtest of TIMSS 2011 fitted the two-class Mixture Rasch model best. When these results are considered, it is seen that that the results of this study show similarity to those obtained by applying Mixture IRT models to largescale test data such as TIMSS and PISA.
Two latent classes were identified in Singapore, Turkey and South Africa data. It was concluded that the students in the first latent class in Singapore, Turkey and South Africa data performed better in answering items than the students in the second latent class. It was concluded that the students in the second latent class in Singapore, Turkey and South Africa data had a lower performance in answering items than the students in the first latent class. These results indicate the presence of latent classes in the data of countries with high, medium and low performance regardless of country's performance ranking. The parameter invariance assumption, which is one of the assumptions of standard IRT models, is violated in the presence of latent classes (Park et al.,2016). As the parameter invariance assumption could not be met, it was concluded that the data obtained from the 8th grade science subtest of TIMSS 2015 fit Mixture IRT models. As a result, Mixture IRT models are needed for calibration on subgroups basis in TIMSS 2015 assessment. Accordingly, as it is stated in studies stressing the importance of meeting model assumptions (Goldstein, 2004;Grisay & Monseur, 2007;Kreiner & Christensen, 2014;Oliveri & von Davier, 2011), it will be possible to reach more accurate conclusions about determining the strengths and weaknesses of countries, reorganizing, improving, and evaluating education programs based on findings resulting from the collection of correct evidence about the validity of results obtained from large-scale assessments.
This study is conducted on dichotomous scored items in the 8th grade science subtest of TIMSS 2015. Researchers can perform Mixture IRT model analyses with polytomous scored items. Moreover, although it can be stated that students give a low or high performance in the classes obtained with Mixture IRT models, no account can be provided for the cognitive levels of TIMSS (knowing, applying and reasoning) in which students in these classes are successful.
Researchers can obtain more detailed information about the formation of latent classes by cognitive domain levels using the Confirmatory Mixture IRT model approach. Furthermore, DIF studies can be conducted with Mixture IRT models using large-scale test data such as TIMSS and PISA.