Investigation of the Effect of Missing Data Handling Methods on Measurement Invariance of Multi-Dimensional Structures *

The purpose of this study was to compare the missing data handling methods on measurement invariance of multidimensional structures. For this purpose, data of 10857 students who participated in PISA 2015 administration from Turkey and Singapore and fully responded to the items related to affective characteristics of science literacy was used. Data with different percentages of missing data (5%, 10%, and 20% missing data) were generated from the complete data set with missing completely at random (MCAR) mechanism. In all data sets, missing data was completed with listwise deletion (LD), serial mean imputation (SMI), regression imputation (RI), expectation maximization (EM), and multiple imputation (MI) methods. Measurement invariance of the construct being measured between countries on completed data sets was investigated with multiple-group confirmatory factor analysis (MG-CFA). Findings from each dataset were compared with reference values. In the results of the study, RI and MI methods in the data set with 5% missing, EM method in the data set with 10% missing, and MI method in the data set with 20% missing gave the more similar results to the reference values than the other methods.


INTRODUCTION
Measurement instruments are of great importance in education systems. In order to train qualified workforce in accordance with the needs of the society, placement of individuals in educational institutions and programs, making changes and improvements in educational systems can be made based on the findings obtained from measurement instruments. As a result of national and international assessment studies, countries can even change their educational policies. In particular, the results of large-scale assessment studies that enable international comparisons are followed with interest by all stakeholders of the education. PISA (Program for International Student Assessment) and TIMSS (Trends in International Mathematics and Science Study) aim to make cross-country comparisons. PISA and TIMSS are large-scale studies that aim to make comparisons between countries and can affect educational policies at national and international levels. The comparability of the results, especially in international assessments, is of great importance in the evaluation of countries. To be able to interpret the findings from different groups who took the same measurement instrument, the measurement instrument should have the same meaning for all groups. In this context, the concept of measurement invariance emerges. Drasgow (1984) defines measurement invariance as the similar relationships between observed test scores and latent traits across all subgroups.
The data obtained by the measurement instruments are not always complete. For reasons caused by the examinee, the measurement instrument, or the administrator, some data may be missing on data sets.
Missing values arise as a problem since they directly affect the results of the statistical analyses of data sets. As in all other statistical analyses, in the measurement invariance studies, the missing data needs to be checked and managed before the analyses. The presence of missing data can affect the results of many analyses, including confirmatory factor analysis. Since excluding examinees with missing values from data sets will reduce the sample size, the power to generalize the results to the population decreases. In addition, the presence of missing values can cause type I and type II errors. Even the difference in the methods used to handle the missing data problem may lead to different findings from the analysis (Harrington, 2009).
Many techniques have been developed to handle missing data. Allison (2001) classified the missing data handling methods as traditional methods, methods based on Maximum likelihood, and multiple imputation approaches. Listwise deletion (LD) is the method that enables the complete data set to be obtained by removing all cases with unobserved data in any of the variables in the data set. If the missing data has the missing completely at random (MCAR) mechanism, the standard error estimates will be close to the standard error estimates of the real data, since the data set obtained by removing the missing data will be a random sample of the original data set (Allison, 2003). However, if each missing value is in different observations, the sample size will be greatly affected by this situation. This can cause problems even if the missing data has the MCAR mechanism (Enders, 2010). Serial mean imputation (SMI) assigns the mean of the observed data in the variable where the missing data is located, instead of missing data (Little & Rubin, 2002). Since the average of the variable is imputed to the missing data, it does not change the mean value of the variable. However, it reduces the distance of the missing data from the mean to zero, and it underestimates the variance (Enders, 2010;Tabachnick & Fidell, 2013). In the regression imputation (RI) method, the missing variables are imputed values with a regression equation obtained from the observed variables. However, the imputed values have some disadvantages, such as better fit than expected due to estimation from other variable and reducing the variance because it will most likely impute a value close to the mean. And, when the other variables are not a good predictor of the variable with missing value, there is no difference between regression imputation and mean imputation (Tabachnick & Fidell, 2013). Expectation maximization (EM), which is a method based on maximum likelihood, is a method consisting of two steps: expectation (E) and maximization (M), and consists of sequential steps based on a series of regressions. The disadvantage of this method is that the standard errors obtained from this method are not consistent with the actual standard errors (Allison, 2003). In the multiple imputation (MI) method, the random variance is added to the values estimated by regression, unlike EM method. However, different results can be obtained each time due to the addition of random variance (Allison, 2003).
There are two commonly used approaches in measurement invariance tests: confirmatory factor analysis and item response theory (Reise, Widaman & Pugh, 1993). Measurement invariance is generally examined by the multiple-group confirmatory factor analysis (MGCFA) method, which includes hierarchical steps (Whitaker & McKinney, 2007). In order to control the measurement invariance between groups with MGCFA method, configural invariance which requires equality of factor structures between groups, metric invariance which requires equality of factor loadings between groups, scalar invariance which requires equality of intercepts between groups, and strict invariance which requires equality of residual variances between groups must be tested hierarchically (Schoot, Lugtig & Hox, 2012).

Purpose of the Study
The purpose of this study was to investigate the effect of missing data handling methods on measurement invariance of multi-dimensional structures. In this context, the answer to the following problem is sought: "What is the effect of listwise deletion (LD), serial mean imputation (SMI), regression imputation (RI), expectation maximization (EM), and multiple imputation (MI) methods used to handle missing data on the measurement invariance in data sets with different percentages of missingness?".

General Background
In the literature, Reise, Widaman, and Pugh (1993) investigated the effects of confirmatory factor analysis and item response theory models on the invariance of psychological measures. The actual psychological data collected from Minnesota and China were examined by both methods, and their advantages and disadvantages were investigated. Cheung and Rensvold (2002) investigated how GFI goodness of fit statistic changed in MGCFA, which is generally used in measurement invariance studies. As a result of the invariance study performed in the simulation data consisting of two groups, it was suggested to use ∆CFI, ∆Gamma, and ∆McDonald's indices from 20 different fit indices based on GFI. Chen, Wang, and Chen (2012) conducted a simulation study on data sets with different rates of missingness in order to compare the missing data handling methods in exploratory and confirmatory factor analysis. In the study where six different methods were examined, all the methods produced appropriate results for exploratory and confirmatory factor analyses. It was concluded that the most suitable method for exploratory factor analysis was EM. In the case of less than 20% missing, no statistically significant difference was found between the methods. However, when the missing data is more than 30%, it is suggested to use the SMI and linear trend methods It is seen that studies on measurement invariance are generally based on real data among different groups such as gender and culture (Schnabel, Kelava, Vijver & Seifert, 2015;Wang, Willett & Eccles, 2011).Some of the studies were also used to compare the goodness of fit indices used when examining the measurement invariance (Chen, 2007;Cheung & Rensvold, 2002).
Studies on the effect of missing data on test and item parameters and model data fit (Akbaş & Tavşancıl, 2015;Çüm & Gelbal, 2015;Demir, 2013;Köse, 2014) were conducted. However, there are not many studies about the effect of missing data handling methods on measurement invariance under different conditions. In one of these studies, Selvi, Alıcı & Uzun (2020) examined the effect of EM RI, and SMI methods on measurement invariance on the data obtained from the School Attitude Scale developed by Alıcı (2013) under the condition of 5% missing. Findings of the study show that different methods can change measurement invariance decisions. It has been suggested by the researchers to do more research on different missing data structures and different proportions of missing data.
When the studies related to the missing data handling methods were examined, it is generally aimed to determine which method is more successful in handling missing values (Allison, 2003;Chen, Wang & Chen, 2012;Downey & King, 1998;Olinsky, Chen & Harlow, 2003). The data sets used are generally simulation data, and it is seen that the successful methods change in the data sets with different sample sizes and different percentages of missingness. Missing data studies have recently increased. The problem of missing data is no longer ignored, and efforts are being made to solve the problem.
In this context, it is thought that examining the performance of the missing data handling methods at different missing rates in measurement invariance studies on multi-dimensional structures is important in terms of shedding light on the problem of missing data in measurement invariance studies. Five methods frequently used in researches are discussed within the scope of this study.

Participants
The sample was 10857 15-years old students (5109 from Turkey and 5748 from Singapore) who participated in PISA 2015 administration from Turkey and Singapore. Students who have fully responded to items on "enjoyment of science, instrumental motivation, and epistemological beliefs about science" were used in the study. Measurement invariance studies between Turkey and Singapore were conducted on a complete data set of 10857 students in total.
Since PISA results are generally used for cross-country comparisons, it was decided to evaluate the measurement invariance between countries in the data set. It was decided to use Turkey and Singapore data whose mean science score distance from the OECD average is approximately equal in absolute value in terms of mean science score. Singapore has 556 mean science score, Turkey has 425 mean science score, and OECD average is 493. It is also taken into account that Singapore is the most successful country in terms of average science score. Similarly, the percentage of variation in science performance explained by students' socio-economic status was also considered.

Data Collection Instruments/Data Collection Methods/Data Collection Techniques
The data used in this study was obtained from the PISA 2015 administration organized by OECD and aimed to evaluate the educational systems of countries. PISA is an administration to measure the level of knowledge and skills necessary for students to participate in modern society. In addition to focusing on key areas such as science, mathematics, and reading, the 2015 administration included collaborative problem solving and financial literacy as an innovative field (OECD, 2016).
In this study, the model including the items of enjoyment of science, instrumental motivation, and epistemological beliefs was used. Enjoyment of science is represented by five items, instrumental motivation by four items, and epistemological beliefs by six items. Each item has four response categories, such as strongly disagree, disagree, agree, and strongly agree. Some sample items are shown in the Table 1.  -square=8840.290) in the data set with 10% missing, and p= 0.645 (chi-square=23308.247) in the data set with 20% missing were found. Accordingly, it can be said that the missing data in all data sets have MCAR mechanism. Afterwards, LD, SMI, RI, EM, and MI with five imputation methods were applied to each data set to handle the missing data problem, and inter-country measurement invariance was examined by MGCFA approach on completed data sets.
For cross-country measurement invariance, enjoyment of science, instrumental motivation, and epistemological beliefs model is shown in Figure 1. Before starting the analysis, it is necessary to check the missing values, normality, outliers, and multicollinearity in the data set. The kurtosis and skewness values of each data were examined for normality assumption. According to the findings, the skewness values of the variables ranged from -0.942 to -0.471, and the kurtosis values ranged from -0.296 to 0.913. Tabachnick and Fidell (2013) stated that the closeness of kurtosis and skewness values to zero shows that the distribution is close to normal distribution. According to obtained kurtosis and skewness values, it can be said that each variable was distributed normally. To determine the outliers, z distributions were examined. |z|>3.29 indicates that the variable contains outliers (Tabachnick & Fidell, 2013). According to the findings, z scores of the variables ranged between -2.78 and 1.42. In this case, it can be concluded that there are no outliers in the data set. VIF and tolerance values were examined to determine if there was a multicollinearity problem. VIF values ranged between 2.178 and 4.882, and the tolerance values ranged between 0.205 and 0.459. Based on this finding, it was concluded that there is no multicollinearity problem in the data set.
In order to compare the results obtained from a measurement instrument applied to groups with different characteristics, it is important to ensure the measurement invariance between groups. There are different approaches to test measurement invariance, such as MGCFA and item response theory. In this study, the measurement invariance was examined with the MGCFA approach with ML estimator. MGCFA aims to compare the means, variance, and covariance of the latent variable between the groups while  (Asparouhov & Muthen, 2014). In this context, configural invariance, metric invariance, scalar invariance, and strict invariance were tested hierarchically. ∆CFI was examined to determine whether measurement invariance was provided at each stage. A difference of less than .01 supports the less parameterized model (Chung et al., 2016).

RESULTS
In this section, the findings of the research are given. Firstly, the reference values to compare the data sets with different percentages of missingness were obtained by performing a hierarchical measurement invariance in the complete data set.
Before moving on to measurement invariance studies in the whole data set, confirmatory factor analysis was performed in Turkey and Singapore datasets separately, and model-data fits were examined. Fit indices obtained from Turkey and Singapore datasets are presented in Table 2. When Table 2 is examined, it is seen that the data for both countries fit the model. After that, a crosscountry measurement invariance study was conducted for the complete data set, and reference values were obtained. Reference values obtained from the complete data set are provided in Table 3. When the fit indices in the Table 1 were examined, it was seen that configural invariance, metric invariance, and scalar invariance were achieved in the complete data set, but not the strict invariance (|∆CFI|≤.01). The values related to fit indices from the reference data set was used to compare with the completed data sets. Then, the results of the measurement invariance studies were included in the data sets with 5% missing, 10% missing, and 20% missing and completed with LD, SMI, RI, EM, and MI methods.

Influence of Missing Data Handling Methods on Measurement Invariance in the Data Set with 5% Missing
The data set with 5% missing was completed with LD, SMI, RI, EM, and MI methods, and measurement invariance was hierarchically tested on completed data sets. The fit indices obtained at each stage of measurement invariance according to different methods are provided in Table 4. When the fit indices in the tables were examined, it was seen that the first three stages of measurement invariance between countries were achieved in all data sets, but not the strict invariance (|∆CFI|≤.01). When the fit indices obtained for each method were compared with the reference values given in Table  1, it was observed that the indices obtained from SMI, RI, EM, and MI methods gave more similar results to the reference values. But dissimilarly, LD and SMI methods showed ᵡ 2 / less than the reference value. All indices, especially ∆CFI, were compared with the reference data set. Methods giving more similar results to the reference values were determined. RI and MI methods yielded the closest results.

Influence of Missing Data Handling Methods on Measurement Invariance in the Data Set with 10% Missing
The data set with 10% missing was completed with LD, SMI, RI, EM, and MI methods. Measurement invariance was hierarchically tested on completed data sets. The fit indices obtained at each stage of measurement invariance according to different methods are provided in Table 5.  When the fit indices in the tables were examined, it was seen that all the missing data handling methods are provided all the invariance stages except strict invariance as in reference data set (|∆CFI|≤.01). When the fit indices obtained for each method were compared with the reference values given in Table 1, it was seen that the EM method gives results very close to the reference values. Dissimilarly, LD and SMI methods showed ᵡ 2 / less than the reference value. And the SMI method showed CFI and TLI values to be more than they were.

Influence of Missing Data Handling Methods on Measurement Invariance in the Data Set with 20% Missing
The data set with 20% missing was completed with LD, SMI, RI, EM,and MI methods, and measurement invariance between countries was hierarchically tested on completed data sets. The fit indices obtained from the measurement invariance studies are provided in Table 6. When the fit indices in the tables were examined, it was seen that all the missing data handling methods provided all the invariance stages except strict invariance (|∆CFI|≤.01). When the fit indices obtained for each method were compared with the reference values given in Table 1, it was seen that the MI method gives results close to the reference values. The MI method shows ᵡ 2 / close to the reference value, but dissimilarly, LD and SMI methods shows ᵡ 2 / lower than it is, and the EM method shows ᵡ 2 / higher than it is.

DISCUSSION and CONCLUSION
In this study, the effect of completing data sets with missing values with LD, SMI, RI, EM, and MI methods on measurement invariance was investigated. As a result of measurement invariance studies between countries performed in data sets completed with different missing data handling methods in all missing percentages, it was observed that all the invariance stages except strict invariance were provided in accordance with the complete data set. Although the data sets were completed with different methods, there was no result that would show the measurement invariance between countries different from the reference data set.
The research was limited in terms of missing data handling methods, missing data mechanisms, and measurement invariance approaches. LD, SMI, RI, EM, and MI methods were used as missing data handling methods. The data sets have MCAR mechanism. Data sets with multi-dimensional structures were used in the study. And, measurement invariance was handled by MG-CFA approach. The findings and discussion in this study are based on a single data set obtained from the PISA 2015 administration.
No replication was done in the study. Please consider this situation as a limitation.
In the literature, methods based on the likelihood approach and the multiple imputation approach are proposed as the strategy of handling the missing data in CFA models (Allison, 2003;Brown, 2006). The findings of the research show that EM and MI methods which are based on the likelihood approach yielded more successful results in accordance with the literature.
Selvi, Alıcı & Uzun (2020) tested the measurement invariance with structural equation modeling in the complete data matrix and in cases of handling the missing data tested using EM, Regression-Based Imputation, and Mean Substitution methods. They concluded that different methods can change the decisions of measurement invariance. But, in the findings of this study it was seen that not all methods change measurement invariance decisions. Allison (2003) stated that MI has good statistical properties, and it can be used in almost any situation. Schafer and Graham (2002) recommended EM algorithm for maximum likelihood and MI method. Similar to the studies, the results obtained from EM and MI methods were found to be more appropriate to reference data in this study.
As a result of comparing the fit indices obtained from each data set with the fit indices obtained from the complete data set, the data sets completed with RI and MI in the data set with 5% missing yielded closer results to the reference values. In the data set with 10% missing, closer results were obtained from the EM method than the other methods. And in the data set with 20% missing, the missing data handling method which gave the closest results to the reference values was MI. While making comparisons, based on ∆CFI change, the methods whose fit indices give the closest results to the reference values were determined descriptively. As a result of the research, recommendations for implementation are as follows: In the measurement invariance studies to be performed in multi-dimensional data sets, data sets with 5% missing can be completed by RI and MI methods. The EM method works better than other methods if there are around 10% missing. And, if the data set has about 20% missing, the MI method can be used to complete the data set.