Comparative Analysis of Common Statistical Models Used for Value-Added Assessment of School Performance

The purpose of this study was to compare three popular value-added models used in measuring school effectiveness based on their distinguishing characteristics. In this study, the simple fixed effects model (SFEM) and two hierarchical models (UHLMM and AHLMM) were analyzed using value-added measures obtained from a common data set with two years standard assessment data. Value-added measures obtained from these three models were analyzed to determine the impact of the differences of each model. Correlational analyses were also conducted to see whether there were meaningful relationships among these value-added models. SFEM and UHLMM models produced very similar rank orders of school effects while SFEM and AHLMM had only a moderate correlation. Thus there was not much difference between SFEM and two HLM models in terms of the rank orders of schools.


INTRODUCTION
Over the past few decades, there has been growing interest in the effectiveness and accountability of schools around the world. As an example, this has been the case with the U.S., especially since the adoption of the No Child Left Behind act of 2001 which requires states to measure student academic achievement and to report on progress using Adequate Yearly Progress (AYP) measures (Amrein-Beardsley, 2008). This system is based on an approach which gives rewards to schools that make contributions to students' learning and sanctions those that do not make any improvement on student test scores. Early applications of this state-wide assessment have focused on the current status of 304 students. The current-status approach compares different cohorts of students at a single point in time (Doran & Izumi, 2004). It simply uses the percentage of students who passed the state test at the end of the school year.
Educators recognize that a one-time test score is not always a useful way to estimate school effects on student performance. Differences among schools may be due to student and school variables that are not measured in tests but that influence test scores. Current-status methods don't take socioeconomic factors into account, for example, when assessing schools' effectiveness. Although these methods are located at the heart of the state accountability system, there are at least two reasons why they're invalid and inappropriate to use for the purpose of school comparisons.
First, students come to school with different backgrounds. In other words, there is no random assignment of students to schools (Doran & Izumi, 2004) yet the statistical methodology underlying this approach assumes random assignment. This results in making unfair comparisons between disadvantaged and advantaged schools in terms of socioeconomic status.
Second, current-status methods are cumulative. They reflect the impact of learning obtained from all previous schools on students' performance scores (Doran & Izumi, 2004) but they do not differentiate current effects from previous effects. Thus, we cannot hold only the latest school accountable for a student's good or poor test score if the student has changed schools in the past. As Ballou, Sanders, and Wright (2004) note, holding schools accountable based on mean achievement levels makes no sense, when students enter those schools with large mean differences in achievement.
It is widely accepted that status-based accountability systems are likely to be flawed, resulting in inaccurate judgments of school quality (Doran & Izumi, 2004;Tekwe et el., 2004). As the shortcomings of this method increasingly become apparent, an alternative way of assessing school effectiveness using growth models has gained acceptance. This new method focuses on the improvement students in the school made during the year. Instead of considering how cohort groups have increased in knowledge, measuring individual student progress over time from one time point to the next is more reasonable in terms of "learning," which is meant to be "change." Growth models are designed to generate estimates from these kinds of data (Doran & Izumi, 2004).
In this regard, researchers have developed a method called value-added analysis (VAA) which enables them to use individual student achievement scores over time in order to identify effective schools. As defined by Tekwe et al. (2004) "Value-added is a term used to label methods of assessment of school/teacher from one year to the next and then use that measure as the basis for a performance assessment system" (p. 31). Pioneers of VAA claim that VAA generates fairer and more accurate estimates than those generated by state tests that measure only the achievement of a single year. The primary purpose of VAA is to determine the impact of teachers or schools on the progress of their students (Raudenbush, 2004). To do this, VAA computes gain scores by taking the differences between students' scores on state tests from one year to the next (Sanders et al., 2002).
The VAA approach evaluates schools based simply on how they increased the level of their students' knowledge. The two basic ideas underlying value-added measurement are that it is calculated for each individual nested within the schools and that it is based on changes in student performance from one year to the next (Ladd & Walsh, 2002). Another advantage cited of VAA is that, unlike the current-status method, it can control the effect of confounding variables such as student and school socioeconomic status that may influence the test scores. In this way, it is an attempt to minimize the influence of experiences, privilege, and ethnicity on student performance.
In general, value-added models (VAMs) are a class of statistical model procedures that analyze students' standardized test scores over time to identify the degree to which a student's progress is a function of their own characteristics or of the characteristics of their school (Doran & Izumi, 2004). VAMs have recently received a great deal of interest from both policy makers and researchers due to a belief that these models can adequately determine how individuals are growing over time while Another important model is one developed by Raudenbush and Bryk (1986) and Aitkin and Longford (1986). This model relies on hierarchical linear models to measure student growth. Although there are several VAMs which are based on different statistical assumptions (Braun, 2004;Tekwe et al., 2004), the most popular has been the TVAAS (Olson, 2004). For any of these models to be useful in VAA analysis, however, the test scores must be vertically scaled (Ballou et al., 2004;Doran & Cohen, 2005). That is, the test scores must all be expressed on a common scale that extends over the time periods included in the analysis. In brief, longitudinal data, annual assessment, and vertically equated tests are said to be basic elements of VAMs. Typically, standardized assessment scores are used in VAM studies. Though no VAM has yet been obvious to be clearly superior over another, VAMs are considered to be fairer and more accurate than conventional methods (Doran & Izumi, 2004).
To date, several alternative models, ranging from simple gain scores to complex mixed models, have been suggested by researchers with regard to assessment of school effectiveness. However, there have been a limited number of studies which make comparisons among these different models (Ballou et al., 2004;McCaffrey et al., 2003;Tekwe et al., 2004). Selection of the most useful model for an accountability analysis requires determining which model is most accurate. Fortunately, a few important studies have been conducted to determine the most desirable model for computing school effects. The Journal of Educational and Behavioral Statistics published one volume solely concerning the VAA and popular VAMs (Wainer, 2004). The papers in that volume concluded that there are numerous acceptable models as opposed to only a single acceptable model. Tekwe et al. (2004), Ballou et al. (2004), andMcCaffrey et al. (2003) describe differences among VAMs. As these studies have noted, compared to other methods, VAMs are less biased and produce more precise estimates. Although there is a lack of comparative studies showing which VAM is better than the others, the LMEM model has been used frequently for accountability purposes. Ballou et al. (2004) conducted a simulation study to evaluate the TVAAS model which is based on the LMEM. Results indicated that the TVAAS uses a highly parsimonious model that omits controls for contextual factors such as SES and demographics that influence achievement. Unlike the LMEM model, HLM models include school and student variables and attempt to control such factors by statistical adjustment (Bryk & Raudenbush, 1992). Sanders et al. (2002) noted that inclusion of these factors in HLM affects the school estimates resulting in biased measures of schools towards zero. Sanders' LMEM model does not account for these variables. That model attempts to eliminate controls for these variables by use of multiple measures on each student (Ballou et al., 2004). Sanders found that the inclusion of these factors to the model did not result in a significant difference between the two models (Ballou et al., 2004). Results of a simulation study comparing the general model, which is similar to the AHLMM, with those of a layered model which is similar to the LMEM, however, suggested that the AHLMM fit the data better than the layered model (McCaffrey et al., 2003). Tekwe et al. (2004) found little or no benefit from use of more complex models. The simpler SFEM model provided results that were more accurate compared to estimates from the other models. Results also indicated that the AHLM model would be preferred when there is a need for controlling the effects of student and school variables estimates and that selection of one of the two models should be based on non-empirical considerations.
Although VAMs have been shown as an important tool for accountability system, a number of researchers criticized the VAMs application for determining school or teacher effectiveness. An important criticism of VAMs is that they do not yet solve the problem of randomization completely (Wiley, 2006). Another criticism of VAMs is about the precision of the value-added estimates obtained from longitutional data sets. Schochet and Chiang (2010) examined the likely system error rates for measuring teacher and school performance in the upper elementary grades using ordinary least squares (OLS) and Empirical Bayes (EB) methods applied to student test score gain data.
Similarly, Guarino, Reckase, and Wooldridge (2015) investigated the accuracy of the value-added estimates of teachers obtained from commonly used value-added models. They found that no one method accurately captures true teacher effects and classifies teachers in realistic conditions. In addition, VAM approach has been shown to be invalid when there is endogeneity which may be due to correlation between the random effect in the hierarchical model and some of its covariates (Manzi, San Martín, & Van Bellegem, 2014). Another criticism of VAMs is about the data requirements of these models. As mentioned above vertically equated test results from multiple years are basic elements of VAMs. This makes VAM useful for a single developmental scale. However, most of the VAMs cannot be used for multiple test instruments (on different scales) administered within a school year. A few researchers have discussed how to use VAMs to analyze longitudinal student achievement data obtained from multiple instruments (Green, 2010;Rivkin, Hanushek, & Kain, 2005).
There have been numerous studies that show the strengths of the VAMs over the conventional methods. However, the concern remains that simpler models are as efficient as more complex models (Doran & Fleischman, 2005). Several models introduced in VAA calculate the value-added measures based on different assumptions. SFEM and UHLMM do not account for school/non-school variables, while AHLMM attempts to control these factors by statistical adjustments. In this study, the impact of school and non-school factors are compared on school-level value-added scores using an empirical data with an eye to better understanding problems associated with model complexity. Three popular VAMs (i.e., SFEM, UHLMM, and AHLMM) were examined in this study. The models selected for the present study show similarities to a previous study conducted by Tekwe et al. (2004). Tekwe et al. (2004) have also examined the LMEM in their study in addition to the models compared in this study. LMEM was excluded from our study due to data requirements of this model.

Instrumentation
Data for this study were taken from 2002 and 2003 statewide mathematics and reading test results of the Florida Comprehensive Assessment Test (FCAT) for Grades 6 to 8. Separate analyses were done for each grade. The FCAT is a criterion-referenced test that aims to assess student achievement in high-order cognitive skills represented in the Sunshine State Standards (Florida Department of Education, 2003) in reading, mathematics, writing, and science. The FCAT includes three types of questions: multiple choice items, graded response items, and open-ended items. FCAT scaled scores used in this study were vertically scaled, thus making them appropriate for VAA.

Sample
Separated analyses were performed for each of the grade cohorts for Grades 6, 7 and 8 in a large Florida school district with 44 secondary schools for 2002 and 2003. Only standard curriculum students were used in the analyses. Special education students with any exceptionality and students in the limited English proficiency (LEP) program for two or fewer years were excluded due to following reasons. Generally, it is impossible to collect two years of score from students with severe cognitive disabilities that are required for most of the VAMs. In addition, students with limited English cannot show real performance on state test and this may have a negative effect on the valueadded measures of schools. Students whose reported ages were outside the acceptable age range for a given grade were excluded from the analyses. Listwise deletion was applied to exclude these students' information.
A total of 60,718 students were available for analyses after the exclusions: 19,611 for Grade 6, 20,433 for Grade 7, and 20,674 for Grade 8. Non-school variables for socioeconomic status and minority status were included in the data set. Socioeconomic status information was provided in the form of student's eligibility for the free-or-reduced lunch program. Minority status is a school-level variable is based on the proportion of African-American or non-African-American students in the school. Descriptive statistics based on grade and subject combination are presented in Table 1.

Value-Added Models Used in This Study
As noted above, VAMS have the capability of controlling the effects of non-school variables as well as prior performance. In this study, results for three commonly used VAMs were compared: a simple fixed effects model and two hierarchical linear models. It should be noted that layered mixed effects model (LMEM) is another popular VAM that is useful for data sets collected from students attending multiple schools. This model was not examined in this study as the data set in this study does not have students attending multiple schools within a school-year. This makes present study different from Tekwe et al. (2004).

Simple fixed effects model (SFEM)
Fixed effects models (FEM) used for VAA assume school effects to be fixed rather than random. These have the advantage of being the simplest VAM, requiring less computation than the others. As a result, estimates from FEM are more easily understood by policymakers and educators with little statistics experience (Wiley, 2006). The simple fixed effects model (SFEM) is an extension of the FEM. One concern with this model is that it does not incorporate student-level covariates and does 308 not apportion variance for students who have attended multiple schools. Thus it does not produce any shrunken estimates. As SFEM uses only two years of data in a single subject, however, its application is very straightforward.
Model parameterization: where = 2 − 1 , = is a simple change score obtained from difference between two examinations of a student i in school j on the same subject area s, =is the test score on the subject area s ( = 1, 2) at time t ( = 1, 2) for the student j ( = 1, ⋯ , ) in school i ( = 1, ⋯ , ), = is effect coding at time (t = 2) for school k ( = 1,⋯, 44) with coding numbers m ( = 1, ⋯ , 43), = 1 for = and ≠ 44; 0 for ≠ and ≠ 44; -1 for = 44, and is the random error for student j in school i for subject area s.
It is assumed that ~ .
1 in Equation 1 is the value-added component in subject area s for school k.
Hierarchical linear models.
Hierarchical linear models (HLM) require using hierarchically ordered nested data. The hierarchical nature of the structure is that students are considered nested within classes and classes as nested within schools. Due to the nature of the data used in education, HLM has been used extensively for analysis of school effects (Raudenbush & Bryk, 2002). HLM is a special type of the general mixed models family and can be used to obtain value-added measures. These models demand more computation than SFEM, but unlike SFEM, HLM-based models produce shrunken effects.
The HLM analysis consists of four parts as follows (Raudenbush & Bryk, 1988-1989 i. Apportioning variation between and within units of analysis ii.
Assessing the homogeneity of regression assumption iii.
Testing for compositional effects iv.
Assessing the effect of the method Traditional regression methods assume that individuals are independent of each other although students in the same school might have similar results when compared to students from different schools. HLM can handle this violation of the independence assumption unlike linear models.
In this study, two different types of HLM were examined, unadjusted HLM (UHLMM) with random intercept and adjusted HLM (AHLMM). The AHLMM consists of two equations called student-level and school-level models. The two-level HLM provides an analytical framework for examining the effects of schools on student outcomes. An extension of two-level model (i.e., three-level HLM) can be used to obtain value-added estimates of schools and teachers using a data set structure which has students nested within teachers and teachers nested within schools.

Unadjusted hierarchical linear model (UHLMM)
UHLMM uses unadjusted change score with random intercept. This model consists of two level HLM described by the following equations; Student-level model: where is the change score defined as in Equation 1, is a random intercept associated with the school i, and is a random error.

School-level model:
where is the mean of the random intercepts, , and are the random effect and random error of school i on the random intercept for subject area s. and are assumed to be independent.
and are assumed to have normal distribution.

Adjusted hierarchical linear model (AHLMM)
The AHLMM model is adjusted for student-level and school-level covariates.

Student-level model:
where , is a random intercept associated with the school i and subject area s, an indicator of minority status (Yes or No) for student j in school i, an indicator of poverty in which the status of a student eligible for a free-and-reduced lunch is considered to be poverty (Yes or No) for student j in school i, and , are the fixed effects of previous year's test score, minority status, and poverty on learning gain in subject area s, and is a random error.
School-level model: where is the mean input score for the school i, is the percentage of students in poverty in the school i, is is the random error associated with the value of the random intercept for the subject area test (s) and the school i in the student level model, and the 's are fixed effects coefficient parameters. The within and between school error terms, and , are assumed to be independent.

RESULTS
Assumptions and characteristics of each of the VAMs used in this study are shown in Table 2. Thus, differences in characteristics of the models can be seen in Table 2. Interpretations of results for each model are based on distinguishing characteristics of the model. Correlations between VAM measures of schools generated from each model are given in Table 3. Schools were ranked based on their VAM estimates from different models. Correlational results provide information about the rank order of school effects generated from each model. Tables with these rankings are also presented in Appendices.  With respect to the assumption of school effects as random, the SFEM is the only one that accounts for school effects as fixed effects. Therefore, it is appropriate to compare the SFEM to the UHLMM. The UHLMM differs only in that it considers the school effect to be random. The most important finding that is evident in Table 3 is the very high correlation between SFEM and UHLMM valueadded estimates (r = .99) in all cohorts. This suggests that the two models provide the same rank ordering of schools. Thus, it is possible to conclude that there was no difference between taking school effects as random or fixed in terms of rank order of school effects.
A second concern in measuring school effectiveness is to include school and non-school covariates in the models. Among the models in this study, only the AHLMM can take both student-level and school-level effects into account. Apart from this characteristic, the AHLMM and UHLMM are identical. As a result, we can make inferences based on the comparison of these two models. As can be seen in Table 3, there were moderate correlations ranging from .54 to .85 between AHLMM and UHLMM for the different cohorts. This indicates that the effects of including school and non-school variables in the AHLMM had a clear impact on the VAA estimates.
Another comparison with the AHLMM can be made with SFEM. This comparison will help to see the effects of employing shrinkage or including school and non-school variables in the AHLMM model. Correlations between these two models showed moderate values ranging from .55 to .85. These results suggest there is a noticeable difference between SFEM and AHLMM. Although the AHLMM is appropriate when seeking to adjust for confounding variables, the only thing we can really conclude is that there was a difference between the rank orders of schools based on these two models.
Strong correlations were observed between results generated by the SFEM and UHLMM, but much more modest correlations were observed between the AHLMM and all other models. We conclude on the basis of these results that there was not much difference between the SFEM and hierarchical models in terms of the rank order of school estimates.
Once a model is chosen, value-added measures for students can be converted to standardized grades to determine the relative performance of the teachers within each school (or attributed to each school). To obtain standardized grades, standardized value-added measures were divided by their standard errors and assigned grade point average (GPA) values using the following criteria from Tekwe et al. (2004): If z > 2, then assign a grade of A and 4 growth points; If 1 < z ≤ 2, then assign a grade of B and 3 growth points; If -1 < z ≤ 1, then assign a grade of C and 2 growth points; If -2 < z ≤ -1, then assign a grade of D and 1 growth points; If z ≤ -2, then assign a grade of F and 0 growth points.
Results of the standardized grade conversions are presented in Table 4.
Since grades from the SFEM and UHLMM models were found to be similar, we present only results for the SFEM and AHLMM in Table 4. Results in Table 4 suggest that large schools with higher value-added estimates tended to have lower GPA values than smaller schools with lower valueadded estimates, although it was also possible that large schools with lower value-added estimates could have higher GPA values.
Individual school estimates and their rankings were obtained for each grade from three different VAMs. Only estimates for Grade 6 are presented (see Tables 5 and 6 in Appendices A and B). (Estimates for Grades 7 and 8 are available on request from the first author.). For the SFEM, estimates can be interpreted as the difference between the school specific sample average change and the average changes overall. Estimates from the UHLMM are shrunken estimates of school effects from the SFEM. These can be calculated as estimates of the best linear unbiased predictors of the random effects for each school and each grade. Value-added estimates of the AHLMM were also calculated as estimates of best linear unbiased predictors.
The ranks of the school estimates from the SFEM were similar to those of the school estimates from the UHLMM. It is interesting to note that estimates from both models were very similar. This result also suggests that there was little difference in estimating school effects as either random or fixed. Results from the AHLMM had moderate agreement with results from SFEM. Results from each of the models suggested that VAM rankings of schools differed across different grades. Results compared for each grade, however, were very consistent with the results of correlational analyses.

DISCUSSION and CONCLUSION
The purpose of the present study was to determine whether there were similarities or differences among three models commonly used for value-added assessment of schools. The simplest model was the SFEM. This model treats school effects as fixed. Two hierarchical linear models were also included. Each model has distinguishing characteristics and different assumptions. Value-added estimates of individual schools obtained from these models were analyzed to compare results from the different models on the estimates.
The primary question was to investigate whether results from simpler models, such as the SFEM, differed as effective as the more complex models such as AHLMM in terms of school rankings. Previous research has found that little difference between the results of simple and complex valueadded models in that correlations between estimates from SFEM and AHLMM models ranged from .55 to .85 (Tekwe et al., 2004). Results from this study were somewhat consistent with previous research in that the simple model produced similar rank orders of school effects with the more complex AHLMM. Based on these results, it may be concluded that simple models were as effective as more complex models at estimating value added effects of schooling. Further, simpler models generally could be used in place of more complex models such as AHLMM. There is typically a desire for using simpler statistical models among policy makers as well as the general public. Results of the present study tend to support the use of simpler models such as the SFEM in value-added accountability systems.
Another concern in value-added studies is to determine the impact of the inclusion of school and student background variables into models on model estimates. Among the models in this study, only the AHLMM includes statistical adjustments for these potentially confounding variables. Tekwe et al. (2004) suggested that both inclusion and exclusion of these variables during the analysis result in biased estimates of schools. In this study, the estimates from the AHLMM model were compared to estimates from other models to determine the effects of these covariates. No major differences were observed between results of the AHLMM, the UHLMM and the SFEM. Correlations between estimates from the AHLMM and SFEM ranged from .55 to .85. Correlations between results from the AHLMM and the UHLMM also ranged from .54 to .85. These correlations were mostly consistent with results from previous research. Consistent with previous research, inclusion of these covariates did have an effect on value-added estimates. The omission of covariates from the model appeared to bias parameter estimates when students were stratified by those covariates (McCaffrey et al., 2003).
The present study also reported on standardized GPA grading and rankings of each school based on value-added estimates from each model. These results were consistent with the correlational analysis. VAM-based rankings of schools showed differences over grades. It should be noted that the conclusions drawn from this study cannot be generalized to teachers or to other test conditions. Although, value-added models are believed to be useful in school accountability system, the credibility of these methods have been questioned by a number of researchers (AERA, 2015, Amrein-Beardsley, 2014Ballou & Springer, 2015;Guzman, 2016; The American Statistical Association (ASA), 2014). Amrein-Beardsley (2014), emphasized that VAMs have several problems with reliability, validity, and bias, affecting their fairness and transparency. In addition to these serious problems, theoretical and methodological assumptions of VAMs have also been questioned in the literature. Thus, school (or teacher) performances should not be based on only value-added measures obtained from any of the VAMs described in this study. As Amrein-Beardsley (2014) suggested multiple measures and more holistic evaluation systems should be used for school evaluations rather than relying only on VAMs.