Using Generalizability Theory to Investigate the Reliability of Scores Assigned to Students in English Language Examination in Nigeria *

Olufunke Abstract The study investigated the reliability of scores assigned to students in English language in National Examinations Council (NECO). The population consisted of all the students who sat for NECO Senior School Certificate Examination (SSCE) in 2017 in Nigeria. A sample of 311,138 was selected using the proportionate stratified sampling technique. The Optical Marks Record (OMR) sheet containing the responses of the examinees was the instrument for the study. The data was analyzed using lme4 package of R language and environment for statistical computing, factor analysis and Tucker index of factor congruence. The psychometric properties of the data were determined by estimating the generalizability (g) coefficient, phi ( Φ ) coefficient and construct validity. The results indicated the g-coefficient to be 0.90 and Φ coefficient as 0.87, which is an indication of high reliability of scores. The result also showed that a decrease in the number of the items resulted in a decrease in both g- and phi coefficients in D-study. The construct validity of 0.99 obtained from the result affirms the credibility of the items. Hence, it was concluded that the scores were dependable and generalizable.


INTRODUCTION
Generalizability theory is a statistical method used to analyze the results of psychometric tests, such as performance tests like the objective structured clinical examination, written or computer-based knowledge tests, rating scales, or self-assessment and personality tests (Breithaupt, 2011). It involves separating various sources of error and recognizing that multiple sources of error such as error attributed to items, occasions, and forms may occur simultaneously in a single measurement process, thereby forming the basic approach underlying generalizability theory (g-theory) which is to decompose an observed score into a component for the universe score and one or more error components. Its main purpose is to generalize from an observation at hand to the appropriate universe of observations. It is also advantageous in that it can estimate the reliability of the mean rating for each examinee while simultaneously accounting for both interrater and intra-rater inconsistencies as well as discrepancies due to various possible interactions, which are impossible in Classical Test Theory (CTT) (Brennan, 2001). In generalizability theory, various sources of error contributing to the inaccuracy of measurement are explored. It is a valuable tool in judging the methodological quality of an assessment method and improving its precision. It gives the opportunity of disentangling the error components of measurement and is also interested in the reliability or dependability of behavioral measurement, that is, the certainty that the score is reliable to generalize.
148 Breithaupt (2011) identified two types of measurement errors in the examination of items and test scores: random error and systematic error. The author expressed that random error is a source of bias in scores and an issue of validity while systematic error is a measurement error that can be estimated in reliability studies. Its estimates permit the test developer to determine the possible size and sources of construct irrelevant variation in test scores. Thus, it is assumed that the skill, trait, or ability measured is a relatively stable defined quantity during testing. Therefore, variation in obtained scores is usually attributed to sources of error and thus poses the challenge of determining the psychometric property of a test. The goal of the psychometric analysis is to estimate and minimize, if possible, the error variance so that the observed score (X) is a good measure of the true score (T). Understanding whether the test error is due to high variance is important in measurement. It is generally assumed that the exact or true value exists based on how what is being measured is defined. Though the true value exactly may not be known, attempts can be made to know the ideal value. In CTT any observed score is seen as the combination of a true component and a random error component, even though the error could be from various sources. However, only a single source of measurement error can be examined at any given time. CTT treats error as random and cannot be used to differentiate the systematic error from random error. Generalizability theory also focuses on the universe score, or the average score that would be expected across all possible variations in the measurement procedure (e.g., different raters, forms, or items). This universe score is believed to represent the value of a particular attribute for the object of measurement (Crocker & Algina, 2008). The universe is defined by all possible conditions of the facets of the study. It also gives the opportunity to judge whether the score differences observed between the subject could be generalized to all items and occasions (de Gruijter & van der Kamp, 2008). This means that g-theory helps to know whether the means observed over a sample of items and a sample of occasions could be generalized to the theoretical universe of items and occasions. Since g-theory focuses on the simultaneous influence of multiple sources of measurement error variance, it more closely fits the interest of researchers.
The reliability coefficients under CTT are usually focused on the consistency of the test results. For instance, test-retest reliability considers only the time/occasions of testing, parallel-forms reliability considers only the forms of the test and internal consistency considers the items as the only source of error. Some authors (Mushquash and O'Connor, 2006;Webb, Shavelson, & Haertel, 2006) noted that the effects of various sources of variance can be tested using CTT models within which it is only possible to examine a single source of measurement error at a given time, but that it is impossible to examine the interaction effects that occur among these different sources of error. Generalizability theory is particularly useful in this regard; each feat of the measurement situation is a source of error in test scores and its termed facet. Therefore, the inadequacy of explanation of numerous sources of error as pointed out by several authors (Brennan, 2001;Johnson & Johnson, 2009) and the researchers' dissatisfaction with CTT's inability to identify possible sources of error and simultaneously examining them led to the development of g-theory which was an extension of CTT. It offers a broader framework than the CTT for estimating reliability and errors of measurement. Generalizability theory involves two types of study: generalizability study (G-study) and Decision study (D-study). The main purpose of a G-study is to estimate components of score variance that are associated with various sources, while a D-study takes these estimated variance components to evaluate and optimize among alternatives for subsequent measurement. Two types of decision and error variance, relative and absolute, are made in G-study, but only relative decisions are made in CTT (Brennan, 2001;Yin & Shavelson, 2008). Alkharusi (2012) explained that an observed score for any student obtained through some measurement procedure could be decomposed into the true score and a single error. Since the performances of students in National Examinations Council (NECO) Senior School Certificate Examination (SSCE) are based on the sum of their total scores, that is, CTT, there is a need to consider the psychometric properties (difficulty, discrimination, reliability, validity) of the test in taking decisions on the observable performance of candidates in order to improve upon test construction, administration and analysis. Reliability and validity are two technical properties that indicate the quality and usefulness of tests as well as major factors to be considered in the construction of test items for examinations. Junker (2012) described reliability as the extent to which the test would produce 149 consistent results if it is administered again under the same conditions. It also reflects how dependably a test measures a specific characteristic. This consistency is of three types: over time (test-retest reliability), across items (internal consistency), and across different researchers (inter-rater reliability). Many reasons can be adduced for an individual not getting exactly the same test score every time he or she takes the test. These include the test taker's temporary psychological or physical state, multiple raters and test forms. These factors are sources of chance or random measurement error in the assessment process. If there are no random errors of measurement, the individual will get the same test score, that is, the individual's true score each time. The degree to which test scores are unaffected by measurement errors is an indication of the reliability of the test.
Reliability is threatened when errors occur in measurement. When a measure is consistent over time and across items, one can conclude that the scores represent what they intend to; meanwhile, there is more to it because a measure can be reliable but not valid. Reliability and validity are therefore needed to assure adequate measurement of the construct of interest. Validity refers to what characteristic the test measures and how well the test measures that characteristic. In other words, it determines the extent to which a measure adequately represents the underlying construct that it is supposed to measure. Valid conclusions cannot be drawn from a test score unless one is sure that the test is reliable. Even when a test is reliable, it may not be valid. Therefore, care should be taken to ensure that any test selected is both reliable and valid for the situation. The accuracy and validity of the interpretation of test results are determined by the inferences made from test scores. Validity of inferences is concerned with the negative consequences of test score interpretation that is traceable to construct underrepresentation or construct-irrelevance variance. The focus should be on the theoretical dimensions of the construct a test is intending to measure in order to prevent inappropriate consequences from test score interpretation. Generally, in testing, it is necessary to consider how test-takers' abilities can be inferred based on their test scores. Student marks are affected by various types of errors of measurement which always exist in them, and these reduce the accuracy of measurement. The magnitude of measurement error is incorporated in the concept of reliability of test scores, where reliability itself quantifies the consistency of scores over replications of a measurement procedure. Also, it is often expected that test score variation should only be due to an artifact of test-takers' differing abilities and task demands. But in reality, it is being proven that test-takers' scores are most of the time affected by other factors, including test procedures, personal attributes other than abilities, and other random factors. A single score obtained on one occasion on a particular form of a test with a single administration as done by NECO is not fully dependable because it is unlikely to match that person's average score over all acceptable occasions, test forms, and administrations. A person's score would usually be different on other occasions, on other test forms, or with different administrators. Which are the most serious sources of inconsistency or error? Where feasible, it is expected that error variances that arise from each identified source be estimated. Regardless of the strengths of g-theory, it has not been widely applied specifically to estimate the dependability of scores of students in secondary school examinations in Nigeria.
In Nigeria, at the end of secondary school education, students are expected to write certification examinations such as the SSCE conducted by the West Africa Examination Council (WAEC) and the NECO, or the National Business and Technical Certificate Education (NBTCE) conducted by the National Business and Technical Examination Board (NABTEB). The NECO conducts the SSCE in June/July and November/December every year. It was established in 1999 to reduce the workload of WAEC, especially to mitigate the burden of testing a large number of candidates. It was also to democratize external examination by providing candidates with a credible alternative. While some Nigerians saw NECO's arrival as an opportunity for choice of examination body for candidates to patronize, others doubted its capacity to conduct reliable examinations that could command widespread national and international respect and acceptability.
English language education is a colonial legacy that has deeply entrenched in Nigerian heritage and apparently become indispensable. It is widely recognized as an instrument par excellence for sociocultural and political integration as well as economic development. Its use as a second language as well as the language of education provided a speedy access to modern development in science and  (Olusoji, 2012). It is for the above reasons that much importance is attached to English Language education nationwide and at all levels of the nation's educational system. To date, the English language remains the major medium of instruction at all levels of education in Nigeria, and no student can proceed to the tertiary level without a minimum of pass in the English language. In addition, considering the importance of the English language as an international language and its influence on Nigerian secondary school students' performance, it is imperative that generalizability theory be used to examine the credibility of secondary school examinations, hence this study.

Purpose of the Study
The objectives of the study are to: 1. Determine the generalizability coefficient of the English Language items; 2. Estimate the phi (dependability) coefficient of the English Language items; and 3. Determine the validity of the English Language items.
4. Conduct a D-study to determine the generalizability and phi coefficients based on the results of G-study.

METHOD
The study adopted the ex post facto research design. This type of design examines the cause and effect through selection and observation of existing variables without any manipulation of existing relations.

Sample
The total population of students who sat for NECO SSCE English Language examination in the year 2017 in Nigeria was 1,037,129, out of which 311,138 candidates constituted the study sample. The sample was selected using a proportionate stratified sampling technique. Thirty percent of the candidates were randomly selected from each state. The detail is presented in Table 1.

Data Collection Techniques
The data used in the study were responses of the candidates (to the 100-item multiple-choice test) who wrote the NECO June/July 2017 English language SSCE in Nigeria as indicated on the Optical Marks Record (OMR) sheets obtained from NECO office.

Instrument
The instrument used for the study was the OMR sheets for the NECO June/July 2017 English language objective items. The OMR sheets contained the responses of examinees to the NECO June/July 2017 English Language objective items paper III. The English Language examination is a dichotomously scored multiple-choice examination consisting of 100 items with five options length. The responses of the examinees were scored 1 and 0 for correct and incorrect responses. The minimum score for an examinee from computation was zero while the maximum score was 100.

Data Analysis
The data were analyzed using "lme4" package of R language and environment for statistical computing, factor analysis and Tucker index of factor congruence. The generalizability study was conducted with fitting linear mixed-effect models using lme4 package of R language and environment for statistical computing to find the g-coefficient and phi coefficient. Factor analysis was conducted to identify one dimension underlying the English language test for male and female samples. Thereafter the extracted factor loadings for the test under male and female samples were compared. The comparison of the extracted factor loadings in two samples was made using Tucker index of factor congruence.

RESULTS
One-facet ( ) design of generalizability theory was adopted to determine the generalizability coefficient. This is because there is a single facet; the items ( ) and the persons ( ) are the objects of measurement. However, to conduct the analysis under generalizability theory, two levels of analysis were conducted as recommended by Shavelson and Webb (1991). The analysis includes the generalizability (G) study and the decision (D) study. First, the G-study was conducted, and thereafter 152 the D-study was conducted based on the result of the G-study for the extraction of the generalizability coefficient. The analysis was conducted with fitting linear mixed-effect models using lme4 package (Bates, Mächler, Bolker and Walker, 2015) of R language and environment for statistical computing. The table shows that the variance component for candidates (i.e., the universe score variance) accounts for only 0.0142 or 6.0% of all the variance, and this is rather low. Furthermore, the variance component for the items (0.0747, or 31.6% of the total variance) is large relative to the universe score variance but smaller than the residual variance (0.1472 or 62.3% of the total variance). Figure 1 presents the histogram that calculates the percentage of items that each candidate got correct. The Figure shows that none of the participants got all the items correct or incorrect and that the overwhelming majority of participants got 60% or 70% of the items correct on the test (i.e., 60 to 70 correct answers). This tight clustering accounted for the observed low universe score variance. Table 3 shows the proportion of correct items obtained by the candidates for the 100 items 2017 NECO English language test. The table shows that the proportion of item correct ranges from .02 to .91, which reflects a lot of variation and corroborates the high percent of variation accounted for by the items. The large residual variance captures both the person by item interaction and the random error (which we are unable to disentangle). Maybe some items were more easily answered by some participants or maybe there was systematic variation such as the physical environment where the test was administered, or possibly other random variation like fatigue during the assessment. Whatever the cases, these sources could not be disentangled from one another in this variance component.

Generalizability Coefficient of 2017 NECO English Language Test
The generalizability coefficient is similar to the reliability coefficient in CTT. It is the ratio of the universe score to the expected observed score variance. For relative decisions and a random-effects design, the generalizability coefficient is calculated as: where 2 is the variation of students' test scores (the universe-score variance), 2 is the relative error variance (Desjardins & Bulut, 2018)    To determine the dependability coefficient, D-study was conducted based on the G-study conducted in objective 1. Thereafter, the dependability of the NECO test was extracted from the D-study. As in the case of the generalizability coefficient, lme4 package was used for the analysis. The dependability coefficient is calculated with: (2) where 2 is the variation of students' test scores (the universe-score variance), and 2 is the absolute error variance (Desjardins & Bulut, 2018). Table 5 presents the result.  As can be seen from Tables 4 and 5, the G and phi coefficients for 100-items fully crossed random designs were estimated as .90 and .87 respectively. Table 6 shows the D-study results obtained by reducing the number of items. When the number of items was reduced from 90 to 80, the relative error variance increased from 0.0017 to 0.0019; the absolute error variance also increased from 0.0024 to 0.0028; the g-coefficient decreased from .90 to .88 and phi coefficient also decreased from .86 to .84. The D-study is particularly useful in determining which combination of various measurement methods can be employed to obtain reliable coefficients.
Two levels of analysis were conducted to determine the extent to which the test was able to measure the same trait among male and female students. Factor analysis was conducted to identify one dimension underlying the English language test for male and female samples. Thereafter the extracted factor loadings for the test under male and female samples were compared. The comparison of the extracted factor loadings in two samples was made using Tucker index of factor congruence. The congruence coefficient is the cosine of the angle between two vectors and can be interpreted as a standardized measure of the proportionality of elements in both vectors. It is evaluated as: where and are loadings of variable i on factor x and y, respectively, i = 1, 2, 3, …, n (in this case n = 100). Usually, the two vectors are columns of a pattern matrix. Therefore, how large should the coefficient be before two factors from two samples can be considered highly similar? Lorenzo-Seva and Ten Berge (2006) suggested that a value in the range of .85-.94 corresponds to a fair similarity, while a value higher than .95 implies that the two factors or components compared can be considered equal. The estimated factor loadings and other parameters for the estimation of the congruence index are presented in Appendix.
The table shows the parameters of the Tuckers index for congruence estimation. These parameters were substituted for in Equation 3. The result is presented as follows. The result showed that Tucker congruence index of similarity of the factors estimated under male and female candidates' samples was .99. This indicates that the factor underlying the performance of male candidates was almost identical with the factor underlying the female candidates' performance. The implication of the result is that the construct validity of the 2017 NECO English language test was very high and the test measured to a great extent the proficiency of students in the English language, and there was no other nuisance factor(s).

DISCUSSION and CONCLUSION
The findings of this study also showed the magnitude of error in generalizing from a candidate's score on 2017 NECO English language test to a universe score, as shown in Table 2. All 100 dichotomously scored items were analyzed using generalizability theory (G-theory) in a single-facet crossed study of persons (p) crossed with items (i). The variance component for candidates (i.e., the universe score variance) accounts for a smaller percentage of all the variance, corresponding to the largely similar scores obtained by the examinees. In order to reach more reliable results, it is generally desired that the number of moderate difficult items in the test is higher and the number of easy and difficult items relatively less; most of these items are of moderate difficulty. Therefore, none of the examinees scored all the items correct or incorrect; the majority of them scored between 60% and 70% of the items correct in the test. The tight clustering accounted for the observed low universe score variance. Furthermore, the variance component for the items is large relative to the universe score variance but smaller than the residual variance. The proportion of items that is correct reflects a lot of variations which corroborate the high percentage of variation accounted for by the items. The large residual variance captures both the person by item interaction and the random error, which cannot be disentangled. The high estimated variance component for persons crossed with items and the error is an indicator that almost 2/3 of the variability (random error) lies within this relationship and provides an estimate in the changes in the relative standing of a person from item to item (see Table 2). The result is in agreement with the findings of de Vries (2012) that the majority of error variance for the examination could be due to the interaction of persons with items, and lowering this variance would lead to an increase in dependability.
For relative decisions and a random-effects design, the generalizability coefficient is highly reliable. The dependability coefficient, Φ, an index that reflects the contribution of the measurement procedure to the dependability of the examination was also very dependable. As claimed by Brennan (2003) and Strube (2002), values approaching one (1) indicate that the scores of interest can be differentiated with a high degree of accuracy despite the random fluctuations of the measurement conditions. An important advantage of Φ is that it can be used to determine the sources of error that reduce classification accuracy and the methods to best improve such classifications, although most authors examined variability across facets to determine which one will be of greater benefit to generalizability. These results are consistent with the findings of Gugiu, Gugiu and Baldus (2012), Fosnacht and Gonyea (2018), Tasdelen-Teker, Sahin and Baytemir (2016), Nalbantoglu-Yilmaz (2017), Kamis and Dogan (2018) and Rentz (1987) who reported that the acceptable standards for dependability should be ≥ .70.
The study is also in contrast to the findings of Uzun Aktas, Asiret and Yorulmaz (2018), de Vries (2012) and Solano-Flores and Li (2006), who argued that each test item poses a unique set of linguistic challenges and each student has a unique set of linguistic strengths and weaknesses. Therefore, a certain number of items would be needed to obtain dependable scores. Uzun et al. (2018) and de Vries (2012)  Based on the outcome of Tucker congruence index of similarity of the factors estimated under male and female candidates' samples (.99), the factor underlying the performance of male candidates was almost identical with the factor underlying the female candidates' performance. This implies that the examination measures to a great extent proficiency of students in the English Language. The result is in agreement with Zainudin (2012), who reported that the factor loading for an instrument must be higher or equal to .50. Also, Lorenzo-Seva and Ten Berge (2006) suggested that a value in the range of .85-.94 corresponds to a fair similarity, while a value higher than .95 implies that the two factors or components compared can be considered equal.

Conclusion
The study reflected that the reliability was high, which established that the scores assigned to candidates were dependable and generalizable. Also, the item validity was high because it measured the underlying construct, which underscores the good credibility of the items.

Recommendation
Prospective users of a measurement procedure are therefore advised to consider explicitly various sources of variation. They have to state whether they are interested in making absolute or relative decisions and whether they wish to generalize overall or only certain facets of a measurement procedure. However, there is a need to apply this concept to all school subjects to ensure the generalizability of the certification examinations.