A META-ANALYTIC RELIABILITY GENERALIZATION STUDY OF THE OXFORD HAPPINESS SCALE

The purpose of this study was to analyze the meta-analytical reliability generalization of short form and long form of the Oxford Happiness Scale (OHS) for Turkish sample. In addition, how different moderator variables affect reliability coefficients was examined. A number of criteria have been set to determine the studies to be included in meta-analysis. Of 95 Cronbach’s Alpha coefficients obtained from 92 studies that were selected according to criteria were included in the meta-analysis. In the data analysis, reliability generalization based on meta-analysis was used. The effect of moderator variables on variability in reliability estimations as effect size was examined by Analog ANOVA. As a result of the research, it was found that the mean alpha was .81 for overall studies; .76 for the short form and .87 for the long form of OHS. In addition, it was concluded that number of items had a statistically significant effect on the reliability estimation in terms of heterogeneity of true effect sizes, and sample type had a statistically significant effect on the reliability estimation for OHS (long-form). But sample type had no effect on the reliability estimation for OHS-S (short-form), and field of study had no effect for both short and long form reliability estimates.


INTRODUCTION
The place and importance of measurement and assessment in education and psychology are indisputable. Accordingly, education and psychology are unthinkable without the field of measurement and evaluation. Two conceptions underlie the field of measurement and evaluation: these are reliability and validity. The aim of the classical test theory is to present a model to estimate the accuracy of test score measures. And the accuracy is related to the reliability of the test (McDonald, 1999). In short, reliability is the degree of being free from random error of measures. In addition, reliability means consistency of the scores received by the same individuals participating in the same or equivalent tests (Anastasi, 1982). A number of calculations are required to interpret reliability. At this point, the concepts of reliability index and reliability coefficient appear. While the reliability index refers to the relationship between observed scores and true scores, the reliability coefficient refers to the relationship between the scores from the parallel forms. Based on the mathematical relationship between the two concepts, it can be said that the reliability coefficient is the ratio between the true score variance and the observed score variance (Crocker & Algina, 2008). There are formulas suggested by researchers in the calculation of the reliability coefficient. Some coefficients require a single test administration, while others require more than one test administration. One of the most useful characteristics of internal consistency calculations is that it is based on only single test administration (Kline, 2005). Some of the formulas that used in calculating the reliability coefficient in the context of internal consistency are as follows: KR-20, KR-21 (Kuder & Richardson, 1937), Guttman Lambda (λ3) / or Cronbach's Alpha ( ) (Cronbach, 1951;Guttman, 1945), Kristof's coefficient (Kristof, 1963), Stratified alpha (Cronbach, Schönemann, & McKie, 1965), Heise and When the studies in the literature were examined, there were studies which analyzed the reliability generalization, and also whether the reliability coefficient differs according to variables such as test length (or the number of items), sample size, sample type, gender, reliability coefficient type, study language, race, marital status, age, etc. In the results of some of these studies, it was seen that variables such as number of items and sample type were found as sources of variability in reliability (e. g. Caruso, 2000;Caruso et al., 2001;Hanson et al., 2002;Henson et al., 2001;Hess et al., 2014;Nilsson et al., 2002;Shields & Caruso, 2004;Yin & Fan, 2000). In contrast, some studies concluded that these variables did not affect the reliability coefficient (e.g., Graham et al., 2011;Hess et al., 2014;Thompson & Cook, 2002;Wallace &Wheeler, 2002). As seen in previous studies, the reliability of the measures obtained with different scales can affected by different variables. In this study, by examining these studies, a meta-analytic RG analysis was carried out for the Turkish sample. And similar to the literature, it was investigated how general the reliability coefficients are in different number of items, sample types, and fields of study and whether the reliability coefficients were affected by these variables. Within the scope of the study, it was aimed to analyze the reliability generalization of the long and short forms of the Oxford Happiness Scale (OHS) (Argyle, Martin, & Lu, 1995;Hills & Argyle, 2002). The reason for choosing this scale in the study was that the studies in the field of positive psychology have increased in recent years, and happiness is one of the concepts that are frequently researched in the field of positive psychology (Compton & Hoffman, 2019). Also, considering that the feeling of happiness has an effect on many aspects of individuals' lives, it is extremely important to measure the structure of happiness reliably. When both Turkish and non-Turkish literatures were examined, it was seen that the OHS is frequently used to measure happiness (e.g. Demir, 2020;Francis & Crea, 2018;Lin, Imani, Griffiths & Pakpour, 2020;Okur & Totan, 2019;Yıldırım & Sezer, 2020). Considering that the reliability values of the studies using the OHS in the Turkish literature have been in a wide range (.29 -.97), these differences should be investigated, and the reliability should be generalized for the Turkish sample. It is thought that RG studies can contribute as important sources of information for test administrators and researchers by Vacha-Haase, Henson, and Caruso (2002). In line with all this, it is important to bring this study into Turkish literature and the field of education and psychology.

METHOD
This meta-analysis study was performed according to the PRISMA  guidelines. According to that, two authors searched the databases independently, identified the studies by screening the titles and the abstracts, removed the duplicates, and assessed the full-text articles for including in meta-analysis. This section includes data collection tools, sample, coding of study characteristics, and data analysis.

Data Collection Tool
The studies that were searched at the databases of Google Scholar, YOK (Higher Education Institution in Turkey) national thesis/dissertation center, EBSCOhost via Gazi University Central Library, and finally Aydin Adnan Menderes University Library's databases (e.g. BMJ, Dergipark, DOAJ, Clinical

Oxford happiness scale
In the psychology, educational and social sciences, different measurement tools have been developed in order to measure happiness according to the increase in the studies on the concept of happiness. One of the most commonly used measurement tools in measuring happiness is the Oxford Happiness Inventory (OHI). This scale was developed by Argyle, Martin, and Crossland (1989) and Argyle et al. (1995).
OHI has been developed similarly to the format of the Beck Depression Inventory. The inventory has consisted of 29 personal well-being items by reversing 20 items from Beck Depression inventory and adding nine items that reflect different aspects of happiness. The cross-cultural comparison of OHI has been made by applying to students in Australia, Canada, and America (Francis, Brown, Lester, & Philipchalk, 1998). At the same time, it has been adapted for many different cultures such as Israel and China (Francis & Katz, 2000;Lu & Shih, 1997). However, since this inventory was developed by applying it to clinical patients, it was observed that individual responses were directed towards one of the two main items when administered to non-patients. The means for a substantial portion of items could be below the corresponding standard deviations. This showed that the responses could be distributed uniformly, and the items might not be able to fully contribute to the measurement of happiness. To overcome these situations, Hills and Argyle (2002) revised the inventory and constituted the OHS.
OHS consists of 29 items which are 6-point Likert-scale, and these points are within the range of strongly agree-strongly disagree. Half of the scale items are reversed. Thus, it is thought to decrease the possibility of individuals to respond harmoniously or biased. In addition, in the same study, an 8item short form of OHS was developed for situations when setting was limited (Hills & Argyle, 2002).
The adaptation study of OHS to Turkish was conducted by . They examined the psychometric properties of the scale by implementing 491 university students. While the validity of OHS was investigated by criterion-related validity methods and exploratory and confirmatory factor analyses (EFA, CFA), the reliability was investigated by internal consistency, split-half, and composite reliability methods. Accordingly, the Cronbach's Alpha coefficient and composite reliability coefficient were found .91, and the reliability coefficient obtained by the split-half method was found .86.
When the validity studies were examined, as a result of the EFA, a single-factor structure was obtained, as it was in its original form. It was concluded that the single-factor structure of the scale was preserved with CFA. The findings revealed that the Turkish form of OHS showed similar psychometric features to its original form.
The short form of OHS, which consists of eight items, was adapted to Turkish by Doğan and Çötok (2011). They applied the scale to 532 university students and evaluated the psychometric properties via EFA and CFA, internal consistency, and test-retest methods. In the item analysis, item 4 was excluded from the scale because the item-total correlation value was less than .30. The reliability and validity analyses after this stage were made with the remaining seven items. Cronbach's Alpha coefficient calculated from the data obtained from 321 students were found as .74. In the test-retest reliability study, OHS-S was applied to 81 students at two-week intervals, and the correlation was found .85 between the two administrations.
The EFA was showed that the scale has a single-factor structure as its original form does. It was concluded that the single-factor structure of the scale was confirmed by CFA. As a result, it was determined that OHS-7 was a valid and reliable measurement tool to measure the happiness of Turkish students. In this review, studies administering the long or short form of the OHS, which was adapted to Turkish and was analyzed in terms of validity and reliability, were searched. After these phases of meta-analysis, the study group consisted of 92 studies of which 27 were thesis, and 65 were articles in accordance with the criteria determined by the researchers in this study. And 95 Cronbach's Alpha coefficients were obtained from 92 studies that were presented in Appendix B. The selected studies were read and classified by two authors.  Table 1, which has shown descriptive features of the studies included in the study, was examined, it was seen that seven studies were published between 2011-2015, and 85 studies were published between 2016-2020. Twenty studies were in English, and 72 studies were in Turkish. The short form was used in 56 of the studies and the long form of the scale in 36 of them. In addition, the sample type was coded as student for 50 studies and non-student for 42 studies. And, there were four studies with sample sizes ≤ 100, 15 studies with sample sizes between 100 and 200, and 73 studies with sample sizes > 200. Finally, the field of study was examined; it was observed that most studies (63) were in the field of social sciences. Also, it was seen that least studies (11) were in the field of sport sciences. Lastly, there were 18 articles or thesis for psychology/health sciences.

Coding of Study Characteristics
After selecting the studies according to the inclusion criteria to the meta-analysis, the following sample and study characteristics were recorded by the researchers: (i)name of the article or thesis, (ii)name of the author(s) who conducted the study, (iii)year of the article or thesis, (iv)publication language of the study, (v)type of the study (article/thesis), (vi)type of the scale (the short form/the original form), (vii) reliability coefficient, (viii)type of reliability, (ix)sample size/the number of participants in the sample, (x)the number of items on the scale, (xi)fields of study and (xii)participant characteristics.
A total of 108 reliability coefficients were obtained from 97 studies. Of the 108, 104 were coefficient alpha; four coefficients were test-retest reliability, split-half reliability, and composite reliability estimates. However, the present study didn't characterize the scores by reliability type because of the small number of the reliability estimates differing from the coefficient alpha. Also, in some studies, it was observed that the item was removed or not used completely, and studies indicating a different number of items from the 7 and 29 items in the original scale forms were excluded from the study. Therefore, 92 studies remained when the studies that did not use all of the items were eliminated, and 95 alpha coefficients were obtained from these studies. Finally, 95 coefficient alpha values were analyzed for reliability generalization.
The inter-coder reliability was also examined for the data coded by the two authors according to the determined variables and criteria. The inter-coders reliability was calculated by the percent of agreement and Krippendorff's Alpha coefficient. For this, two coders coded for the same 10 studies and 11 reliability coefficients. These statistics were analyzed by SPSS 23 and SPSS macro that was developed by Hayes and Krippendorff (2007) and used for Krippendorff's Alpha coefficient. As a result of the analyses, the percent of agreement was .95, and the Krippendorff's alpha coefficient was

380
.94. These values were an indication that the inter-coder reliability is high. Krippendorff (2004) suggested that Krippendorff's Alpha coefficient should be at least .80, and he stated that alpha ≥ .667 is acceptable. Accordingly, inter-coder reliability is considered appropriate. Also, conflicts between the coders were examined by authors, and it has been determined that it was caused by the use of the keyboard. These conflicts that were detected were resolved.

Data Analysis
Reliability generalization studies provide reliability predictions to make a comparison between studies.
In addition, it also examines the potential causes of variability in score reliability across studies (Graham et al., 2006). In this RG study, the generalizability of Cronbach's Alpha coefficients was investigated. Cronbach's Alpha is the square of the correlation because the reliability coefficients are variance-accounted statistics (Thompson & Vacha-Haase, 2000). Since the distribution of correlations isn't normal and has problematic standard errors, they must be transformed. Therefore, the raw alpha coefficients were transformed by Fisher z-transformation. Although Fisher's z-transformation was suggested for reliability coefficients calculated as Pearson correlation (e.g., test-retest, parallel forms) (Sánchez-Meca, López-López & López-Pina, 2013), recent studies have shown that Fisher z performed well and was very similar to other transformations in terms of empirical coverage probability (Romano, Kromrey, & Hibbard, 2010).
The random effects model (REM) which assumes that between-studies variance has been estimated greater than zero was used because of considering that the studies included in the research were obtained from different samples, fields, and years. Also, REM has been more realistic for real world applications (Field, 2003). In RG studies, there are a few heterogeneity estimators that are used for REM. Some of these estimators are Hunter-Schmidt, Hedges, DerSimonian and Laird, and the estimator based on maximum likelihood estimation (Maximum Likelihood-ML, Restricted ML-REML). In this study, the between-study variance, τ 2 , was estimated by DerSimonian and Laird.
The heterogeneity of Cronbach's Alphas was assessed by calculating the I 2 index as a function of Q statistic. The Q statistic was applied to test the assumption of homogeneity among the alpha coefficients. I 2 index is a possible measure of the amount of heterogeneity (Higgins & Thompson, 2002). It can be thought that I 2 values, which are approximately 25%, 50%, and 75%, reflect low, moderate, and large heterogeneity, respectively (Huedo-Medina, Sánchez-Meca, Marín-Martínez, & Botella, 2006).
To interpret the results, the mean effect sizes, their lower and upper confidence intervals obtained with Fisher z-transformation were back-transformed to the original metric of alpha coefficient. The predicted alpha coefficients were evaluated according to the .70 criterion level determined by Nunnally and Bernstein (1994). Values of .70 and above indicate that there is sufficient reliability for the internal consistency of the scale. The effect of the moderator variables on the variability of the reliability estimates was performed through Analog ANOVA. These moderator variables are type of scale (OHS, OHS-S), type of sample (student, non-student), and field of study (social sciences, psychology/health sciences, sport sciences). In addition, the variables of sample type and study field were analyzed as moderators separately for both OHS and OHS-S.

RESULTS
In this study, a meta-analysis of 95 Cronbach's Alpha coefficients was performed from moderator variables determined by examining literature. The distribution of alpha values in primary studies separately for each scale type is shown in Figure 2. Without the weighting factor, the average reliabilities of the alpha coefficients are .85 (SD = 0.08) and .74 (SD = 0.10) for OHS and OHS-S, respectively. The kurtosis and skewness coefficients are -2.50 (SE = 0.39), 6.73 (SE = 0.76) for OHS; and -1.95 (SE = 0.31), 9.13 (SE = 0.62) for OHS-S. Table 2 given below presents descriptive results for the estimates of alpha coefficients for general and moderator variables which are back-transformed to the alpha coefficient's original metric. Table 2 also shows 95% confidence interval for the estimated mean Cronbach's Alpha and the highest and lowest alpha values of the studies which constitute the RG meta-analysis.

Figure 2. Distributions of Alpha Coefficients for OHS and OHS-S
As shown on the bottom line in Table 2, the reliability or the mean effect size of total OHS scores yielded a mean coefficient of .81 while the lower limit was .78 and the upper limit was .82 in 95% confidence interval. In addition, the reliability of total scores ranged from .29 to .97. Although there was a wide distribution of reliability estimates, the mean reliability estimate and limits of the confidence intervals are acceptable score reliabilities. For total reliability estimates, it can be said that they tend to be large and heterogeneous.  Table 2 also presents the mean alpha coefficients obtained for moderator variables. When the mean alpha coefficient was analyzed according to the type of scale, the mean alpha from the OHS-S was found .76 with a lower limit of .73 and an upper limit of .78 in 95% confidence interval. The mean effect size for the OHS was found .87 while the lower limit was .85, and the upper limit was .88 in 95% confidence interval. The reliability scores ranged between .50-.94 for the OHS and .29-.97 for OHS-S. The reliability scores range showed that especially OHS-S had lower coefficients than the OHS. The minimum reliability coefficients were below .70 for both types of scale (Nunnally & Bernstein, 1994). Again, the mean effect sizes and their 95% confidence interval limits were at an acceptable level for both types of scale. When the mean effect sizes were examined for two types of scale, the mean alpha coefficient for the OHS-S was smaller than OHS.

Journal of Measurement and Evaluation in Education and
When Table 2 was examined according to characteristic of sample, the mean effect size estimates were higher in non-student sample for OHS-S. Despite that, the mean alpha value was higher in student sample for OHS. For OHS, the mean effect sizes were .89 and .85, respectively, in student sample and non-student sample. On the other hand, the mean effect sizes were found .75 and .77 respectively in student and non-student sample for OHS-S. For both types of sample there were wide distributions of reported alpha coefficients except OHS in student sample. The lowest reported alpha coefficient (α = .29) was in student sample. And so, the minimum mean effect size was calculated as .75 (95%CI, .71-.78) in this sample.
With regard to field of study, it was seen that the mean alpha estimates were reported for three categories. The mean effect sizes obtained with the alpha coefficients of OHS were for Social Sciences (α = .87, 95%CI, .85-.88), Psychology/Health Sciences (α = .89. 95%CI, .85-.92), and Sport Sciences (α = .86, 95%CI, .81-.89). Also, the mean effect sizes of OHS-S were .77 (95%CI, .73-.80) for Social Sciences, .74 (95%CI, .67-.79) for Psychology/Health Sciences, and .72 (95%CI, .61-.81) for Sport Sciences. According to these results, the mean alpha estimate for the field of Social Science was greater than the other fields for OHS-S. For OHS-S, the mean alpha estimates were almost close for all categories of field of study. For OHS, although the reliability estimates were close, the highest mean alpha value was in the field of psychology/health sciences.
In this study, the heterogeneity of Cronbach Alpha values was investigated. So, I 2 index for the amount of heterogeneity and Q test of homogeneity for the total scale were calculated. According to the results, the Q test was statistically significant with high heterogeneity coefficients. The estimates of Q for the scale was QTotal(94) = 2639.66, p < .001. The between-study variance, τ 2 , was estimated 0.08 by DerSimonian and Laird method. The I 2 index indicated that in 96.44%, the reliability coefficients had a large variability among the true effect size estimates. In the next step, the effect of the sub-group moderator variables was examined.
The Analog ANOVA was performed to examine the effect of the moderator variables on the variability of the reliability estimates. Whether the mean alpha coefficients differ according to the type of the scale was analyzed with the Analog ANOVA. The result was presented in Table 3. As shown in Table 3, Qtotal of coefficient alpha values was found 2639.66 (p = .00), and it was statistically significant. Therefore, it can be said that the true variance estimate of reliability coefficients was statistically significant for all of the studies. In addition, it can be said that the variance within groups was statistically significant at the level of p < .05 since Qwithin was 1503.98 (p = .00). When the difference between the groups was examined, it was seen that the Qbetween(REM) value was 50.75 (p = .00), and this value was significant. Accordingly, it can be said that the alpha coefficient was related to the scale type. When the variance in which the scale type explained for the alpha coefficient was examined, it was found that (1135.69/2639.66) .43 proportion of the true variance or 43.02% of the true variance was explained by the scale type. Based on this value, it can be said that the proportion of explaining the variance in the alpha coefficient of the scale type alone is high. When heterogeneity was examined for different scale types separately, heterogeneity was high in both scale types because Q statistics (QOHS and QOHS-S) was statistically significant at the level of p < .05, and I 2 were 94.76% and 91.34% for OHS-S and OHS, respectively. This may be due to lack of classification according to other variables ignored in this Q test. Some of these variables can be administration conditions, sample size, administration year, research type, etc. Due to the significant variance between the scale forms, it would be more meaningful to examine the effect of the moderator variables separately for OHS-S and OHS. As seen in Table 3, Qtotal for OHS-S form data is also significant which means weighted sum of squares is much more than expected (df, k-1) by random, within study, variation. Forest plots for long form and short form were presented separately in Figure 3 and When Figure 3 and Figure 4 were examined, it can be said that the reliability coefficients of all individual studies were statistically significant. In addition, it can be stated that the reliability coefficients were generally in the range of .80-1.00 and .60-1.00 for OHS and OHS-S, respectively.  The significance of the difference of alpha coefficients according to the type of the sample was analyzed by Analog ANOVA for OHS-S. When Table 4 was examined, it can be seen that the Qtotal of coefficient alpha values was found 1088.35 (p = .00), and it was statistically significant. Therefore, it can be said that the variance was statistically significant for all of the studies that used OHS-S. In addition, it can be said that the variance within groups was statistically significant at the level of p < .05 since the Qwithin was 982.82 (p = .00). When the difference between the groups was examined, it was seen that the Qbetween(FEM) value was 105.53 (p < .05). Accordingly, whether the sample consists of the students or not did have a statistically significant effect on the variability of alpha coefficient when FEM was used. In this case (105.53/1088.35), .10 proportion of variance or 10% of the true variance was explained by sample groups. However, this group difference could be overcome by using the REM analysis. As can be seen in Table 4, when the REM approach was used in the analysis, this variance was not significant anymore. Accordingly, whether the sample consists of students or not didn't have a statistically significant effect on the variability of alpha coefficients when REM was used. So, it can be said that the alpha coefficient was not related to the sample type of OHS-S for REM. When heterogeneity was examined for different sample types, it was high in both sample types for studies that used OHS-S. Because Q statistics (Qstudent and Qnonstudent) was statistically significant at the level of p < .05, and I 2 were respectively 93.88% and 94.93% for student sample and non-student sample. This may be due to the lack of classification according to other variables ignored in this Q test. Some of these variables can be study field, administration conditions, sample size, administration year, research type, etc. Some further studies are needed to explain the remaining heterogeneity in OHS-S form reliability estimates.   Table 5 presents the significance of the difference of alpha coefficients according to the type of sample for OHS. With regard to Table 5, Qtotal of coefficient alpha values was found 415.63 (p = .00), and it was statistically significant. In addition, when we examined the variance within the groups, the studies separated by sample type were also heterogeneous in within groups. Qwithin was 369.31, and p-value was 0.00 (p < .05). When the difference between the groups was examined, it was seen that the Qbetween(FEM) value was 46.32 (p < .05). Accordingly, whether the sample consists of students or not did have a statistically significant effect on the variability of alpha coefficients when FEM was used. In , .11 proportion of true variance or 11.14% of the true variance was explained by means of sample groups for OHS. Also, this group difference couldn't be overcome by using the REM analysis because, as can be seen in Table 5, when the REM approach was used in the analysis, Qbetween(REM) value was 7.90 (p < .05), and this value was significant. Accordingly, whether the sample consists of students or not had a still statistically significant effect on the variability of alpha coefficients when REM was used. Therefore, although the heterogeneity was significant in the within groups, it can be said that the mean alpha values of the studies separated according to the sample type differed significantly from each other. And it can be said that the alpha coefficient was related to the sample type for OHS. In addition, based on the proportion of true variance value, it can be said that the proportion of explaining the true variance in the alpha coefficient of the sample type alone is low. When heterogeneity was examined for different sample types, heterogeneity was high in both sample types. Because Q statistics (Qstudent and Qnonstudent) was statistically significant at the level of p < .05, and I 2 values were respectively 83.03% and 92.48% for sample of student and sample of nonstudent. This may be due to the lack of classification according to other variables ignored in this Q test. Some of these variables can be field of study, administration conditions, sample size, administration year, research type, etc. Some further studies are needed to explain the remaining heterogeneity in OHS long-form reliability estimates. The Analog ANOVA was performed to examine whether alpha coefficients showed a statistically significant difference according to field of study for OHS-S. As seen in Table 6, Qtotal of coefficient alpha values was found 1088.35 (p = .00), and it was statistically significant. Therefore, it can be said that variance was statistically significant for all of the studies that used OHS-S. In addition, it can be said that the variance within groups was statistically significant at the level of p < .05 since Qwithin was 1079.06 (p = .00). When the difference between the groups was examined, it was seen that the Qbetween(FEM) value was 9.29 (p < .05). Accordingly, whether the sample consists of students or not did have a statistically significant effect on the variability of alpha coefficients when FEM was used. In this case (9.29/1088.35), a .01 proportion of the true variance or 1.00% of the true variance was explained by field of study for OHS-S. However, this group difference can be overcome by using the REM analysis. As can be seen in Table 6, when the REM approach was used in the analysis, this variance was not significant anymore. Accordingly, whether the sample consists of students or not didn't have a statistically significant effect on the variability of alpha coefficients when REM was used. So, it can be said that the alpha coefficient wasn't related to the field of study for OHS-S for REM. When heterogeneity was examined for different fields of study, heterogeneity was high in social sciences and psychology/health sciences for studies that used OHS-S because Q statistics (QSocial and QPsychology/Health) was statistically significant at the level of p < .05 and I 2 were respectively 95.99% and 90.74% for the field of social sciences and psychology/health sciences. As mentioned before, this may be due to the lack of classification according to other variables ignored in this Q test. Some of these variables can be sample type, administration conditions, sample size, administration year, research type, etc. Some further studies are needed to explain the remaining heterogeneity in OHS-S form reliability estimates. In addition, although heterogeneity was high in the fields of social sciences and psychology/health sciences, it was observed that there was low heterogeneity in the field of sports  The Analog ANOVA was performed to examine whether alpha coefficients showed a statistically significant difference according to field of study for OHS. As seen in Table 7, Qtotal of coefficient alpha values was found 415.63 (p = .00), and it was statistically significant. Therefore, it can be said that the variance was statistically significant for all of the studies that used OHS. In addition, it can be said that the variance within groups was statistically significant at the level of p < .05 since Qwithin was 412.64 (p = .00).

Journal of Measurement and Evaluation in Education and
When the difference between the groups was examined, it was seen that the Qbetween(FEM) value was 2.99 (p > .05). Also, Qbetween(REM) value was 1.83 (p = 0.40), and this value wasn't statistically significant. Accordingly, whether the field of study is social sciences, psychology/health sciences, and sport sciences or not did not significantly affect the variability in alpha coefficients for both models. So, it can be said that the alpha coefficient was not related to the field of study for OHS. Already, (2.99/415.63) the 0.01 proportion of true variance or 1.00% of the true variance was explained by study fields for OHS. When heterogeneity was examined for different fields of study, it was high in social sciences, psychology/health sciences, and sport sciences for studies that used OHS-S since Q statistics (QSocial, QPsychology/Health and QSport) was statistically significant at the level of p < .05, and I 2 values were respectively 89.54%, 96.36% and 93.42% for the fields of social sciences, psychology/health sciences, and sport sciences. This may be due to the lack of classification according to other variables ignored in this Q test. Some of these variables can be sample type, administration conditions, sample size, administration year, research type, etc. Some further studies are needed to explain the remaining heterogeneity in OHS long-form reliability estimates. Also, it can be said that the absence of a significant difference between these fields supports the high level of heterogeneity among the groups. As mentioned before, the scale type had a large variability source for the alpha coefficient. On the other hand, the moderator variables which were sample type and field of study weren't seen as the important sources of variability in the alpha coefficients. The reason for the high level of heterogeneity in the same fields of study, sample types, and scale types is that the studies come from different universes.
In this study, the meta-analysis was performed with only published articles and theses. Since the published studies generally have a high or significant effect size, taking only these studies into the meta-analysis may cause publication bias. Therefore, the publication bias was examined by rank correlation and regression test for funnel plot asymmetry and classic fail-safe N method. In the failsafe N method, assuming the main effect of the studies to be added is zero, it is calculated how many studies are to be added so that the p-value isn't significant. And the calculated number's name is failsafe N. If only a few studies are needed, there may be a concern that the effect is actually zero (Borenstein, Hedges, Higgins, & Rothstein, 2013). The fail-safe N was calculated as 4975 (p < .01). According to these results, it was seen that the number of studies to be added was quite high so that not the summary effect was significant. The other approach, funnel plot asymmetry is seen in Figure  4. In the funnel plot, studies are expected to be distributed symmetrically around the summary effect size.

Journal of Measurement and Evaluation in Education and
Although the studies were seen approximately symmetrically distributed to the right and left of the summary effect size, this interpretation is subjective (Borenstein et al., 2013). The rank correlation and regression tests were performed for more an objective interpretation. According to Egger's regression test, the regression intercept was not significant (intercept = -2.94, p = .09). The hypothesis was accepted to show that the regression constant didn't deviate from zero significantly. Begg and Mazumdar's rank correlation test also contributed to the lack of asymmetry in the funnel plot. According to that, Kendall's tau was not significant (Kendall's tau = -0.057, p = .41). It can be interpreted that there wasn't an asymmetry in the funnel diagram. In addition, according to Duval and Tweedie Trim and Fill test, there was no difference between the observed effect size and true effect size which was created to correct the effect caused by publication bias. As a result of the general symmetrical distribution of studies on both sides of the overall effect size, the difference was found zero. So the statistical tests for funnel plot asymmetry did not show any evidence of publication bias. Therefore, it can be said that all results were not likely to be the result of publication bias.

DISCUSSION and CONCLUSION
In this study, a meta-analytical reliability generalization analysis was conducted on OHS and OHS-S. In addition, it was investigated whether Cronbach's Alpha was affected by sample type, scale type, and study field. The results of this study showed that the mean Cronbach's Alpha coefficients obtained from both the OHS and OHS-S were at an acceptable level. The fact that these coefficients are high is an indication of the usability of the scale by both practitioners and researchers. When it was examined whether the reliability coefficient was affected by the scale type, a significant difference was observed between the two scale forms according to both REM and FEM. This difference was observed for the favor of OHS. Accordingly, it can be said that the measures obtained with the OHS, in general, are more reliable. In general, it is thought that more sensitive and more reliable measurements will be made as the number of items increases. And the results of this study support this idea. When the results of other studies were examined, it was seen that similar results were found. For example, Henson et al. (2001) and Nilsson et al. (2002) observed that reliability was higher for the long-form. Henson et al. (2001) stated that as the length of the test increase, the reliability estimates increase in all subscales except one. Also, Nilsson et al. (2002) found that the CDMSE long form's reliability coefficients were higher than the short form's. In contrast, Hanson et al. (2002), Hess et al. (2014), and Vacha-Haase (1998) observed that reliability was higher for the short form. Hanson et al. (2002) observed that the mean reliability coefficient obtained from the short form was slightly higher for both client and therapist versions of the Working Alliance Inventory. Hess et al. (2014) and Vacha-Haase (1998)  similarly concluded that the measures obtained from the short form were more reliable in a result of the RG analysis. As can be seen, while the effect of test length on reliability varies in the studies, it was observed that the reliability coefficient increases significantly as the test length increases for OHS in the Turkish sample. In the tests, it is recommended that the length of the test is as short as possible in terms of usability and it is long enough for acceptable reliability (McDonald, 1999). In this regard, the OHS is considered to be appropriate in terms of usability as it will not take much time to respond. Also, as a result of this study, it is obvious that the mean alpha coefficient of OHS for the Turkish sample is higher than OHS-S. In line with all of these, administering OHS instead of OHS-S may be more suitable for reliability for the Turkish sample. However, this situation may vary with the variance explained by the scale, the properties of the administration group, administration conditions, etc.
When the effects of the sample type on reliability were examined, it was seen that reliability was significantly different for students and non-students for the OHS-S according to FEM. But the true variance explained by sample groups was low for OHS-S. Although there was a significant effect in FEM, this group difference could be overcome by REM analysis. Therefore, researchers and practitioners may be advised to use REM analysis for such group differences. As a result, it was seen that the reliability wasn't significantly different for students and non-students for the OHS-S when REM was used. When the effects of the sample type on reliability were examined for OHS, it was seen that the reliability was significantly different for students and non-students for the OHS, according to both REM and FEM. Similar to these results, as it was in REM for OHS-S, Hess et al. (2014), Thompson and Cook (2002), Wallace and Weller (2002) found that reliability wasn't affected by the sample type. While Hess et al. (2014) separated sample types as student and professional, it was observed that Thompson and Cook (2002) distinguished as undergraduate, graduate, and faculty. And it was stated that there was no variability between the reliability coefficients of the groups in both studies. Also, Graham et al. (2011) concluded that the relationship between the proportion of college students in the sample and the reliability coefficient regarding the scores obtained with Locke-Wallace Marital Adjustment Test (LWMAT), Kansas Marital Satisfaction Scale (KMS), Quality of Marriage Index (QMI), and Marital Opinion Questionnaire (MOQ) wasn't significant. Contrary to these studies and similar to the results of OHS long-form Caruso et al. (2001), Vacha-Haase (1998), and Yin and Fan (2000) concluded that the sample type (student/non-student) affected reliability coefficients. As can be seen, while the effect of sample type on reliability varies in the studies, it was observed that the reliability coefficient didn't differ in student or non-student samples for OHS-S according to REM analysis in this study. The difference between the alpha coefficients is almost negligible, with about two per thousand for OHS-S. Also, the mean alpha coefficients were found to be high in both groups. These results are indicators of the availability of the OHS-S for both students and non-students for Turkish sample. In addition, while the effect of sample type on reliability varies in the studies, it was observed that the reliability coefficient differs for student or non-student samples for OHS in this study according to FEM and REM. This difference between the alpha coefficients was about four per thousand. But the proportion of explaining the true variance in the alpha coefficient of the sample type alone was quite low. Therefore, this difference wasn't at an important level. In addition, the mean alpha for student sample was higher than non-student sample, and the mean alpha coefficients were at an acceptable level for both OHS and OHS-S. The development of the OHS by applying it to students may be a factor in this. So, OHS is more suitable for students, but it can be used for both sample types. To summarize, it can be said to researchers and practitioners that both OHS and OHS-S can be used for both student groups and non-student groups for Turkish sample.
Finally, in the scope of the research, it was investigated how mean alpha was in different fields of study and whether the mean difference in reliability estimation was significant or not for these fields for both OHS and OHS-S. In OHS-S, the highest mean alpha was found in the field of social sciences, while the lowest mean alpha was found in the sport sciences. Also, the highest mean alpha was found in the field of psychology/health sciences, while the lowest mean alpha was found in the sport sciences for OHS. In line with all results, no major changes were observed in the reliability coefficients in all fields for all OHS forms according to REM. But in FEM analysis, the mean difference in reliability estimation was significant for OHS-S. When it was examined how much true variance was explained with field of study, it was seen that explained variance was low and not important for OHS-S. Although there was a significant effect in FEM, this group difference could be overcome by REM analysis. Therefore, researchers and practitioners may be advised to use REM analysis for such group differences. As a result, it was seen that reliability wasn't significantly different for field of study for the OHS-S when REM was used. Based on this, it can be said that field of study is generally not effective in reliability estimation for all OHS forms. When the researches in the literature were investigated, it was seen that Vicent et al. (2019) analyzed the effect of study focus on reliability estimation by classifying study focus as applied and psychometric. And they concluded that the effect of study focus on reliability estimation for CAPS sub-dimensions was significant. However, they found that the variance in which the variability in reliability was explained by the study focus was low, and in meta-regression analysis, they found that it was the variable that least explained variance. It can be that the reason why this study is different from Vicent et al.'s (2019) research is that the study focus and field of study concepts are distinct each other and the classification is made differently. In such a case, the different measurement tools may have affected the differentiation of the results. Another study in the literature classified the study type as medical and nonmedical and research design as psychometric/others and experimental/others (Barnes et al., 2002). But they stated that they didn't examine the relationship between reliability and study type; they examined how much reliability was reported in journals in different contexts. Also, they founded that there were very low correlations between internal consistency coefficient and the contexts of psychometric/others or experimental/others. The low correlations found are similar to this study, but although research design and fields of study are similar, they are not the same. Consequently, reliability coefficients didn't differ significantly in different fields of study according to REM and were acceptable for all of them. Therefore, it is thought that the OHS's forms can be used in different fields.
The reliability estimates of OHS and OHS-S showed acceptable level in the present study. However, as mentioned above, as each measurement depends on the different conditions of the sample or settings, the results in this study are specific to these conditions. Therefore, it is necessary to calculate reliability based on their own data, besides the RG studies (Capraro & Capraro, 2002). To summarize in general, administering OHS instead of OHS-S may be more suitable for reliability for Turkish sample. Also, scale forms can be used for both student sample and non-student sample and can be used for each field of study. But, in generally, it is suggested that REM analysis should be performed for these variables since some group difference can be overcome when REM is used. Finally, in deciding which form of OHS to use, this situation may vary with the variance explained by the scale, the properties of administration group, administration conditions, etc.
The limitations of this study are transforming Cronbach's Alpha coefficients into Fisher Z scores, examining the variables of scale type, sample type, and field of study as sources of measurement error of reliability, and performing the analysis in the CMA program. Future investigations can examine and compare the reliability estimates which use other reliability estimators like Hakstian-Whalen (1976) and Bonett (2002)