Investigating the Effect of Rater Training on Differential Rater Function in Assessing Academic Writing Skills of Higher Education Students *

This study aimed to examine the effect of rater training on the differential rater function (rater error) in the process of assessing the academic writing skills of higher education students. The study was conducted with a pre-test and post-test control group quasi-experimental design. The study group of the research consisted of 45 raters, of whom 22 came from experimental, and 23 came from control groups. The raters were pre-service teachers who did not participate in any rater training before, and it was investigated that they had similar experiences in assessment. The data were collected using an analytical rubric developed by the researchers and an opinion-based writing task prepared by the International English Language Testing System (IELTS). Within the scope of the research, the compositions of 39 students that were written in a foreign language (English) were assessed. Many Facet Rasch Model was used for the analysis of the data, and this analysis was conducted under the Fully Crossed Design. The findings of the study revealed that the given rater training was effective on differential rater function, and suggestions based on these results were presented.


INTRODUCTION
Academic writing is defined as a type of text in which thoughts are logically structured and justified (Bayat, 2014). According to another definition, academic writing is defined as explaining the individual's views, ideas, feelings, observations, experiments, and experiences based on his/her world of thought, congruent with the rules of the language by planning them in accordance with the individual's interest towards the chosen subject (Göçer, 2010). It can be seen from these definitions that academic writing requires many skills, and it has a complex process. Academic writing consists of multiple language skills that require the use of mental, motor, and affective skills at the same time (Çekici, 2018). Essays, theses, and research reports written by students in higher education are included in academic writing types (Gillet, Hammond & Martala, 2009). Academic writing aims to convey complex thoughts, abstract concepts, and high-level mental processes (Zwiers, 2008). In this context, when academic writing is considered as the realization of higher-level mental skills, it is important to assess academic writing validly and reliably (Carter, Bishop & Kravits, 2002).
The tools that are used to assess students' academic writing skills must be authentic, which makes it difficult to choose writing tasks. Selected writing tasks need to have a place in students' lives, and if this situation is neglected, there is a risk of under-representation or a bad definition of the structure in the assessment of academic writing skills (Cumming, 2013(Cumming, , 2014. One of the research areas that are frequently studied in the assessment of academic writing skills is the development and assessment of students' academic writing skills in English as a second language (Aryadoust, 2016; It can be stated that one of the important concerns about performance-based assessment is the issue of objectivity in the process of assessing individual performance and determining the situation because it is very difficult to assess objectively with performance-based assessment methods compared to traditional ones (Romagnano, 2001). Many methods have been proposed in the literature to ensure objectivity in performance-based assessment. These methods can be listed as automated scoring (Attali, Bridgeman & Trapani, 2010;Burstein et al., 1998), using more than one rater (Gronlund, 1977, p.85;Kubiszyn & Borich, 2013, p.170), using rubrics (Dunbar, Brooks & Miller, 2006;Ebel & Frisbie, 1991, p. 194;Kutlu, Doğan & Karakaya, 2014, p.51;Oosterhof, 2003, p.81), and rater training Haladyna, 1997, p.143;İlhan & Çetin, 2014;Lumley & McNamara, 1995). Each of these methods has advantages & disadvantages and strengths & weaknesses compared to each other. Haladyna (1997) emphasized that it was difficult to ensure consistency among raters, regardless of the method used. In other words, regardless of the method used, there is always the possibility that some external variables other than individual performance affect the assessments (interfere with the assessments) in performance assessment. These inconsistencies that occur in the process of assessing individual performance were defined as "rater effect/bias" (Farrokhi, Esfandiari & Vaez Dalili, 2011;Haladyna, 1997, p.139;İlhan, 2015, p.3).
In case that one or more rater errors occur during the assessment process of individual performance, the number of errors regarding the estimations of students' ability levels will be high. In other words, the estimations obtained will not be reliable. Rater errors that occur during the assessment process of individual performance also have negative effects on validity. Rater errors pose a direct validity threat since they are attributed to variance unrelated to the structure (Kassim, 2011;Brennan, Gao & Colton, 1995;Congdon & McQueen, 2000;Farrokhi et al., 2011). Therefore, it is important to minimize or control the interference of rater errors in assessments (Kim, 2009;Linacre, 1994). Rater training, which is an effective method in reducing rater errors, was used in this study Feldman, Lazzara, Vanderbilt & DiazGranados, 2012;Haladyna, 1997;Hauenstein, & McCusker, 2017;Stamoulis & Hauenstein, 1993;Weigle, 1998;Zedeck & Cascio, 1982). Rater training is widely used to reduce rater errors involved in assessments (Brijmohan, 2016). Many methods/designs regarding rater training were suggested in the literature. In this study, rater error training (RET) and frame of reference training (FRT) were used in the training of raters by combining them.
The main purpose of rater training is to enable rater to develop a common understanding of student performance and assessment criteria (Eckes, 2008;Shale, 1996). In other words, rater training ensures a valid and reliable assessment of individual performance (Moser, Kemter, Wachsmann, Köver & Soucek, 2016). Since the scores students get from an open-ended exam consist of both the performance of the student and the rater's interpretation of the student's performance, it creates constant validity anxiety in the test results (Ellis, Johnson & Papajohn, 2002;McNamara, 1996). When decisions taken based on test results are vital, rater errors should be identified, and these behaviours should be reduced to an acceptable level (Ellis et al., 2002).
In statistically identifying rater errors involved in the measurements during the assessment of performance, generalizability theory and item response theory are often used. The development of Differentiating rater function is defined as the tendency of the rater to give higher or lower scores to some individuals than others, depending on various characteristics of the rater, such as gender, age, and cultural factors (Wesolowski, Wind, & Engelhard, 2015). For example, a rater can give more points to successful individuals. Because the interference of differentiating rater function in the measurements is considered a systematic error, it has a negative effect on the validity of the measurements. DRF refers to a situation in which students with the same basic ability level are not likely to receive the same level of scores by raters due to their group membership. Thus, an erroneous (bias) rater prefers or dislikes a particular group of students compared to another group, for example, when scoring students' writing skills. DRF often gets involved in measurements when group memberships are known. However, in some studies, it was stated that DRF was also involved in the measurements when group membership was not known (Jin & Wang, 2017).
When the literature was examined, it was found that raters whose assessments involved severity, leniency, or central tendency error in the process of assessing individual performance, generally exhibited DRF error as well (Johnson et al., 2008;Myford & Wolfe, 2003;Wind & Guo, 2019). It was seen that studies investigating the involvement of DRF in assessing performance are quite limited. Wolfe and McVay (2012) found that 10% of the raters displayed more than one rater error in the process of assessing the essays of 120 students by 40 raters. It was investigated that some raters displayed severity, leniency, and DRF together. The study of Engelhard and Myford (2003) revealed that DRF was involved in the measurements of raters in assessing the academic writing skills of students according to their gender, race, and the language they speak. Wesolowski, Wind, and Engelhard (2015) found that DRF was involved in the measurements of 24 expert raters in assessing the jazz band performances of students. In the study conducted by Kim et al. (2012), it was found that very severe and very lenient raters generally displayed DRF. In Liu and Xie's (2014) study, 12 different scenarios were used in the process of assessing students' second language academic writing skills, and it was determined that raters showed DRF according to the scenarios. Schaefer (2008) found that errors of severity, leniency, and DRF were all involved in the process of assessing student essays. In the process of assessing performance, it was seen that rater training was used to reduce this error because DRF was frequently involved in the measurements.  study showed that the rater training given in the process of assessing students' oral presentation skills was effective. Fahim and Bijani's (2011) study revealed that rater training given in the process of assessing students' academic writing skills in the second language decreased rater x criterion interactions. On the other hand, in the study conducted by Kondo (2010), it was found that rater training given in the process of assessing second language academic writing skills did not have a significant effect on DRF. In this context, it was noticed that different results were obtained depending on the rater training pattern used and the assessed performance.

Study Group
The research consists of a total of 45 raters, 23 from the control group and 22 from the experimental group. The raters are pre-service English teachers studying at a university's English Language Teaching Department. It was assumed that the participating pre-service teachers could assess academic writing skills since they were in the last year of their education. The average age of the raters was 21.84. A personal information form was prepared to determine whether the participants have been rater and they participated in a rater training program before, and they were asked some demographic questions. It was investigated that the participants did not participate in any rater training program before, their rating experiences were similar, and they were all inexperienced in rating. Since the efficiency of the experimental process is examined rather than the purpose of generalization to the universe in experimental studies, a universe and a sample that represents the universe have not been chosen. The scorers assessed the essays written by 39 students who were continuing their education in the first year of the same department. These students took the advanced writing and reading courses in their first year, and they were all at B1 level. The essays were collected by an academician working in the same department from the students in her course, and the students participated in the study voluntarily. While the students were writing the essays, they were informed that these essays would not be graded, and they were asked not to write their names, student numbers, or ID numbers on the papers.

Data Collection Tools
Writing task The student essays within the scope of the research were obtained by using the opinion-based writing task published as an example by the International English Language Testing System (IELTS) (Appendix A) (IELTS, t.y.). These writing tasks are prepared in many different areas to improve students' academic writing skills in English. The main purpose here is to help students reach the level in a short time that they can write essays. These writing tasks are prepared in two different categories, academic and general, and the individual chooses one of them according to his / her area of interest. The main reason for choosing this writing task stems from the idea that it will contribute to the validity and reliability of the measurements in the process of assessing the performance of the individual since it represents real-life situations. Students were given 40 minutes for the writing task, and they were asked to write an essay consisting of at least 250 words. The essays written by the students were numbered randomly, reproduced, and distributed to the raters.Rubric (for academic writing) In the process of assessing student essays, the analytical rubric developed by the researchers was used. A systematic process was followed in the development of the rubric, and in this way, it was aimed that it would contribute to the validity and reliability of the measurements. In this context, suggestions of 167 Goodrich (2000), Haladyna (1997), Kutlu et al. (2014), and Moskal (2000) were taken into consideration in the rubric development process. The literature was reviewed while determining the rubric's criteria, and sample rubrics in the studies of Weigle (2002), Hughes (2003), Brown (2004), Brown (2007), and Brookhart (2013) were comprehensively examined. After the literature review, a draft form consisting of a total of 20 sub-criteria under seven fundamental criteria was prepared, and the opinions of 11 experts in academic writing skills were consulted. The Lawshe (1975) approach was used to provide evidence for the content validity of the measurements obtained from the rubric, and the content validity rate (CVR) was calculated for each criterion. When the CVR calculated for each criterion is 0.591 and above, it was accepted that the relevant criterion has sufficient content validity (Wilson, Pan & Schumsky, 2012). In line with the opinions of the field experts, the final version of the rubric consisting of six basic criteria and 16 sub-criteria was obtained (Appendix B). Because most students did not give a title to their essays even though they were told to do it, the subcriterion of 'Title of Essay' was not included in the many facet Rasch analysis.
After collecting the evidence for the content validity of the measures obtained from the rubric, exploratory factor analysis was performed for the construct validity. For the exploratory factor analysis, the assumptions were tested, and it was investigated that the assumptions were met (for the relevant data CVR = 0.70; χ 2 (sd) = 956.427 (105) for the Barlett sphericity test; p = 0.000). In the data set, there were no extreme values and missing data, and the relationship between the criteria was found to be linear, and except for two of them, the criteria showed a normal distribution. When the literature on how big the sample should be in the exploratory factor analysis was reviewed, it was seen that there are many different opinions. Guadagnoli and Velicer (1988) stated that all these different views were not based on a theory and that there were no experimental studies, and they emphasized that the factor loadings of the variables were important rather than the sample size in their Monte Carlo simulation study, which they conducted for the sample size required for exploratory factor analysis. Accordingly, it was stated that variables with a sample size of less than 50 people and with a factor load of 0.80 and higher, regardless of the number of variables, would produce consistent results (Guadagnoli & Velicer, 1988). Although the sample size was less than 50 participants in this study, it was found appropriate to perform an exploratory factor analysis for the data set since the factor load of all variables, except three, was greater than 0.80. Exploratory factor analysis was conducted by taking the average of the scores given by 45 raters to 39 essays. As a result of the analysis, it was found that the criteria were collected under a single factor and explained 70.05% of the variance (the factor loadings of the criteria for the relevant data set are as follows; 0.842; 0.855; 0.936; 0.968; 0.644; 0.860; 0.960; 0.987; 0.945; 0.605; 0.911; 0.891; 0.899; 0.861 and 0.622).
As a result of the exploratory factor analysis, since the factor load obtained for each criterion was different (congeneric measurements), the McDonald ω coefficient (McDonald, 1999) was used for the reliability evidence of the measurements because it gave consistent results (Osburn, 2000) as a reliability determination method. As a result of the analysis, McDonald ω coefficient was found to be 0.971 (95% Confidence Interval: 0.956-0.980). Considering the reliability and validity evidence obtained for the analytical rubric, it can be argued that the measurements obtained using this measurement tool are reliable, and the inferences made based on these measurements are valid.

Experimental Process
Before starting the experimental process, to determine the starting levels of the experimental and control groups, the students' essays were distributed to the raters and the scores they gave were taken as a pre-test, and the cases of statistical differentiation were examined with the independent samples t-test and the Many Facet Rasch Model. As a result of the analysis, it was found that both groups exhibited similar rater errors in the process of assessing student essays, and the rater errors involved in the measurements were close to each other. In addition, before starting the experimental process, the analytical rubric developed for the experimental and control groups was introduced, and how to use it in the scoring process was explained. Later, both groups were explained what academic writing skill is, what its general characteristics are, and its connection with the developed rubric. These 168 procedures were carried out to ensure that the experimental and control groups reach a similar level at the beginning. Thus, in the process of assessing academic writing skills, the mixing of different variance sources (such as measurement tools) in the measurements was tried to be minimized. It was aimed that the raters did not know whether they were in the experimental or control group. Then, the student essays were distributed to the experimental and control groups, and they were given one week to assess the essays. One week later, student essays were collected, and they were analysed on the computer.

Rater training
To create a common understanding between raters while assessing individual performance, rater error training (RET) and frame of reference training (FRT), which are recommended in the literature, were combined. The two selected trainings were combined because of the inability of RET in defining rater behaviors and errors, but not being effective on rater accuracy, and the success of FRT on rater accuracy (Murphy & Balzer, 1989;Sulsky & Day, 1992). In other words, both rater training patterns were chosen because they were complementary to each other. The basic assumption of the RET design is that familiarity with common rater errors and encouraging raters to avoid these errors will result in a direct reduction of rater errors and, therefore, more effective performance assessment. (Woehr & Huffuct, 1994). Although rater errors such as rater severity and leniency decreased in the RET pattern, findings indicate that rating accuracy also decreases (Bernardin & Pence, 1980). In the FRT pattern, it is taken as a basis that the performance assessed is multidimensional (Selden, Sherrier & Wooters, 2012). Therefore, all sub-dimensions of performance should be defined, and behavioural examples representing these dimensions should be given to the raters. The basic principle in the FRT pattern is to train the raters to ensure that the performance dimensions assessed have certain standards. Thus, a match can be made between the scores given by the rater and the actual scores of the student (Woehr & Huffuct, 1994). The rater training was completed in four weeks in total, giving one hour each week in the measurement and evaluation course.
In the first week, the purpose, scope, and importance of rater training were introduced within the framework of RET. Then, the target audiences and the methods used were introduced in the rater training, and the first stage was completed. The second stage included information about the most common rater errors of the performance assessment process and the effects of these errors on validity and reliability. Finally, for rater training, in-group discussions were made based on a few examples. Thus, the first week of rater training was completed.
In the second week, the possible sources of rater errors involved in the measurements in the performance assessment process were explained, and the actions to be taken to reduce these errors were specified. These suggestions were determined by reviewing the literature, and the sample applications were shared with the experimental group. With this process, the RET part of the rater training was completed, and the FRT part was started. First, the academic writing skill, which was assessed by the raters, was defined. The sub-dimensions of this skill and which criteria correspond to the sub-dimensions in the rubric were explained. Then, the raters in the experimental group were asked to give representative behaviours regarding the dimensions of academic writing skill. They were then asked to discuss these representative behaviours in the group.
In the third week, as a continuation of the second week, examples regarding the dimensions of academic writing skill were given, and in-group discussions continued. After completing this stage, based on the pre-test results of the raters, the best, middle, and low-level student compositions were determined. These compositions were multiplied and distributed to the raters in the experimental group, and they were asked to be re-assessed. The raters were not informed about whether the essays were good or bad. After the assessment process, raters were randomly selected and asked about the scores they gave and the reasons for giving these scores. Later, the same question was asked to other raters in the experimental group. This process was carried out considering the criteria with the highest standard error according to the pre-test measurements. The main goal is to create a common In the last week, the activities of the third week were continued with different raters. The compositions of three students, which were determined beforehand according to the pre-test results, were assessed by an academician. Raters were asked to explain how many points the field expert (academician) gave according to the determined criteria; thus, in-group discussions were made conducted. After all stages, rater training was completed, and students' compositions (39) were given to the experimental and control groups again for the post-test measurements (the duration for assessment was one week). Participation in all stages of the experimental process and scoring was voluntary. Also, additional points were added to the final grades to encourage these students.

Data Analysis
During the data analysis process, EFA and Lawshe techniques were applied in order to provide evidence for the validity of the measurements obtained from the first developed measurement tool. Then, many facet Rasch analyses were performed, and Mann Whitney U test was run based on the logit values obtained as a result of this analysis. At first, EFA was performed because the scoring of the raters showed a normal distribution. Then, since the logit values obtained by MFRM were not normally distributed, the Mann Whitney U test was used. The analysis of MFRM was preferred because it gives the common interaction between facets at the individual level. Since all raters assessed the compositions of students over all criteria, MFRM was conducted under a completely crossed-out pattern. Detailed information about MFRM was presented below.

Many facet Rasch model
MFRM has emerged as an extension of the basic Rasch model. Unlike the basic Rasch model, many variability sources (facets) such as rater, item, task, individual, time are placed on a single scale (Kim et al., 2012;Linacre, 1993;Linacre, 1996). Also, interactions between MFRM and sources of variability can be examined (Kassim, 2007). MFRM is a linear model that calibrates all parameters and converts the observations in the ranking scale to an equidistant logit scale (Bond & Fox, 2015). The logistic transformation of the log odds ratios allows independent variables such as peer assessment, status determination criteria, and open-ended items to be seen as dependent variables (Esfandiari, 2015).
Another advantage of MFRM is that it offers information that classical test theory and generalizability theory cannot provide (Lunz, Wright & Linacre, 1990). MFRM can provide the researcher with detailed information about each facet. For example, a lot of information can be obtained such as which of a group of raters assessing the performance of individuals, what the scoring is (observed value), and what the scoring should be (expected value). As MFRM provides detailed feedback, it is possible to determine which rater is good or bad and what kind of intervention is required. Based on these advantages of MFRM, the rater errors can be determined before the rater training; therefore, training can be arranged for these errors. Thus, the validity and reliability of the measurements can be increased.
Considering rater x student composition (pxb) interactions, the measurement model is defined as follows; where ln (Pbkpx / Pbkpx-1) = the probability that Performance b rated by Rater p on Item k in receives a rating in category x rather than category x-1, x-1 and x and Ipb = Interaction term between rater facet and student composition facet.
Since MFRM belongs to the Rasch model family, it must meet the assumptions in the Rasch models (Eckes, 2015;Farrokhi, Esfandiari & Schaefer, 2012;Farrokhi et al., 2011). The assumptions to be met for MFRM are unidimensionality, local independence, and model data fit. As stated in the data collection tools, the rubric had a single factor structure. For the local independence assumption, the G 2 statistics proposed by Chen and Thissen (1997) were used. The standardized LD χ2 values were found to range from -0.4 to 4.5. The marginal fit χ 2 values were close to zero, and local independence was found. Standardized residual values were examined for model-data fit. The total number of observations for the pre-test application was 39x45x15 (composition x rater x criterion) = 26.325. it was observed that model-data fit was achieved for the pre-test application since the number of standardized residual values outside the ± 2 range was 1.067 (4.05%) and the number of standardized residual values outside the ± 3 range was 164 (0.62%). While the total number of observations for the post-test application was 26.322 (3 missing data), the number of standardized residual values outside the ± 2 range was 995 (3.78%), and the number of standardized residual values outside the ± 3 range was 186 (0.71%).

RESULTS
Findings were presented under two headings as before (pre-test) and after (post-test) rater training. MFRM analysis was given by presenting group statistics firstly, then individual statistics.

Investigating DRF Status of Raters in Experimental and Control Groups Before Rater Training
The estimated chi-square value for the statistical indicator of rater x student compositions (pxb) interactions at the group level was found to be significant (χ2(sd) = 5 298.40 (1755), p < 0.05).
According to the significance of the chi-square value, the rater function that differed at the group level was mixed up in the measurements during the assessment of student compositions. After determining that DRF was involved in the measurements at the group level in pxb interaction, the statistics at the individual level were examined. T statistics are used for interactions that are significant in interaction between sources of variability in MFRM. Statistical significance is tested by comparing the t-value obtained as a result of MFRM interaction analysis with the critical t-value. Interactions with a t-value outside the ± 2 range indicate differential rater function (Linacre, 2017). The number of possible interactions in the control group was 897 (23x39), and the number of significant interactions was 203 (22.63%). The number of possible interactions in the experimental group was 858 (22x39), and the number of significant interactions was 160 (18.65%). When the t statistic takes a negative value, it is defined as differential rater severity; when it takes a positive value, it refers to differential rater leniency. Table 1 presented the frequency and percentages of the raters in the experimental and control groups regarding the type of significant interactions.  Table 1 showed that the interference levels of the DFR of the experimental and control groups in the measurements were close to each other. The statistical significance of the differential rater severity and leniency of the raters in the control and experimental groups was tested using the bias size values obtained in MFRM interaction analysis, and analysis results were given in Table 2. As is seen Table 2, the interference levels of the DRF of the raters in the experimental and control groups before the rater training were statistically similar (for DRS, U = 3872.00; Z = -1.90 p > 0.05; for DRL, U = 3307.00; Z = -0.74; p > 0.05).

Investigating DRF Status of Raters in Experimental and Control Groups after Rater Training
After the experimental procedure, the estimated chi-square values for the statistical indicator of rater x student compositions (pxb) interactions at the group level were found to be significant (χ2(sd) = 4 084.90 (1755), p < 0.05). This finding shows that, despite rater training, the differential rater function in the performance assessment process of the raters interfered with the measurements.
Statistics at the individual level were examined since DRF was involved in group-level measurements. Therefore, t statistics regarding pxb interactions were examined. While 163 of 897 possible interactions (18.17%) of the control group were significant, 110 (12.82%) of 858 possible interactions of the experimental group were found to be significant. Table 3 presented the frequency and percentage values of the raters in the experimental and control groups related to the differential rater function involved in the measurements during the performance assessment process after the rater training. The interference levels of the DRF of the raters in the experimental and control groups differed after the rater training while assessing student compositions. The statistical significance of the differential rater severity and leniency of the raters in the control and experimental groups was tested using the bias size values obtained in MFRM interaction analysis, and analysis results were given in Table 4. After rater training, the interference level of the differential rater severity in the measurements in the performance assessment process was found to be statistically significant, while the interference level of the differential rater leniency was insignificant (for DRS, U = 2072.50; Z = -2.72 p < 0.05; for DRL, U = 1476.50; Z = -1,38; p > 0.05). According to this result, rater training had a small effect (r = 0.22) on differential rater severity, but no effect on differential rater leniency.

Journal of Measurement and Evaluation in Education and
To observe the effect of rater training on pxb interactions, significant interaction numbers of the raters in the experimental group according to the pre and post-tests were given in Table 5. As is seen in Table 5, while assessing student compositions after rater training, the significant interactions of 14 raters (1,4,5,6,7,9,10,13,15,16,17,18,20, and 21) decreased (positively affected by the training); the significant interactions of 7 raters (2, 3, 8, 11, 12, 14, and 19) increased (negatively affected by the training), and the significant interactions of 1 rater (22) remained constant.
To make Table 5 more understandable, the graphical representation of pxb interactions was given in Figure 1. As seen in Figure 1, the red lines representing the raters' pre-test were mostly outside the ± 2 range. After rater training, blue lines representing raters' ratings were observed less outside the ± 2 range. According to Figure 1, some compositions were subject to more rater bias than other compositions. For example, the raters were more severe in assessing composition numbered 37 than the other compositions. Besides, it can be said that the given rater training had a positive effect on rater errors in general, and as a result, contributed to the validity of the measurements.

DISCUSSION and CONCLUSION
This study aimed to investigate the effect of rater training on DRF, which is involved in measurements while assessing second language academic writing skills. In this context, the findings obtained before and after rater training were examined. Before rater training, DRF effect involved in the measurements was similar in both the experimental and control groups while assessing the compositions of students. Similar DRF effects were found in both group level and individual statistics. Approximately one-fifth of pxb interactions in the experimental and control groups were observed to be DRF. Research supports this finding, indicating that DRF is frequently involved in measurements in the performance assessment process (Liu & Xie, 2014;Schaefer, 2008;Wesolowski et al., 2015;Wolfe & McVay, 2012). While assessing the compositions of students, DRF involved in the measurements appeared in two ways: differential rater severity and differential rater leniency. This study found that raters mostly showed differential rater severity. The literature advocates that DRF involved in the measurements during the performance assessment process is a combination of both severity and leniency behavior, and DRF generally occurs due to too severe or too lenient raters (Kim et al., 2012). Considering that there are more severe raters in the current study, the abundance of differential rater severity confirms the literature.
During the process of assessing student compositions after the rater training, the involvement level of DRF in the measurements was examined. While the amount of change in the control group was minimal, a significant change was found in the experimental group. Although the level of interference

174
of the two types of DRF in the experimental and control groups in the measurements was statistically similar before the rater training, it differed statistically after the rater training. It was found that the differential rater leniency was not affected by the experimental process, but the differential rater severity was affected. In other words, rater training was effective on the differential rater severity of DRF. Considering the studies conducted by , Fahim and Bijani (2011), and May (2008) and Yan (2014), rater training was effective on DRF. Van Dyke (2008) found that the differential rater leniency in the performance assessment process interfered with the measures, but the differential rater severity did not interfere. There are two main reasons for the difference between the current study and the one conducted by Van Dyke (2008): The first reason may be that the raters consisted of different groups, and the second one is that the performance assessed was different.
The results of this study can be summarized as follows;  During the process of assessing compositions. DRF was involved in the measurements and accounted for approximately one-fifth of pxb interactions.
 Raters in the experimental and control groups exhibited similar DRF before rater training.
 Rater training had an impact on the different types of rater severity of DRF, and rater training had a small effect size on DRF.
Based on these results, some suggestions were made for future studies and researchers;  In the present study, two different rater training patterns were combined. Considering that there are many different rater training patterns in the literature, different combinations can be made to examine the effects of rater training on DRF.
 A large experimental group was used in this study. The literature emphasizes that the training of smaller (n = 5-6) groups is more effective. Thus, it may be useful to use small groups in future studies.
 The effect of rater training on DRF can be used to train raters and contribute to the validity and reliability of the measurements during the performance assessment process utilized in placement and selection exams.
All the needed spelling rules are inaccurately used in written text.
All the needed punctuation rules are inaccurately used in written text.