How Reliable Is It to Automatically Score Open-Ended Items? An Application in the Turkish Language *

The use of open-ended items, especially in large-scale tests, created difficulties in scoring open-ended items. However, this problem can be overcome with an approach based on automated scoring of open-ended items. The aim of this study was to examine the reliability of the data obtained by scoring open-ended items automatically. One of the objectives was to compare different algorithms based on machine learning in automated scoring (support vector machines, logistic regression, multinominal Naive Bayes, long-short term memory, and bidirectional long-short term memory). The other objective was to investigate the change in the reliability of automated scoring by differentiating the data rate used in testing the automated scoring system (33%, 20%, and 10%). While examining the reliability of automated scoring, a comparison was made with the reliability of the data obtained from human raters. In this study, which demonstrated the first automated scoring attempt of open-ended items in the Turkish language, Turkish test data of the Academic Skills Monitoring and Evaluation (ABIDE) program administered by the Ministry of National Education were used. Cross-validation was used to test the system. Regarding the coefficients of agreement to show reliability, the percentage of agreement, the quadratic-weighted Kappa, which is frequently used in automated scoring studies, and the Gwet's AC1 coefficient, which is not affected by the prevalence problem in the distribution of data into categories, were used. The results of the study showed that automated scoring algorithms could be utilized. It was found that the best algorithm to be used in automated scoring is bidirectional long-short term memory. Long-short term memory and multinominal Naive Bayes algorithms showed lower performance than support vector machines, logistic regression, and bidirectional long-short term memory algorithms. In automated scoring, it was determined that the coefficients of agreement at 33% test data rate were slightly lower comparing 10% and 20% test data rates, but were within the desired range.


INTRODUCTION
Individuals experience numerous tests throughout their lives. Tests show differences in individuals' knowledge, skills and abilities. Thus, decisions can be made about them (Geisinger & Usher-Tate, 2016). In recent years, the use of more than one item format in tests has become more popular. In this approach, which is referred to as a mixed-format test, open-ended items with or without restricted responses are used in addition to the multiple-choice items. In multiple-choice items, individuals encounter one right and more than one wrong answer about a problem. In open-ended items with restricted responses, individuals answer questions with a few words, sentences, or paragraphs, while in items with unrestricted responses, they respond in any length they want (Downing, 2009). The combined use of the item types allows to eliminate the limitations of each format (Messick, 1993). For example, using only the multiple-choice items in tests affects the teaching and learning process and lead individuals to study for multiple-choice tests. This situation can restrict original, critical, and higher level thinking skills. However, the use of open-ended items can overcome this limitation.
When the studies in the literature are reviewed, it is seen that automated scoring procedures are carried out in languages other than Turkish. The studies of Gierl et al. (2014), Adesiji et al. (2016), Taghipour and Tou Ng (2016) can be given as examples of studies using different algorithms in machine learning. Gierl et al. (2014) used the SVM algorithm based on supervised machine learning in automated scoring, Adesiji et al. (2016) utilized a structure consisting of three modules based on unsupervised machine learning in automated scoring, and Taghipour and Tou Ng (2016) utilized three recurrent neural network algorithm based on supervised machine learning (basic recurrent units, gated recurrent units, and LSTM units). The difference in language structures is a factor that may affect automated scoring. Therefore, automated scoring in the Turkish language should be investigated. Altaic language family, which Turkish is included in, has features such as vowel harmony, agglutination, suffix, sentence order, the modifier preceding the modified, having no difference in terms of the case, gender, and number in the adjective clauses. Names that come after numbers indicating plurality do not have plural suffixes, and gender is not specified in words. The differentiation of these features from other language families requires reviewing automated scoring studies in the Altaic language family. Jang et al. (2014) conducted research on the Korean language and Ishioka and Kameda (2006) on the Japanese language. In the two studies mentioned, algorithms in which properties are defined manually were used. The current research has originality since it was the first automated scoring attempt on the Turkish language.

METHOD
In this study, a correlational research method was adopted since the reliability of the scores of human raters and the reliability of the scores of automated scoring algorithms were compared. Creswell (2012) states that in correlational research, it is possible to see how the change in one variable affects the other variable.

The Development of the Software Used in Research
In the study, an automated scoring software developed by a team including the researcher was used. While the software was developed, the Turkish test's open-ended items with restricted responses in "Monitoring the Measurement and Evaluation Applications, Research and Development Project" applied by the Ministry of National Education (MoNE) were used. The Turkish test of "Monitoring the Measurement and Evaluation Applications, Research and Development Project" (ABIDE) is independent of the tests used in this stage. This test is for fifth-grade students and includes five openended items. While preparing the software, five open-ended items with restricted responses scored 0-1, and 0-1-2 were used. In this test, all student answers were graded by two raters, and when necessary, a final score was obtained by reaching the upper rater. Rubrics were used in scoring processes.
The results of two of the items used in the development of the software were presented as an example. The item with two categories (item 16) and the rubric is included in Appendix-A, the item with three categories (item 20) and the rubric is included in Appendix-B. Data of 303 students for the 16th item and 637 students for the 20th item in the Turkish test were used. Since item 20 was scored in three categories, more data were tried. An automated scoring system was created using the Python program on the Linux operating system, and trials were made. Five algorithms were used in automatic scoring: SVM, LR, MNB, LSTM, and BLSTM. Two libraries named Keras and scikit-learn were utilized in the software. 90% of the data was used to train the system and 10% to test the system. The random sampling method was used with cross-validation. With 10-fold cross-validation, the test data and training data were changed ten times to be different from each other, and automated scoring was made as much as the number of data and the percentages of agreement were calculated over these scores. Thus, 303 scoring results were obtained in the trial conducted on 303 data, and 637 scoring results were obtained in the trial performed on 637 data. The usability of the software was investigated by examining the agreement between automated scoring and final scores of human raters. Table 1 includes the results of dichotomously scored (0-1) item 16 and polytomously scored (0-1-2) item 20.  (Hartmann, 1977).
When Table 1 is examined, it is seen that the percentages of agreement obtained for item 16 are quite high. The algorithms showing the highest compliance percentage for the item 16 were LSTM and BLSTM. It was determined that the percentages of agreement obtained for item 20 were sufficient. The algorithm showing the best agreement for item 20 was BLSTM. The obtained results showed that the created system would be sufficient for scoring the structured answer items. Thus, an automated scoring process was started for ABIDE data sets within the scope of this research.

Research Data Source
The data source of the study consisted of 8th grades research of the Academic Skills Monitoring and Evaluation (ABIDE) Project implemented by MoNE in Turkey in 2016. In the tests aiming to examine students' higher-order thinking skills, multiple-choice and open-ended items with restricted responses are included together. The research was conducted on open-ended items with restricted responses in Turkish tests of A1 and B1 booklets. Nine items in the A1 test and 10 items in the B1 test are openended. The five open-ended items in the A1 and B1 tests are common. Open-ended items are scored as 0-1 and 0-1-2. The scoring process of open-ended items was made by two human raters. If there was no agreement between the scores, the answer was sent to the higher scorer. Thus, the final scores were obtained. Rubrics were used while scoring. It was stated that the Cramer's V coefficients of the openended items in the A1 and B1 booklets vary between .83-.98 and .87-.99, respectively. It is stated that the coefficients above .80 indicate that the consistency of the raters is high (MoNE, 2017a;MoNE, 2017b). Sample items and rubrics from ABIDE test are included in Appendix-C and Appendix-D.

Transfer of the Data to Computer Environment
First of all, the data described above were requested from the MoNE. Based on this request, 1000 data selected randomly among the data were shared with the researchers. In the data, there are score matrices of two different rater groups and final scores and student answers in jpeg format. Student answer sheets were entered into the computer environment manually. The reason for this is that student texts are difficult to read and due to the use of cursive handwriting, optical character recognition systems (OCR) cannot be adequately utilized. In addition, this eliminates errors caused by OCR programs. In order for the manually entered data to match the student answers, the data were checked by a study group of undergraduate students, and errors were corrected. Student responses were transferred directly and were not corrected.

Data Analysis
Before analyzing the research data, the data of 1000 students taken from the MoNE was examined. Data was entered based on the balanced distribution of the scores obtained from the open-ended items into the categories. This process was carried out to avoid the prevalence (imbalance in distribution to categories) problem of open-ended items in the data as much as possible. Nine open-ended items for the A1 booklet and ten open-ended items for the B1 booklet were taken into consideration, and 697 data from the A1 booklet and 701 data from the B1 booklet were entered. Then, students who answered half or more than half of the open-ended items in the test were selected. After this process, the missing data rate was calculated for each open-ended item. The data was cleaned so that the missing data rate remained below 5%. This process was carried out in order to prevent the coefficients of agreement from being higher than normal in automated scoring. While clearing the data, the distribution by categories was taken into account. Since there are few data in some categories, attention was paid not to exclude individuals that scored points in these categories as much as possible. The criteria mentioned above were considered and the data of 84 people from the A1 booklet and 96 people from the B1 booklet were cleared. Then, the scores given to the students by the human rater group 1 and the human rater group 2 were examined. A group of students was also excluded from the study because of the missing scores encountered here. A total of 6 people were excluded from the A1 and B1 booklets, respectively. Finally, the number of missing data in the multiple-choice items was evaluated, and the students who did not answer more than half of the total number of items in the test and more than half of the multiple-choice items were excluded from the study. Thereby, the missing data rate remained below 5%. No data was excluded from the A1 booklet, and the data of 15 people were excluded from the B1 booklet. Consequently, 90 people were from the A1 booklet and 117 people from the B1 booklet were excluded. Thus, the data preparation process was completed, and the automated scoring process was started with 607 data from the A1 booklet and 584 data from the B1 booklet.

Automated scoring of ABIDE open-ended data
In the automated scoring phase, the automated scoring system was trained by using some of the final scores. In this way, the automated scoring system was enabled to learn how to score from human raters, and scoring features were mapped to the system. Then, the data that were not used in the training of the system were scored automatically. There was no manual definition of any feature in the software. The data rate used in training/testing the system was a factor whose effect was examined in the research. The data rates used for the test were determined as 10%, 20%, and 33%. Therefore, the data rate used in training the system was 90%, 80%, and 67%, respectively. According to these values for the A1 booklet, 61, 121 and 200 data out of 607 data were used to test the system, and 546, 486 and 407 data out of 607 data were used to train the system, respectively. A similar calculation can be made for booklet B1. When calculating the results, 10-fold cross-validation for 10% test data rate, 5-fold cross-validation for 20% test data rate and 3-fold cross-validation for 33% test data rate were used. In this way, the training and test data were differentiated and all 607 data for the A1 booklet and all 584 data for the B1 booklet were turned into test data. When comparing research results with other studies, data numbers rather than data rates should be used. The reason for indicating the result with the ratio is to increase the application of cross-validation and clarity.
For the evaluation of the automated scoring results, the consistency with the final scores of the human raters was calculated. The compatibility of the human rater group 1 and the human rater group 2 with the final scores was also examined in terms of making a comparison. Each item was examined separately.

Coefficients of agreement
While examining the agreement between raters, percentage of agreement (PA), quadratic weighted Kappa (QWK), and Gwet's AC1 (Gwet's AC1) coefficients were used. Detailed information is given below.

33
Percentage of Agreement: The percentage of agreement is a coefficient which can be understood and interpreted easily. Also, it can be calculated simply and quickly. Therefore, it was included in the research. In this method, the series of scores that the participants get from the first and second rater are compared, the ratio of the number of ratings that the raters fully agree on to the number of all ratings is calculated, and the result is stated as a percentage. The results obtained range from 0% to 100%. This coefficient is criticized as it does not take into account agreements that may occur by chance. Because this situation may lead to an excess of harmony. It also does not include the conflict between raters. This method can be used when all scale levels (nominal, ordinal, scale) and the number of score categories are two or more (Araujo & Born, 1985;Goodwin, 2001;Graham, Milanowski & Miller, 2012;Meyer, 1999). Although there is no certain rule, researchers have a consensus about the percentage of agreement should be above 80% (Hartmann, 1977).

Quadratic Weighted Kappa:
Kappa coefficient is one of the most commonly used coefficients of agreement. The Kappa coefficient is a coefficient that takes into account the probability of agreements that may occur by chance between raters. But it does not take into account the possibility of disagreement between raters. For this reason, the Kappa coefficient has been weighted. When weighing the Kappa coefficient, weights are used according to the degree of mismatch. The two most commonly used weighting techniques are linear and quadratic. In linear weighting, weights are proportional to the standard deviation of the scores, while in quadratic weighting, weights are proportional to the square of the standard deviation of the scores (variance). Since it is easy to interpret, the use of quadratic-weighted Kappa (QWK) is quite common in practice. QWK is frequently used in automated scoring researches. Therefore, it was included in this research. This coefficient, which can be used when there are two or more score categories, can be misleadingly low if one of the scores is higher than the other or the others. This situation is defined as a prevalence problem in the literature and is the most reported problem related to the Kappa coefficient. Besides the prevalence, bias is also effective on the Kappa value. The bias problem arises when there is a difference between the frequencies of raters' evaluations about a situation (Byrt, Bishop & Carlin, 1993;Eugenio & Glass, 2004). The quadratic weighted Kappa can also be used to evaluate the agreement between automated scoring system scores and the human raters' scores agreed upon, and takes values ranging from 0 to 1. While the 0 coefficient indicates that there is no agreement between the raters, the one coefficient indicates a very good agreement between the raters. This value may drop below 0 when there is less agreement among the raters than the value that would arise by chance (Altman, 1991;Brenner & Kliebsch, 1996;Graham, Milanowski & Miller, 2012;Preston & Goodman, 2012;Sim & Wright, 2005;Vanbelle, 2016). Landis and Koch (1977) specified a criterion for the interpretation of the Kappa coefficient, and Altman (1991) adapted this criterion. Accordingly, the interpretation of values are as follows: <.20 as "poor", .21-.40 as "fair", .41-.60 as "moderate", .61-.80 as "good" and .81-1.00 as "very good" agreement. Williamson, Xi, and Breyer (2012) suggest that the agreement between human raters and automated scoring systems should be over .70. Equations used by Wang, Wei, Zhou, and Huang (2018) and Preston and Goodman (2012) were used to calculate the quadratic weighted Kappa value. Detailed information can be obtained from these sources.
Gwet's AC1 Coefficient: Gwet's AC1 coefficient (Gwet, 2008) emerged in line with the paradoxes encountered in Cohen's Kappa coefficient. The skewness (prevalence) in the distribution of the data into categories, the bias caused by the raters, the differentiation of the sensitivity and specificity of the raters reduce the capability of the Kappa value to determine the agreement between the raters (Eugenio & Glass, 2004;Gwet, 2008). The AC1 coefficient differs from the Kappa coefficient with the adjustment on the averages of marginal probability for each category and the expected ratio of chance agreement. Thus, comparing with the Kappa value, it is less affected by paradoxes, and it is more stable against the skewness between categories, that is, the variability between categories (Hoek & Scholman, 2017).
When there are imbalance and lack of symmetry in the categories, the AC1 coefficient is more efficient at detecting the agreement between raters (Shankar & Bangdiwala, 2014). Gwet's AC1 coefficient can be used in categorical data regardless of the number of raters (Wongpakaran, Wongpakaran, Wedding & Gwet, 2013). AC1 coefficient takes lower values than the percentage of agreement and higher than  (Lacy, Watson, Riffe & Lovejoy, 2015). Gwet's AC1 coefficient can be interpreted through the criteria defined by Landis and Koch (1977) for the Kappa coefficient (Senay, Delisle, Raynauld, Morin & Fernandes, 2015;Siriwardhana, Walters, Rait, Bazo-Alvarez & Weerasinghe, 2018). Hoek and Scholman (2017) recommend researchers to use the AC1 value along with the Kappa value in their research. In addition, Haley (2007) states that the AC1 coefficient is an efficient way to evaluate the automated scoring systems. Therefore, this coefficient was included in the current study. The equation used to calculate Gwet's AC1 coefficient can be found in Gwet's research (2016).
When interpreting the coefficients of agreement, the prevalence of scores and the bias of raters are crucial. Therefore, the prevalence and bias indexes are calculated. Byrt, Bishop, and Carlin (1993) state that its essential to take into consideration the prevalence and bias indexes so that the Kappa coefficient is not misleading. Even though the prevalence index varies between -1 and 1, it can be stated that since the absolute value is used, being close to 1 of the coefficients obtained will decrease the Kappa value. On the other hand, the absolute value of the bias index varies between 0 and 1, and it can be stated that the increase in the bias coefficients will also increase the Kappa value (Byrt, Bishop & Carlin, 1993). The prevalence and bias coefficients of all structured answer items in A1 and B1 booklets were examined. The prevalence coefficient of item 2, item 7, item 14, and item 19 in the A1 booklet; item 3 and item 5 in the B1 booklet are high, and consequently, it is predicted that the QWK value in these items may be lower than the real agreement value. It is predicted that items 10 and 11 in the A1 booklet, item 8, item 9, and item 18 in the B1 booklet are the items with the lowest prevalence coefficient, and therefore the QWK value will be closer to the real agreement. The bias values of all of the items in the A1 and B1 booklets are very low, and therefore it is very unlikely of the QWK value's being higher than the real agreement value.
While calculating the percentage of agreement, QWK and AC1 coefficients; the "irr" (Gamer, Lemon, Fellows & Singh, 2010), "rel" (LoMartire, 2017) and "Metrics" (Hamner & Frasco, 2018) packages in the R program (R Core Team, 2018) were used, respectively. The performances of the algorithms were compared by averaging all items for the coefficients of agreement. In addition, the performance of the algorithms was reviewed by averaging the data rates used in testing the system.

FINDINGS
The coefficients of agreement related to the open-ended items in the A1 booklet were first calculated between the human raters group 1 and 2 and the final scores of the human raters. Then, the consistency between five different automated scoring algorithms and the final scores was examined by changing the data rates used in testing the automated scoring system. The results are shown in Table 2 for the A1 booklet. A sample of the interpretation of an item (item 2) in the A1 booklet is given. The sample item is about a situation where there is a prevalence problem. The results related to other items in the A1 booklet can be evaluated in Table 2. In Table 2, three coefficients with the highest agreement values are shown in bold, and three coefficients with the lowest agreement values are shown in italic for each type of agreement coefficient.
When the values belonging to item 2 in table 2 are examined, it is seen that the percentage of agreement between the first human raters group and the final scores was .980, the AC1 index was .976, and the QWK value was .880. The percentage of agreement between the second human raters group and the final scores was .979, the AC1 index was .975, and the QWK value was .862.
When the agreement between the automated scoring and the final scores of the human raters is examined with a 10% test data rate, it is seen that the highest percentage of agreement was obtained as .941 with the BLSTM algorithm, followed by the .921 with MNB algorithm. The lowest percentage of agreement was obtained with .913 in the LSTM algorithm. When the percentages of agreement are examined, it was concluded that the values were close to each other and at acceptable levels (>.80). When the AC1 index is examined, the algorithm with the highest agreement was the BLSTM algorithm with .931, followed by the LR algorithm with .910. The lowest AC1 value was in the SVM and LSTM algorithms with a value of .904. It was observed that AC1 values were close to each other and had a very good agreement (>.80) for all algorithms. The highest QWK value was found as .569 with the BLSTM algorithm, followed by the MNB algorithm with .448. The lowest QWK value was in the LSTM algorithm with .061, and this value was followed by the LR algorithm with .223. It was concluded that the QWK values varied considerably among the algorithms, the range was .508, and it differed from the AC1 index and the percentage of agreement. When the QWK value is evaluated as a whole, it can be stated that the BLSTM and MNB algorithms were moderate (<.60 Ʌ >.40), the LR and SVM algorithms (<.40 Ʌ >.20) were fair, and the LSTM algorithm was poor (<.20).
With 20% test data rate, the BLSTM algorithm showed the highest percentage of agreement with .942, while the MNB algorithm showed the lowest percentage of agreement with .913. It is seen that the percentages of agreement in all algorithms were very close to each other and at an acceptable level (>.80). When the agreement was evaluated in terms of the AC1 index, the highest agreement was found in the BLSTM algorithm with .933, and the lowest with .899 in the MNB algorithm. It can be stated that the AC1 index values were generally close, and all of them showed very good agreement (>.80). When the QWK values are examined, it can be stated that the algorithm with the highest agreement was the BLSTM algorithm with .593 and the algorithm with the lowest agreement was the LSTM algorithm with .147. The second algorithm with the lowest agreement was SVM with .212. As it can be seen, at a 20% test data rate, similar to the 10% test data rate, QWK values were low, and there were differences between algorithms. The range of QWK values at a 20% test data rate was .446. When the QWK values were examined in general, it is seen that the BLSTM algorithm showed moderate agreement (<.60 Ʌ >.40), the MNB, LR, and SVM algorithms showed a fair agreement (<.40 Ʌ >.20), and the LSTM algorithm indicated a poor agreement (<.20).
For the 33% test data rate, the highest percentage of agreement is the BLSTM algorithms with .934. The algorithm with the lowest percentage of agreement is the SVM with .909. Generally, the percentages of agreement were high, close to each other, and acceptable (>.80). In addition to the fact that AC1 indexes are generally high, the highest agreement is in the BLSTM algorithm with .924, and the lowest agreement is in the SVM algorithm with .899. The values obtained for all algorithms are close to each other and show very good agreement (>.80). When the QWK values were evaluated, the highest agreement was obtained in the BLSTM algorithm with .522, and the lowest two agreements were obtained in the SVM algorithm with .128 and in the LSTM algorithm with .000. At 33% test data rate, the QWK values were low, varied widely between algorithms, and its range was .522. When the values obtained were examined, it was seen that the BLSTM algorithm had moderate agreement (<.60 Ʌ >.40), MNB and LR algorithms had fair agreements (<.40 Ʌ >.20), and LSTM and SVM algorithms had poor agreements (<.20).     Figure 1 shows the agreement values obtained for item 2 in A1 booklet according to automated scoring algorithms and test data rates.

Figure 1. Graph showing Agreement Values for Item 2 in A1 Booklet according to Automated Scoring Algorithms and Test Data Rates
When figure 1 is examined, for item 2, in all the test data rates and automated scoring algorithms, the QWK coefficient was considerably lower than the AC1 values and percentage of agreement. The reason for the low values encountered in all of the QWK coefficients and the coefficient's being close to .000 under some circumstances was the prevalence problem. Therefore, QWK was not taken into consideration. This was one of the situations predicted in the research. When a comparison was made by considering all test data rates and automated scoring algorithms, it was observed that the agreement values were slightly higher at 20% test data rate and slightly lower at 33% test data rate. However, the differences between them were very small. The agreement percentages were above .80, which is the acceptable limit in all conditions. The AC1 index indicated a very good agreement in all conditions (>.80). AC1 values were evaluated in the same direction as the Kappa coefficient. Accordingly, all AC1 coefficients were higher than the expected agreement value (>.70, Williamson et al., 2012) between automated scoring and human raters. When all the conditions for item 2 in table 2 were considered, the highest percentage of agreement (.942) and the highest AC1 value (.933) were obtained in the BLSTM algorithm with a 20% test data rate. These values were close to the percentage of agreement and AC1 value between the human rater groups and the final scores. Due to the prevalence problem encountered in item 2, the QWK values calculated between the human raters and the final scores were also low. This situation has reflected on machine learning more negatively.
The coefficients of agreement for open-ended items in the B1 booklet were calculated in the same way as in the A1 booklet. The results are shown in Table 3. The interpretation of an item (item 5) in the B1 booklet is given as an example. Results related to the other items in the B1 booklet can be evaluated in table 3. In table 3, three coefficients with the highest agreement values are shown in bold, and the three coefficients with the lowest agreement values are shown in italics according to each type of coefficient of agreement.  For the 33% test data rate, the highest agreement percentage was the BLSTM algorithm with .892. The algorithm with the lowest percentage of agreement was the LSTM with .784. The percentage of agreement was acceptable (>.80) in all algorithms except in LSTM and MNB algorithms. According to the AC1 indexes, the highest agreement was in the BLSTM algorithm with .853. The lowest agreement was in the LSTM algorithm with .718 and this algorithm was followed by the MNB algorithm with .720. In terms of AC1 indexes, it is seen that very good agreement (>.80) was achieved for BLSTM and SVM algorithms, and good agreement (<.80 Ʌ >.60) for LR, LSTM, and MNB algorithms. According to the QWK coefficient, the highest agreement was obtained in the BLSTM algorithm with .904 and the lowest two agreements were obtained in the MNB algorithm with .744 and in LSTM algorithm with .783. QWK values indicated very good agreement (>.80) for BLSTM, LR, and SVM algorithms, good agreement (<.80 Ʌ >.60) for LSTM and MNB algorithms. It is seen that the QWK values were also greater than the AC1 indexes at 33% test data rate.  .716 * Common items in A1 and B1 booklets. Note 1: P1: First rater group scores, P2: Second rater group scores, PF: Final scores Note 2: PA: Percentage of Agreement, AC1: Gwet's AC1 Coefficient, QWK: Quadratic Weighted Kappa Note 3: CV: Cross validation, 10%, 20% and 33% shows test data rate. Note 4: Item 5, item 6, item 8 and item 9 in this table correspond to item 7, item 8, item 10 and item 11 in the A1 booklet, respectively.     Figure 2 is examined, in all conditions, the coefficients of agreement of the MNB algorithm are lower than the coefficients of agreement of the other algorithms, while the coefficients of agreement of the BLSTM algorithm are higher than the coefficients of agreement of the other algorithms. QWK value indicated very good agreement in all test data rates for BLSTM, LR, and SVM algorithms and at 10% and 20% test data rates for LSTM algorithm (>.80). It also showed good agreement in all test data rates for the MNB algorithm and at 33% test data rate for the LSTM algorithm (<.80 Ʌ >.60). In all conditions, AC1 values showed very good agreement (>.80) for BLSTM and SVM algorithms and good agreement (<.80 Ʌ >.60) for LR, MNB, and LSTM algorithms. All AC1 coefficients for item 5 were lower than QWK coefficients. Percentage of agreement showed acceptable values in all test data rates for the BLSTM, LR, and SVM algorithms and at 10% and 20% test data rates for the LSTM algorithm. The QWK values were acceptable in all algorithms and test data rates according to Williamson, Xi, and Breyer's (2012) criteria that the Kappa coefficient of agreement between human raters and automated scoring should be at least .70. When the same criteria were used for the AC1 coefficient, acceptable values were achieved in all algorithms and test data rates. For item 5, the highest percentage of agreement (.918), AC1 value (.888) and QWK coefficient (.925) were obtained in BLSTM algorithm at 10% test data rate. These values are close to the values of AC1, QWK, and the percentage of agreement between the human rater groups and the final scores.

Journal of Measurement and Evaluation in Education and Psychology
In order to make a general comparison between the automated scoring algorithms, the performance of the algorithms in each item was averaged. Table 4 shows the performances of the automated scoring algorithms in different test data rates and the averages of these performances. In Table 4, the coefficients showing the highest agreement in each test data rate and average performance in all coefficients of agreement are shown in bold, and the coefficients showing the lowest agreement are shown in italic.  When the averages of all test data rates are examined in terms of each automated scoring algorithm and coefficient of agreement, it is seen that the algorithm with the highest percentage of agreement and highest AC1 and QWK values is BLSTM. Along with the BLSTM algorithm had an acceptable percentage of agreement, it showed very good agreement according to the AC1 coefficient and good agreement according to the QWK coefficient. SVM, LR, and MNB algorithms indicated good agreement according to the acceptable percentage of agreement, the AC1 coefficient, and the QWK coefficient. The LSTM algorithm did not have an acceptable percentage of agreement, but it indicated good agreement in terms of the AC1 index and moderate agreement in terms of the QWK coefficient. As a result of both the evaluation of the item averages and the evaluations made within the scope of the item, the best three automated scoring conditions were determined as the BLSTM algorithm at 10% test data rate, the BLSTM algorithm at 20% test data rate and the BLSTM algorithm at 33% test data rate. Figure 3 shows the average of the algorithms taken according to the test data rates. When Figure 3 is examined, it was determined that MNB and LSTM algorithms performed slightly less than other algorithms. The lowest performance was observed in the LSTM algorithm and the highest performance was observed in the BLSTM algorithm.

RESULTS AND DISCUSSION
The research compared automated scoring algorithms with changes made on data rates used in testing the system. For this purpose, SVM, LR, MNB, LSTM, and BLSTM algorithms were compared with each other according to 10%, 20%, and 33% test data rates. When comparing the algorithms, the consistency of human raters with the final scores was taken into account. Thus, the difference between human raters and automated scoring was determined. Considering the ABIDE data, the results showed that the best automated scoring was achieved with the BLSTM algorithm. LSTM and MNB algorithms had lower agreement values than SVM, LR, and BLSTM algorithms. In their previous experiments on various classification algorithms, Kumar and Rama Sree (2014) determined that Naive Bayes algorithm had lower percentages of agreement than LR and SVM algorithms. This result supports the research findings. Gierl et al. (2014) stated that the QWK value was very good in the automated scoring process performed with the SVM algorithm. In the current study, it was determined that the SVM algorithm indicated good agreement. Taghipour and Tou Ng (2016) found that the algorithm with the highest QWK value (.746) was LSTM in their study in which they compared the recurrent neural networks in the automated scoring process. In the same study, the closest QWK value was obtained in the BLSTM algorithm (.699). Similarly, in the current study, the QWK value of the BLSTM algorithm indicated good agreement. However, in the current study, it was determined that the LSTM algorithm showed a medium level of agreement according to the QWK value. The reason for this situation may be that the one-way analysis of sentences in LSTM algorithm and two-way analysis of sentences in BLSTM algorithm may differ in the Turkish language. Even though the comparisons made according to the test data rates showed that the coefficients of agreement slightly decreased at 33% test data rate, SVM, LR, MNB, and BLSTM algorithms indicated good or very good agreement in all conditions.
When the comparison was made according to the lowest acceptable agreement for automated scoring, it was determined that the LR and BLSTM algorithms were at the desired level, and the SVM algorithm was very close to the desired level. When the percentage of agreement of the system created with this current research was taken into account, it can be stated that this system performed better than the unsupervised machine learning-based method prepared by Adesiji et al. (2016). Thus, it was concluded that open-ended items in the Turkish language could be scored automatically by selecting the appropriate automated scoring algorithm based on supervised machine learning in the Turkish language. Although automated scoring systems developed in languages that have similar features to the Turkish language are not based on supervised machine learning, they can be used similarly. Ishioka and Kameda (2006) and Jang et al. (2014) determined that there was a high level of correlation between the automated scoring system and human scores in the Japanese language and the Korean language, respectively.
The automated scoring system created in the Turkish language can be used in large-scale tests. It was also stated that the automated scoring system created in Korean, which is a similar language to Turkish, can be used in large-scale tests (Jang et al., 2014). Based on the findings obtained as a result of the research, the recommendations for researchers and practitioners are as follows: 1. Automated scoring, which is tried for the first time in the Turkish language and seems to be usable, can be used in large-scale tests by developing the system and pilot scheme, and exam costs can be reduced, and the results can be explained more quickly.
2. Among the automated scoring algorithms, BLSTM and LR algorithms can be preferred for data having similar characteristics to the data used in this study.
3. In automated scoring, it can be suggested that MNB and LSTM algorithms should not be used in data having characteristics similar to the data used in this study.
4. This research reflects automated scoring results with at least 400 training data. In future studies, the effect of this situation on the coefficients of agreement can be evaluated by making automated scoring with less training data. Moreover, after the automated scoring process with a large number of training data in large samples (>1000 or >3000), the effect of this situation on automated scoring can be examined by gradually reducing the training data.
5. Automated scoring results obtained in cases where the spelling errors in the data are corrected or not corrected in subsequent studies can be compared.
6. In subsequent studies conducted on paper-pencil tests, the results obtained by data entry via OCR systems and manual data entry can be compared.
7. Within the scope of the research, items with two and three categories were studied. In case of an increase in the number of categories in later studies, the results of automated scoring systems can be examined.