Comparison of Different Ability Estimation Methods Based on 3 and 4PL Item Response Theory 3 ve 4PL Madde Tepki Kuramı Modellerine Göre Farklı Yetenek Kestirim Yöntemlerinin Karşılaştırılması

This research analyzed the two-category Item Response Theory (IRT) models as part of different ability estimation methods. The research was carried out in consideration of responses to 20 items under the Mathematics subtest of TEOG (National Transition from Primary to Secondary Education) exam by the 8th-grade students in 2015-2016. The study group consisted of 400 students who were randomly selected from the students participated in the TEOG exam. Ability estimations and standard error values for these estimations were calculated based on the data. These estimations were compared by two-way analysis of variance (ANOVA) for repeated measurements According to the research findings; it was revealed that the four-parameter logistic (4PL) item model fit better. In terms of ability estimation methods, the accuracy of Weighted Likelihood Estimation (WLE) was higher than Maximum A Posteriori (MAP) and Expected A Posteriori (EAP). WLE and MAP ability estimation model gave lower standard error values compared to the 4PL and 3PL model, respectively. The highest marginal reliability coefficient value for the 3PL model was calculated using estimations made according to MAP while estimations made according to WLE were used for the 4PL model. According to the research findings, it was concluded that the accuracy of ability scores obtained by the WLE estimation method under the 4PL model was higher.


Introduction
Item Response Theory (IRT) is described as the relation between the level of the individual's ability and the item characteristics with responses of the individual to the item. The IRT is based on the assumption that individuals' abilities can be estimated independently of the items (Hambleton & Swaminathan, 1985). IRT models consist of the Rasch model, 1, 2 and 3 Parameter Logistic (PL) models for dichotomous responses. In addition to these models, there is the 4PL model within the scope of the literature of IRT models. Results of the analyses on the characteristics of the test items according to the IRT showed that the use of additional item parameters increased the accuracy and precision of the estimations of parameters characterizing the individuals. Kılıç (1999), found that the 3PL model was more compatible with ÖSS (National Student Selection Exam in Turkey) data of 1993 compared to 1 and 2 PL models. Similarly, it was shown that ability values estimated according to the 3 PL model in consideration of Turkish and Social Sciences subtests of OKS (National Secondary School Institutions Student Selection and Placement Test in Turkey) of 2002 had a more invariable characteristic compared to the values estimated according to 1 PL and 2 PL models (Can, 2003). These studies and some other studies made similar inferences by estimating item parameters according to the 1, 2 and 3PL models only (Barton & Lord, 1981;Baykul, 1979;Berberoğlu, 1988;Can, 2003;Kılıç, 1999;Reise & Waller, 2003;Yapar, 2003;Yeğin, 2003).
The 3PL model, which is quite popular among IRT models, is one of the unidimensional IRT models developed for dichotomous responses. In this model developed by Birnbaum (1968), the possibility of a correct response to item i for an individual j at θ ability level is calculated as follows: Here, ai, is determined as the discrimination parameter for item i; bi, as the item difficulty for item i and ci, as the correct response possibility for an individual at the lowest ability level or the success by chance. The b parameter takes a value usually between -2.00 and +2.00 and the parameter a is theoretically specified to be valued in the range of −∞ and + ∞, it usually takes a value between 0 and 2 (Hambleton & Swaminatthan, 1985). Also, items with a negative parameter a should be omitted from the test (DeMars, 2010). In the 1 and 2PL models, when an individual at a low ability level response to difficult items correctly, the correct response possibility approaches 0. When an individual at a high ability level response to an easy item correctly, the correct response possibility approaches 1. Nevertheless, this hypothesis may not always be true. An individual knowing nothing could still select the correct answer by chance (Bar-Hillel, Budescu, & Attali, 2005;Gardner-Medwin & Gahan, 2003;Yen, Ho, Chen, Chou, & Chen, 2010). Besides, students at a high ability level may on occasion miss items that they should have answered correctly when they are anxious, careless, distracted by poor testing conditions, or even when they answered the item wrong (Hockemeyer, 2002;Rulison & Loken, 2009). Under these conditions, the 3PL model may lead to a low success level for a student at a high ability level who makes a careless mistake on an easy item (Barton & Lord, 1981;Rulison & Loken, 2009). More specifically, the low asymptote in the 3PL IRT model may accommodate a situation where a student at a low ability level makes a correct guess on a difficult item. However, the upper asymptote of 1 in the 3PL model assigns a possibility of 0 when a student at a high ability level fails on an easy item.
Another IRT model is the 4PL model developed by adding an inattention parameter to the 3PL model (Barton & Lord, 1981). According to this model, the possibility of a correct response to item i is as follows: In this equation, the upper asymptote shown as di is the inattention parameter. In addition to a, b, and c parameters, d parameter (the upper asymptote) allows values less than 1.00 and theoretically, it can be between 0.00 and 1.00. With the addition of the upper asymptote having a value less than 1.00, when a student at a high ability level answers an easy item wrong, its position in the ability scale does not change significantly. In other words, d parameter estimates possibilities where a student at a high ability level answers items with lowlevel difficulty wrong. To determine whether the upper asymptote is to be changed or not, standard tests can increase measurement precision, and Barton and Lord (1981) (Barton and Lord, 1981). Rulison and Loken (2009) showed that the 4PL model (with the upper asymptote d=0.98) may decrease estimation error for students at a high ability level who got off to a bad start. In this case, the 4PL IRT offers an opportunity for individuals to correct inattentive errors in the Computer Adaptive Test. To study the general implementation of the 4PL in detail, Loken and Rulison (2010) estimated item parameters for this model and evaluated model compliance and its performance in the IRT test which is not an experimental standard. In this research, the 4PL model was successfully applied to measure adolescent guilt experimentally.
Weighted Likelihood Estimation (WLE) method maximizes the likelihood function over the range of possible values of an ability. This method's function is also known as the bias correction term (Warm, 1989).
According to the Maximum a Posteriori (MAP) method, the ability estimation of an individual is the value that maximizes the posterior probability density function. This method enables lower standard error values to be achieved even when an individual answers all items correctly or wrong (Hambleton & Swaminathan, 1985).
Unlike the MAP method, the Expected A Posteriori (EAP) method is not an iterative method. Both the EAP and MAP methods use the posteriori distribution, but EAP uses the MAP mode when using the average of the posteriori distribution. According to this method, the assumption of normality and mixed iterative mathematical calculations are not required at every stage of the estimations. It also performs skill estimation in cases where the individual does not respond correctly or responds to all of the test items correctly. EAP estimation allows talent estimation of individuals with 0 and full scores (Embretson & Reise, 2000;Hambleton, Swaminathan & Rogers, 1991).
In the literature, there are views suggesting that the 3PL model is the one that fits the best according to the research estimating 1, 2, and 3PL models and testing model fit. One of the researches on this subject is by Çelik (2001) in which 1, 2, and 3PL model's level of fitness was analyzed in consideration of data obtained from Mathematics and Science subtests of National Secondary School Institutions Student Selection and Placement Test carried out by the Republic of Turkey Ministry of National Education (MEB). In this research, it was concluded that the model that fits the best in terms of the Mathematics subtest is the 3PL model. Another research on this subject is by Önder (2007) which explores the best-fit model of IRT-based models in consideration of data obtained from Science Test under Özdebir ÖSS 2004 D-II exam. Similarly, the research carried out by Taşdelen Teker, Kelecioğlu and Eroğlu (2013) found that, of the two category IRT models, the 3PL model is the one that fits the best in consideration of data obtained from Science subtest of 2009 Placement Test. Besides, only a few numbers of researches are incorporating the 4PL model as well for binary scored items within the framework of IRT. In one of these studies, items and ability parameters estimated according to the 1, 2, 3, and 4PL models were compared. The research has revealed that the estimation made under the 4PL model has a standard error lower than the other three models and the ability parameter was estimated more accurately in this model (Magic, 2013). Another study benefited from the Low Self-Esteem (LSE) scale under the Minnesota Multiphasic Personality Inventory Adult Form (MMPI-2) suggested that parameters were better estimated in the 4PL model (Reise & Waller, 2003). The ability estimation was advised to be applied according to the 4PL model for studies benefiting from Computerized Adaptive Test (CAT) since it provides lower standard error value (Rulison & Loken, 2009;Yen, Ho, Laio, Chen & Kuo, 2012).
Within this general framework in the present research, ability estimations methods compared based on the 3PL model, which assume the correct response possibility to an item for an individual at a low ability level, and 4PL model, which assume the wrong response possibility to an easy item due to inattention for an individual at a high ability level. Therefore, in this research we aimed to determine the best-fit IRT model and ability estimation method based on real data. In line with this, the research questions of the present study were given below: 1. What are the ability estimations made according to the ability estimation models and methods, and the standard error values to the ability estimations?
2. Which of the 3 and 4PL ability estimation models are best-fit to data?
3. Does the accuracy of ability estimations show significant variation according to the estimation models and methods?
4. Does the accuracy of standard errors to the ability estimations show significant variation according to the estimation models and methods?

Method
This research was descriptive which analyzed the model-data fit and the accuracy of itemparameter estimations comparatively based on the 3 and 4PL models.

Study Group
This research was carried out based on data obtained from the Mathematics subtest of the TEOG exam held in the 2015-2016 school year. TEOG is an exam which is held for the 8thgrade students in two semesters and consists of Turkish, Mathematics, Science, History of Turkish Revolution, Foreign Language, and Religion subtests. The analysis was carried out on a study group of 4000 students selected randomly after missing data and the full score was taken out for this research by the Directorate General for Measurement, Assessment, and Examination Services under the Republic of Turkey Ministry of National Education.

Data Collection Tool
As the data collection tool, this research used 20 items under the Mathematics subtest of the TEOG exam held for the 8th-grade students in the 2015-2016 school year fall semester. Test and item statistics to Mathematics subtest are given in Table 1 according to Classical Test Theory (CTT).  (MEB, 2016) x̄b: mean of item difficulty; x̄a: mean of item discrimination index As it is seen in Table 1, Mathematics subtest is a medium-difficulty and can distinguish between the lower and upper groups as desired. The test has high reliability.

Analysis of Data
Comparing EAP, WLE, and MAP ability estimation methods according to the 3 and 4PL models based on data obtained from the Mathematics subtest of TEOG held in 2015, this research tested whether IRT assumptions were to be met or not before the analysis. In this sense, the unidimensionality hypothesis was examined in Mplus 8 program using exploratory factor analysis (EFA), and its fitness values are shown in Table 2. As seen in Table 2, the TEOG Mathematics subtest had a unidimensional factor structure. Eigenvalues graph drawn as a result of EFA is shown in Figure 1.

Figure 1. Eigenvalues graph as a result of EFA
As is seen in Figure 1, there was only one factor where the eigenvalue was greater than 1. A sharp drop, also seen in the graph, proves that the Mathematics subtest is unidimensional.
To determine the validity of the unidimensional structure of the Mathematics subtest, confirmatory factor analysis was applied. The results are as follows: [χ2=1577.492*, sd=170, χ2/df=9.27, RMSEA=0.04, CFI=0.97, TLI=0.96]. Calculated goodness of fit values revealed that the unidimensional structure of the Mathematics subtest was valid for this research (Cole, 1987;Kline, 2005).
Q3 statistics were calculated to test the local independence hypothesis. Q3 statistics for the item pair formed for 20 items got values lower than 0.20 critical values (Q3min= -0.13, Q3max= 0.11, DeMars, 2010, p.50;De Ayala, 2009, p.134). These results prove that the items are statistically independent and the local independence hypothesis is met. Item parameters estimated according to the 3 and 4PL are shown in Table 3. As seen in Table 3, evaluating the item parameters estimated according to the 3PL, it is seen that a parameter varied from -0.51 to 2.76, b parameter varied from -2.97 to 2.24, and c parameter varied from 0.00 to 0.36. According to the 4PL model, the item discrimination parameter varied from -0.47 to 1.84, difficulty parameter varied from -2.87 to 3.03, pseudochance parameter varied from 0.00 to 0.37, and d parameter varied from 0.52 to 1.00. It was seen that most of the items estimated according to both models had high discrimination and most of the items got values different than zero when c parameter values are considered. Although, item 16 must be excluded from the test since it had negative item discrimination value for two models. It was seen that item difficulty parameter estimations according to the 4PL model were lower than those estimated according to the 3PL model. d parameter estimations lower than 1.00 indicate the extent to which the students at a high ability level answered that item wrong. According to Table 3, it was seen that d parameter got values to differ than 1.00. The item with the highest wrong response possibility due to inattention was the 17th item with the value of di=0.52.
As part of the analysis of research data, ability estimations and their standard error values were obtained first according to the estimation methods. Subsequently, the amount of information and test information functions at each ability point were calculated and marginal reliability coefficients for each estimation method were obtained. Analyses were carried out using the Multidimensional Item Response Theory (MIRT) package in R studio program (Chalmers, 2013). Furthermore, SPSS 20 package program was utilized to test the differences between the estimation methods. Significance tests were carried out at the 0.001 level.

Findings on the Ability Estimations and Standard Error Values
Firstly, descriptive statistics of the research variables were calculated. In this sense, mean, minimum, maximum, skewness, and kurtosis coefficient values were calculated for the ability estimations made according to the 3 and 4PL models and their standard error values. The results are given in Table 4. As seen in Table 4, the estimation method having the highest average value in estimations made according to the 3 and 4PL models was WLE (0.10, 0.13). Similarly, for the 3PL model, the standard error value average for the ability estimation was calculated based on the highest WLE estimation method (x̄=0.47). For the 4PL model, the standard error value average for the highest ability estimation was calculated according to the EAP estimation method (x̄=0.45). It was seen that ranges of the ability estimations made according to the 3 and 4PL models were close to each other in terms of each ability estimation model.

Findings on the Fitness of the 3 and 4PL Models to Data
For the 3 and 4PL models, to find out which one was more compatible with the data, the models were examined using paired comparison with the calculated -2loglik, AIC, BIC, and RMSEA values. The results are given in Table 5. As seen in Table 5, -2loglik, AIC, and BIC values calculated for the 4PL model were lower compared to those for the 3PL model. This indicated that the 4PL model fit better than the 3 PL model. The difference between the 4PL and the 3 PL models is evaluated (X 2 (20)=127.54, p<.05). So, the 4PL model fits better than the 3PL model.

Findings on the Accuracy of the Ability Estimations Values
In the 3 and 4PL models, it was examined whether the accuracy of the ability estimations made according to EAP, WLE and MAP estimation methods differed significantly. Examining the descriptive statistics values given in Table 3 by taking into consideration that the range of the study group was substantially wide, it was seen that the ability estimation and standard error values to the scores obtained from the data set showed normal distributions. The results obtained from the two-way analysis of variance (ANOVA) for iterative measurements are given in Table 6. As seen in Table 6, it was found that estimation of the responses of the individuals analyzed according to two different IRT models by different ability estimation methods showed significant differences [F(2, 15996)=34.42, p<0.001]. This indicated that factors of different ability estimation methods had significant mutual effects on individuals' ability scores when the estimation was made according to different IRT models. Accordingly, using different IRT models had different effects on obtaining individuals' ability scores. Another finding indicated that there was a significant difference [F(2, 15996)=1621.26, p<0.001] between the average scores as a result of different ability estimation methods applied to the individuals analyzed according to the 3 and 4PL models. In other words, it can be argued that there was a significant change at the ability estimation level according to EAP, WLE, and MAP ability estimation methods. This means that individuals' ability estimations varied based on the applied estimation methods (EAP, WLE, and MAP) unless IRT models are distinguished. Bonferroni Test -one of the multiple comparison tests in statistics-was applied to determine which ability estimation methods had differences between each other. Evaluating the average scores of individuals' abilities estimated according to the ability estimation methods, it was found that all estimation methods were statistically different from each other. According to the results of this test, evaluating the average scores of the individuals according to the ability estimation methods, it was seen that WLE ability estimation method (x̄=0.10) according to the 3PL model was higher than the averages of ability estimations made according to EAP ability estimation (x̄=0.00) and MAP ability estimation (x̄=0.08) methods. For the highest ability estimation value average made according to the 4PL model, it was seen that the WLE ability estimation method (x̄=0.13) is higher than the averages of ability estimations made according to EAP ability estimation (x̄=0.002) and MAP ability estimation (x̄=0.08) methods. Furthermore, it was found that the model variable had no significant effect on the ability estimation scores [F(1, 7998)=0.12, p>0.001]. According to this finding, the ability estimations made according to the 3 or 4PL model showed that there were no significant changes in ability estimation scores of individuals.

Findings on the Accuracy of Ability Estimations and Their Standard Error Values
Two-way analysis of variance (ANOVA) was applied to determine the differences between the standard error values of MAP, EAP, and WLE ability estimation methods according to the 3 and 4PL models. The obtained results are given in Table 7. As seen in Table 7, it can be argued that ability estimation models, ability estimation methods, and ability estimation model-ability estimation method interaction had significant effects on the standard error values of ability estimations [F(1, 23994)model=1121.17, p<0.001; F(2, 23994)estimation method=429.27, p<0.001; F(2, 23994)model-estimation method=412.00, p<0.001]. Bonferroni Test -one of the multiple comparison tests in statistics-was applied to determine the differences between the ability estimation models, ability estimation methods, and ability estimation model-ability estimation method interaction. Evaluating the average scores of individuals' abilities estimated according to the ability estimation methods, it was found that all estimation methods were statistically different from each other. According to the results of this test, evaluating the average scores of the standard errors of the abilities of individuals estimated according to the ability estimation models, it was seen that the standard error value (x̄=0.44) of the ability estimation made according to the 3PL was higher than the standard error value (x̄=0.37) of the ability estimation made according to the 4PL model. Secondly, evaluating the ability estimation methods affecting the standard error values of ability estimations, it was seen that Finally, evaluating the effect of the ability estimation model-ability estimation methods interaction on the standard error values of ability estimations, for the 3PL model, the highest ability estimation value average was obtained by WLE ability estimation (x̄=0.47) while the lowest ability estimation value average was obtained by MAP ability estimation method (x̄=0.41). Secondly, evaluating the ability estimation methods affecting the standard error values of ability estimations, it was seen that the standard error values obtained by the EPA ability estimation method (x̄=0.45) was higher than those obtained by WLE ability estimation method (x̄=0.40) and MAP ability estimation method (x̄=0.37). For the 4PL model, the highest ability estimation value average was obtained by the EAP ability estimation method (x̄=0.45) while the lowest ability estimation value average was obtained by the WLE ability estimation method (x̄=0.33). Marginal reliability coefficients calculated based on the estimation methods according to the 3 and 4PL models are given in Table 8.

Table 7. Two-Way Analysis of Variance (ANOVA) Results of the Standard Error Values of the Ability Estimations Made According to the 3 and 4PL Model
Examining Table 8, it was seen that, for the 3PL model, the highest and lowest marginal reliability coefficient values were calculated with the ability scores estimated by MAP and WLE, respectively, whereas, for the 4PL model, the highest and lowest marginal reliability coefficient values were calculated with the ability scores estimated by WLE and EAP, respectively. For the estimations made according to IRT models, marginal reliability coefficients of ability scores estimated by MAP and EAP estimation methods were very close to each other.

Discussion
In this research, based on data consisting of the answers of 4000 students to the Mathematics subtest of TEOG exam in 2015-2016 school year, which of the 3 and 4PL models the data was more compatible with, MAP, EAP, and WLE estimation methods under the 3 and 4PL models, the ability estimations, standard error values of the ability estimations and their marginal reliability coefficients were analyzed.
Model-data fit was compared by -2loglik, AIC, BIC, and RMSEA methods. According to the comparisons, three of these methods (2loglik, AIC, and BIC) indicated that the 4PL model fit better than the 3PL model. The same value was calculated for the 3 and 4PL models according to the RMSEA method. This finding was in line with those reported in the previous studies. Loken and Rulison (2010) also carried out parameter estimation utilizing the 4PL model and found that the 4PL model fit better than the 3PL model. Similarly, Erdemir (2015) has reported that the best-fit model was the 4PL model in terms of model-data fit. However, unlike this result, Barton and Lord (1981) and Yalçın (2018) suggested that the 3PL model fit better than the 4PL model. Barton and Lord (1981) discussed that this was since d parameter cannot be estimated freely and therefore, it was calculated by fixing one d parameter estimation for all items. Furthermore, Yalçın (2018) carried out parameter estimations by a different model type, the MixIRT model.
The research also analyzed the accuracy of individuals' ability estimations in consideration of scores obtained by MAP, EAP, and WLE ability estimation methods by the 3 and 4PL models. The results showed that the accuracy of ability estimation scores was significantly different based on the estimation methods. This difference indicated that the accuracy of scores obtained by the WLE ability estimation method was higher than those obtained by the MAP ability estimation method while the accuracy of scores obtained by the MAP ability estimation method was higher than those obtained by the EAP ability estimation method. There are contradictory findings in the relevant literature. Çetin and Çelikten (2016) have reported that the methods making the most accurate estimations were MAP, EAP, WLE, and ML estimation methods, respectively. The present study and the cited study showed that MAP made more accurate estimations than the EAP estimation method. This finding was also supported by various studies (Wang & Vispoel, 1998;Wang & Wang, 2001;Finch & French, 2012;Seong, Kim & Cohen, 1997). On the other hand, Borgatto, et al., (2015) have reported that the WLE method gave the best results for the estimation of abilities of the individuals at a high ability level for low-difficulty tests. According to the findings of the item analysis performed for the TEOG exam used in the present research, the item difficulty values were at a low level. This finding was parallel with the findings by Wang and Wang (2001) who argued that the WLE method made estimations with lower bias compared to EAP and MAP estimation methods for fixed-length tests based on CAT application.
Within the scope of the present research, it was found that IRT models, ability estimation methods and ability estimation model-ability estimation method interaction had a significant effect on the standard error values of ability estimations. In this sense, evaluating the standard error values of ability estimations, it was found that the standard error value of the ability estimation made according to the 3PL was higher than the standard error value of the ability estimation made according to the 4PL model. In other words, in consideration of ability estimation standard error values, the lowest standard error value was obtained by the ability estimation made according to the 4PL model. This finding was in line with those obtained in similar studies (Liao, Ho, Yen, & Cheng, 2012;Rulison & Loken, 2009;. For instance, when Erdemir (2015) used the 4PL model instead of the 3PL model, the standard error value of the ability became lower. Accordingly, it can be inferred that the accuracy of the estimation increased. Another finding from the study was that the most accurate ability estimation on the standard error values of estimations was the score points obtained by EAP, WLE, and MAP estimation methods, respectively. This finding supports the view that the systematic error of the MAP estimation method was higher than the systematic error of the EAP estimation method (Çetin & Çelikten, 2016). In this sense, it can be inferred that the accuracy of estimation increased as the ability range increased. Moreover, according to the ability estimation model-ability estimation methods interaction affecting the standard error values of the ability estimations analyzed as part of this research, the highest ability estimation value average according to the 3PL model was obtained by WLE estimation while the highest ability estimation value average according to the 4PL model was obtained by EAP ability estimation.
Lastly, the highest marginal reliability coefficient value according to the 3PL model was calculated with the ability values obtained by the MAP estimation method while the highest marginal reliability coefficient value according to the 4PL model was calculated with the ability values obtained using WLE estimation method. In this sense, marginal reliability coefficients showed similarity with the order of accuracy of standard errors of the ability estimations. This might be caused by the mean reversion of the marginal reliability coefficient.

Conclusion and Implication
In this research, based on answers of 8th-grade students taking the TEOG exam in the 2015-2016 school year to 20 items under the Mathematics subtest of the TEOG exam, IRT based model-data fit, ability estimations, and their standard values, and marginal reliability coefficient of the test were analyzed. In an overall evaluation of the findings were evaluated, it was found that the 4PL model fit better, and the standard error value of WLE and MAP ability estimation models were low according to the 4 and 3PL model, respectively. Furthermore, in this research, it was observed that the reliability coefficient obtained based on these estimation methods under both ability estimation models was higher.
Individuals' estimated ability scores were used in the evaluation stage of large-scale exams such as TEOG which is very important for determining success and competence and performing selection and placement. Accordingly, it can be suggested that calculating these scores according to the 4PL model and by the WLE ability estimation method may provide more accurate results. Carrying out similar research for large-scale exams held as of 2016 may contribute to the precision of results. Moreover, EAP, MAP, and WLE ability estimation methods were analyzed as part of this research. Research results may be expanded by testing other types of Bayesian methods.