Bibliometric Analysis of Articles on Computerized Adaptive Testing

Accepted: 17.05.2021 The items that are suitable for everyone's own ability level with the support of computer programs instead of paper and pencil tests may help students to reach more accurate results. Computer adaptive tests (CAT), which are developed based on certain assumptions in this direction, are to create an optimum test for every person taking the exam. It then becomes essential to examine the development process of such important exams and to monitor what studies have contributed to this development in what year. Citespace is a program developed to map information fields, explain the relationship between different disciplines, examine and estimate the studies in a certain period of time, uncover the latest studies and predict the trend issues that occur according to the analysis of bibliographic records of related publications. In this study, it is aimed to find out what articles about CAT are produced in which areas, at what time periods e and which articles have a significant effect in these periods. CiteSpace program was used to make a document/article co-citation analysis. Articles on CAT between 1946-2016 were scanned by “or” connector. A total of 637 articles were used, the analyses were finalized according to the networks. As a result of the research, clusters were determined based on the relationship in the citations, articles that were the most cited and important among studies on CAT were presented.


Introduction
The measurement and evaluation process can be handled in three classes as diagnostic, formative and summative according to their purposes (Crisp, 2007). Different measurement and evaluation tools are used in all of these processes. On the other hand, it can be said that tests are tools that are widely used to achieve the goals of measurement and evaluation processes. Generally, tests comprise of paper-pencil tests during the in class measurement and evaluation process. However, paper and pencil tests may not always provide the desired level of quality. For example, determining a student's ability in a difficult test or making inferences about the psychometric properties of a test administered to a successful group may not always be obtained correctly. In addition, it may not be economical to apply the paper-pencil test for everyone at the same time and with the same number of questions. Such negative situations may cause misinterpretations and false inferences about individuals or the test. In this respect, directing the items that are suitable for everyone's own ability level with the support of computer programs instead of paper and pencil tests may help students to reach more accurate results.
There are basically two approaches to the development of tests. These approaches are Classical Test Theory (CTT) and Item Response Theory (IRT) (Hambleton, & Jones, 1993;Thompson & Weiss, 2011). The two approaches have advantages over the purposes they are used for. In the classical test theory, tests are developed depending on the group and applied to each student at the same difficulty and discrimination level. However, the difference in the levels of students, which is the most important element of education, should be taken into consideration. In tests developed according to the IRT approach, the abilities of the individuals can be determined independently from the items asked, and information about the items of the test can be obtained regardless of the success of the group (Hambleton, Swaminathan, & Rogers, 1991).
With the development of computer technology, calculations of IRT approaches have been included in computer systems and computerized adaptive testing (CAT) have begun to be used (Meijer & Nering, 1999;Thompson & Weiss, 2011;Weiss & Kingsbury, 1984). With CAT applications, questions with appropriate psychometric characteristics are directed to determine the ability level of each individual (Wise, et al., 1992). If the individual's ability is determined, the exam is completed. In this case, the exam will be of different length for each individual (Meijer & Nering, 1999). Thus, it will be possible to make a more qualified measurement with fewer items (Meijer & Nering, 1999;Wainer, 1993). However, the development or evaluation of CAT requires large sample and cost. CAT applications have more advantages than disadvantages (Meijer & Nering, 1999). Taking advantage of such benefits of CAT applications, CAT studies are applied in many areas. Studies on CAT in measurement and evaluation in education can be gathered under the following areas: in the comparison of item selection methods (Barrada, et al., 2010;Deng, Ansley & Chang, 2010;Han, 2010Han, , 2012Lee & Dodd, 2012;Sulak, 2013;Veldkamp, 2010); paper-pencil test results in comparison (Kalender, 2011;Kezer, 2013;Smits, Cuijpers & van Straten, 2011); in comparison of the termination rules (Babcock & Weiss, 2012;Choi, Grady & Dodd, 2011;Eroğlu, 2013;Yao, 2013); determining the item pool characteristics (He, Diao & Hauser, 2014;Lee & Dodd, 2012;van der Linden & Xiong, 2013); in the study of differential item functioning (Gierl, Lai & Li;González-Betanzos, Abad, & Barrada, 2014;Piromsombat, 2014) and to investigate the relationship between cognitive diagnosis models (Cheng, 2009;Huebner, 2010;Hsu & Wang, 2015). In addition to such studies, the inclusion of a study showing for what disciplines and time periods the studies about CAT in the literature have been designed and conducted will give an idea to many researchers and help them to direct their fields of study. It indeed plays a very important role to examine the development process of such important exams and to monitor which of the studies in the relevant literature of the field have contributed to this development in which years. Therefore, in this study, it is aimed to reveal the most common study areas of CAT, the most common time periods and the most important articles of the specified period of time. This study will shed light on the studies related to interdisciplinary CAT applications and hende is believed to contribute to the literature.

Method
The aim of the research is to classify the articles on CAT and its applications and to reveal the network structure between these articles. In addition, it is aimed to reveal which article is most popular in the specific time line. For this purpose, it has been designed as a Bibliometric research to detect quantitative measurements and indicators. Bibliometric studies are used to compare research on numerous areas (Besimoğlu, 2015), to evaluate and follow scientific processes (Gmür, 2003;Mongeon & Paul-Hus, 2016;Santos, 2015;Van Raan, 2005). These intend to unearth the relationships between documents and examine the development of a research topic with co-citation methods (Tsay, Xu & Wu, 2003;Yu, Chang & Yu, 2016).

The Data of The Research
There are three databases representing different approaches which are Web of Science (WoS), Scopus and Google Scholar. WoS and Scopus are commercial databases and are used as a database to provide current data by evaluating citations and articles (Feng, Zhang, Du & Wang, 2015;Jasco, 2005;Seyedghorban, Jekanyika-Matanda & LaPlaca, 2015). Google Scholar has been an open source since 2004 (Jacso, 2005). Scopus is built from records extracted from Elsevier such as Geobase, Biobase, Embase, and enriched with citation information (Agapiou & Lysandrou, 2015;Archambault, et al., 2009;Fingerman,2006). WoS is interpreted as a much more scientific and comprehensive multidisciplinary content research platform than Scopus (Fingerman, 2006; http://thomsonreuters.com/thomson-reuters-web-ofscience/). The data required for the research were obtained from the Web of Science TM core collection database and from the articles covered by the SCI-EXPANDED, SSCI, A & HCI, CPCI-S, CPCI-SSH, ESCI indices. In the study, the terms "computerized adaptive testing", "computerized adaptive exams", "computer adaptive testing", "computer adaptive test" and "computer adaptive exams" were scanned by "or" connector for between the years 1946-2016 . Articles solely on the subject under consideration were discussed in this study. A total of 800 articles were obtained from the database. Repeated articles are excluded from analysis, analyses were continued with 637 articles.

Analysis of Data
In Bibliometric research, there are usually three types of co-citation analysis work which are analysis of journal co-citation, document co-citation and author co-citation. The basic assumption behind co-citation analysis is to see the relevant document is cited from the successful work done in the subject area (Tsay, et al., 2003). If two documents or authors appear in the same bibliography (used source), there is co-citation. The more the two publications are cited together, based on the similarity of the content of these two authors or documents / articles, the stronger their links (Feng, et al., 2015;Gmür, 2003;Tsay, et al., 2003). In this study, document co-citation is used.
CiteSpace is a java application that analyses and visualizes the large network structure obtained for bibliometric research (Chen, 2006;Feng, et. al. 2015;Zhao & Wang, 2011). The program, developed by Chaomei Chen, produces co-citations or C networks of nodes and links. It is an effective program for measuring relationships and links between sources such as authors, articles, institutes, terms and keywords (Tsay, et al., 2003;Seyedghorban, et al., 2015;Zhao & Wang, 2011). In fact, it constitutes a program developed to map information fields, explain the relationship between different disciplines, examine and estimate the studies in a certain time period, uncover the most recent studies and use these to predict the trend issues that arise according to the analysis of the bibliographic records of related publications (Chen, 2014;Feng, et al., 2015;Khan & Niazi, 2017;Liu, Yin, Liu & Dunford, 2015;Zhao & Wang, 2011). The present study was carried out with the analysis of 637 articles between 1984 and 2016 with CiteSpace program.
Clusters are formed according to the similarities of the references cited by published articles in the feature of interest in CiteSpace program. There are three different algorithms to name cluster. These algorithms are TF*IDF, LLR and MI algorithms. Algorithms serve to characterize the nature of the cluster to be identified (Chen, 2014). The program uses TF*IDF as default. LLR is based on the log-likelihood ratio, while the MI algorithm uses common knowledge (Chen, 2014). In this study, the naming of the clusters was formed based on the words in the abstracts of the articles according to TF*IDF (term frequency by inverted document frequency).
With CiteSpace program, the structural development of important studies in time periods can be observed. Timeline visualization can be used to view new trends and developmental schema (Kim & Chen, 2015;Santos, 2015). In this respect, the time tunnel of the studies about CAT was included in the research.

Findings
The articles obtained from WoS in the research vary between 1984 and 2016. The 30 disciplines where the highest number of articles on CAT were carried out and the information showing how many articles were published in this field are given Table 1. When Table 1 is examined, it is observed that the disciplines for which the highest number of articles were produced about CAT are psychology, mathematical methods and health. However, it is seen that studies are carried out in a wide variety of disciplines. It is observed that CAT is used more in disciplines where it is important to evaluate the individual's level independently from the group.  between nodes shows the co-citation relationship between the two articles. There is a total of 166 nodes in Figure 2. Nodes and citation networks vary according to their colour and size. The size of the nodes is proportional to the number of citations. It shows a communication link between the two peaks in the networks. The thickness / thinness of the lines indicates the strength of co-authoring. CiteSpace program provides information with the colours of time periods. The blue colour shows the first years, the green colour shows the middle years, and the orange and red colour show the current years. Darker shadows of the same colours represent earlier time periods, and lighter colours show later times (Khan & e Niazi, 2017). As shown in Figure 2, studies representing large nodes such as Ware (2000), Hart DL. (2005), Cella D. et al (2007) show that they have more citations than other articles. The most cited articles and the information related to these articles are given in Table 2. As seen in Table 2, the most cited articles are mainly in the first cluster. The most cited article is Cella D. et al.'s (2007), entitled "The Patient-Reported Outcomes Measurement Information System (PROMIS): Progress of an NIH Roadmap Cooperative Group during its first two years". In this article, the researchers summarized the organization and scientific activity of the PROMIS network during its first two years.
In the study, six clusters were obtained and the naming of the clusters was formed based on the words in the abstracts of the articles according to TF*IDF (term frequency by inverted document frequency). The size and colours of each cluster differ from the other clusters. The cluster # 0 is the largest cluster. In addition, articles at cluster # 0 with articles at cluster # 4 are highly interrelated.
As a result of the clustering process, there are two coefficients showing the importance of the network obtained from the analysis of 637 articles that include the concept of CAT. These coefficients are "Silhouette" and "Modularity Q". Silhouette values for six clusters ranged from 0.697 to 0.969. A value of 0.3719 was obtained for the mean Silhouette value, and 0.6585 for Modularity Q. These two values are expected to be higher than 0.5 as a good network structure indicator. The modularity Q value is high and this value gives information about whether the articles in the network are logically divided into clusters. The mean Silhouette value shows the homogeneity of the clusters. A high Silhouette value indicates that the cluster members are more stable. However, if the size of the cluster is small, it does not mean that the cluster is homogeneous. For example, when there are only 7 elements in the cluster # 9, and the Silhouette value is 1, this can mean that the same author can also refer to 7 articles (Chen, 2014). As shown in Table 3, the largest cluster, whose Silhouette value is 0.841, is cluster # 0 with 31 articles. The average of the publication dates of the works in this cluster is 2009. This cluster is referred to as the "patient-reported outcome" according to TF*IDF. The most cited study is "Development of a PROMIS item bank to measure pain interference" by Amtmann D, Cook KF, Jensen MP, et al (2010).
The second cluster is #1, which is called "item selection" according to TF*IDF. This cluster contains 30 articles. The most cited study in this cluster is "A method for the comparison of item selection rules in computerized adaptive testing" by Barrada, J.R., Olea, J., Ponsoda, V. & Abad, F.J. on 2010. In Table 3, it is possible to see the information about the other clusters. The methodological development of CAT applications is mostly observed in clusters # 1 and # 5, while in other clusters CAT applications in the field of health come to the fore.
Timeline visualization is examined for 6 clusters. Each node in the timelines represent an important article. Rings and colours of articles give information about "betweenness" centrality, citation frequency or citation "burstiness" (Khan & Niazi, 2017). The articles that are important in this directory stand out with their rings. The size of the node is proportional to the number of citations. The purple nodes are an indication of the centrality betweenness which indicates that they are an important turning point. The citation burstiness is shown in red (Khan & Niazi, 2017).

Figure3. Co-citation timeline for the 6 clusters
The developmental scheme of each cluster was given separately according to the timelines. As some of the important articles in the clusters stand out, the relationship of these studies with each other is shown by networks. The articles that have high betweenness centralities are given below:  Cella D., Yount, S., Rothrock, N., Gershon, R., Cook, K., Reeve, B., et al., (2007) These articles in Table 4 represent important turning points in the clusters to which they belong. It is seen that the most important turning points were between 2005 and 2010. In some years, more than one study constitutes a turning point.
As shown in Figure 3, the absence of red rings around the nodes indicates that there is no citation burstiness. Therefore, the citation numbers of the articles did not show a sudden increase in a short time.
As seen from the timeline, articles conducted after 2010 are mostly about reporting on the health of the patient and anxiety and are related to each other. The articles related to item selection have improved after 2010 and it is possible to say that further studies on these issues will be made.

Discussion and Conclusion
Article co-citation analysis is a statistical method used to analyse the structure underneath a prominent topic and to reveal the citations and attributes of the articles (Tsay, et al., 2003;Yu, et al., 2016). It is also used to visualize scientific research, identify emerging content, and predict future research (Song, Zhang & Dong, 2016). The current study is planned to see the structures in articles with CAT applications and to learn what areas include more articles of this sort and how these articles are related to each other. Using 637 articles related to CAT applications between 1984 and 2016, this study draws attention to some important points visually. Article co-citation analysis was performed by using CiteSpace program. Subsequently, the results and some suggestions can be listed as follows: It is seen that the articles were studied after 1995. and that these articles increased more intensively after 2010. Technological developments and advances in the uses of technology in this sense may be effective in this direction.
CATs have been used in health, education and psychology. However, it is possible to mention that CAT applications are mostly used in health areas. As a matter of fact, it has been observed that the most cited articles in the field of health belong to articles such as Cella et al.'s (2007) and Reeve et al.'s (2007). When the content of these articles is examined, it is figured out that CAT applications in the field of health were used to define the psychometric properties of the individual in determining the pain threshold and to examine the patients' report outputs.
In addition, articles have been carried out to determine the best method selection and to compare these methods. As it can be seen from the results, the use of CAT applications in more than one field will help to enrich interdisciplinary studies and help researchers to see the different perspectives relating to their research.
As a result of the citation analysis, six significant clusters were reached. These clusters contained articles from different fields and in this case, it appeared that it is encouraging to work among disciplines. Therefore, if the words to be used in the analysis are chosen to address more than one area, researchers can conduct a more comprehensive research. This means, researchers will be more active in the fields or journals on the subject of their study and have the opportunity to apply the methods in different fields.
Timelines are provided to acquire an idea of future research and to see the relationship between the obtained 6 clusters. In this timeline, it can be said that the studies on CAT are extremely related to each other. In particular, it is possible to see that the studies in cluster # 0, which includes intensive studies in the field of health, are intensively related to the studies in the four clusters following it and that, the articles in clusters # 2 and # 3 are intensively referring to each other. The articles, which have been displayed with purple colour, indicate that there are important turning points of those years. Therefore, it shows when the important articles have appeared. The most important turning point spotted in the present study is about the research that evaluated the property of item bank with CAT for PROMIS. This study is deemed important for many other related studies.
It is possible to estimate new studies in the field of CATs with the timeline and to see the past studies that help in the development of these studies. CAT applications were first studied with the articles in cluster # 5, referred to as "Adaptive Test", and then used in clinical studies in the 2000s.
With the development of technology, computer applications have been included in many fields. CAT applications developed in accordance with the Item Response Theory that can be used to determine each individual's level independently from the group. This is very important for education and health areas since it is vital to evaluate each individual differently for these fields. Therefore, the use of CAT applications especially in these areas will provide more qualified results. This research casts light on the work of scholars doing research in the field of CATs, and will be able to reach the information about what studies were completed for what time period in the field of study. For these reasons, bibliometric studies will remain a crucial instrument in order to see what the deficiencies in any topic / field and thusly carry out studies in order to eliminate these detected deficiencies. Researchers can conduct their studies using this analysis in particular while carrying out literature reviews on the field of interest.