The Effects of Log Data on Students’ Performance

This study aimed to assess the relationships between response times (RTs), the number of actions taken to solve a given item, and student performance. In addition, the interaction between the students’ information and communications technology (ICT) competency, reading literacy, and log data (time and number of actions) were examined in order to gain additional insights regarding the relations between student performance and log data. The sample consisted of 2 348 students who participated in the triennial international large-scale assessment of the Programme for International Student Assessment (PISA). For the current study, 18 items in the one cluster of the 91st booklet were chosen. To achieve the aim of the study, explanatory item response modeling (EIRM) framework based on generalized linear mixed modeling (GLMM) was used. The results of this study showed that students who spent more time on items and those that took more actions on items were more likely to answer the items correctly. However, this effect did not have variability across items and students. Moreover, the interaction only with reading and the number of actions was found to have a positive effect on the students’ overall performance.


INTRODUCTION
Depending on the stakes or context of the tests, students adapt different test-taking behaviors. To explore these behaviors, much research has been undertaken in psychometric practice. With the emerging utilization of technology in testing, it has become possible to analyze test-takers' behaviors in detail in relation to many psychometrical aspects. Considering the feasibility of administration of computerized assessments in education, computer-generated log-files are able to provide rich information in this context.
A student log file records all the data produced by the student during testing. Log files make it possible to see beyond students' overall performance by determining, for example, what actions have been undertaken, and how much time has been spent for a specific item. The information gathered in log files reveals a different perspective concerning students' performance and cognitive behaviors (Greiff, Wüstenberg & Avvisati, 2015). Moreover, log files can offer valuable feedback about students' learning and cognitive abilities (Greiff et al., 2014). Many recent studies have shown that students' log files provide validity evidence (e.g., Lee & Jia, 2014;Wise & DeMars, 2005), possible associations with student performance (Goldhammer et al., 2014;Greiff et al., 2015), and a better understanding on non-traditional competences (Azzolini, Bazoli, Lievore, Schizzerotto, & Vergolini, 2019).
In particular, from the students' log data, the response time (RT) has been the subject of many studies within the field of psychology and psychometrics (e.g., Goldhammer, Naumann & Greiff, 2015;Lee & Haberman, 2016). RT has been used to gain a better understanding of mental activity in psychology, and the utilization of RT is also on the rise in testing over the last few decades (Schnipke & Scrams, 2002). This is because time plays an important role in examining the process of answering items in detail. In this sense, RT has been examined as an indicator of test-taking motivation/engagement (Wise & DeMars, 2005), rapid-guessing behavior (Lee & Jia, 2014), or a characteristic of student performance (Goldhammer et al., 2014).
In their study, Goldhammer et al. (2014) examined the time effect in reading and problem solving using the items of the Programme for the International Assessment of Adult Competencies (PIAAC). They found that the time effect depended on item difficulty and test-takers' ability. In this sense, the time had a positive effect on problem-solving items while the opposite relationship was found for reading items. With a similar purpose, item RT was investigated using a computerized version of Raven's Advanced Progressive Matrices (RAPM) test (Goldhammer et al., 2015). According to the findings of the study, item RT had a negative effect on the overall performance of test-takers. However, this effect differed in that it was highly negative for easy items among higher-performing test-takers, but not high enough for difficult items and lower-performing test-takers. In another study (Greiff, Niepel, Scherer & Martin, 2016) using students' RT, it was revealed that spending an extremely low or high level of time led to lower performance in complex problem-solving. Lee and Haberman (2016) used RT to investigate test-taking behaviors in an international language assessment and found that the behaviors and RTs of examinees from different countries did not generally follow a stable trend. On the other hand, in their study, higher-performing examinees showed a more stable trend within each country in terms of RTs. In another study by Dodonova & Dodonov (2013), the relationship between cognitive ability and RT of individuals was examined using the RAPM test. The result of their research showed that higher-performing individuals had lower RTs than lower-performing individuals; however, this association changed in relation to more difficult items.
The aim of the current study was also to model RT as a characteristic of student performance and examine the effect of the number of actions taken to solve a given item using the Programme for International Student Assessment (PISA) 2015 data. In addition, the interaction between the students' information and communications technology (ICT) competency, reading literacy, and log data (time and number of actions) were examined in order to gain additional insights regarding the relations between student performance and log data. An only a limited number of studies considered the investigation of the interactions between log data and other possible indicators, such as reading ability or technological competencies which can have a role in shaping this data. Thus, to provide more information from students' log data, the current study aimed to assess the relationships between RTs, the actions taken to solve a given item, and student performance.
Considering the results of the above-mentioned research studies and the effort required to give correct answers to the items in PISA, it was assumed, in this study, that RT has a positive effect on the overall student performance. Therefore, it was expected that the more students spent time on items, the more their probability of answering items correctly would increase. Since spending less time on items is considered as rapid guessing and having lower levels of test engagement, it was also expected that students with higher ability would spend more time on items. Moreover, it was also assumed in the current study that RT increased depending on item difficulty regardless of students' ability, given the results of various studies (e.g., Goldhammer & Klein-Entink, 2011;Goldhammer et al., 2014;Klein-Entink, Fox & van der Linden, 2009) indicating that the difficulty of items had a moderating effect on performance. Moreover, students' reading ability can affect RT when answering items, since an item needs to be read before giving a response to the item. The interaction between reading performance and time will vary depending on the reading load of the items. However, in the current study, it was assumed that this interaction would have a negative effect on student performance. Apart from their reading ability and understanding, the student's RT also may be affected by the level of their ICT competencies since during the process of solving the item in computerized tests, such as PISA, students need to press buttons, drag and drop, and select lists (Organisation for Economic Cooperation and Development-OECD, 2017a). Thus, it was expected that students having a lower ability on ICT would spend more time on items, and it was assumed that the interaction between ICT competence and time would negatively affect overall student performance. Although extensive research has been carried out on the relationship between RT and test-takers' ability, a limited number of research (He, von Davier, & Han, 2018;Herborn, Stadler, Mustafić & Greiff, 2018) was found in the literature regarding how the number of actions taken to solve a given item affect student performance. Since these studies were in the context of problem-solving behaviors, additional research can be undertaken to find associations between the number of actions taken by students while answering items during testing and students' overall performance. In this way, it would be possible to compare the effects of log data such as the number of actions in different types of assessments. For instance, unlike paper-pencil assessments, students needed to undertake several actions in order to answer the items in PISA 2015. Hence, it was expected that students engaging in more actions on items would have a positive effect on overall student performance. Moreover, it was also assumed that the number of actions increased depending on item difficulty regardless of the students' ability in this study. Moreover, students' ICT competencies might have affected the number of actions taken when answering items in PISA 2015. Students having higher ICT competence and taking more actions to answer to items might be able to solve problems better, but for those with lower ICT competence undertaking irrelevant actions would make no difference in answering the items correctly. Thus, in this study, it was assumed that this interaction between ICT competence and the number of actions would have positively affected student performance. Likewise, it was expected that the interaction between the number of actions and reading would have a positive effect. In this sense, the following four research questions were addressed: 1. Does time have a significant effect on overall student performance? 2. Does the interaction between reading, ICT competence, and time have a significant effect on overall student performance?

Journal of Measurement and Evaluation in Education and Psychology
3. Does the number of actions have a significant effect on overall student performance?
4. Does the interaction between reading, ICT competence, and the number of actions have a significant effect on overall student performance?

METHOD
The aim of this study was to investigate the effects of log-data on students' performance. To achieve this aim, explanatory item response modeling was used. RT and the number of actions were modeled as covariates. Sample, data collection instruments and data analysis are described in the following section.

Sample
The sample consisted of students who participated in the triennial international large-scale assessment of PISA in 2015, which assesses the key knowledge and skills of 15-year-old students, focusing on reading, mathematics, and science literacy. PISA also uses questionnaires in order to obtain information regarding various aspects of students, schools, and countries. In PISA 2015, apart from students from schools in 15 countries unable to fulfill the technological requirements, all participants completed the tests and questionnaires via computer. Thus, students' log files were available in PISA 2015. In each cycle of PISA, one of the core domains is tested in detail, and in 2015, the major domain was science.
In order to avoid item position effects, 2 348 students who answered 27 items in the same order in the one cluster of the 91st test booklet, which was taken by the largest number of students, were chosen for this study. However, some students had to be excluded from the analysis due to not having completed/taken the ICT competency questionnaire (n = 635), having an extremely large number of actions or RTs (n = 147); therefore, the final sample consisted of 1 566 students (51% female; ̅ = 15.78, = 0.29).

Items
In PISA 2015, the scientific literacy items focused on three competencies (explain phenomena scientifically, evaluate and design scientific enquiry, and interpret data and evidence scientifically) (OECD, 2017b). In this cycle of PISA, some items required the completion of interactive tasks, meaning that students need to manipulate, variables in simulation given on items (OECD, 2017a). Each student first received two 30-minute booklets of science tasks and two 30-minute booklets for the other domains (OECD, 2017c).
Since the 91st booklet was taken by the largest number of students in PISA 2015, the items in the one cluster of this booklet were chosen for the current study. Of the items in this cluster, two polytomous items, one item not having the timing data, and six items having low item discrimination values were not included; therefore, only 18 science items were selected for the analyses. In this study, log data regarding response times and the number of actions of those items were included. Response time variable indicates how much time was spent answering each item and the number of actions variable indicates how many actions were taken to answer a given item by students (such as clicks, keypresses, and drag/drop events).
Reading literacy and ICT competence were also utilized as predictors in this study. Reading literacy is defined by OECD (2017b) as "understanding, using, reflecting on and engaging with written texts, in order to achieve one's goals, develop one's knowledge and potential, and participate in society" (p. 51). In PISA 2015, three aspects (access and retrieve, integrate and interpret, reflect and evaluate) were defined to assess reading literacy by using mixed response format items. Students' perceived ICT competence was assessed by asking them several questions regarding their level of comfort in using various digital devices (OECD, 2017b). An index variable was calculated from these responses for each student in PISA 2015, and this index was used in the present study.

Data Analysis
To achieve the aim of the study, explanatory item response modeling (EIRM) framework based on generalized linear mixed modeling (GLMM) (De Boeck et al., 2011;De Boeck & Wilson, 2004) was used. With this framework, properties of items and persons are modeled as explanatory covariates in order to explain individuals' responses in a broader approach (Wilson, De Boeck & Carstensen, 2008).
In the context of EIRM, responses are treated as repeated observations nested within students. Unlike traditional item response theory (IRT) models, EIRM allows including item-and person-level covariates in the measurement model to explain variances in the latent abilities of individuals. In the framework of GLMM, EIRM is the complex extension of the Rasch model (Rasch, 1960), "in which the clustering of item responses within respondents is a function of item-specific fixed effects and one person-specific random effect" (Briggs, 2008, p. 93). More detailed information about how GLMM is formulated as a Rasch model can be found in Rijmen, Tuerlinckx, De Boeck, and Kuppens (2003) and Briggs (2008).
In this study, RT and the number of actions were modeled as covariates separately. For data preparation, time-variable was initially log-transformed as suggested in the literature in order to obtain a better model fit (van der Linden, 2009). The number of actions, reading literacy, and ICT competence variables were also normalized. Outliers (147 students) were excluded from data analysis. After this process, the data was translated into the long format using the "reshape" package (Wickham, 2012) in R (R Development Core Team, 2018).
For the study, first, the data fit was examined for the Rasch model by obtaining related fit indices and checking other required assumptions. Since the Infit and Outfit indices for items ranged between 0.5 and 1.5 (De Ayala, 2009), the item fit was confirmed. For unidimensionality, the average RMSEA value was found to be .03 less than .05, indicating that the data was fitted to a one-factor model. When 382 the local independence assumption was checked with Yen's Q3 statistics, all residual correlations for all pairs of items were found to be below .20, indicating that item responses are independent in the data. These assumptions were examined using the "sirt" package (Robitzsch, 2019) in R. After the assumptions were met, explanatory IRT models were tested using the "lme4" package (Bates, Maechler & Bolker, 2012) in R. Within the approach of Goldhammer et al. (2014Goldhammer et al. ( , 2015 and as described by Desjardins and Bulut (2018), all explanatory IRT models tested separately for time and action variables in this study are as follows:

RESULTS
According to the results of the initial analysis, all items were fitted to the Rasch model, and the correlation between students' abilities estimated using the selected items in this study and the performance scores obtained from PISA was found to be .91. The coefficient Alpha value was calculated as .81, meaning that the items had high internal consistency. The item statistics, item parameters, and fit statistics are given in Table 1. As shown in Table 1, the easiness of the items ranged from -1.02 to 2.05, with the average difficulty being 0.24, which means that the items were of moderate difficulty overall. The results from EIRMs about RT are presented in Table 2, and EIRM related to the number of actions are given in Table 3.  As can be seen in the tables given above, the overall effects of RT and the number of actions were statistically significant (βtime = 0.04, βaction = 0.33, p < .001). The positive effects indicated that students spending more time on items and those taking more actions on items were more likely to answer the items correctly. However, when RT and the number of actions were included as random effects in addition to being fixed effects, the estimated effects of these variables were not significant (βtime = 0.02, βaction = -0.15, p > .05). This finding shows that the effects of RT and the number of actions were not associated linearly with the abilities of students and difficulties of items. Thus, the results indicate that the variation of RT and the number of actions taken by higher performing students on easy or difficult items differed from those of lower-performing students on easy or difficult items. Thus, the variability of RT and the number of actions was unequal across items and students.
The models including interactions between log data and reading literacy and ICT competency showed that all interactions except the interaction between the number of actions taken and reading literacy were found to be a non-significant predictor. This finding shows that students' level of ICT competency did not differ depending on RT and the number of actions taken by students in order to answer the items correctly. However, students with higher reading literacy performance took a greater number of actions. 33237.8 33278.9 -16613.9 6.26 * * p < 0.05, ** p < 0.01, *** p < 0.001. Note: All other models were compared with Model 0  7.31* * p < 0.05, ** p < 0.01, *** p < 0.001. Note: All other models were compared with Model 0 As seen in Tables 4 and 5, Model 3 showed the best fit in terms of AIC and BIC fit statistics. It should be noted that Model 1 having a related variable as a random effect on item level seems to fit the data better than other models.

DISCUSSION and CONCLUSION
The aim of this study was to assess the relationships between RTs, the number of actions taken to solve a given item, and student performance. In addition, the interaction between the students' ICT competency, reading literacy, and log data (time and number of actions) were examined in order to gain additional insights regarding the relations between student performance and log data. The results of this study showed that students who spent more time on items and those that took more actions on items were more likely to answer the items correctly. However, this effect did not have variability across items and students.
In this study, it was assumed that RT and the number of actions had a positive effect on overall student performance. As hypothesized, the results revealed that students spending more time on items and those taking more actions on items were more likely to answer the items correctly. Moreover, it was also assumed that RTs depended on item difficulty and student ability in the study. Unexpectedly, this effect did not have variability across items and students, and broadly, this finding did not support the findings from other studies (Dodonova & Dodonov, 2013;Goldhammer & Klein-Entink, 2011;Goldhammer et al., 2015;Lasry, Watkins, Mazur & Ibrahim, 2013;Verbić & Tomić, 2009), which found a negative relationship between RT and abilities of individuals on a particular test. Furthermore, they found that RT varied significantly across items and individuals having a different level of abilities; however, since other studies investigated tests measuring cognitive skills, RTs may play a different role in those tests. This inconsistency may be due to the item structure used in PISA. The science items used in PISA have different features in terms of context than cognitive tests. Similarly, Lee and Haberman (2016), investigating RT as a pacing and speediness indicator using PISA data sets, found that the RTs of examinees from different counties were not following a stable trend in general. Similar to items in PISA that measure not a cognitive structure but something more like an achievement in a particular field, some studies (Klein-Entink et al., 2009) did not find a relationship between RT and student performance on Scholastic Aptitude Test (SAT). Hence, it may be concluded that item types and more specifically the aim of the test also affect RT. Another possible explanation for this could be the testing conditions (Lee & Jia, 2014). As Goldhammer et al. (2014) stated, "when collecting time information across tasks and individuals that are heterogeneous in difficulty and skill level, respectively, the role of time and its interpretation may differ" (p. 624) and the same finding occurred in this study. All the discussions undertaken concerning RT can be applied to the number of actions. However, further evidence is certainly needed to understand the effect of the number of actions on answering items. Given that all items were not released in PISA, future studies could use other types of items and tests in which they can examine item features in more detail while looking for an effect on RT and the number of actions.
In the present study, several effects of interactions were examined. It was assumed that the interaction between RT, reading, and ICT competence would have a negative effect on student performance. However, none were found to have a significant effect on student performance, and these results are likely to be related to previous findings. Given the non-uniform distribution of RTs among items and students, RTs of students having a higher reading ability or ICT competency would also have a similarly non-uniform distribution. The finding related to students' reading ability supports the work of Golhammer et al. (2014) and Petscher, Mitchell, and Foorman (2015). In the study by Petscher et al. (2015), the variability of RTs of students having higher reading ability showed more functional information compared to students with lower or moderate ability. On the contrary, Su and Davison (2019) found that students with high reading ability had lower RTs while answering the items correctly. However, since only science literacy items selected for this study, students' abilities could have played a different role in RTs on items. Moreover, this result may be due to the students' testtaking behaviors. Wise (2006) argued that students adopting rapid-guessing behavior spent less time on items, especially those with a high reading load. As Wu, Chen, and Stone (2018) stated, students' test-taking behavior is not a trait, but a reaction to that particular test, and students' RTs and other performances depend on test features. In this sense, non-significant interactions between those variables cannot be ascribed to the other assessments, and PISA can be classified as a low stake assessment. For that, future studies with similar purposes may use high stakes tests in order to explore those interaction effects.
In the current study, it was also expected that the interaction between the number of actions, reading competence, and ICT competence would have a positive effect on student performance. While the interaction with ICT and the number of actions did not have a significant effect on overall student performance, interaction with reading and the number of actions was found to have a positive effect on the students' overall performance. In this sense, it could be argued that ICT competence and the number of actions do not have a relationship in terms of students' likelihood of answering items correctly. The study by Lasry et al. (2013) demonstrated that students with lower confidence spent more time on items. Following the same logic, it was assumed that students' ICT competence could play a role in students' performance together with the number of actions they had taken. This result is likely to be related to the variation of those features among students with different levels of abilities. On the contrary, a positive interaction effect between the number of actions and reading was found in the current study. That is, the effect of the number of actions on the overall performance was higher in students who possessed the higher reading ability. This may be due to students with a high reading ability tending to take more actions by trying harder on items considering the high impact on the overall science performance of the students.
The present study proposes that the effect of time does not have a uniform trend across items and students. However, it should be noted that in this study, only a limited number of items were included in order to avoid possible item position effects; thus, the results and interpretations of this study may not cover all booklets used in PISA. Therefore, other types of research design should be implemented in the future to generalize these findings. Many other interaction effects could be included in order to explain the role of RT and the number of actions on students' performance, as explained variances found in the study suggest that there are further variables having a role in the students' log data and performance. Future studies can include other possible interactions to explain relationships between those variables. Furthermore, it would be interesting to test the role of RT and the number of actions with other IRT-based models. This could provide more detailed information to replicate this study, allowing for not only multiple-choice items but also constructed response items to be included.