Monte Carlo Simulation Studies in Item Response Theory with the R Programming Language R Programlama Dili ile Madde Tepki Kuramında Monte Carlo Simülasyon Çalışmaları

Monte Carlo simulation studies play an important role in operational and academic research in educational measurement and psychometrics. Item response theory (IRT) is a psychometric area in which researchers and practitioners often use Monte Carlo simulations to address various research questions. Over the past decade, R has been one of the most widely used programming languages in Monte Carlo studies. R is a free, open-source programming language for statistical computing and data visualization. Many user-created packages in R allow researchers to conduct various IRT analyses (e.g., item parameter estimation, ability estimation, and differential item functioning) and expand these analyses to comprehensive simulation scenarios where the researchers can investigate their specific research questions. This study aims to introduce R and demonstrate the design and implementation of Monte Carlo simulation studies using the R programming language. Three IRT-related Monte Carlo simulation studies are presented. Each simulation study involved a Monte Carlo simulation function based on the R programming language. The design and execution of the R commands is explained in the context of each simulation study.


INTRODUCTION
Monte Carlo simulation studies are the key elements of operational and academic research in educational measurement and psychometrics.Both academic researchers and psychometricians often choose to simulate data instead of collecting empirical data because (a) it is impractical and costly to collect the empirical data while manipulating several conditions (e.g., sample size, test length, and test characteristics); (b) it is not possible to investigate the real impact of the study conditions without knowing the true characteristics of the items and examinees (e.g., item parameters, examinee ability Despite the increasing use of the R programming language for conducting various statistical and psychometric analyses, many researchers are still unfamiliar with the capabilities of R for conducting Monte Carlo simulation studies.As Hallgren (2013) pointed out, the use of simulation studies should be available to researchers with a broad range of research expertise and technical skills.Researchers should be familiar with how to address research questions that simulations can answer best (Feinberg & Rubright, 2016).Given the growing demand for IRT-related research in the field of educational measurement, this study aims to demonstrate how to use R for the design and implementation of Monte Carlo simulation studies in IRT, specifically for individuals with minimal experience in running simulation studies in R. The purposes of this study are threefold.First, we introduce readers the packages and functions in R for simulating response data and analyzing the simulated data using various IRT models.Second, we summarize the principles of Monte Carlo simulations and recommend some guidelines for conducting Monte Carlo simulation studies.Third, we illustrate the logic and procedures involved in conducting IRT-related Monte Carlo simulation studies in R with three examplesincluding the R codes for simulating item response data, analyzing the simulated data, and summarizing the analysis results.The examples will target three different uses of Monte Carlo simulation studies in IRT, including item parameter recovery, evaluating the accuracy of a method for detecting differential item functioning (DIF), and investigating the unidimensionality assumption.Each simulation study uses different criteria to evaluate the simulation results (e.g., accuracy, power, and Type I error rate).For the sake of simplicity and conciseness, the readers of this study are assumed knowledgeable about (1) the basics of the R programming language and (2) the fundamentals of IRT.The readers who are not familiar with the Monte Carlo simulation studies in IRT are referred to Harwell et al. (1996) and Feinberg and Rubright (2016) for a comprehensive review.In addition, the When generating random samples using the R functions mentioned above, the randomization of the generated values is systematically controlled based on the random number generator (RNG).The RNG algorithm assigns a particular integer to each random sample.In the R programming language, the integer associated with random samples is called "seed".The user can select a particular seed when generating a random sample and then use the same seed again whenever the same random sample needs to be obtained.The set.seed function can be used for selecting a particular seed in R (e.g., set.seed(1111), where 1111 is the seed).The set.seed function plays an important role in simulations studies because it allows the researcher to create a reproducible simulation.

R Packages for Estimating IRT Models
R has many user-created psychometric packages that allow researchers and practitioners to conduct statistical analysis using psychometric models and methods.The CRAN website has a directory of the R packages categorized by topic, which is called "Task View".One of these task views, the Psychometric Task View, is specifically dedicated to psychometric methods (see https://cran.rproject.org/web/views/Psychometrics.html),such as IRT, classical test theory, factor analysis, and structural equation modeling.A lot of the packages in the Psychometrics Task View focus on the estimation of IRT models, such as unidimensional and multidimensional IRT models, nonparametric IRT modeling, differential item functioning, and computerized adaptive testing (see Ünlü and Yanagida [2011] for a review of the CRAN Psychometrics Task View).Rusch, Mair, and Hatzinger (2013) also provided a detailed summary of the R packages for conducting IRT analysis.The primary R packages for estimating item parameters and person abilities in IRT include mirt (Chalmers, 2012), eRm (Mair, Hatzinger, & Maier, 2016), irtoys (Partchev, 2016), and ltm (Rizopoulos, 2006).There are also specific packages for a particular IRT analysissuch as lordif (Choi, Gibbons, & Crane, 2016) and difR (Magis, Beland, Tuerlinckx, & De Boeck, 2010) for differential item functioning; catR (Magis & Raiche, 2012) and mirtCAT (Chalmers, 2016) for computerize adaptive testing; and equate (Albano, 2016) for test equating.In addition to these packages, we encourage the readers to browse through the Psychometric Task View for other types of IRT methods available in R.

Guidelines for Conducting Monte Carlo Simulation Studies
Monte Carlo simulation studies can be used for investigating a wide range of research questions, such as evaluating the accuracy of existing statistical models under unfavorable conditions (e.g., small sample and non-normality), answering a novel statistical question, or understanding the empirical 1.The researcher determines a set of simulation factors expected to influence the operation of a particular statistical procedure.The simulation factors can be either fully crossed or partially crossed.If the simulation factors are fully crossed, then a data set needs to be generated for every possible combination of the simulation factors.If, however, they are partially crossed, only some simulation factors are assumed to be interacting with each other.
2. A series of assumptions are made about the nature of data to be generated (e.g., types of variables and probability distributions underlying the selected variables).These assumptions are crucial to the authenticity of the simulation study because the quality of the simulation outcomes depends on the extent to which the selected assumptions are realistic.4. Statistical analyses are performed on the simulated data sets and the parameter estimates of interest from these analyses are recorded.The parameter estimates can be p-values, coefficients, or a particular element of the statistical model.
5. Finally, the estimated parameters are evaluated based on a criterion or a set of criteriasuch as Type I error, power (or hit rate), correlation, bias, and root mean squared error (RMSE).
The researcher can report the findings of the simulation study in different ways (e.g., a narrative format, tables, or graphics).The researcher should determine how the simulation results will be communicated to the target audience based on the size of the simulation study (i.e., the number of simulation factors), the complexity of the simulation design (e.g., several fully-crossed simulation factors or a simpler design with one or two factors), and the type of the reporting outlet (e.g., technical reports, journal articles, or presentations).
It should be noted that the five steps summarized above could be slightly different for each simulation study, depending on the research questions that need to be addressed.

Principles of Monte Carlo Simulation Studies
Apart from the guidelines summarized above, there are also three principles that the researchers need to consider when designing and conducting a Monte Carlo simulation study.These principles are authenticity, feasibility, and reproducibility.
The authenticity of a Monte Carlo simulation study refers to the degree to which the simulation study reflects the real conditions.For example, assume that a researcher wants to investigate the impact of test length on ability estimates obtained from a particular IRT model.The researcher selects 30, 60, 90, and 300 items as the hypothesized values for the test length factor.Because a 300-item test is quite unlikely to occur in real life, the researcher should probably consider eliminating this option from the simulation study.The authenticity of a Monte Carlo simulation study is also related to the necessity of the simulation factors.Continuing with the same example, the researcher might consider sample size as a potential factor for the simulation study on the recovery of ability estimates; but sample size is known to have no effect on the estimation of ability (or latent trait) when item parameters are already known (e.g., Bulut, 2013;Bulut, Davison, & Rodriguez, 2017).Therefore, the researcher would not need to include sample size as a simulation factor in the study.
The feasibility of a Monte Carlo study refers to the balance between the goals of the simulation study and the scope of the simulation study.The combination of many simulation factors and a high number of replications may often lead to a highly complex simulation study that is hard to complete and summarize within a reasonable period.Therefore, the researcher should determine which simulation factors are essential and how many replications can be accomplished based on the scope of the simulation study.For example, assume that a researcher plans to use test length (10, 20, 40, or 60 items), sample size (100, 250, 500, or 1000 examinees), ability distribution (normal, positively skewed, or negatively skewed), and inter-dimensional correlation (r = .2,r = .5,or r = .7)as the simulation factors.If the simulation factors were fully crossed, then there would be 4 x 4 x 3 x 3 = 144 cells in the simulation design.If the researcher conducted 10,000 replications for each cell, the entire simulation process would result in 1,440,000 unique data sets that need to be analyzed and summarized.Depending on the complexity of the statistical analysis, this simulation study might take several weeks (or possibly months) to complete, even with the parallel computing feature available in R and other statistical software programs.
The reproducibility of a Monte Carlo simulation study refers to the likelihood that the researcher who conducted the simulation study can replicate the same findings at a later time, or that other researchers who have access to the simulation parameters can replicate (or at least approximate) the findings.To ensure reproducibility, the researcher should specify the seed before generating data and store a record of the selected seeds in the simulation study.The researcher can either use the seeds to replicate the findings later on or give them to other researchers interested in replicating their findings (Feinberg & Rubright, 2016).However, it should be noted that even using the same seeds might not guarantee that identical simulation results will be obtained because the mechanism of the random number generator can differ from one computer to another, or across the different versions of the same software program.

METHOD
The following sections of this study will demonstrate three Monte Carlo simulation studies about item parameter recovery, differential item functioning, dimensionality in IRT-based assessment forms.
Each study focuses on a set of research questions in the context of IRT and aims to address the research questions through a Monte Carlo simulation study.Each study consists of three steps: data generation, statistical analysis, and summarizing the simulation results.The implementation of these steps will be demonstrated using R.The readers are strongly encouraged to run the examples in their own computers by copying and pasting the provided R codes into the R console.It should be noted that most R packages are regularly updated by their creators and/or maintainers, and thus the R functions presented in this study are subject to change in the future.Therefore, we recommend the readers to check out the R packages used in this study before using the Monte Carlo simulation functions.For this study, we use the latest version of Microsoft R Open (version 3.4.0).The readers are strongly encouraged to use this particular version of Microsoft R Open to ensure the reproducibility of the Monte Carlo studies presented in the following sections.

Simulation Study 1: Item Parameter Recovery in IRT
In this study, we aim to investigate to what extent the accuracy of estimated item parameters in the unidimensional three-parameter logistic (3PL) IRT model depends on the number of examinees who respond to the items (sample size) and the number of items (test length).In addition, we want to find out which item parameter (item difficulty, item discrimination, and guessing) is the most robust against changes in sample size and test length.To address these research questions, we design a small-scale Monte Carlo simulation study in which sample size and test length are the two simulation factors.The simulation study will be based on a fully crossed design with three sample sizes (500, 1000, or 2000 examinees) and three test lengths (10, 20, or 40 items), resulting in 3 x 3 = 9 cells in total.For each cell, 100 replications will be conducted with unique item parameters and person abilities in each replication.For the evaluation of the recovery of true item parameters, we use bias and RMSE: where K is the total test length,  ̂ is the estimated item parameter for item i (i = 1, 2, …, K), and   is the true item parameter for item i.The average bias and RMSE values over 100 replications will be reported for each of the nine simulation cells.

Data generation
The item response data will be simulated using the 3PL model.The mathematical formulation of the 3PL model can be shown as follows: where   (θ) is the probability of an examinee with the ability of θ responding to item i correctly, bi is the item difficulty parameter for item i,   is the item discrimination parameter for item i,   is the lower asymptote (also known as the pseudo-guessing parameter), e is the base of the natural logarithm approximated at 2.178, and D is a constant of 1.7 to transform the logistic IRT scale into the normal ogive scale (Camilli, 1994;Crocker & Algina, 1986).
Following the suggestions from previous studies regarding data simulation for the 3PL model (e.g., Harwell & Baker, 1991;Feinberg & Rubright, 2016;Mislevy & Stocking, 1989;Mooney, 1997), the item difficulty parameters are drawn from a normal distribution, ~(0, 1); the item discrimination parameters are drawn from a log-normal distribution, ~(0.3,0.2); and the lower asymptote parameters are drawn from a beta distribution, ~(20, 90).Furthermore, the ability parameters are drawn from a standard normal distribution, θ~(0, 1).For simulating dichotomous item responses and estimating the item parameters based on the 3PL model, we will use the mirt package (Chalmers, 2012) in R. To install and activate the mirt package, the following commands should be first run: install.packages("mirt") library("mirt") Then, we define a simulation function called itemrecovery, which generates item parameters, simulates dichotomous item response data using the generated item parameters, estimates the item parameters of the 3PL model for the simulated data, and finally computes bias and RMSE values for each set of estimated item parameters: In the itemrecovery function, there are three input values that need to be specified by the researcher: nitem as the number of items (i.e., test length), sample.sizeas the number of examinees (i.e., sample size), and seed as the seed for the random number generator.The itemrecovery function begins with setting the seed for the random values that are going to be generated, which will ensure reproducibility of the simulated data.Next, the item parameters are randomly generated based on the distribution characteristics explained earlier using the rlnorm, rnorm, and rbeta functions.Each set of the generated parameters (called a, b, and c) is saved as a matrix with a single column.The simdata function from the mirt package simulates dichotomous item responses according to the 3PL model using the generated item parameters.More details about the simdata function can be obtained by running the ?simdata command in the R console.Then, we estimate item parameters using the mirt function, extract the estimated item parameters from the model using the coef function, and save the parameters in a data frame called parameters.More details about the estimation process in the mirt function can be obtained by running the ?mirt command in the R console.At the end of the function, we compute the bias and RMSE values for each item parameter, save the values into a data set called result, and return the result data set as the outcome of the simulation.To enable the itemrecovery function, we can either copy and paste the entire function into the R console and hit the "enter" button, or select all the entire function in the R script file, right-click on the selected lines, and choose "Run line or selection" to execute the commands in the R console.
The next step is to conduct the simulation study using the itemrecovery function.First, we randomly generate 100 integers to be used as the random seeds in the study.We sample random integers ranging from 0 to 1,000,000 using the sample.intfunction and store the generated values in a data set called myseed.We export the seeds into a text file called "simulation seeds.txt".This file will be saved in the current working directory designated by the user.To save the document in a specific folder, a complete folder path should be provided, such as "C:/Users/username/Desktop/simulation seeds.txt".Note that when defining a folder path, forward slash (/) should be used instead of a backslash (\).Next, we define an empty data set (i.e., result) that will store the simulation results out of 100 replications.The final step of the simulation study is to run the simulation within a loop and save the results into the result data set.for (i in 1:length(myseed)){ } creates a loop to run a procedure 100 times (i.e., the same length of myseed).In this study, we want to run the itemrecovery function 100 times using a different seed from myseed for each replication.Once all the iterations are complete, the result data set will consist of one hundred rows (one row per iteration).At the end, we use the colMeans function to find the average bias and RMSE values across 100 replications.Because we use the round(colMeans(result),3), all of the values will be rounded off to three decimal digits.For each of the nine cells planned for this simulation study, we set the nitem and sample.sizevalues accordingly and re-run the R script presented above.When the simulations for all of the cells are complete (i.e., the R script has been run nine times in total), we report the results shown in Table 1.The simulation results show that the lower asymptote (i.e., guessing) parameter had the smallest bias and RMSE, whereas the item discrimination parameter indicated the largest bias and RMSE.As sample size and test length increased, bias and RMSE decreased for all of the item parameters.In addition to Table 1, we can also present the simulation results graphically using the lattice package (Sarkar, 2008) in R. First, we manually enter the simulation results in Table 1 into an empty Excel spreadsheet using a long format and save the spreadsheet with a .csvextension using the "Save As" option under the "File" menu.The saved file is called "simulation results.csv".Figure 1 shows a screenshot of the "simulation.csv"file.

274
After we read the data set "simulation results.csv" in R, we define the two variables (SampleSize and TestLength) as categorical variables by using the as.factor function.Next, we install the lattice package and then activate it using the library command.Finally, we use the xyplot function in the lattice package (Sarkar, 2008) to create an interaction plot.This plot will demonstrate the relationship between bias, RMSE, sample size, and test length for each item parameter.In the xyplot function, we first select the variable for the y axis (either bias or RMSE) and the variable for the x axis (test length), the variable that defines multiple panels (sample size), and the group variable (parameters).In addition, xlab defines the label for the x axis, type = "a" indicates an interaction plot, the elements in auto.keydefine the position of the legend, whether or not data points should be shown, whether or not lines should be shown, and the number of columns for the legend, and the elements in par.settingsdefines the colours (lty) and thickness (lwd) of the lines in the plot.The details of the xyplot function can be obtained by running the ?xyplot command in the R console.Figures 2 and 3 show the interaction plots for bias and RMSE, respectively.The second simulation study aims to investigate the detection of differential item functioning (DIF) in the context of multidimensional IRT models.In educational testing, DIF occurs when the probability of responding to a dichotomous item correctly varies between focal and reference groups (e.g., male and female students), after controlling for examinees' ability levels.If the item is polytomous, then the probabilities of obtaining no credit, a partial credit (e.g., 1 point in a two-point item), or a full credit (e.g., 2 points in a two-point item) are expected to differ between the focal and reference groups, after controlling for examinees' ability levels.There are two types of DIF in test items: uniform and nonuniform DIF.If the focal group consistently underperforms or outperforms the reference group, then the item is flagged for having uniform DIF.If, however, the direction of bias changes between the focal and reference groups along the ability continuum, the item is flagged for having nonuniform DIF (Lee, Bulut, & Suh, 2016).
There are many methods in the literature to detect uniform and nonuniform DIF in the context of unidimensional IRT models.These methods include the Mantel-Haenszel method (Mantel & Haenszel, 1959), simultaneous item bias test (SIBTEST; Shealy & Stout, 1993), Raju's differential functioning of items and tests (DFIT; Raju, van der Linden, & Fleer, 1995); and the multiple indicators multiple causes (MIMIC) model (Finch, 2005;Woods & Grimm, 2011).However, when the definition of DIF is extended to a multidimensional assessment that simultaneously measures two or more abilities, there are only a few DIF methods in the literature, such as multidimensional MIMICinteraction model (Lee et al., 2016), IRT likelihood ratio test (Suh & Cho, 2014), and multidimensional SIBTEST (MULTISIB;Stout, Li, Nandakumar, & Bolt, 1997).
In this Monte Carlo simulation study, we use the IRT likelihood ratio test described by Suh and Cho (2014) for detecting uniform and nonuniform DIF in the context of multidimensional Graded Response Model (MGRM).The mathematical formulation of MGRM for a polytomous item with  + 1 response categories on an M-dimensional test becomes:

276
where   * () is the probability of selecting the response option k ( = 1, … , ) in item i for an examinee with the ability vector of  = [ 1 , … ,   ], bik is the boundary parameter of the k th category of item i,   is the item discrimination parameter for item i on dimension m ( = 1, … , ), and D is a constant of 1.7 to transform the logistic IRT scale into the normal ogive scale.
In this study, we assume a two-dimensional test with a simple structure.The consists of 30 items, where the first set of 15 items is loaded on the first dimension and the second set of 15 items is loaded on the second dimension.To simulate polytomous item responses, we use item characteristics similar to those from Jiang, Wang, and Weiss's (2016) recent study on MGRM.Item discrimination parameters are randomly drawn from a uniform distribution, ~(1.1,2.8); the first category boundary parameter is randomly drawn from a uniform distribution,  1 ~(0.67,2) and the other two category boundary parameters are created by subtracting a value randomly drawn from a uniform distribution,  2 =  1 − (0.67,1.34) and  3 =  2 − (0.67,1.34).In addition, the ability parameters are drawn from a multivariate normal distribution, ~(0, ) where is  the variancecovariance matrix of the abilities.
Three simulation factors are manipulated in this study: sample size, DIF magnitude, and interdimensional correlation.Sample size is manipulated for the reference (R) and focal (F) groups as R1000/F200, R1500/F500, and R1000/F1000.DIF magnitude is manipulated as 0, 0.3, or 0.6 logit difference for uniform and nonuniform DIF.DIF magnitude is added to the category boundary parameters for uniform DIF and to the item discrimination parameters for nonuniform DIF.We assume that the focal group is at a disadvantage due to uniform or nonuniform DIF.Inter-dimensional correlation refers to the correlation between the two ability dimensions.Inter-dimensional correlation is manipulated as ρ=0, ρ=.3, or ρ=.5.Based on the value of the inter-dimensional correlation, the offdiagonal elements of the variance-covariance matrix () are replaced with ρ, while the diagonal elements remain as "1".
For simulating polytomous item responses, estimating the item parameters based on MGRM, and running the IRT likelihood ratio tests, we will use the MASS package (Venables & Ripley, 2002) and the mirt package (Chalmers, 2012) in R (R Core Team, 2017).In addition, we will use the doParallel package (Revolution Analytics & Weston, 2015) to benefit from the parallel computing to further speed up the estimation process.To install and then activate these packages, the following commands should be first run in the R console:

install.packages("doParallel") library("doParallel") library("mirt") library("MASS")
Next, we define a function called detectDIF, which generates item parameters and simulates polytomous item responses based on MGRM, estimates the item parameters using the simulated data, runs IRT likelihood ratio tests to detect uniform and nonuniform DIF on a particular set of items, and computes the true positive rates (i.e., power) and false positive rates (i.e., Type I error) as the evaluation criteria.
The detectDIF function requires four input values: sample.sizedefines the size of reference and focal groups (e.g., sample.size= c(1000, 200) for the reference group of 1000 examinees and the focal group of 200 examinees); DIF.size defines the magnitude of uniform DIF and nonuniform DIF (e.g., DIF.size = c(0.3,0) for 0.3 difference in the category threshold parameters as uniform DIF and DIF.size = c(0, 0.3) for 0.3 difference in the discrimination parameters as nonuniform DIF); cor specifies the correlation between the two ability dimensions (e.g., cor = 0.5 for a correlation of 0.5 between the two abilities); and seed is the user-defined seed for the data generation process.Based on the selected input values, the function generates a 30-item, polytomously-scored test in which items 1, 7, 15, 16, 23, and 30 are tested for uniform and nonuniform DIF.These items are particularly selected because they represent a combination of low-, medium-, and high-difficulty as well as low-, medium-, and high-discrimination parameters.

277
The IRT likelihood ratio test examines the likelihood difference between two nested IRT models based on a chi-square test with degrees of freedom equal to the difference between the numbers of estimated item parameters between the two IRT models (see Suh and Cho [2014] for more details on this procedure).To investigate uniform DIF with the IRT likelihood ratio test, we first estimate mod0 that assumes that all items except for items 1,7,15,16,23,30 are invariant between the focal and reference group.Next, we estimate mod1 that constrains the category boundary parameters b1, b2, and b3 to be equal between the focal and reference group for each of the six DIF items and tests whether there is a significant change in the model likelihood due to the equality constraints.Significant likelihood changes from mod0 to mod1 indicate that the items being tested exhibit uniform DIF.Nonuniform DIF is examined by comparing mod0 against mod1 and mod2, which constrain the item discrimination parameters a1 and a2 to be equal between the focal and reference groups, respectively.Significant likelihood changes between these models indicate that the items being tested exhibit nonuniform DIF.
The detectDIF function returns a data frame in which the aforementioned six items and the p-values from the IRT likelihood ratio test for each item are listed.When one of the input values for DIF.size is larger than zero, the function returns the p-values for detecting DIF correctly (i.e., power) for the six items listed above.If, however, DIF.size = c(0, 0), then the function returns the p-values for detecting DIF falsely (i.e., Type I error) for the items.In this study, we assume that DIF occurs either in the category boundary parameters or in the item discrimination parameters.Therefore, one of the values in DIF.size will always be zero when running the analysis for power.#Uniform and nonuniform DIF for items 1, 7, 15, 16, 23, and 30 b.foc <-b.ref b.foc[c(1,7,15,16,23,30),] <-b.foc [c(1,7,15,16,23,30) foc[c(16,23,30),2] <-a.foc [c(16,23,30) (1,7,15,16,23,30) For this study, we use 30 replications, as in Jiang et al.'s (2016) simulation study with MGRM.Despite using only 30 replications, this simulation study is more complex compared to the first simulation study presented earlier because in addition to estimating item parameters from a two-dimensional MGRM, we conduct a series of IRT likelihood ratio tests to examine uniform and nonuniform DIF across six items (items 1, 7, 15, 16, 23, and 30).To increase the speed of the entire simulation process, we use the doParallel package.First, we generate a set of 30 random seeds ranging from 0 to 10000 and save the generated seeds in a data set called "myseed".Next, using a computer with a multi-core processor, we allocate multiple cores for our simulation study.To check the number of processors in a computer, the researcher can first run detectCores().To assign a particular number of cores, registerDoParallel() should be used.For example, to allocate 8 cores for the simulation study, registerDoParallel(8) should be used.Once this command is executed, the simulation process can be completed using 8 cores rather than a single core, which is the default setting in R.Although it is possible to use all available cores in a computer, this could be problematic because using all available cores would slow down the operation of the computer significantly, especially when performing other tasks to in addition the simulation study.The parallel computing is particularly useful when an iterative computing processsuch as a simulation studyis implemented.Because the current simulation study requires 30 replications, estimating multiple replications simultaneously is expected to reduce the duration of simulation significantly (e.g., with 8 cores, it would be theoretically 8 times faster than a regular estimation with a single core).

279
Once the parallel computing process is set up, the final step of this simulation study is to run the simulation study by changing the input values in the detectDIF function.The foreach function from the doParallel package is used for creating a loop of 30 replications.After each replication, the results will be combined in a single data set called result.We use the ifelse function to create a binary variable for the items in the result data set that have a p-value less than .05(i.e., significant DIF based on the IRT likelihood ratio test).The mean function will return the proportion of the items that have indicated significant DIF (i.e., average power or Type I error depending on the condition).The following R commands demonstrate an example for running a particular combination of the simulation factors.For each replication, a random seed from myseed is selected.Then, the detectDIF function is executed using sample sizes of 1000 and 200 for the reference and focal groups, the DIF magnitude of 0.3 for uniform DIF, and a correlation of ρ=.3 between the two ability dimensions.After 30 replications are complete, the average proportion of significant IRT likelihood ratio tests across six items is reported with two decimal points (using the round function).Table 2 shows a summary of the findings across all simulation factors.The results show that sample size and DIF magnitude are positively associated with the power of the IRT likelihood ratio test when detecting uniform and nonuniform DIF in MGRM.As sample size and DIF magnitude increased, power rates of the IRT likelihood ratio for detecting both uniform and nonuniform DIF test also increased.Unlike sample size and DIF magnitude, the effect of inter-dimensional correlation does not seem to be consistent regarding power rates.When DIF magnitude is large (i.e., 0.6), the correlation between the two dimensions has no effect on power rates for uniform DIF.However, the correlation between the dimensions affects power rates slightly for nonuniform DIF.Overall, the IRT likelihood ratio test appears to detect uniform DIF more precisely than nonuniform DIF.Type I error rates appear to be reasonable, although small sample size condition (R1000/F200) seems to have higher Type I error rates than the other two sample size conditions.As for the power rates, the effect of interdimensional correlation is not consistent regarding Type I error rates.The results in Table 2 can also be summarized with a scatterplot to demonstrate the relationships between the simulation factors more visually (see the graphical example in Simulation Study 1 for more details.).

Summary
This simulation study investigated the effects of sample size, DIF magnitude, and inter-dimensional correlation in detecting uniform and nonuniform DIF for a multidimensional polytomous IRT model (i.e., MGRM).Power rates in detecting DIF correctly and Type I error rates in detecting DIF falsely are used as the evaluation criteria.For the demonstration purposes, we only used 30 replications but the number of replications could be easily increased with the help of parallel computing.This would result in more reliable simulation results.Future studies can expand the simulation factors of the current study to have a more comprehensive analysis.For example, different values for DIF magnitude, sample size, and inter-dimensional correlations can be used.Furthermore, new simulation factors can be included.For example, three or higher dimensional structures can be used to see the impact of the number of dimensions, which would also allow examining the impact of varying interdimensional correlations.In addition, instead of a simple structure, a complex test structure with items associated with multiple dimensions can be assumed.

Simulation Study 3: Investigating Unidimensionality
The third simulation study aims to investigate test dimensionality.Unidimensionality is essential for the test theories like as Classical Test Theory or Item Response Theory.Therefore, the investigation of test dimensionality is very important.The unidimensionality assumption requires that there is a single latent trait underlying a set of test items.As Hambleton, Swaminathan, and Rogers (1991) pointed out, the unidimensionality assumption may not hold for most measurement instruments in education, psychology, and other social sciences due to complex cognitive and non-cognitive factors, such as motivation, anxiety, and ability to work quickly.Therefore, we can expect that at least one minor extra factor confounds unidimensionality.However, finding a major component or a factor underlying the data is adequate to meet the unidimensionality assumption.
Monte Carlo studies can be very convenient for investigating the factors affecting unidimensionality under various conditions.In the following Monte Carlo study, we aim to examine the impact of sample size, the number of items associated with a secondary (nuisance) dimension, and inter-dimensional correlations on the detection of unidimensionality.Two-dimensional response data with a simple structure are generated.While most items are assumed to be associated with the first ability dimension, the number of items (10, 20, or 30 items) associated with a secondary dimension is manipulated as a simulation factor.Second, the correlation between the two ability dimensions (ρ=.3, ρ=.6, or ρ=.9) is manipulated.As the third simulation factor, sample size (500, 1000, or 3000 examinees) is modified because sample size is considered an important factor for the accuracy of dimensionality analyses.The three simulation factors are fully crossed, resulting in 3 x 3 x 3 = 27 cells in total.One hundred replications are conducted for each cell.
A multidimensional two-parameter logistic IRT model (M2PL) is used for data generation.The M2PL model can be written as follows: where   () is the probability of responding to item i correctly for an examinee with the ability vector of  = [ 1 , … ,   ],   is the discrimination parameter of item i related to ability dimension m (m = 1, 2, …, M), and   is the difficulty parameter of item i.In this study, the item discrimination parameters are randomly drawn from a uniform distribution, ~(1.1,2.8), the item difficulty parameters are randomly drawn from a uniform distribution, ~(0.67,2.00), and the ability values are obtained from a multivariate normal distribution, ~(0, ) where is  the variancecovariance matrix of the abilities.Each generated data set is analyzed with NOHARM explanatory factor analysis (McDonald, 1997), which is an effective method for finding the number of underlying dimensions in item response data (Finch & Habing, 2005).NOHARM is implemented with one factor restriction using the sirt package (Robitzsch, 2017).The average root mean square error of approximation (RMSEA) is used as the evaluation criterion.RMSEA values smaller than .05are usually considered a close fit, whereas RMSEA values equal or greater than .10 are considered a poor fit (Browne & Cudeck, 1993;Hu & Bentler, 1999).
For this study, we define a function called detectDIM, which draws item difficulty and item discrimination parameters according the distributions explained earlier, simulates two-dimensional

281
response data with a simple structure based on the M2PL model, fits an exploratory factor model with a one-factor restriction (i.e., unidimensional model) to the simulated data, and extracts the RMSEA value as the evaluation criterion.Before using the detectDIM function, the following packages must be installed and activated: install.packages("sirt") library("sirt") library("mirt") library("MASS") The detectDIM function requires five input values: sample.sizefor the number of examinees, testLength1 for the number of items with a non-zero loading on the first ability dimension, testLength2 for the number of items with a non-zero loading on the second ability dimension, cor for the correlation between the two ability dimensions, and seed for setting the random number generator.The detectDIM function is shown below:

3.
Multiple data sets are generated based on the simulation factors and the assumptions about the nature of the data.The process of generating multiple data sets is often called replication.Monte Carlo simulation studies often involve multiple replications (a) to acquire the sampling distribution of parameter estimates, (b) to reduce the chance of obtaining implausible results from a single data set, and (c) to have the option to resample the true parameters based on the assumption made in Step 2.

Figure 1 .
Figure 1.A Screenshot of the First Nine Rows of the "simulation results.csv"File

#
Figure 2. The Interaction Plot for Bias, Test Length, and Sample Size

Figure 3 .
Figure 3.The Interaction Plot for RMSE, Test Length, and Sample Size

Bulut, O., Sünbül, Ö. / Monte Carlo Simulation Studies in Item Response Theory with the R Programming Language ___________________________________________________________________________________ ___________________________________________________________________________________________________________________
ISSN: 1309 -6575 Eğitimde ve Psikolojide Ölçme ve Değerlendirme Dergisi Journal of Measurement and Evaluation in Education and Psychology

Table 1 .
Simulation Results from the Item Parameter Recovery Study

Journal of Measurement and Evaluation in Education and Psychology ___________________________________________________________________________________
___________________________________________________________________________________________________________________ ISSN: 1309 -6575 Eğitimde ve Psikolojide Ölçme ve Değerlendirme Dergisi Journal of Measurement and Evaluation in Education and Psychology

Bulut, O., Sünbül, Ö. / Monte Carlo Simulation Studies in Item Response Theory with the R Programming Language ___________________________________________________________________________________
___________________________________________________________________________________________________________________ ISSN: 1309 -6575 Eğitimde ve Psikolojide Ölçme ve Değerlendirme Dergisi Journal of Measurement and Evaluation in Education and Psychology

Table 2 .
Power and Type I Error Rates for Detecting DIF in MGRM

/ Monte Carlo Simulation Studies in Item Response Theory with the R Programming Language ___________________________________________________________________________________
Bulut, O., Sünbül, Ö. ___________________________________________________________________________________________________________________ ISSN: 1309 -6575 Eğitimde ve Psikolojide Ölçme ve Değerlendirme Dergisi Journal of Measurement and Evaluation in Education and Psychology