Drawing a Sample with Desired Properties from Population in R Package “drawsample”

The aim of this study is to develop an R package called drawsample , which will be used to draw samples with the desired properties from a real data set. In accordance with the aim of the study, a sample with the desired properties can be drawn by purposive sampling with determining several conditions, such as deviation from normality (skewness and kurtosis) and sample size. Different applications of the package drawsample are illustrated using real data from the “Science and Technology(Score_1)” and “Social Studies (Score_2)” subtests of 6th Grade Public Boarding and Scholarship Examinations (PBSE). As the importance given to research with real data has increased in recent years, a good approach would be to draw a sample of the population. With this package, it is expected that researchers will draw samples as close as possible to the desired properties from the population or a large sample. It is thought that using the drawn samples obtained from real data with package drawsample will provide an alternative to simulation studies as well as a complement for these studies.


INTRODUCTION
In the field of measurement and evaluation in education and psychology, the distribution of scores has an important role in the description of the groups. In addition to the description of groups, testing for normality to conduct many procedures of statistical inference, which are based on the assumption of normality, is crucial. However, as Erceg-Hurn and Mirosevich (2008) pointed out, the assumption of normality is rarely met when analyzing real data. Therefore, in applications, non-normal distributions are more common than normal distributions (Blanca, Arnau, López-Montiel, Bono, & Bendayan, 2013;Geary, 1947;Micceri, 1989;Olivier & Norberg, 2010;Pearson, 1932). Due to the failure of the normality assumption, violation of normality, and distribution types have been the focus of many researchers working on important issues such as test equating, computer adaptive testing, differential item functioning, classification, and latent score estimation (Custer, Omar, &Pomplun, 2006;Finney & DiStefano, 2006;Gotzmann, 2011;Kieftenbeld & Natesan, 2012;Kirisci, Hsu, & Yu, 2001;Kolen, 1985;Kogar, 2018;Seong, 1990;Uysal, 2014;Yıldırım, 2015).
In the process of collecting data in a study, researchers may obtain different types of distributions. For example, most of the time, mathematics achievement scores differ from a normal distribution (skewed to the right) in selection exams (Ministry of National Education-MoNE, 2020; Student Selection and Placement Center-SSCP, 2019). If a researcher plans to conduct a study to investigate relations to antecedent and subsequent factors with mathematics scores obtained by a selection exam, and the statistical analysis intended to be used requires normality assumption, the researcher would not make use of the data because the results would be suspenseful. Since a sample selected from this data would also be skewed to the right, drawing a sample from this population will not solve the problem either. Otherwise, the scenario may be the opposite. For example, the aim of researchers may be to test the violations of the normality assumption in a psychometric analysis, and the data they collected may show normal distribution.
In empirical research, the process of data collection is challenging. The sample may not be representative of the population distribution; alternately, it may not be normally distributed, or it may be unsuitable for the desired distribution. To meet the assumption of normality in the literature, many studies in which the data set was manipulated have been found. For example, Gelbal (1994), in accordance with the purpose of his research, examined test scores, which included approximately two thousand fifth grade students who took both the Turkish language test and Math test. In order to get the desired distributions, approximately five hundred students from each test were removed. Doğan and Tezbaşaran (2003), in their study, selected participants with the required attributes to ensure the desired distribution. The researchers stated that random and purposive sampling techniques were used in the selection of the samples. For the purpose of their study, the students were drawn from a population consisting of students who had taken the Secondary Education Institutions Student Selection and Placement Examination in 2001. The samples were drawn randomly, right-skewed, left-skewed, flattened, and normal distribution, ranging in sample size from 2,353 to 29,244. In their study, in skewed samples, absolute values of skewness (±1.00) and kurtosis (1.37) were kept equal among samples to increase the accuracy for comparisons. Similar to the study of Doğan and Tezbaşaran (2003), Şahin ve Yıldırım (2018), obtaining the ability parameters, both right-skewed and left-skewed ability distributions were chosen from the real data. The real data were obtained from mathematics subtests of the Placement Test (SBS) applied in 2012. The selection of the right-skewed distributions was made randomly because it was originally a 407 right-skewed data set (skewness value=1.05). For the left-skewed data sets, the intended sample distribution was achieved through purposive sampling, and the groups whose skewness value is approximately -1.00 were chosen for all samples.
In addition to the above, in the literature, many researchers have chosen to draw samples from the real data set (population) in accordance with the purpose of their studies (Courville, 2004;Doğan & Kılıç, 2018;Fan, 1998;Nartgün, 2002;Reyhanlıoğlu Keçeoğlu, 2018). In the process of sampling from the population, it is important for future studies to have a function that makes the sample selection easier and brings it closer to the desired properties. In fact, it is suggested that the study of different abilities with non-normal distributions or samples with different levels of ability is the result of some research in the literature (Çelikten & Çakan, 2019). When the studies are examined, it was concluded that there is a need for a tool to enable researchers to draw samples with the desired properties from a large data set.

Purpose of the Study
In this study, the package drawsample, which aims to draw a sample based on the information of total score or ability parameter in accordance with the desired sample size and deviation from normality (skewness and kurtosis), was developed.With this package, it is expected that researchers will draw samples as close as possible to the desired properties from the population or a large sample, and it is thought that it will pave the way for the studies to be conducted on different topics based on the distribution in the literature. With this function, it is possible for researchers to draw samples with desired properties from large data in order to conduct statistical analysis under different conditions.

Fleishman's Power Method
In this section, Fleishman's (1978) power method, which is used to select the desired measures of deviation from normality (skewness and kurtosis), is explained briefly. Fleishman (1978) used a cubic transformation of a standard normal variable to create a distribution with pre-specified moments. Fleishman's (1978) power method, = + + 2 + 3 , was used to generate a non-normal distribution, where Y is a non-normal deviate with specified skewness and kurtosis. The value of is a standard normal deviate, and , , , and are constants for transforming the standard normal variable to a variable with known skewness and kurtosis. (Kirisci, 2001). These constants for the normal distribution are 0.0, 1.0, 0.0, and 0.0 ( = ) respectively. Fleishman (1978) tabulated these coefficient values for the selected skewness and kurtosis values. Writing the function in R, the values in this table were used to get the non-normal distributions. The values in this table can also be accessed using the find_constants() function in the "SimMultiCorrData" (Fialkowski, 2018) package in R. The find_constants() function is a function that calculates Fleishman's third or Headrick's (2002) fifth-order constants, converting a standard normal random variable into a continuous variable with a certain skewness and standardized kurtosis value. When the skewness value of the function is 0 and the standardized kurtosis value is 0, the usage example is given in Table 1.

408
Since the use of the function used in the example given in Table 1 extends the operation process, an R object named "constants_table" was created with the values obtained using this function.

Skewness and Kurtosis Statistics
The first four moments of the distribution are mean, variance, skewness, and kurtosis, respectively, which are the most important characteristic of frequency distributions (D'agostino, Belanger, & D'Agostino, 1990).
The following equations are for the third and fourth moments, skewness and kurtosis statistics, in Equation 1 and 2. These equations are used routinely; for example, SAS and SPSS give skewness and kurtosis statistics using them in their descriptive statistics output (D'agostino, Belanger, & D'Agostino 1990).
There are many R packages to calculate the skewness and kurtosis values. In this study, the describe() function in the psych package was used to calculate skewness and kurtosis values. Table 2 shows the example of calculating descriptive statistics of the vectors of "normal_dis" and "skew_dis" generated by rnorm() and rbeta() functions, respectively. As shown in Table 2, the describe() function has 13 different outputs. From the output of this function, the skewness and kurtosis values can be extracted, as shown in Table 3.

Drawing Samples
The most commonly used function for selecting samples in R is the sample() function in the base package. This function takes a sample of the specified size from a determined vector using either with or without replacement In this study, sample_n() function which is a function of dplyr package 2019) is used to select samples. The sample_n() function has similar arguments with the 409 sample() function in the base package. The sample() function works with vectors, while the sample_n() function works with data sets. The sample_n() function has the "weight" argument instead of the "prob" argument in the sample() function. The value of the "weight" argument can be any column in the data set or data frame. In order to demonstrate the use of the sample_n() function, "example1" data set consisting of four variables with 100 observations was created. The variables in the data set "example1" are "id," "gender," "math_score" and "science_score." In order to create a new data frame with students who have higher science scores, the "weight" argument was used with the value of this variable (science_score). Table 4 shows the example of using sample_n() function. In Table 4, "example1" data set was created, and summary information about the data set was printed. While creating the "example2" data set, the students were weighted according to the "science_score" variable, and the sampling was selected. When the summary information about "example2" data set is examined, it is seen that the minimum, quartiles, median, and median values of "science_score" are higher than "example1".
In the drawsample package, the draw_sample() function has been improved to get a sample with the desired distribution properties and sample size in accordance with skewness and kurtosis. The code belonging to this function is explained below.

R CODE FOR draw_sample() FUNCTION
draw_sample() function with 6 arguments was written to draw a sample with the desired properties. The arguments of the function are given in Table 5. When determining "skew" and "kurts" from the arguments in Table 5, the Fleishman Power Method Weights table must be consulted. Fleishman coefficients corresponding to some combinations, such as skewness value 1 and kurtosis value 0, are absent. The minimum and maximum values of the kurtosis coefficient corresponding to a determined skewness coefficient are presented in this table created by using the Flesihman's (1978) Power Method Weights Table. For example, if the skewness coefficient is selected as 2, the kurtosis coefficient must be entered between 5 and 20. In other words, the minimum and maximum value of kurtosis values corresponding to each skewness coefficient that can be used are presented in Table 6. R commands for draw_sample() function are given in Table 7. In this function, the value of the "dist" argument must be a data frame that has two columns. Note that the data includes student IDs in the first column and student total test scores or abilities (thetas) in the second column. For that purpose, with the command of names(dist), the columns of the imported object columns in the R environment are named "id" and "x" ( Table 6, Line 8). Then, the x is extracted as the variable x" in Line 10, so "x" becomes a vector that can provide convenience. If "n" from the arguments of the function, the desired sample size, is larger than the length of the data, it gives the following error: "Cannot take a sample larger than the length of the data". For example, although the sample size of the imported data is 1,000 and users desire to take sample size 2,000, the function gives the error and stops running (Lines 13 to 16).

Journal of Measurement and Evaluation in Education and
The values in Fleishman's (1978)  Within the repeat loop, the reference distribution with the skewness and kurtosis values entered by the user between Line 38 and Line 53 is formed. According to the minimum and maximum values of the distribution formed in this loop and then included in the user's data set (Line 65), the rescaled "reference_v4" distribution forms the basis for the function's work. Before the repeat loop, an empty vector was created to form a distribution with the skewness and kurtosis values entered by the user. Firstly, an object with a normal distribution called "reference" with a mean of 0 and a standard deviation of 1 is formed in the loop (Line 41). Within the repeat loop, the "reference_v2" object is formed by multiplying the "reference" object by the b, c, and d coefficients in the table, respectively. When the skewness and kurtosis values of the "reference_v2" object are equal to the skewness and kurtosis values entered by the user, the loop is stopped, and the "reference_v2" object is assigned to "reference_v3" (line 50). If the calculated values are not equal to the values defined by the user, the "reference_v3" object is left empty, and the loop is repeated. With the draw_sample() function, it is aimed to form a similar distribution from the values in the user's data set based upon the "reference_v4" object formed in accordance with the values entered by the user. On lines 67-69, the outputs of the hist (reference_v4) function are used for this purpose. The starting and ending points of each bar of the histogram are assigned to "x_break" objects, the number of bars in the histogram to "n_break" objects, and the number of elements in each bar to "x_counts" objects.
The vector "x" is categorized by "x_break" and identified as "x_v1". The categorized object is added as a new column to the user's data set. The information about how many individuals are in each category is assigned to the "x_n" object. The specified operations are defined between 71-73. The information on how many individuals there are in each category is crucial in terms of determining whether the function will select the sample of the user's desired properties without resampling. When the number of individuals in each category in the data set is higher than in each category of the reference distribution, the function can be performed without resampling, with the default value of the "replacement" argument. This situation is checked between lines 73 and 79. If the number of the individuals in at least one category in the data set is less than the number of the individuals in the relevant category of the reference distribution, the function gives an error: "Cannot take a sample form that data without replacement. Please change replacement = TRUE." In this situation, the function can be used by changing the value of the "replacement" argument. The codes working up to line 83 have been written in order to prepare for drawing sample. The drawing sample process is carried out through the for loop between 89-105. For data manipulation in the loop, filter() and sample_n() functions in the package of dplyr (Wickham, François, Henry, & Müller; 2019) are used. The scores belonging to the individuals to be formed in the for loop were created in the "new_sample" and the empty matrices named "ID_list" for the identity information of the individuals on lines 83 and line 84. In both matrices formed, the number of lines was determined as the number of categories ("n_break") and the number of columns as the maximum number of individuals in these categories.  1  2  3  4  5  6  7  8  9  10  11  12  13  14  15  16  17  18  19  20  21  22  23  24  25  26  27  28  29  30  31  32  33  34  35  36  37  38  39  40  41  42  43  44  45  46  47  48  49  50  51  52  53  54  55  56  57  58  59  60  61  62  63  64  65  66  67  68  69  70  71  72  73 draw_sample <-function(dist,n,skew,kurts, replacement =FALSE, output_name = c("sample","default")){ # rename the data skew <-round(skew,1) kurts <-round(kurts,1)
Between lines 141 and 144, the output of the function is formed. The output, which is a three-component list, consists of descriptive statistics of the data and sample, the sample formed and the histogram graphs of the data, and the distribution of the sample. Descriptive statistics, which are the first component of the list, were formed between lines 136-139 by using the describe() function in the psych package (Revelle, 2018), the graph, which is the third component, was formed by using the histogram() function in the "lattice" package (Sarkar, 2008) between 120-134 lines. The desc component consisting of descriptive statistics information is a matrix. This matrix includes the mean, standard deviation, skewness, and kurtosis of the population, sample and the reference distribution. The second component is called sample, and it is from the tibble package (Wickham, Francois, and Müller;2016). It is situated between 112-117 lines required to extract this data. It includes ids and x scores which are sampled. The third component is called "graph," and it includes two histogram graphs one is for "population" (imported data), and one is for the "sample" (extracted data). The third component of the output is also extracted.

EXAMPLES WITH REAL DATA
In the examples, related functions and outputs are presented based on the "Science and Technology" and "Social Studies" subtests data of the 6th Grade Public Boarding and Scholarship Examinations (PBSE) in 2013. At the secondary school level, the PBSE test consists of 100 multiple-choice test items, which include 25 items in each subtest (Turkish, Mathematics, Science and Technology, and Social Studies). It was administered in two booklet types, A and B (MoNE, 2013).
In 2013, 242,598 students participated in PBSE at the 6th-grade level, and 121,523 (50.09%) received booklet A. Of the students, 133,866 (55.18%) were female and 108,732 (44.82%) were male students. Within the scope of the study, randomly selected 5,000 students taking booklet A were considered as the "population." Of this group, 2,745 (54.90%) are female students. The data were obtained by the Directorate General for Measurement, Assessment and Examination Services of the Ministry of National Education in accordance with written permission. The total score distributions for each test were examined. Then, two datasets were used for the demonstration. The Science and Technology subtest was chosen as an example of left-skewed distribution. The Social Studies subtest was used as an example of platykurtic distribution. In each example, a sample of 500 students was drawn from the population for the related subtests. That the samples have the desired properties in terms of distribution type was taken into consideration. The functions and outputs for this process were given in Tables 8-14 and Figures 1-3. In the first two examples, particular importance has been given to draw samples with a normal distribution and both negatively skewed and leptokurtic distribution from the data of the science and technology subtest, respectively. The command for the first example is shown in Table 8.  [,c(1,2)], n=500, skew = 0, kurts = 0,output_name = c("sample","1")) First, the package drawsample is installed and then loaded. After then the object "example_data" which is automatically provided by the package is loaded. It has three columns including the total scores of the PBSE subtests of 5,000 students and IDs. The first column contains IDs (1: 5000), the second column contains the total scores of "Score_1 (Science and Tecnology subtest) ", and the third column contains "Score_2 (Social Studies subtest)" respectively.

416
As seen in Figure 1, the drawn sample distribution given according to the total scores of Science and Technology subtest was very close to the normal distribution. The command for this second example is shown in Table 10. In the second example, different from example 1, the value of skew and kurts are changed to -1 and 2, respectively.  2)], n=500,skew = -1, kurts = 5, output_name = c("sample","2")) Table 11 shows the descriptive statistics of the distribution of 500 students from the total score distribution of Science and Technology and the output of some of the students in the sample drawn.  The next two examples for real data were to draw sample with right-skewed and leptokurtic distribution (skewness value is =1.5 and kurtosis value=3) drawn from the distribution given according to the total scores of Social Sciences subtest. The command required for this situation is presented in Table 12.  [,c(1,3)], n=500,skew = 1.5, kurts = 3, output_name = c("sample","3")) When the code in Table 12 is set to work, since the function cannot draw the data with the desired properties from the provided data without resampling, it gives an error and suggests allowing resampling. The argument "replacement," which is FALSE by default, has been replaced to meet the distribution conditions set out in Table 13.  [,c(1,3)], n=500,skew = 1.5, kurts = 3,replacement = TRUE, output_name = c("sample","3"))

Journal of Measurement and Evaluation in Education and
Resampling is allowed when "TRUE" is entered in the "replacement" argument. In other words, an individual selected from "population" to "sample" is allowed to be repeatedly selected to provide the desired distribution. Table 14 shows the descriptive statistics of the distribution of 500 students drawn from the total score distribution of Social Sciences and the output of some of the students in the sample extracted. In this case, the dist data frame contains the columns "ID" and "Score_2", which are used for defining the student identity and total score of the Social Studies subtest. When the descriptive statistics in Table 14 were examined, the total score distribution of the Social Sciences subtest of 5,000 students (population) is slightly left-skewed; the desired sample is right-

Evaluating the Function's Stability
Measures of kurtosis and skewness are used to determine if indicators met normality assumptions (Kline, 2005). The extent to which a frequency distribution diverges from symmetry is described as skewness. Kurtosis is a measure of how flat the top of a symmetric distribution is when compared to a normal distribution of the same variance. A perfect symmetrical distribution will have a skewness of 0 and a kurtosis of -3 ('excess' kurtosis of 0). The original kurtosis value is sometimes called kurtosis (proper), and West, Finch, & Curran (1995) proposed a reference of substantial departure from normality as an absolute kurtosis (proper) value > 7. Most statistical packages such as SPSS provide 'excess' kurtosis obtained by subtracting 3 from the kurtosis (proper). In this study, 'excess' kurtosis is used for practical reasons. Distributions that are more flat-topped than normal distributions are called platykurtic, and their kurtosis values are less than 3. Distributions that are less flat-topped than normal distributions are called leptokurtic, and their kurtosis values are more than 3 (Flott, 1995;Wuensch, 2005).
There is no consensus about the skewness and kurtosis values which indicate normality in the literature. It is widely accepted that absolute skew and kurtosis values up to one provide normality. (Büyüköztürk, Çokluk, & Köklü, 2014;Huck, 2012;Ramos et al., 2018). Furthermore, there are some suggestions that much larger values of skewness and kurtosis indicate normality (Brown, 2006;Kim, 2013;West et al., 1995). Furthermore, kurtosis is generally interesting only when dealing with approximately symmetrical distributions. Skewed distributions are always leptokurtic. Besides, kurtosis can be thought of as a measurement which adjusts to remove the effect of skewness (Blest, 2003). Moreover, social science researchers are concerned with the deviation of the distribution from symmetry rather than its flatness. In addition, high kurtosis should be considered for the researcher to look for outliers in one or both tails of the distribution (Wuensch, 2005). For this reason, although the possible skewness and kurtosis values can be selected in the draw_sample() function, the data provided by the function provides very close results in the skewness values, but not in the kurtosis values. We recommend that users should choose kurtosis values closest to 0 for normal distributions and higher than 3 for leptokurtic distributions, and lower than 3 for platykurtic distributions. If the aim of the researcher is to obtain data with outliers, the value of kurtosis can be increased up to 20 according to the number of outliers.
In order to determine how close the drawn sample to the reference distribution, a function called draw_sampleRMSE() is written. This function can take samples from the data with different set.seed values as much as the specified number of replications. The functions' output is the skewness and  1:10000,1)) result <-drawsample::draw_sample(dist =df, n = n, skew = skew,kurts = kurts, output_name = c("samp le",paste(i)))$desc }, error = function ( To illustrate the stability of draw_samples(), two simulated datasets are used. First, negatively skewed and platykurtic data was generated with a sample size of 10000 by using rbeta() function, called "datfra".
Then, 100 different samples were drawn from "datfra" with a different set.seed values with draw_sampleRMSE() function. After calculating the skewness and kurtosis values for each sample, the RMSE values and descriptive statistics were presented in Table 16 for skewness values, and only descriptive statistics were presented for kurtosis In the first example in Table 16, normal distributions are drawn from the negatively skewed and leptokurtic distribution. It is seen that the mean of skewness and kurtosis values of the distributions produced in this example are quite close to the determined value, 0. The skewness values vary between -0.09 and 0.19, and kurtosis varies between -0.56 and 0.16. RMSE calculated for the skewness value was determined as 0.078.
In the second example in Table 16, positively skewed and leptokurtic distributions are drawn from the negatively skewed and leptokurtic distribution. It is seen that the mean skewness value of the distributions produced in this example is quite close to the determined value, 1. However, the mean kurtosis value of the distributions produced in this example is larger than 3, as expected for leptokurtic distributions. The skewness values vary between 0.6 and 1.19, and RMSE calculated for the skewness value was determined as 0.172.
In the third example in Table 16, negatively skewed and platykurtic distributions are drawn from the negatively skewed and leptokurtic distribution. It is seen that the mean skewness value of the distributions produced in this example is quite close to the determined value, -0.5. However, the mean kurtosis value of the distributions produced in this example is smaller than 3, as expected for platykurtic Second, positively skewed and platykurtic data was generated with a sample size of 10000 by using rbeta() function, called "datfra2".Then, 100 different samples were drawn from the "datfra2" with a different set.seed values with draw_sampleRMSE() function. After calculating the skewness and kurtosis values for each sample, the RMSE values and descriptive statistics were presented in Table 17 for skewness values, and only descriptive statistics were presented for kurtosis. In Table 17, positively skewed and leptokurtic distributions are drawn from the positively skewed and leptokurtic distribution. It is seen that the mean of skewness values of the distributions produced in this example are quite close to the determined value, 2. The skewness values vary between 1.51 and 2.27, and kurtosis values are higher than 3. RMSE calculated for the skewness value was determined as 0.174. As a result, it was found that the function gives more consistent results at more common skewness values (between -1 + 1).

INSTALLING THE drawsample PACKAGE
The R package drawsample can be installed from CRAN with install.packages("drawsample") command. The package drawsample automatically provides the example data set "example_data". Additionally, package's files are available from the GitHub repository https://github.com/atalayk/drawsample.

FINAL REMARKS
In this study, an R package drawsample has been developed to draw samples with desired properties from a given distribution. Contrary to simulation studies, the importance given to studies with real data has increased in recent years. It is thought that using the drawn samples obtained from the real data with drawsample package will provide an alternative to simulation studies as well as a complement for these studies. In addition, since the real data is used instead of the simulation studies, the descriptive characteristics of the study groups can be examined. Thus, it may be possible to examine the demographic characteristics of the individuals making up the sample.
In this study, four examples with real data are presented. It can be inferred from the examples in the study; the sample drawn from the real data is very close to the desired properties. However, it should be noted that it is not so easy to draw samples that perfectly match the desired properties in real data sets to draw sample from simulation data sets. Apart from the examples discussed in the study, two simulation data were genareted to evaluate the stability of the of draw_sample(). Then samples were drawn from these data sets under four cases. For each case in the the draw_sample(), 100 replications were performed and RMSE values are reported. As a limitation, draw_sample() yields more inconsistent

Researchers
can access the web-wide data sets provided by the "https://toolbox.google.com/datasetsearch" search engine, as well as they can access large public data such as TIMSS (Trends in International Mathematics and Science Study), PIRLS (The Progress in International Reading Literacy Study), and PISA (The Program for International Student Assessment). Various studies can be done by drawing samples using the data sets mentioned above based on distribution properties. In situations like this, a good approach would be to draw a sample of the population. As authors, we are open to all kinds of suggestions in the development of the drawsample package.