Sample Size Determination and Optimal Design of Randomized/Non-equivalent Pretest-posttest Control-group Designs

A recent systematic review of experimental studies conducted in Turkey between 2010 and 2020 reported that small sample sizes had been a significant drawback (Bulus & Koyuncu, 2021). A small chunk of the studies in the review were randomized pretest-posttest control-group designs. In contrast, the overwhelming majority of them were non-equivalent pretest-posttest control-group designs (no randomization). They had an average sample size below 70 for different domains and outcomes. Designing experimental studies with such small sample sizes implies a strong (and perhaps an erroneous) assumption about the minimum relevant effect size (MRES) of an intervention; that is, a standardized treatment effect of Cohen’s d < 0.50 is not relevant to education policy or practice. Thus, an introduction to sample size determination for randomized/non-equivalent pretest-posttest control group designs is warranted. This study describes nuts and bolts of sample size determination (or power analysis). It also derives expressions for optimal design under differential cost per treatment and control units, and implements these expressions in an Excel workbook. Finally, this study provides convenient tables to guide sample size decisions for MRES values between 0.20 Cohen’s d 0.50.


Introduction
One crucial question in education policy and practice is whether a program, product, or service produces favorable outcomes.The first step to answering such a research question is to solicit funding from stakeholders in a grant proposal to cover research expenses.The description of the research design in the grant proposal should convince stakeholders (and peers in the publication process) that the study employs rigorous methodological procedures and that the sample is not fundamentally flawed to produce biased or inconclusive results.
In education policy research, experiments are indispensable research designs that can establish a causeeffect relationship between an independent variable (e.g., receiving a program, product, or service) and an outcome variable (e.g., academic achievement) (Campbell & Stanley, 1963;Cook et al., 2002;Mostseller & Boruch, 2004).An experiment's main characteristic is that researchers can manipulate the independent variable to isolate its effect from unobserved confounders.In the simplest form, this is achieved via randomly assigning subjects in the sample into the treatment and control groups.Randomization assures that effects of unobserved confounders on the outcome -a significant threat to the internal validity of experiments -are canceled out on average (Campbell & Stanley, 1963;Cook et al., 2002;Mostseller & Boruch, 2004).In this case, treatment and control groups do not systematically differ (especially in large samples).This type of design is referred to as a true experiment.However, randomization is not always feasible.For example, in education research, it is common to assign entire clusters to treatment and control groups (e.g., classrooms) without randomization.In this case, the treatment effect may be contaminated with unobserved confounders.In other words, treatment and control groups may systematically differ.This type of design is a non-equivalent design (see Campbell & Stanley, 1963;Oakes & Feldman, 2001) and categorized as one of the weak experiments in the literature.Nonetheless, weak experiments can be manipulated to mimic true experiments via matching subjects on the pretest or covariates (Fraenkel et al., 2011; see also Campbell & Stanley, 1963).This type of design is referred to as a quasiexperiment.

AUJES (Adiyaman University Journal of Educational Sciences) )
Recent reviews of experiments in Turkey indicated that they had inadequate sample sizes (e.g., Bulus & Koyuncu, 2021;Yildirim et al., 2019).Overwhelming majority of the reviewed experiments in Bulus and Koyuncu (2021) and Yildirim et al. (2019) were small-scale weak or quasi-experiments.Most of them were based on convenience sampling where intact classrooms received the treatment or control protocols (often, one classroom in each).Average sample size was 70 for experiments reviewed in Bulus and Koyuncu (2021) and was 54 for those reviewed in Yildirim et al. (2019).Such small sample sizes imply a strong (and perhaps an erroneous) assumption about an intervention's minimum relevant effect size (MRES) before an experiment is undertaken.In other words, a standardized treatment effect of Cohen's d < 0.50 is not relevant to education policy or practice.MRES is related to the "What is the minimum treatment effect that is meaningful and relevant to education policy and practice?"question, and its value should carefully be justified.
The result of a small-scale experiment is sometimes "too good to be true."There are several potential sources of bias inherent to small-scale experiments.For example, the treatment effect in a small-scale experiment could be overestimated due to publication bias (Hedges, 1992;Vevea & Hedges, 1995), small study effect (Sterne et al., 2000), overfitting problem where the model picks up noise (Yarkoni, 2017), teaching treatment group to perform superior on the researcher developed test, shorter pretest-posttest interval (Slavin, 2008), baseline incomparability, classroom or school confounding, researcher bias such as choosing the more able subjects for the treatment group, or a combination of them.Bulus and Koyuncu (2021) reported large treatment effects for 106 experiments targeting cognitive outcomes (Cohen's d = 1.02, on average) and for 81 experiments targeting affective outcomes (Cohen's d = 1.01, on average).The authors did not adjust effect size estimates for the pretest.Yildirim et al. ( 2019) also reported large treatment effects of learning strategies on academic achievement based on a random-effect meta-analysis of 28 experiments (Cohen's d = 1.21, on average).The authors did not explicitly state whether they adjusted effect size estimates for the pretest.We do not know whether the effects reported in Bulus and Koyuncu (2021) and Yildirim et al. (2019) were artifacts (due to several potential sources of bias mentioned earlier) or actual effects.Effects sizes of this magnitude, if considered artifacts, cannot be explained by failure to adjust for the pretest alone.If these are actual effects, it begs why these programs are not scaled-up.
One effective way to decipher this ambiguity and ameliorate potential sources of bias mentioned earlier is to conduct an experiment with sufficient sample size.A sufficient sample size would allow the experiment to detect a minimum effect relevant to policy and practice with sufficient statistical power (probability to detect an effect when there is an effect in the underlying population).This study mainly describes formulas and software to determine sample size for randomized pretest-posttest control-group design (true experiment) and nonequivalent pretest-posttest control-group design (weak experiment).It derives expressions for the optimal design of true experiments under differential cost per treatment and control units, and provides a convenient Excel workbook for this purpose (Optimal Design: https://osf.io/uerbw/download).Moreover, it provides convenient tables to guide sample size decisions for MRES values between 0.20 Cohen's d 0.50 (Appendix and Supplement: https://osf.io/t2as3/download).
In what follows, first, the approximate standard error of the treatment effect for several types of experimental designs will be described.Approximate standard errors are required for power analysis routines.Suppose approximate standard errors are formulated in terms of known design parameters such as MRES, treatment group allocation rate, and explanatory power or covariates.Then, one can conveniently find the minimum required sample size (MRSS) for true and weak experiments given design parameters.Second, illustrative examples are provided to find MRSS depending on common design characteristics.Finally, key points are discussed and summarized

Approximate Standard Error Formulas for Power Analysis
To answer the crucial question of "At least how many participants are needed in treatment and control groups to detect an effect that is relevant to policy and practice?"one will need to have a guestimate for the standard error of the treatment effect.Fortunately, there are many important studies in this line of work.Several scholars derived expressions for approximate standard errors, which is a function of the known design parameters such as total sample size, treatment group allocation rate, and explanatory power of covariates (e.g., Bloom, 2006, Dong & Maynard, 2013;Oakes & Feldman, 2001).Expressions for approximate standard errors considering true and weak experiments will be described momentarily.
Approximate standard error expressions presented in this study apply to several experimental designs described in Campbell and Stanley (1963) and Fraenkel et al. (2011) when Analysis of Variance (ANOVA) or Analysis of Covariance (ANCOVA) model is the method of choice.Randomized posttest-only control-group and randomized pretest-posttest control-group designs are categorized as true experiments (Campbell & Stanley, 1963;Fraenkel et al., 2011).Static-group comparison design (SCD; Campbell & Stanley, 1963) and static-group Buluş pretest-posttest design (SPPD; Fraenkel et al., 2011) are categorized as weak experiments.SCD and SPPD designs are also known as non-equivalent designs.There is no guarantee that treatment and control groups are comparable at the baseline in non-equivalent designs (see Campbell & Stanley, 1963;Oakes & Feldman, 2001).This study adopts the latter naming convention; non-equivalent posttest-only control-group design for SCD and non-equivalent pretest-posttest control-group design for SPPD.

True Experiments
In a simple true experiment, subjects are randomly assigned into the treatment and control groups.While treatment group subjects benefit from a program, product, or service, no procedures are undertaken for the control group except for the administration of questionnaires.Information is collected at the baseline (e.g., pretest) to control bias resulting from baseline differences (mostly in small-scale weak or quasi-experiments) and improve the estimate's precision.In the end, outcomes between the two groups are compared to gauge the effectiveness of an intervention.

Randomized Pretest-posttest Control-group Design
The diagram of the randomized pretest-posttest control-group design is described below.R refers to the randomization process, X refers to the implementation of the treatment protocol, and O refers to the observation of the pretest before X or posttest after X.
The following procedures are followed in this type of design; (i) subjects are randomized into the treatment and control groups, (ii) a pretest questionnaire is administered before subjects receive treatment and control protocols, (iii) treatment and control group protocols are administered, and (iv) a posttest questionnaire is administered after subjects receive treatment and control protocols.Control group subjects could receive the business-as-usual approach or another intervention different from the treatment group.Data collected from this type of design can be analyzed via an ANCOVA model.The approximate standard error for the treatment effect takes the form of ( ̂) √ 1 with degrees of freedom (Bloom, 2006, p. 12;Dong & Maynard, 2013, p. 45).R 2 is the proportion of variance in the posttest explained by the pretest.p is the treatment group allocation rate (proportion of subjects in the treatment group).n is the total sample size in the treatment and control groups.g indicates the number of covariates (g = 1 when pretest is the only covariate).To determine MRSS for this type of design, one can use PowerUpR (Bulus et al., 2021) R package or PowerUp! (Dong & Maynard, 2021) Excel workbook for this purpose.These freeware will be described in the software illustration section momentarily.

Randomized Posttest-only Control-group Design
The diagram of the randomized posttest-only control-group design is described below.
The following procedures are followed in this type of design; (i) subjects are randomized into the treatment and control groups, (ii) treatment and control group protocols are administered, and (iii) a posttest questionnaire is administered after subjects receive treatment and control protocols.Similarly, control group subjects could receive the business-as-usual approach or another intervention different from the treatment group.Data collected from this type of design can be analyzed via an ANOVA model.Per G*Power 3.1 guide (p.49), the approximate standard error for the treatment effect takes the form of ( ̂) √ 2 with degrees of freedom.The remaining parameters are defined earlier.The relevant specification in G*Power is "Test family: t-tests" and "Statistical test: Means: Difference between two independent means (two groups)."Note that when pretest information is not available in Equation 1 (R 2 = 0 & g = 0), it converges to Equation 2. Alternatively, one can use PowerUpR (Bulus et al., 2021) R package or PowerUp! (Dong & Maynard, 2021) Excel workbook for this purpose.Note that in this case R 2 = 0 and g = 0 in PowerUpR and PowerUp!

Optimal Design of True Experiments
Conducting an experiment can be costly.Naturally, costs for the treatment group could be higher than costs for the control group.When the cost per subject in treatment and control groups is differential, it is desirable to sample less from the group with higher costs.Higher costs associated with the treatment group may emerge from new materials, new approaches to learning, hiring experts, and other overhead costs needed to develop and implement an intervention.Overhead costs for treatment and control groups can be divided by the number of subjects in each group and added to the subject-unique costs.In this case, each subject in the treatment and the control groups will be associated with differential costs.Therefore, it is reasonable to sample fewer subjects from the treatment group and more subjects from the control group.In what follows, analytic expressions are derived to find optimal p and n given total cost or budget.
Let C TRT and C CTRL be the cost per subject in treatment and control groups, respectively.Let also C TOT be the total cost or budget.Total cost is the sum of the costs for treatment and control groups.Costs for the treatment and control groups can be expressed as the subject-level cost in each group multiplied by the number of subjects in each group.There are subjects in the treatment and subjects in the control group.
Then, the following equation can be defined as 3 Re-arranging Equation 3, n can be expressed as 4 Plugging Equation 4for n in Equation 1, the squared standard error can be expressed as In order to find optimal that minimizes the squared standard error in Equation 5, one needs to take the derivative of ( ̂) with respect to p as Setting Equation 6to zero and solving for p produces the optimal p as √ √ √ 7 Equation 7 can be further simplified.Define cost ratio as , then √ 8 Equations 4 and 8 can be used to devise a randomized pretest-posttest control-group design optimally.First, one would need to have information on the cost ratio.Once the cost ratio is known, optimal p can be obtained using Equation 8.In the second step, optimal p can be plugged in Equation 4to get an estimate for n.

Weak Experiments
Although weak experiments are presented here, they are not the first choice to produce knowledge for evidence-based practices.They should be preferred when randomization is not feasible.They are described below for interested readers.

Non-equivalent Pretest-posttest Control-group Design
The diagram of the non-equivalent pretest-posttest control-group design is described below.
The following procedures are followed in this type of design; (i) a pretest questionnaire is administered to subjects in two naturally occurring groups (e.g., classroom) before they receive treatment and control protocols, (iii) treatment and control group protocols are administered to these two groups, and (iv) a posttest questionnaire is administered after these two groups receive treatment and control protocols, respectively.Note that there is no randomization.Data collected from this type of design can also be analyzed via an ANCOVA model.The approximate standard error for the treatment effect is adapted from Oakes and Feldman (2001, p. 15) as with degrees of freedom.Unlike earlier designs, is the squared point-biserial correlation between the pretest variable and the treatment indicator.It represents the proportion of variance in the pretest explained by the treatment indicator.

Non-equivalent Posttest-only Control-group Design
The diagram of the non-equivalent posttest-only control-group design is described below.

Control group O
The following procedures are followed in this type of design; (i) treatment and control group protocols are administered to two naturally occurring groups, and (ii) a posttest questionnaire is administered after these two groups receive treatment and control protocols, respectively.There is no randomization.Data collected from this type of design can also be analyzed via an ANOVA model.The approximate standard error for the treatment effect can be obtained via re-expressing Equation 9 as ( ̂) √ 10 with degrees of freedom.One could righteously argue that does not apply to this formulation because pretest information is not collected.Although pretest information is not collected, differences between treatment and control groups at the baseline would affect standard error of the treatment effect.Thus, it would be a good practice to have a guesstimate for and determine sample size accordingly.Other parameters are defined earlier.

Sample Size Determination in True Experiments
In this section, the nuts and bolts of sample size determination in randomized pretest-posttest controlgroup design will be described.First, in the software illustrations section, PowerUpR and PowerUp! will be used to determine the sample size for a hypothetical intervention.Second, in the optimal design section, a stepby-step guide will be provided to optimally design a hypothetical intervention, along with the description of the Optimal Design Excel workbook accompanying this article.Finally, in the table illustration section, the relevant table in the Appendix will be used to determine sample size without using any software packages.

Software Illustrations
There are a few points to consider when determining the minimum required sample size (MRSS):  Type I error rate can be defined as the probability of finding a treatment effect in the sample when there is no effect in the underlying population.It is usually specified as 05%, the default value in PowerUpR (alpha = .05). Power rate can be defined as the probability of finding a treatment effect in the sample when there is an effect in the underlying population.It is usually defined as 80% in social science, which is the default value in PowerUpR (power = .80). Whether the hypothesis test is one-tailed or two-tailed.Generally, a two-tailed hypothesis test is performed assuming that the intervention could either be beneficial or detrimental, the default value in PowerUpR (two.tailed= TRUE). The minimum relevant effect size (MRES), standardized according to Cohen's d.MRES is usually defined as 0.20 or 0.25 in education research, the default value in PowerUpR (es = 0.25).An MRES of 0.25 means that a minimum meaningful treatment effect bumps an average student's score by ten percentile points.
 Treatment group allocation rate (p) is defined as the proportion of subjects in the treatment group.
Allocating half of the sample into the treatment group produces the smallest variance (or maximum power rate), which is the default value in PowerUpR (p = .50). The proportion of variance in the posttest explained by the pretest and other covariates (R 2 ).There is not much research in Turkey that provides R 2 values for planning experimental designs beyond Bulus and Koyuncu (2021).Brunner et al. (2018) analyzed PISA data for 81 countries, including Turkey, and provide design parameters for planning cluster-randomized trials.Their results apply to 15 years old students.If the interest is the explanatory power of socio-demographic variables for high school students, R 2 values reported for student-level can possibly be used.Socio-demographic variables explain a small amount of variance in academic achievement (Median R 2 = .05),affect and motivation (Median R 2 = .01),and learning strategies (Median R 2 = .01)at the student level.R 2 should rely on earlier literature or some existing data targeting the same outcome.The correlation between the pretest and the posttest tends to be higher with affective outcomes because, in comparison to cognitive outcomes, they tend to persist over time.This tendency for a stronger relationship manifests itself as higher R 2 values.In fact, for true experiments, Bulus and Koyuncu (2021, p. 32) reported that average values for affective and cognitive outcomes are R 2 = .38and R 2 = .22,respectively (r2 = .38or r2 = .22).MRSS computations can be performed considering the information presented above.For this purpose, PowerUpR R package and PowerUp!Excel workbook will be used.These two freeware have the same naming conventions and employ the same algorithms to determine MRSS.Although these statistical packages are mainly designed for multilevel randomized experiments, they also include a function for randomized pretestposttest control-group design under the "Individual Random Assignment" function or module.Considering MRSS result for an intervention targeting a cognitive outcome only, for example, one can report the power analysis procedure in a paragraph as follows: For this randomized pretest-posttest control-group design, we assume that the pretest explains 22% of the posttest variance (Bulus and Koyuncu, 2021).We further assume that the hypothesis test is two-tailed, the Type I error rate is 5%, and the power rate is 80%.Under these conditions, based on PowerUpR (Bulus et al., 2021) or PowerUp! (Dong & Maynard, 2013), a sample of 394 subjects equally allocated to treatment and control groups is needed to detect an effect size as small as 0.25.

Optimal Design under Differential Costs
The task of undertaking an experiment can be costly.Expenses can either be covered by the researcher or can be solicited from funding agencies.In either case, one can optimally allocate subjects into treatment and control groups if costs associated with treatment and control units are available.Optimal Design Excel workbook accompanying this article implements optimal design formulas presented in this study.The step-bystep approach to optimal design of randomized pretest-posttest control-group design is presented in Figures 3 to 6.The Optimal Design Excel workbook can also be used to optimally devise a randomized posttest-only control group design.
Assume that the reserved budget is 2000₺, which cannot be increased (fixed budget).Further, assume that costs associated with each treatment and control unit are 20₺ and 5₺, respectively.Defining these values in the Optimal Design Excel workbook (yellow highlighted cells) produces a sample size of 200 with an allocation rate of p = 0.33 (see Step 1 in Figure 3).We know this is the best allocation that produces minimum variance (or maximum power) compared to alternative allocations under identical budget constraints.However, we still do not know what power rate this allocation will produce.The question is: What is the power rate for the optimal allocation rate (p = .33)and the sample size (n = 200)?Using PowerUpR, the power rate is computed as 47% (see Step 2 in Figure 4).If the total cost or budget is fixed at 2000₺, this the best we can do.
Step 2: Check the power rate in PowerUpR or PowerUp! given optimal p and n produced in Step 1. Specify other design parameters according to your study field.If the total cost or budget is fixed stop here.
power.ira(alpha = .05,two.tailed= TRUE, es = .25,g = 1, r2 = .22,p = .33,n = 200) Suppose the total cost or budget is flexible.In that case, we can demonstrate that we opted for a costefficient allocation via exploring alternatives.The allocation rate does not change because it depends on per unit costs in treatment and control groups.The question is: What is the sample size and the total cost for a power rate of 80% given the optimal allocation rate (p = .33)? PowerUpR produces a sample size of 445, which will cost 4450₺ (see Step 3 in Figure 5).
Step 3: For the desired power rate (80%), find the required sample size given optimal p produced in Step 1.Then, re-estimate the total cost or budget.The next question is: What the sample size would have been for a power rate of 80% had we used a balanced allocation (p = .50)and how much would that cost?Had we used a p = .50allocation rate instead of p = .33,we would have needed 394 subjects which would have cost 4925₺ (see Step 4 in Figure 6).
Step 4: For the desired power rate (80%), find the required sample size (n) with the balanced allocation rate (p = .50).Then, re-estimate the total cost or budget.Using an optimal allocation rate of p = .33,we save 475₺ while preserving a power rate of 80%.Researchers can decide whether they should spend the extra 475₺ and go with the more balanced sample.Sometimes, severally unbalanced samples produce unstable estimates in the analysis of variance.Readers are referred to Bulus & Dong (2021a) for the optimal design of more complicated experimental designs.Researchers can use the cosa R package (also available through https://cosa.shinyapps.io/index/;Bulus & Dong, 2021b) for this purpose.

Table Illustration
Tables 1A -7A in the Appendix tabulate the main factors affecting MRSS.MRSS depends on whether the hypothesis test is two-tailed, the Type I error rate ( ), the treatment group allocation rate (p), the explanatory power of the pretest (R 2 ), and the minimum relevant effect size (MRES).Tables are reproduced considering MRES values ranging from 0.20 to 0.50.There are two rationales for these specifications; an MRSS capable of detecting the MRES = 0.20 is an acceptable standard in education research.It is considered the minimum meaningful effect according to Cohen's d when there is no theory that guides MRES specification.Besides, Bulus and Koyuncu (2021) found that the average sample size for experiments conducted in Turkey between 2010 and 2020 is insufficient to detect MRES values of 0.50 and below.Type I error rate ( ) specifications are based on common reporting guidelines in scholarly work (* p < .05,** p < .01,and *** p < .001).The treatment group allocation rate (p) ranges from .35 to .50 because differential costs may impel researchers to draw more subjects from the control group.After all, it is less costly.p = .50produces the smallest MRSS (minimum variance or maximum power) under no cost considerations.R 2 can be as high as .70,according to values reported in Hedges and Hedberg (2013).Thus, the explanatory power of the pretest (R 2 ) ranges from 0 to .70.

AUJES (Adiyaman University Journal of Educational Sciences) )
Let us find the MRSS for an experiment targeting an affective outcome.The default option for linear regression or t-test in SPSS and R produces p-values for a two-tailed hypothesis testing.Thus, we look at the rows in the "Two-tailed" section (see Figure 7).One could argue that the MRES value of 0.25 is the minimum meaningful improvement in education policy and practice.An MRES = 0.25 means that an intervention could bump up an average student's score from the 50 th percentile to the 60 th percentile.Thus, Table 2A in the Appendix is chosen.Bulus and Koyuncu (2021) reported that the explanatory power of the pretest for affective outcomes is .38 on average, a value between R 2 = .35and R 2 = .40(see Figure 7).It is common to deem a program effective if the p-value for the treatment effect is below .05.Thus, the row with = .05is chosen (see Figure 7).Without any cost considerations, it is ideal to choose a balanced sample (p = .50).
For R 2 = .35we need 328 subjects whereas for R 2 = .40we need 303 subjects.A difference of .05 in R 2 corresponds to a difference of 25 subjects in MRSS.R 2 = .38 is .02(2/5 of the difference) units away from the R 2 = .40,so approximately the sample size will be 2/5 of 25 (10 subjects) more.As a result 303 + 10 = 313 subjects are needed in total.Note that this number is the same as the MRSS found in the software illustration section.An MRSS of 313 is the minimum required number.Surely more subjects can be recruited.Finally, one could randomly allocate 157 subjects into the treatment group and the remaining 157 subjects into the control group.
One can report the power analysis procedure in a paragraph as follows: For this randomized pretest-posttest control-group design, we assume that the pretest explains 38% of the posttest variance (Bulus and Koyuncu, 2021).We further assume that the hypothesis test is twotailed, the Type I error rate is 5%, and the power rate is 80%.Under these conditions, based on Table 2A in Bulus (2021), we decided on a sample of 314 subjects equally allocated to treatment and control groups to detect an effect size as small as 0.25.

Table Illustration
There is no known software to determine MRSS for a non-equivalent pretest-posttest control-group design (R 2 > 0) and non-equivalent posttest-only control-group designs (R 2 = 0) yet.Researchers can use Tables S1-S28 in the Supplement for this purpose.Using the same specifications in Figure 7, except that now treatment and control groups are not equivalent on the pretest score, we can find the MRSS for a non-equivalent pretestposttest control-group design.Assume that the point-biserial correlation between the pretest and treatment indicator is 0.243, translating into a standardized pretest difference of 0.50 between treatment and control groups.From the INDEX worksheet in Figure 8, one can choose Table S8 for this purpose.
Figure 8. Finding the relevant table from the Supplemental Excel workbook based on MRES and pretest difference specifications.For R 2 = .35we need 349 subjects whereas for R 2 = .40we need 322 subjects (see Figure 9).A difference of .05 in R 2 corresponds to a difference of 27 subjects in MRSS.R 2 = .38 is .02(2/5 of the difference) units away from the R 2 = .40,so approximately the sample size will be 2/5 of 27 (~11 subjects) more.As a Buluş result, 322 + 11 = 333 subjects are needed in total.Twenty more subjects are needed compared to the earlier example with randomized pretest-posttest control-group design due to the pretest differences between treatment and control groups.One can report the power analysis procedure in a paragraph as follows: This non-equivalent pretest-posttest control-group design assumes that the pretest explains 38% of the posttest variance (Bulus and Koyuncu, 2021).We further assume a point-biserial correlation of .243 between the pretest and treatment indicator, translating into a standardized pretest difference of 0.50 between treatment and control groups.We further assume that the hypothesis test is two-tailed, the Type I error rate is 5%, and the power rate is 80%.Under these conditions, based on Table 8S in Bulus (2021), we decided on a sample 334 subjects (167 of them in the treatment and 167 of them in the control group) to detect an effect size as small as 0.25.

Discussion
Researchers can use G*Power for randomized posttest-only control-group designs.They can also use PowerUpR or PowerUp! via setting R 2 = 0 and g = 0 for this purpose.Collecting pretest information and other covariates means that R 2 > 0. This reduces the required sample size for an experiment.As for the randomized pretest-posttest control-group designs, researchers can use PowerUpR or PowerUp! via setting R 2 > 0 and g > 0 depending on the explanatory power of the pretest and covariates.G*Power and PowerUpR results are comparable when the explanatory power pretest or covariates is zero (R 2 = 0).PowerUpR allows R 2 > 0, whereas there is no convenient option in G*Power for pretest adjustment.Results differ by one or two units in some cases, possibly due to internal rounding differences used during intermediate computations.It is possible to convert G*Power results for R 2 = 0 to other scenarios with R 2 > 0. If one multiplies G*Power results for R 2 = 0 by the term (1 -R 2 ), they will obtain sample sizes comparable to PowerUpR.For example, to detect MRES = 0.20 using a two-tailed test with = .05,p = .50,and R 2 = .50,PowerUpR produces an MRSS = 394 (see Table 1A in the Appendix).G*Power produces an MRSS = 788 with the same specifications.If we multiply the result from G*Power by (1 -R 2 ), we get 394, which is the same as the result produced by PowerUpR.
Alternatively, one can use Tables 1A through 7A in the Appendix for randomized posttest-only control group design (R 2 = 0 & g = 0) and randomized pretest-posttest control-group designs (R 2 > 0 & g > 0).There are some evident trends in MRSS values reported in Tables 1A-7A in the Appendix.Two-tailed hypothesis tests require larger sample sizes compared to one-tailed hypothesis tests.The smaller the Type I error rate ( ), the larger the sample size requirement.A balanced sample (p = .50)requires a smaller sample size than an unbalanced sample (though one may favor unbalanced samples under differential costs).The bigger the value of R 2 , the smaller the sample size requirement.Finally, to detect smaller MRES, larger sample sizes are required.
There is no known software to find MRSS for non-equivalent posttest-only control-group design (R 2 = 0) and non-equivalent pretest-posttest control group design (R 2 > 0).One can use Tables 1S through 28S in the Supplemental Excel workbook for this purpose.Trends observed in Tables 1A-7A for true experiments apply to Tables 1S-28S for weak experiments.For a small point-biserial correlation between pretest and treatment indicator ( ), in other words, for a small standardized difference on the pretest between treatment and control groups, MRSS values hardly differ between tables in the Appendix and tables in the Supplement.For a moderate to large correlation ( and above), in other words, a moderate standardized difference on the pretest between treatment and control groups, differences between Tables in the Appendix, and those in the Supplement become noticeable.Weak experiments typically require larger sample sizes.
Weak experiments could be manipulated before an intervention so that treatment and control groups are comparable on the pretest.One such procedure is known as matching.Subjects not only can be matched on the pretest but they can also be matched on other relevant covariates.These designs are referred to as quasiexperimental designs (Fraenkel et al., 2011).The corresponding quasi-experimental designs would be the matching-only pretest-posttest control-group and matching-only posttest-only control-group designs (Fraenkel et al., 2011).Reserving only matched pairs and discarding remaining subjects will reduce the sample size and result in a loss of power.Assuming that the pretest difference between treatment and control groups is negligible after matching, one can use Tables 1A-7A to determine MRSS values and plan their sample size accordingly.There are other methods to ensure that treatment and control groups are comparable; propensity score matching (Rosenbaum & Rubin, 1983), prognostic scores (Hansen, 2006(Hansen, , 2008;;Wyss et al., 2015), prognostic propensity scores (Leacy & Stuart, 2013), coarsened exact matching (Iacus et al., 2012), inverse probability of treatment weighting (Huber, 2014).The description of these methods is beyond the scope of this study.Readers are referred to the references.
Formulas described in this study, software illustrations, and MRSS values in Tables 1A-7A and 1S-28S assume that observations are independent of each other.This assumption is often violated in practice because students are nested within classrooms (or teachers), and classrooms are nested within schools.Students in the same classroom or school tend to perform similarly.In other words, their scores are correlated due to contextual effects.Design and analysis experiments with nested structure require specialized statistical tools.An emerging bulk of studies consider this nested structure in the design of experiments (e.g., Bloom, 2006;Dong & Maynard, 2013;Hedges & Rhoads, 2010;Raudenbush & Liu, 2000;Konstantopoulos, 2008a;Konstantopoulos, 2008b;Schochet, 2008;Spybrook, 2007 and many others).To find MRSS for such complex experimental designs, researchers can use the PowerUpR or PowerUp!

Conclusion
This study elaborated on the nuts and bolts of sample size determination (or power analysis) in true experiments (randomized pretest-posttest control groups design and randomized posttest-only control-group design) and weak experiments (non-equivalent pretest-posttest control-group design and non-equivalent posttest-only control group design).In addition, illustrations provided step-by-step guidance on using G*Power, PowerUpR, and PowerUp! freeware to determine MRSS for true experiments.Furthermore, the optimal design of true experiments is illustrated using the companion Optimal Design Excel workbook.Finally, this study provided MRSS values for common scenarios in Tables 1A-7A for true experiments and Tables 1S-28S for weak experiments.
G*Power and PowerUpR produced the same results for randomized posttest-only control-group designs.G*Power results can be converted to PowerUpR via multiplying them by (1-R 2 ).PowerUpR and PowerUp! cover a broader range of experimental designs.Either of them can be used to design a randomized pretest-posttest control-group design.The software illustration section defined relevant design parameters and discussed reasonable values for them.One crucial design parameter is the minimum relevant effect size (MRES).Effects below the benchmark MRES would not be an interest to education policy and practice.When no data or literature is available for benchmark MRES value, 0.20 or 0.25 can be used.The second crucial parameter is R 2 value defined as the proportion of variance in the posttest explained by the pretest.R 2 values should rely on earlier studies of a similar kind.When no information is available, researchers can use R 2 = .22Buluş for cognitive outcomes and R 2 = .38for affective outcomes.These values are based on 155 experimental studies reviewed in Bulus and Koyuncu (2021).
This study also provided optimal design formulas for randomized pretest-posttest control-group designs under differential cost assumption.When treatment units are more expensive than control units, and the total cost or budget is fixed, researchers can find optimal p and n.Optimal p depends on the cost ratio (cost per treatment unit/cost per control unit), and n depends on total cost or budget given p. Suppose the total cost or budget is flexible.In this case, the researcher can explore several options described in the illustration.They can then compare the total cost with p = .50and decide whether it is worth pursuing an unbalanced design.Suppose the additional cost induced by the balanced design is not that much.In that case, it is probably better to use a balanced design.Optimal design formulas are implemented in the Optimal Design Excel workbook accompanying this article.
Finally, MRSS values in Tables 1A-7A allow researchers unfamiliar with R programming and Excel workbook to decide on an MRSS for randomized pretest-posttest control groups design and randomized posttest-only control-group design.There is no known software for finding MRSS in non-equivalent pretestposttest control-group design and non-equivalent posttest-only control group design.Tables 1S First, we need to install the PowerUpR package in the R environment and load it into the current session using the following code (or any other package installment routine).GitHub code repository has the most recent version of the package.Once available, the package can also be downloaded from the CRAN repository.The function that allows MRSS computation in PowerUpR is mrss.ira().Earlier versions of the PowerUpR package available on CRAN uses mrss.ira1r1()name.Considering R 2 from Bulus and Koyuncu (2021), MRSS for an intervention targeting to improve an affective outcome (e.g.affect and motivation) or a cognitive outcome (e.g.achievement) can be computed as:If one opts for PowerUp!Microsoft Excel workbook, it should be downloaded from https://www.causalevaluation.org/uploads/7/3/3/6/73366257/powerup.xlsm.MRSS can be computed for each type of outcome using PowerUp!Module IRA with identical specifications (see Figures1 and 2).
Figure 1.MRSS for an intervention targeting an affective outcome.

Figure 2 .
Figure 2. MRSS for an intervention targeting a cognitive outcome.

Step 1 :Figure 3 .
Find optimal p and n Optimal Design of Randomized Pretest-Posttest Control-Step 1 in Optimal Design Excel workbook.
Degrees of freedom: 197 # Standardized standard error: 0.095 # Type I error rate: 0.05 # Type II error rate: 0.535 # Two-tailed test: TRUE Figure 4. Step 2 in Optimal Design Excel workbook.

Figure 7 .
Figure 7. Finding MRSS from tables in the Appendix (or Supplemental Excel workbook) based on MRES and R 2 specifications.

Figure 9 .
Figure 9. Finding MRSS from the Supplemental Excel workbook based on MRES, R 2 , and pretest difference specifications.