Correcting Fallacies about Validity as the Most Fundamental Concept in Educational and Psychological Measurement

Validity is the most fundamental cerebration in educational and psychological testing. That is to say, validity is a crucial concept in psychometrics, but it is still misunderstood and misused. Validity has changed in the last 100 years, in other words, evolved. Validity is the degree to which evidence and theory support the adequacy and appropriateness of the proposed interpretations and uses of the scores obtained from the test or measurement instrument applied to a particular population or sample. In short, validity is not a property of a test or measurement instrument itself, but it is a property of the proposed interpretations and uses of the scores. Thus, such statements as ‘the test is valid’, ‘the validity of scale’ or ‘the scores are valid’ should not be used. The most authoritative source regarding the development and evaluation of educational and psychological tests is published by name of the Standards for Educational and Psychological Testing and briefly referred to as the Standards. The view of content validity, criterion-related validity and construct validity supported in 1966 Standards was quitted in 1999 Standards.


INTRODUCTION
The field of educational and psychological testing is replete with fallacies, urban legends or misconceptions; reliability and validity concepts have also got one's share of these (Bademci, 2007(Bademci, , 2014;;Goodwin & Goodwin, 1999;Phelps, 2009).However, validity is the most fundamental cerebration in educational and psychological measurement.In other words, measurement is at the core of scientific research and validity is at the heart of measurement (Bademci, 2013;Viswanathan, 2005).
Validity is the most important concept in educational and psychological testing, but it has been the most misunderstood or widely misused for a long time (American Educational Research Association [AERA], American Psychological Association [APA], & National Council on Measurement in Education [NCME], 1999[NCME], , 2014;;Frisbie, 2005;Rogers, 1995).On the other hand, validity evolved and it still continues to evolve (Kane, 2001;Messick, 1989).Conceptions of validity have also changed remarkably over the past 100 years (Angoff, 1988;Kane, 2006).

Current Definitions of Validity, Validation, and Reliability
Validity and validation are two closely related but different concepts used in measurement (Kane, 2006;Newton & Shaw, 2014).Validity is the degree to which evidence and theory support the adequacy and appropriateness of the proposed interpretations and uses of the scores obtained from the test or measurement instrument applied to a particular population or sample (Bademci, 1999(Bademci, , 2019)).Validation, on the other hand, is the process by which the evidence of the validity of score interpretations is collected (Bademci, 1999(Bademci, , 2017b)).Besides, reliability is the reproducibility or the consistency of the scores obtained from the test or measurement instrument applied to a particular population or sample (Bademci, 1999(Bademci, , 2011)).It must be borne in mind that score reliability is necessary but not sufficient for score interpretation validity (Thompson, 2003).

MODERN VIEW ON VALIDITY AND CORRECTING FALLACIES ABOUT VALIDITY
Validity is a property of the proposed interpretations and uses of the scores; in other words, validity is not a property of a test or measurement instrument itself or of test scores (Bademci, 1999(Bademci, , 2017a;;Cronbach, 1971;Furr & Bacharach, 2008;Kane, 2006).Therefore, the fallacious expressions such as 'validity of the test', 'the test is valid', 'the validity of scale', 'the validity of measurement instrument (or method)', 'the measurement procedure is valid', 'assessment validity', 'the validity of raters', 'the validity of exam', 'the validity of test scores', 'the scores are valid' and so on should never be used (AERA, APA, & NCME, 1985;Bademci, 2007).For example, the question "Is the test valid" is incorrect; it is appropriate to ask the question "Is it valid the interpretation of the scores from the test?" Today, there is a broad consensus on the point that validity is related to the interpretations that have been made according to the test scores but not the tests themselves (AERA, APA, & NCME, 1999, 2014;Cizek, 2016;Cronbach, 1971;Kane, 2006;Messick, 1989).Also, at the core of this consensus, there is the underlying opinion that the interpretation of test scores is valid (Cronbach, 1971;Newton, 2012).Validity is a matter of degree; that is, validity is not a concept of all-or-none (Bademci, 1999(Bademci, , 2019;;Kane, 2013;Nunnally, 1978).Instead, validity of the interpretation of the scores should be stated with certain degrees such as high validity, medium validity, low validity or no validity (Linn, 2010;Linn & Gronlund, 1995).That is to say, validity is not presented as a dichotomy (valid or not), because it is a continuum, one end of which is anchored by interpretations of scores that simply are not justified (Koretz, 2008).Validity is also dependent on the population or the sample like reliability; in other words, it is always specific to a particular population or sample or group (Bademci, 1999(Bademci, , 2011;;Linn & Gronlund, 1995).It should not be neglected that "…validity information varies with the group tested…" (Linn & Gronlund, 1995, p. 77).
Validity is an evaluation argument and includes an evaluative judgement; it was founded on empirical evidence and theoretical rationales (Bademci, 1999(Bademci, , 2017a;;Linn & Miller, 2005;Messick, 1989;Osterlind, 2006).In other saying, validity requires an evaluation of the degree to which the proposed interpretations and uses of the scores are justified by supporting evidence (Linn & Miller, 2005).Philosophical bases of the validity theory have also changed in years.The traditional psychometric viewpoint on validity which was put forward in the early twentieth century was rooted in positivism; nevertheless, the practices of contemporary validity theory and validation which point out that validity is a property of interpretations which were made from scores have been strongly influenced by constructivism (constructive realism, especially since 1980s) (Bademci, 1999(Bademci, , 2017a;;Messick, 1989;Mislevy, 2018;Sijtsma, 2009).

CONTEMPORARY VALIDITY AND 1999 STANDARDS: REJECTION OF THE HOLY TRINITY OF VALIDITY (CONTENT VALIDITY, CRITERION-RELATED VALIDITY, AND CONSTRUCT VALIDITY)
In fact, the most authoritative source regarding the development and evaluation of educational and psychological tests is published by name of the Standards for Educational and Psychological Testing (AERA et al., 1985(AERA et al., , 1999(AERA et al., , 2014;;APA et al., 1966) and briefly referred to as the Standards.The most major change in concept of validity also occurred in 1985 Standards; validity is a unitary concept (AERA, APA, & NCME, 1985;Algina & Penfield, 2009;Bademci, 1999Bademci, , 2007;;Messick, 1989)."The trinitarian doctrine" or "the holy trinity" of validity (Guion, 1980) which accepts that there are three kinds of validity such as content validity, criterion-related validity and construct validity supported in 1966 Standards was rejected and abandoned in 1999 Standards (APA, AERA, & NCME, 1966;AERA, APA, & NCME, 1999;Bademci, 1999Bademci, , 2017b)).
However, in 1999 Standards that have represented the modern view arguing validity as a unitary concept based on various types of validity evidence, under the title of "sources of validity evidence", the types of validity evidence was presented as 1) evidence based on test content, 2) evidence based on response processes, 3) evidence based on internal structure, 4) evidence based on relations to other variables, 5) evidence based on consequences of testing [evidence for validity and consequences of testing] (AERA, APA, & NCME, 1999, 2014); the latest edition of the Standards was published in 2014.The types of validity evidence are encapsulated below.

Sources of Validity Evidence
Evidence based on test content "can be obtained from an analysis of the relationship between the content of a test and the construct it is intended to measure" (AERA, APA, & NCME, 2014, p.14).Such evidence includes "traditional content validity studies and alignment studies that require independent subject matter experts (SMEs) to review and rate test items according to their content relevance, representativeness, or alignment to curricular objectives as well as practice (job) analyses in the case of employment, licensure, or certification tests" (Sireci & Faulkner-Bond, 2015, p. 221-222).
Evidence based on response processes refers to "concerning the fit between the construct and the detailed nature of the performance or response actually engaged in by test takers (AERA, APA, & NCME, 2014, p.15).Validity evidence in this type include think-aloud protocols, cognitive interviews that rely on examinees' verbalizations about their own thinking processes, eye-movement patterns and timing of responses (Ercikan & Pellegrino, 2017;Urbina, 2014).
Evidence based on internal structure comes from "analyses of the relationships of responses to different items on the test.The central idea is to investigate whether the relationships among item scores or score on parts of the test are as expected from the theory of the construct" (Algina & Penfield, 2009, p.118).In other words, "analyses of the internal structure of a test can indicate the degree to which the relationships among test items and test components conform to the construct on which the proposed test score interpretations are based" (AERA, APA, & NCME, 2014, p.16).Approaches or methods for gathering such evidence include factor analysis, item response theory, multidimensional scaling, differential item functioning, structural equating modeling, and cluster analysis (AERA, APA, & NCME, 1999, 2014;Algina & Penfield, 2009;Osterlind, 2006).Besides, it has been suggested strategies involving generalizability theory or internal consistency methods and other indexes of score reliability as validity evidence in this type (Osterlind, 2006;Urbina, 2014).Thus, Sireci and Soto (2016) remarked "Internal structure evidence also evaluates the "strength" or "salience" of the major dimensions underlying an assessment, and this salience has a relationship to internal consistency reliability " (p.152).Urbina (2014) noted "…for example, a test is designed to assess a unidimensional construct such as spelling ability or test anxiety.For these kinds of instruments, high internal consistency coefficients, like the coefficient alpha…, support the contention of unidimensionality" (p.185).Nevertheless, Crocker and Algina (1986) noted "…alpha should not be interpreted as a measure of the test's unidimensionality" (p.142).Bademci (2014) also emphasized that "Unidimensionality may be examined using exploratory factor analysis or especially confirmatory factor analysis…But, Cronbach's alpha should not be used as a measure of unidimensionality [or homogeneity]…Cronbach's alpha should be used to estimate of the score reliability based on the internal consistency among the [item] scores after unidimensionality is examined" (p.23).However, it must be borne in mind that reliability serves as an integral component to the interpretation of the scores in many validation studies (Algina & Penfield, 2009).
Evidence based on relations to other variables refers to analyses of the relationship test scores and other variables.In other words, "In many cases, the intended interpretation for a given use implies that the construct should be related to some other variables, and, as a result, analyses of the relationship of test scores to variables external to the test provide another important source of validity evidence" (APA, AERA, & NCME, 2014, p.16).Such evidence can include multitrait-multimethod study, test-criterion relationships (predictive and concurrent studies), validity generalization study, contrasted groups studies (APA, AERA, & NCME, 1999, 2014;Reynolds & Livingston, 2012;Suen & Rzasa, 2004).However, Algina and Penfield (2009) noted "…validation methods making use of correlational approaches (e.g., the correlation of multiple tests and multi-trait multi-method studies) can be impacted by the reliability of the obtained test scores, and thus the proper estimation of the reliability of the scores is an important consideration in interpreting the obtained validity evidence" (p.119).
Evidence based on consequences of testing refers to evaluation of the intended (positive and negative) and unintended (positive and negative) consequences associated with interpretations and uses of test scores (AERA, APA, & NCME, 2014; Sireci & Faulkner-Bond, 2015).Examples of evidence based on consequences of testing include increased student dropout, increased teacher stress, improved student achievement, enhanced teacher and student motivation (Linn, 2010).The standard sets which were produced in 1999 Standards have been maintained exactly and in an enhanced way in 2014 Standards (AERA, APA, & NCME, 1999, 2014).

IN LIEU OF CONCLUSION: VALIDITY IS A UNITARY CONCEPT
In contemporary validity, distinct types of validity was rejected such as content validity, criterion-related validity and construct validity.As 1999 Standards and 2014 Standards pointed out, validity is a unitary concept and there are various types of validity evidence as evidence based on test content, evidence based on response processes, evidence based on internal structure, evidence based on relations to other variables, evidence based on consequences of testing [evidence for validity and consequences of testing] (AERA, APA, & NCME, 1999, 2014).Contemporary validity and the sources of validity evidence was manifested in Figure 1.

Figure 1. Validity and the sources of validity evidence
In addition, the radical changes related to validity and reliability were brought up to Turkey's agenda within the framework of a paradigm change by Bademci (1999Bademci ( , 2004Bademci ( , 2017a) ) 23 years ago for the first time.