Overheads for Unit 3--Chapter 4 (Validity)
Exercise: A Principal’s Question
You say that your math course teaches students to reason well
mathematically. What evidence can you provide that it actually
Three Key Concepts in Judging the Quality of an Assessment
Why should you be bothered
with these concepts, anyway?
- Appreciate why all assessments contain error
- Know the various sources of error
- Understand that different kinds of assessments are prone to different kinds of error
- Build assessments with less error
- Know how to measure error, if need be
- Know what is safe—and not safe—to conclude from assessment results
- Decide when certain assessments should not be used
Definition: Appropriateness of how scores are interpreted
That is, to what extent does your assessment measure what you say it does [and is as useful as you claim]?
- Stated another way:
To what extent are the interpretations and uses of a test justified by evidence about its meaning and consequences.
*Appropriate "use" of tests is a controversial recent addition to the
definition of "validity." That is probably why your textbook is
in how it defines it.
Very important points. Validity is:
- a matter of degree ("how valid")
- always specific to a particular purpose ("validity for…")
- a unitary concept (four
kinds of evidence to make one judgment—"how valid?")
- must be inferred from
evidence; cannot be directly measured
Four interrelated kinds of evidence:
Questions Guiding Validation
- What are my learning objectives?
- Did my test really address those
Do the students' test scores really mean what I intended?
- What may have influenced their scores?
Did testing have the intended effects?
- What were the consequences of the testing process and scores
What is an achievement domain?
A carefully specified set or range of learning
outcomes (content and mental skills).
In short, your set of instructional targets.
Definition: The extent to which an assessment’s tasks provide a
relevant and representative sample of the domain of outcomes
you are intending to measure.
- most useful type of validity evidence for classroom tests
- domain is defined by learning objectives
- items chosen with table of specifications
- is an attempt to build validity into the test rather
than assess it after the fact
- sample can be faulty in many ways
"face validity" (superficial appearance) or label does
not provide evidence of validity
assumes that test administration and scoring were
- inappropriate vocabulary
- unclear directions
- omits higher order skills
- fails to reflect content or
weight of what actually
What is a construct?
A hypothetical quality or construct (e.g., extraversion,
intelligence, mathematical reasoning ability) that we use to explain some
pattern of behavior (e.g., good at making new friends, learns quickly,
good in all math courses).
Definition: The extent to which an assessment measures
the construct (e.g., reading ability, intelligence, anxiety)
the test purports to measure
Some kinds of evidence:
- see if items behave the same (if test meant to measure a single construct)
- analyze mental processes required
- compare scores of known groups
- compare scores before and after treatment (do they change in the way your theory says they will and will not?)
- correlate scores with other constructs (do they correlate well—and poorly—in the pattern expected?)
- usually assessed after the fact
- usually requires test scores
- is a complex, extended logical process; cannot be quantified
What is a criterion?
A valued performance or outcome (e.g., scores high on a standardized
achievement test in math, later does well in an algebra class) that we
believe might—or should—be related to what we are measuring (e.g.,
knowledge of basic mathematical concepts).
Definition: The extent to which a test’s scores correlate with some valued performance outside the test (the criterion)
The word "criterion" is used in a second sense in testing, so don't get
them confused. In this context it means some outcome that we want to
predict. In the other sense, it is a performance standard against which we
are comparing students' scores. In the latter sense, it is used to
"criterion-referenced" interpretations of test scores from
"norm-referenced" test scores. Susan reads at the "proficient" level would
be a criterion-referenced interpretation. (She reads better than 65% of
other students would be a norm-referenced interpretation.)
- concurrent correlations (relate to a different current
- predictive correlations (predict a future
What is a correlation?
A statistic that indicates the degree of relationship between any two
sets of scores obtained from the same group of individuals (e.g.,
correlation between height and weight).
- validity coefficient when used in calculating
criterion-related evidence of validity
- reliability coefficient when used in calculating
reliability of test scores
- always requires test scores
- is quantified (i.e., a number)
- must be interpreted cautiously because
- irrelevant factors can raise or lower validity coefficients (unreliability, spread of scores, etc.)
- often hard to find a good "criterion"
- can be used to create "expectancy tables"
What is a consequence?
Any effect that your assessment has—or fails to have—that is important to you or the other people involved.
Definition: The extent to which the assessment serves its intended purpose (e.g., improves performance) and avoids negative side-effects (e.g., distorts the curriculum)
Possible types of evidence:
- did it improve performance? motivation? independent learning?
- did it distort the focus of instruction?
- did it encourage or discourage creativity? exploration? higher level thinking?
- usually gathered after assessment is given
- scores may be interpreted correctly but the test still have negative side-effects
- have to weigh the consequences of not using the assessment (even if it has negative side-effects). Is the alternative any better—or maybe worse?
- judging consequences is a matter of values, not psychometrics
Sources of Threats to Validity
- tests themselves
- administration and scoring
- nature of group or criterion
Can you give examples of each?