29
M O D U L E
Basic Concepts of Measurement
n
Central Tendency and Variability
n
Normal Distribution
Types of Standardized Tests
n
Categories of Standardized Tests
n
Criterion-referenced and Norm-referenced Tests
Standardized Tests and Scores
Outline Learning Goals
1.
Describe the purpose of four broad categories of standardized tests and how standardized tests are used by
teachers.
2.
Explain the difference between criterion-referenced and norm-referenced tests.
3.
Explain the basic properties of a normal distribution.
Characteristics of
―Good‖ Tests
n
Validity
n
Reliability
Summary Key Concepts Case Studies: Re
flect and Evaluate
Types of Test Scores
n
Raw Scores
n
Percentile Scores
n
Grade-equivalent Scores
n
Standard Scores
4.
Describe four types of test scores, and explain the advantages and limitations of each type of score.
5.
Explain why validity and reliability are two important qualities of tests and why teachers need this information
about tests to interpret test scores.
boh7850x_CL9Mod29.p524-539.indd 524
boh7850x_CL9Mod29.p524-539.indd 524
10/23/08 4:06:04 PM
10/23/08 4:06:04 PM
module twenty-nine
standardized tests and scores
525
TYPES OF STANDARDIZED TESTS
How many standardized tests have you taken in your life? Can you remember why you took them? Perhaps you took
the SAT or ACT to apply for college, or the Praxis I exam to be admitted to your undergraduate education program.
Before we consider why educators at all levels use standardized tests, we first need to define exactly what makes a
test standardized. Standardized tests are distinguished by two qualities (Gregory, 2007; Haladyna, 2002):
1. They are created by testing experts at test publishing companies.
2. All students are given the test by a trained examiner under the same (hence ―standardized‖) conditions. For
example, all students are given the same directions, test items, time limits, and scoring procedures.
You’re probably very familiar with tests that are not standardized, such as the classroom tests you have taken
since elementary school. Classroom tests, often created by individual teachers, measure specific learning that occurs
within a classroom and typically focus on the district’s curriculum. Teachers may use classroom tests as a formative
or a summative assessment of students’ knowledge (Linn & Gronlund, 2000). Assessment includes any and all
procedures used to collect information and to make inferences or judgments about an individual or a program
(Reynolds, Livingston, & Willson, 2006). Formative assessments, such as homework assignments and quizzes,
enable teachers to plan for instruction and monitor student progress throughout a grading period. To assess student
achievement at the end of an instructional unit or grading period, teachers use summative assessments such as tests
and cumulative projects.
Like some classroom tests, standardized tests typically are used for summative assessments, but they focus on
broader areas of learning such as overall mathematical achievement rather than mathematical progress within a
grading period. For a summary of the differences between classroom and standardized tests, see Table 29.1.
and Scores
>
>
<
<
Classroom assessment: See page 467.
>
>
<
<
Classroom tests: See page 481.
TA B L E 2 9 .1
Comparison of Classroom Tests and Standardized Tests
Classroom test Standardized test
Purpose formative and summative typically summative Content speci
fic to a content covered in the classroom over a
speci
fic time frame
Source of items created or written by the classroom teacher
can be
flexible for students with disabilities and special needs
created by a panel of professional experts
speci
fic or general topics across many districts or states
Standardized
Tests
Module 29 :
Administration procedures
standardized across all settings and individuals
usually very long
—several hours Scoring procedures typically teacher-scored
typically machine-scored Reliability typically low typically high Scores individual
’s
number or percent correct (raw score)
Length usually short
—less than an hour
compared to predetermined criteria or norm group (converted from raw score)
Grading used to assign a course grade used to determine general ability or achievement; not used to assign course
grade
Source: Haladyna, 2002.
boh7850x_CL9Mod29.p524-539.indd 525
boh7850x_CL9Mod29.p524-539.indd 525
10/9/08 9:23:10 AM
10/9/08 9:23:10 AM
526
cluster nine
standardized testing
What standardized tests do you remember taking in elementary through high school?
What purpose do you think they served? Think about these tests as you read about
the categories of standardized tests.
Categories of Standardized Tests
Standardized tests have several purposes. Some standardized tests—called single-subject survey tests—contain
several subtests that assess one domain-specifi c content area, such as mathematics. Other standardized tests contain
a battery of several tests used in conjunction with one another to provide a broader, more general picture of
performance that may include competencies such as vocabulary, spelling, reading comprehension, mathematics
computation, and mathematics problem solving. Standardized tests fall into one of four broad categories based on
their purpose (Chatterji, 2003), as described here and summarized in Table 29.2.
1. Standardized achievement tests assess current knowledge, which can include learning outcomes and skills
either in general or in a specific domain. Standardized achievement tests do not neces-
TA B L E 2 9 . 2
Examples of Standardized Tests in Four Broad Areas
Name Type/Purpose
Iowa Test of Basic Skills (ITBS) Battery of achievement tests for grades K
–8
Standardized achievement tests
Tests of Achievement and Pro
ficiency (TAP)
Battery of achievement tests for grades 9
–12 Metropolitan Achievement Test (MAT)
Battery of achievement tests for grades K
–12
Standardized aptitude tests
Differential Aptitude Test (DAT) Battery of tests to predict educational goals for students in grades 7
–12 Scholastic Assessment Test
(SAT)
Single test to predict academic performance in college General Aptitude Test Battery (GATB)
Battery of aptitude tests used to predict job performance Armed Services Vocational Aptitude Battery (ASVAB)
Battery of aptitude tests used to assign armed service personnel to jobs and
training programs
Minnesota Multiphasic Personality Inventory-2 (MMPI-2)
Career or educational interest inve ntories
Strong Interest Inventory (SII) Instrument used with high school and college students to identify occupational preferences Kuder
General Interest Survey (KGIS)
Instrument used with students in grades 6
–12 to determine preferences in broad
areas of interest
NEO Five-Factor Inventory (NEO-FFI)
Instrument to assess an individual on
five theoretically derived dimensions of personality
Instrument used to aid in clinical diagnosis
Personality tests
Source: Chatterji, 2003.
boh7850x_CL9Mod29.p524-539.indd 526
boh7850x_CL9Mod29.p524-539.indd 526
10/9/08 9:23:12 AM
10/9/08 9:23:12 AM
module twenty-nine
standardized tests and scores
527
sarily match the curriculum of any particular state or school
district. Instead, they are used to identify the strengths and
weaknesses of individual students as well as school districts.
Readiness tests, like achievement tests, measure young children’s
current level of skill in various academic (reading, math,
vocabulary) and nonacademic (motor skills, social skills) domains
and are used to make placement and curricular decisions in the
early elementary grades.
2. Standardized aptitude tests assess future potential—or
capacity to learn—in general or in a specific domain. Aptitude
tests are used for admission or selection purposes to place students
in particular schools (e.g., private schools or colleges) or specific
classrooms or courses (e.g., advanced mathematics). Standardized
intelligence tests are considered aptitude tests because their
purpose is to predict achievement in school.
3. Career or educational interest inventories assess individual
preferences for certain types of activities. These inventories
typically are used to assist high school and college students in
planning their postsecondary education, as well as to assist
companies and corporations in selecting employees. Some of these
tests are also considered aptitude tests because they may predict
future success.
4. Personality tests assess an individual’s characteristics, such as
interests, attitudes, values, and patterns of behavior. Personality
tests are limited in their educational use because psychologists and
counselors with graduate-level training primarily use them for
diagnosis of clinical disorders and because most personality tests
are appropriate only for individuals age 18 or older.
Most standardized tests administered by teachers are given in a
group format. Group-administered tests are relatively easy to
administer and score, making them cost-effective. Individually
administered tests, such as personality tests and IQ tests, require
expert training, time to administer, and time to score and interpret, all
of which lead to greater cost. Although teachers typically are not
trained to administer these individual tests, they will encounter the
test scores of individually administered tests in meetings to determine
the eligibility of students for special education and related services.
Criterion-referenced and Norm-referenced Tests
The interpretation of test scores includes understanding not only what
the test measures—general versus specific knowledge or current
knowledge versus future potential—but also how test scores should
be evaluated. A test score is a measurement, that assigns a
quantitative or descriptive number during the process of assessment.
But a test score in itself cannot be detached from how it is evaluated.
Evaluation is the subjective judgment or interpretation of a
measurement or test score (Haladyna, 2002). For example, a student
might take a test and answer 20 of 30 questions correctly
(measurement), but whether that score is interpreted as a ―good‖
score, an ―improvement‖ from a previous score, or ―substantially
below‖ the expected score is a matter of evaluation. Standardized
tests typically are designed so that any test score can be evaluated by
comparing it either to a specific standard (criterion) or to data
compiled from the test scores of many similar individuals (norm).
n
Criterion-referenced tests compare an individual’s score to a
preset criterion, or standard of performance, for a learning
objective. Many times criterion-referenced tests are used to test
mastery of specific skills or educational goals in order to provide
information about what an individual does and does not know. On
criterion-referenced tests, test developers include test items based
on their relevance to specific academic skills and curricula. The
criteria are chosen because, together, they represent a level of
expert knowledge. Lawyers, doctors, nurses, and teachers must
take standardized tests and meet a specified criterion in order to
become licensed or certified for their profession. Some tests that
students take—such as state mastery tests—are also
criterion-referenced tests.
n
Norm-referenced tests compare the individual test-taker’s
performance to the performance of a group of similar test-takers,
called the norm sample. A norm sample is a large group of
individuals who represent the population of interest on
characteristics such as gender, age, race, and socioeconomic status.
For example, a norm sample for a standardized test can be all fifth
graders nationally, all fifth graders in a state, or all fifth graders in
a district. Norm samples for nationally used standardized tests,
such as the achievement tests listed in Table 29.2, need to be large
(about 100,000 test-takers) and representative of the population of
students in order to allow accurate interpretations. The test items
on norm-referenced tests are designed to differentiate, to the
greatest degree possible, between individual test-takers. For
example, a norm-referenced mathematics
Standardized
Tests
Module 29 :
and Scores
>
>
<
<
Intelligence measured as IQ: See page 400 .
boh7850x_CL9Mod29.p524-539.indd 527
boh7850x_CL9Mod29.p524-539.indd 527
10/9/08 9:23:13 AM
10/9/08 9:23:13 AM
528
cluster nine
standardized testing
TA B L E 2 9 . 3
Comparison of Criterion-referenced and
Norm-referenced Tests
Criterion-referenced Norm-referenced
Purpose to determine mastery at a speci
fied level
to compare score to the
performance of similar test-takers Content speci
fic to a domain
or content area
standard score, percentile, or grade-equivalent score as
compared to the norm group
Item selection similar level of dif
ficulty wide variability in
dif
ficulty level Scores number or percent correct as compared
to criteria
broad domain or content area
Source: Gregory, 2007.
achievement test might be used to select the top elementary school
students in a school district for a gifted program with limited seats or
space.
The major difference between the two types of tests is the purpose
or situation for which each type of test is most useful, as summarized
in Table 29.3. Many standardized group-administered achievement
tests provide teachers with both a criterion-referenced and a
norm-referenced interpretation of scores, as shown in Figure 29.1.
When dual interpretation is not available, the type of test that is used
will depend on the purpose. Criterion-referenced tests provide
information about mastery of material but do not allow comparisons
among test-takers. In contrast, norm-referenced tests do not provide
information about the mastery or the strengths and weaknesses of a
particular individual, but they do provide ample information for
comparing test scores across individuals using several basic concepts
of measurement, as discussed next.
BASIC CONCEPTS OF MEASUREMENT
In order to interpret test scores accurately, teachers must understand
some basic concepts of measurement that are used in conjunction
with one another to evaluate individual students as well as to evaluate
groups of students, such as classrooms or school districts.
Central Tendency and Variability
One basic measure needed to form evaluations or make comparisons
is central tendency—the score that is typical or representative of the
entire group. Let’s examine a set of classroom or standardized test
scores. Suppose you teach a class of 11 students who receive these
scores on their first exam: 63, 65, 72, 75, 76, 78, 83, 87, 87, 92, and
98. What measure will tell you the central tendency of this set of
numbers? The three most common statistical descriptions of central
tendency are the mean, median, and mode:
1. Mean: Divide the sum of all the scores by the total number of
scores to find the mean, or simple average. Summing the 11 scores
(sum = 876) and dividing by 11 gives mean 79.64.
2. Median: Find the middle score in a series of scores listed from
smallest to largest. In this case, the median is 78, the middle value,
because five scores are on either side. In a group with an even
number of scores, the median is the midpoint, or average, of the two
middle scores.
3. Mode: Find the most frequently occurring score in the group. In
this group, the mode is 87, the only score that occurs more than
once. A group of scores can be bimodal—having two modes— when
two different scores occur frequently.
The mean, median, and mode all provide information about the
typical score within a group but do not provide information about the
variability—how widely the scores are distributed, or spread out,
within a particular group. Compare these two groups of test scores:
boh7850x_CL9Mod29.p524-539.indd 528
boh7850x_CL9Mod29.p524-539.indd 528
10/9/08 9:23:14 AM
10/9/08 9:23:14 AM
module twenty-nine
standardized tests and scores
529
Figure 29.1: A Simple Standardized Test Score Report. Many achievement tests provide two kinds of scores.
Criterion-referenced scores, listed here under
―Performance on Objectives,‖ measure a student’s mastery of material.
Norm-referenced scores, at the bottom of the report, allow some comparisons among test-takers.
TERRANOVA
™, Third Edition
COMPLETE BATTERY
Individual Profile with InView
™, Part I
Student
TERRANOVA3
Standardized
Tests
Module 29 :
Performance on Objectives
Obj. No. Objective Titles
Natl OPI Diff Moderate Mastery Range Objectives Performance Index (OPI)*
0 25 50 75 100
Reading Language
02 Basic Understanding 91 79 12 48
–70
92 84 8 52
–75 65 66 –1 50–70 70 74 –4 45–73
63 68
–5 45–70 59 74 –15 50–75 78 63 15 55–75
71 69 2 47
–77 83 72 11 45–75 66 86 –20 45–60 71 72 –1 50–78 61 83 –22 52–78
52
–75
77 88
–11 44–73 71 74 –3 69 68 1 43–73 03 Analyze Text 04 Evaluate/Extend Meaning 05 Identify Rdg. Strategies
and Scores
PAT WASHINGTON
Grade 4
Simulated Data
Purpose
This report presents information about this student
’s performance on the TerraNova and InView assessments.
Page 1 describes achievement in terms of performance on the objectives. Together with classroom assessments and classwork, this information can be used to identify
potential strengths and needs in the content areas shown.
07 Sentence Structure 08 Writing Strategies 09 Editing Skills
10 Number & Num. Relations 11 Computation & Estimation 13 Measurement 14 Geometry & Spatial Sense 15 Data, Stats., & Pro b.
16 Patterns, Funcs, Algebra 17 Prob. Solving & Reasoning 18 Communication
47 74
–27 50–75 19 Science Inquiry
49 69
–20 52–77 20 Physical Science
46 83
–37 45–78 21 Life Science
52 84
–32 48–73 22 Earth & Space Science
48 78
–30 52–69 23 Science & Technology
52 56
–4 50–73 24 Personal & Social Persp.
Scoring: PATTERN (IRT) Norms Date: 2007
Birthdate: 02/08/98 Special Codes: ABCDEFGHIJKLMNOPQRSTUVWXYZ 3 5 9 7 3 2 1 1 1 Form/Level: G-14
Test Date: 04/15/07 QM: 31
Class: JONES School: WINFIELD District: GREEN VALLEY
City/State: ANYTOWN, U.S.A.
Mathematics
Science
CTB McGraw-Hill
National Percentile
Range
79
–12 48–70
92 84
–8 52–75
65 66 1 50
–70
70 74 4 45
–73
Anticipated Normal
Curve Equiv.
Low Mastery
Moderate Mastery
High Mastery
*OPI is an estimate of the number of items that a student could be expected to answer correctly if there had been 100 items for that objective.
Social Studies
26 Geographic Perspectives 91
Key
27 Historical & Cultural 28 Civics & Government 29 Economic Perspectives
Moderate Mastery Range
Norm-Referenced Scores
Scale Score DIFF
Obtained Normal
Curve Equiv.
Anticipated National Percentile
National Percentile
National Percentile Scale
1 10 25 50 75 90 99
Reading
Language
Mathematics
Total Score** Science Social Studies
**Total Score consists of Reading, Language, and Mathematics. Above or Below appears when there is a significant difference.
— —
— —
— —
— —
664
678
674
679 668 662
— —
— —
— —
— —
—
Above
—
Above
— — — —
— —
— —
— —
35
34
41
35 36 38
— —
— —
— —
— —
47
57
48
55 45 43
— —
— —
— —
— —
23
22
29
24 26 28
— —
— —
— —
— —
45
64
47
59 41 37
— —
— —
— —
— —
32
–56
53
–72
37
–60
50
–66 36–50 27–48 — —
— —
— —
— —
National Stanine Scale
1 2 3 5 6
4 7 8 9
boh7850x_CL9Mod29.p524-539.indd 529
boh7850x_CL9Mod29.p524-539.indd 529
10/23/08 4:06:12 PM
10/23/08 4:06:12 PM
530
cluster nine
standardized testing
Figure 29.2: Normal Distributions with Large (orange) and Small (blue) Standard Deviations. In the small standard deviation,
most scores are close to the mean score of the group. Scores are more spread out in a large standard deviation.
Class 1 scores: 6, 7, 7, 8, 8 Class 2 scores: 4, 7, 7, 8, 10
Both classes have a mean test score of 7.2, but the scores in the second class
show considerably more variation. The range is a simple measure of variability
calculated as the difference between the highest and lowest scores. For Class 1,
the range is 2 (8 minus 6), while for Class 2, the range is 6 (10 minus 4).
The most commonly used measure of variability among scores, standard deviation (SD), is the degree
of variability in a group of scores. The standard deviation is more difficult to compute than the range: it equals the
square root of each score’s deviation from the mean. This sounds more complex than it is. The computation is less
important than understanding the standard deviation for test score interpretation. Figure 29.2 shows the difference in
variability for two groups of scores with small and large standard deviations.
n
A small standard deviation indicates that most scores are close to the mean score of the group.
For a classroom test, the teacher might hope that all students would score well and close to one another,
indicating that all students are mastering the course objectives.
n
A large standard deviation suggests that the scores are more spread out. On standardized achievement tests, a
large degree of variability is not only typical but optimal, because the test items are designed to make fine
discriminations in achievement among a population of students.
In the example with Class 1 and Class 2, the standard deviations are .84 (Class 1) and 2.17 (Class 2). With a small
set of numbers, the variability may be obvious. However, with large groups of scores, such as the thousands of
scores of students taking a standardized achievement test, the standard deviation provides a scientific measurement
of the distribution of scores.
Normal Distribution
A frequency distribution is the simple list of all scores for a group. The scores can be depicted visually in a
histogram, or bar graph. For example, Figure 29.3 depicts the final grades in an educational psychology course.
Final grades are indicated along the horizontal axis (x-axis), and the number of students receiving each final grade is
indicated along the vertical axis (y-axis). As Figure 29.3 shows, 7 students failed the course, 17 students received a
grade of D, 45 students received a C, 37 students received a B, and 15 students received an A. In this figure more
scores fall to the right (higher scores) and fewer scores fall to the left (lower scores) of the midpoint, indicating that
the scores are skewed. Skewness, or the symmetry or asymmetry of a frequency distribution, tells how a test is
working. Negatively skewed distributions (with long tails to the left), such as in Figure 29.3, indicate that the scores
are piled up at the high end. Positively skewed distributions indicate that the scores are piled up at the low end (long
tail to the right).
Classroom tests with negative skewness are what teachers hope to achieve (i.e., mastery by most students). In
standardized testing, positively skewed distributions suggest that the test had too many difficult questions, and
negatively skewed distributions suggest that the test had too many easy items.
For standardized tests, we expect a frequency distribution that is
symmetrical and bell-shaped, called a normal distribution (see Figure
29.4).Normal distributions are apparent in scores on the SAT and on IQ tests
(see Figure
29.4 and 29.4). A normal distribution has several properties:
Figure 29.3: Histogram of Final Grades in an Educational Psychology Course. Final grades are indicated along the horizontal
axis (x-axis), and the number of students receiving each
final grade is indicated along the vertical axis (y-axis).
Number of Students
50
45
40
35
30
25
20
15
10
5
0
F D C B A
Final Grades
boh7850x_CL9Mod29.p524-539.indd 530
boh7850x_CL9Mod29.p524-539.indd 530
10/23/08 4:06:14 PM
10/23/08 4:06:14 PM
module twenty-nine
standardized tests and scores
531
Standardized
Tests
Module 29 :
Percentage of cases under portions of the normal curve
Figure 29.4: Normal Distribution Curves. For standardized tests, a symmetrical, bell-shaped frequency distribution is considered
normal. Normal distributions are ap parent in scores on SAT and IQ tests.
2.14%
Standard Deviations
0.13% 2.14% 0.13%
13.59% 13.59%
34.13% 34.13%
–3s –4s –2s –1s 0 +1s +2s +3s +4s
and Scores
Cumulative Percentages
Rounded
0.1%
2.3% 2%
15.9% 16%
50.0% 50%
84.1% 84%
97.7% 98%
99.9%
Percentile Equivalents
Z-scores Stanines
T-scores
1 5
90 95
99
20 10 30 40 50 Q1
80 60 70
Q3 Md
–4.0 –2.0 –1.0 0 +1.0 +2.0 +3.0 +4.0
–3.0
1 2 3 4 5 6 7 8 9
Percentage in stanine
4% 4% 7% 7%
12% 12% 17% 20% 17%
20 30 40 50 60 70 80
2.14%
0.13% 2.14% 0.13%
13.59% 13.59%
34.13% 34.13%
200 300 400 500
SAT scores
600 700 800
2.14%
0.13% 2.14% 0.13%
13.59% 13.59%
34.13% 34.13%
55 70 85 100
IQ scores
115 130 145
n
The mean, median, and mode are equal and appear at the midpoint of the distribution, indicating that half the
scores are above the mean and half the scores are below the mean.
n
Approximately 68% of scores occur within one standard deviation above and below the mean.
n
Two standard deviations above and below the mean include approximately 95% of scores.
n
Three standard deviations above and below the mean include approximately 99% of scores.
TYPES OF TEST SCORES
Raw Scores
Both classroom and standardized tests first yield a raw score, which typically is the number or percentage of correct
answers. For evaluating the results of classroom tests, raw scores typically are used. For standardized
criterion-referenced tests, raw scores are compared to the preset criterion for interpretation (e.g., pass/fail,
mastery/nonmastery). For standardized norm-referenced tests, raw scores
boh7850x_CL9Mod29.p524-539.indd 531
boh7850x_CL9Mod29.p524-539.indd 531
10/23/08 4:06:15 PM
10/23/08 4:06:15 PM
532
cluster nine
standardized testing
more commonly are converted or transformed by the test developers
to help provide consistent evaluation and ease of interpretation of
scores by parents and teachers. Next we consider several common
norm-referenced test scores.
Percentile Scores
Percentile scores (or ranks) are derived by listing all raw scores from
highest to lowest and providing information on the percentage of
test-takers in the norm sample who scored below or equal to that raw
score. For example, a percentile score of 80 means that the test-taker
scored as well as or better than 80% of all test-takers in the norm
sample. Be careful not to confuse the percentage of correct answers
on a test with the percentile score, which compares individual scores
among the norm sample. For example, an individual could correctly
answer 65 of 100 questions (65%) on a test, but the percentile ranking
of that score would depend on the performance of other test-takers. If
a raw score of 65 is the mean in a normal distribution of scores (the
middle value in the bell curve), then the raw score of 65 would have a
percentile score of 50, meaning that 50% of the norm group scored
below or equal to 65.
One problem with percentile scores is that they are not equally
distributed across the normal curve. A small difference in raw scores
in the middle of a distribution of scores can result in a large percentile
score difference, while at the extremes of the distribution (the upper
and lower tails) larger raw score differences between students are
needed to increase the percentile ranking. This means that percentile
scores overestimate differences in the middle of the normal curve and
underestimate differences at either end of the normal distribution.
As an example, take another look at Figure 29.4b, the normal
distribution of SAT scores for each subscale. Assume the following
percentile scores:
Student A received a score of 500 → 50 percentile
Student B received a score of 600 → 84.1 percentile
Student C received a score of 700 → 97.7 percentile
Student D received a score of 800 → 99.9 percentile
Based on percentile scores, student B appears to have markedly
outperformed student A (percentile score of 84.1 compared to 50),
and students C and D appear to have performed very similarly
(percentile scores of 97.7 and 99.9). In actuality, the difference in
performance between students is exactly the same (100 points).
Hence, comparisons should not be made between two students’
percentile scores. The interpretation of percentile scores should only
involve comparing one student’s score to the performance of the
entire norm group (e.g., student A performed better than or equal to
50% of all test-takers).
Grade-equivalent Scores
Grade-equivalent scores are based on the median score for a
particular grade level of the norm group. For example, if the median
score for all beginning sixth graders in a norm group taking a
standardized achievement test is 100, then all students scoring 100 are
considered to have a grade-equivalent (GE) score of 6.0, or beginning
of sixth grade. The 6 denotes grade level, and the 0 denotes the
beginning of the school year. Because there are 10 months in a school
year, the decimal represents the month in the school year. Suppose
the median score for all sixth graders in the seventh month of the
school year is 120. Then a student earning a score of 120 would have
a GE score of 6.7.
GE scores often are misused because individuals assume they are
mathematical statistics for interpreting students’ performance.
However, GE scores function more like labels—they cannot be
added, subtracted, multiplied, or divided. Because each GE score for
a test (or a subtest of a test battery) is derived from the median raw
score of a norm group, median raw scores will vary from year to year,
from test to test, and from subtest to subtest within the same test
battery. Therefore, GE scores cannot be used to compare students’
improvement from year to year, students’ relative strengths and
weaknesses from test to test, or even students’ scores on subtests of a
standardized test. GE scores can be used only to describe whether a
student is performing above grade level, at grade level, or below
grade level.
There is a risk of misinterpreting grade-equivalent scores. A
person might conclude that a student who scores above his or her
actual grade level is able to be successful in an advanced curriculum
or that a student who scores below grade level should be held back a
grade. Suppose a second-grade student has a GE score of 5.2 on a
reading achievement test. We would say that the second-grade stu-
boh7850x_CL9Mod29.p524-539.indd 532
boh7850x_CL9Mod29.p524-539.indd 532
10/9/08
9:23:16 AM
10/9/08 9:23:16 AM
module twenty-nine
standardized tests and scores
533
dent scored as would an average fifth-grade student in the second
month of school, if the fifth-grade student took a reading test
appropriate for second graders. In other words, all we can say is that
the second grader is above average for his or her grade in reading
achievement, not that the second grader ―reads on a fifth-grade level.‖
The misinterpretation of GE scores stems from two problems:
1. The computation of GE scores does not use any information
about the variability in scores within a distribution. The median
score of all beginning sixth graders may be 100, but there is great
variability in scores among the students. Not all beginning sixth
graders will score 100. An expectation by the school district,
teacher, or government that all students should reach that score is
unrealistic (Anastasi & Urbina, 1997).
2. The variability in GE scores increases as grade level increases,
with students at lower grade levels relatively homogeneous in
performance and students at middle and high school levels
showing a wide variation (Anastasi & Urbina, 1997). So a
first-grade student who scores one year below grade level may be
substantially below peers in achievement, while a ninth-grade
student who scores one year below grade level actually may be
average in achievement.
Because of the likely misinterpretation and misuse of
grade-equivalent scores, most educators and psychologists do not
recommend using them.
Standard Scores
Standard scores are used to simplify score differences and, in some
instances, more accurately describe them than can either percentiles
or grade-equivalent scores. For example, standard scores are used in
interpreting IQ tests and the SAT by converting the raw scores using
the mean and standard deviation. When the scores are converted, the
mean for IQ tests is 100 and the standard deviation is 15, while for
each subscale of the SAT the mean score is approximately 500 and
the standard deviation is 100.
A common standard score is calculated by using the mean and the
standard deviation (SD) to convert raw scores into z-scores. When
raw scores are converted into z-scores using the simple formula
z = (raw score – mean)
SD
the z-score distribution always has a mean score of 0 and a standard
deviation of 1, leading to z-scores that range from –4.00 to +4.00, as
shown in Figure 29.4a. Because z-scores are based on units of
standard deviations, comparisons across student scores are more
precise than with percentiles or grade-equivalent scores and are less
likely to be misinterpreted.
Because negative numbers can create some concern and
confusion—and also have a negative connotation when interpreting
students’ ability or achievement—another common standardized
score is the T-score. The T-score is also based on the number of
standard deviations, but it has a mean of 50 and a standard deviation
of 10 (z-scores are multiplied by 10, and then 50 is added to the
product). Again, looking at Figure 29.4a, we see that a T-score of 60
represents one standard deviation above the mean (+1.00 z-score) and
a T-score of 40 represents one standard deviation below the mean
(–1.00 z-score). T-scores typically are not used with standardized
achievement or aptitude tests, but they are commonly used with
personality tests and behavioral instruments, particularly those that
assist in the diagnosis of disorders (Gregory, 2007).
Stanine scores are based on percentile rank but convert raw scores
to a single-digit system from 1 to 9 that can be easily interpreted
using the normal curve, as shown in Figure 29.4a (Gregory, 2007).
The statistical mean is always 5 and comprises the middle 20 percent
of scores, although scores of 4, 5, and 6 are all interpreted or
evaluated as average. Stanines of 1, 2, and 3 are considered below
average, and stanines of 7, 8, and 9 are considered above average.
Because the scores are based on percentile rank, they do not provide
better comparisons across scores than do z-scores and T-scores.
Teachers typically are asked to interpret standardized test scores
for parents. Given a choice, which type of test score would you
prefer to use in providing your interpretation and why? How
comfortable would you be explaining the various other test
scores?
boh7850x_CL9Mod29.p524-539.indd 533
boh7850x_CL9Mod29.p524-539.indd 533
10/9/08 9:23:17 AM
10/9/08 9:23:17 AM
Standardized
Tests
Module 29 :
and Scores
534
cluster nine
standardized testing
CHARACTERISTICS OF
―GOOD‖ TESTS
Several characteristics of tests and test scores are important for
appropriate test-score interpretation, including:
n
standardized test administration (as mentioned at the beginning of this
module),
n
large and representative norm samples for norm-referenced tests, and
n
the use of standard scores when interpreting performance.
Teachers should evaluate two additional characteristics before
selecting tests to use or interpreting test scores: validity and
reliability. Without adequate evidence that test scores are reliable and
valid, test-score interpretations are meaningless. Let’s explore each
concept in more detail.
Validity
How do we know that a test score accurately reflects what the test is
intended to measure? To answer this question, we need to be able to
evaluate the validity of the test score. Validity is the extent to which
an assessment actually measures what it is intended to measure,
yielding accurate and meaningful interpretations from the test score.
Keep in mind that validity refers to the test score, not the test itself
(the collection of items in a test booklet). Consider these examples:
n
Just because a test score is intended to predict intelligence, such as
an IQ score of 120, does not mean that it fulfills that purpose.
n
A test may be valid for most individuals, but the test score might
be invalid for a particular individual. For example, a standardized
achievement test score would not be valid for a student who takes
the test without wearing his or her prescription eyeglasses. Similarly,
a non–English speaking student taking a test written in English is
unlikely to receive an accurate interpretation of his or her
achievement in a particular subject area based on the test score
(Haladyna, 2002).
Validity is not an all-or-none characteristic (valid or invalid), and it
can never be proved. Rather, it varies depending on the extent of the
research evidence supporting a test score’s validity. All validity is
considered construct validity, or the degree to which an
unobservable, intangible quality or characteristic (construct) is
measured accurately. The construct validity of a test score can be
supported by several types of evidence:
1. Content validity evidence provides information about the extent
to which the test items accurately represent all possible items for
assessing the variable of interest (Reynolds et al., 2006). For
example, do the 50 items on a standardized eighth-grade math
achievement test adequately represent the content of eighth-grade
mathematics? The issue of content validity is also relevant to
classroom tests because most teachers choose a subset of questions
from a pool of possible questions they could ask to represent the
knowledge base for a particular learning goal.
2. Criterion-related validity evidence shows that the test score is
related to some criterion—an outcome thought to measure the
variable of interest (Reynolds et al., 2006). For example, aptitude
tests used to predict college success should be related to subsequent
GPA in college, an outcome measure related to a student’s general
aptitude (Gregory, 2007). Two types of criterion-related validity are:
n
concurrent validity evidence, based on the test score and another
criterion assessed at approximately the same time, such as a math
achievement test score and the student’s current grade in math; and
n
predictive validity evidence, based on the test score and another
criterion assessed in the future, such as an aptitude test and later
college GPA.
3. Convergent validity evidence shows whether the test score is
related to another measure of the construct. For example, a new test
designed to measure intelligence should be correlated with a score
on an established intelligence test.
4. Discriminant validity evidence demonstrates that a test score is
not related to another test score that assesses a different construct.
For example, a reading test would not be expected to correlate with a
test of mental rotations or spatial abilities.
5. Theory-based validity evidence provides information that the
test scores are consistent with a theoretical aspect of the construct
(e.g., older students score higher than younger students on an
achievement test).
>
>
<
<
Use of standardized tests with non
– English speaking students:
See page 545.
Validity of classroom assessments: See page 470.
,
>
>
<
<
boh7850x_CL9Mod29.p524-539.indd 534
boh7850x_CL9Mod29.p524-539.indd 534
10/9/08 9:23:17 AM
10/9/08 9:23:17 AM
module twenty-nine
standardized tests and scores
535
Reliability
If a standardized aptitude test is given to a student on Monday and
again on Friday, would you expect the test scores to be different,
similar, or exactly the same? We would expect both test scores to be
similar, because it would be highly improbable for the student to
receive the exact same score twice or to have two wildly divergent
scores. This consistency, called the reliability of the test score or
measurement, is measured on a continuum from high to low. A
reliability index can be computed in a number of ways depending on
the type of test and the test-scoring procedures. For example,
administering the same aptitude test on Monday and Friday is a type
of reliability procedure called test-retest. The computed relationship
between the test and retest scores provides a reliability index. All
reliability indexes, or reliability coefficients, range from 0 to 1, with
higher numbers indicating higher reliability (Haladyna, 2002):
n
.90 or above is considered highly reliable,
n
between .80 and .90 is considered good, and
n
below .80 is considered questionable.
To better understand reliability, let’s consider another form of
measurement—your bathroom scale. Have you ever stepped on the
scale to weigh yourself, read the number, and then thought ―That
can’t be right‖? You step right back onto the same scale, and a
slightly different weight registers (maybe one you prefer, maybe not).
The difference in the weights is due to measurement error.
Measurement error is the accumulation of imperfections that are
found in all measurements. Test scores, like all other measurements,
are an imperfect type of measurement. Measurement error on tests
can result from a number of sources (Gregory, 2007; Reynolds et al.,
2006):
n
item selection (e.g., clarity in the wording of questions),
n
test administration (e.g., a test administrator who has a harsh
tone of voice and increases student anxiety),
n
individual factors (e.g., anxiety, illness, fatigue), or
n
test scoring (e.g., subjective, judgment-based assessments such as essays).
Even though these sources of measurement error are unpredictable,
developers of standardized tests estimate the amount of error expected
on a given test, called the standard error of measurement (SEM)
(also called the margin of error in public surveys such as political
polls). The statistical calculation for determining the standard error of
measurement is based on the reliability coefficient of the test and the
standard deviation of test scores from the scores of a norm group.
This calculation is not as important as how SEM can be used to
interpret test scores. With an individual test score, SEM can help
determine the confidence interval, or the range in which the
individual’s true score (i.e., true ability) lies. Consider this example:
A student receives a raw score of 25 on a standardized
achievement test with SEM 4.
If we calculate a 68% confidence interval (the raw score plus or
minus the SEM), the student’s score range is 21 to 29.
We can say with 68% confidence that the student’s true score is
between 21 and 29 on this standardized achievement test.
We have used a 68% confidence interval with a raw score as a
simple explanation of how SEM helps determine a student’s true
score. However, remember that most standardized tests results will
report a 95% or 99% confidence interval and that the confidence
intervals will use standard scores (z-scores, T-scores, sta-nines, etc.)
rather than raw scores. Many psychologists and test developers
recommend using confidence intervals to remind professionals,
parents, and researchers that measurement error is present in all test
scores (Gregory, 2007).
Standardized
Tests
Module 29 :
>
>
<
<
Reliability of classroom assessments: See page 483.
and Scores
Measurement Error. All measurements, including weights on bathroom
scales and standardized test scores, have imperfections.
boh7850x_CL9Mod29.p524-539.indd 535
boh7850x_CL9Mod29.p524-539.indd 535
10/23/08 4:06:16 PM
10/23/08 4:06:16 PM
536
cluster nine
standardized testing
One note of caution is needed regarding the relationship between validity and
reliability. Say, for example, that your bathroom scale is very reliable but you later discover
that its measurement is consistently off by 10 pounds. This shows that consistent results
(reliability) can be found with a measure that does not accurately assess the construct of
interest (validity). In short, reliability does not lead to validity. Reliability is necessary—any
test that is valid also must measure the construct of interest consist-ently—but it is not
sufficient for achieving validity.
A bathroom scale can consistently read 350 pounds for a person who actually weights 110
pounds. In this case also, the scale is reliable but not valid. If a standardized aptitude test
accurately predicts success in college (validity), the results should be consistent (reliability)
across multiple testing sessions, such as early or late in twelfth grade. If the test results lack
reliability, validity also is undermined.
Teachers need to evaluate the validity and reliability of a test before attempting to make
test score interpretations. Most standardized tests publish the validity evidence and
reliability coefficients that are used by teachers and school districts to determine which tests
are ―good.‖
Assume that your school district is using a highly reputable standardized test with a
reliability coef
ficient above .90. One of your students performs well, as you would
expect, at the beginning of the year but then performs very poorly on the same test at
the end of the year. Is this student
’s test score valid? What might affect the reliability
and validity of this student
’s test score?
Measurement Error Is Present in All Test Scores. The standard error of measurement (SEM), or margin of error, provides
information used to determine a con
fidence interval or range in which the true score would fall. In this political poll, the President’s
actual approval rating is somewhere between 32% and 42%.
THE PRESIDENT
’S APPROVAL RATING
APPROVE 37%
DISAPPROVE 63%
+
5% margin of error
–
boh7850x_CL9Mod29.p524-539.indd 536
boh7850x_CL9Mod29.p524-539.indd 536
10/23/08 4:06:19 PM
10/23/08 4:06:19 PM
summary
537
Summary
Describe the purpose of four broad categories of standardized tests and how standardized tests
are used by teachers. Standardized achievement tests are used to assess the degree of current
knowledge or learning in either broad or domain-speci
fic areas. Standardized aptitude tests assess an
individual
’s future potential to learn in general or in domain-specific areas. Career or educational interest
inventories assess preferences related to certain types of activities. Personality tests are used to assess
individual characteristics. Teachers are most likely to administer group tests, which are both relatively
easy to administer and cost effective. Teachers may encounter individually administered test results when
determining special education eligibility.
Explain the difference between criterion-referenced and norm-referenced tests. Although some
standardized tests provide both criterion-referenced and norm-referenced interpretations, the type of test
or interpretation used is based on the purpose of the test. Criterion-referenced tests
provide information about the mastery and the strengths and weaknesses of individual students, such as
whether a particular student meets certi
fication requirements. In contrast, norm-referenced tests allow
comparisons among student scores that may be used in making decisions such as selecting the top
students from a group.
Explain the basic properties of a normal distribution.
The normal distribution is a special type of frequency distribution. Although some frequency
distributions may be skewed, with more scores falling on the higher or lower end, the normal distribution
is bell-shaped and symmetrical. The three central tendencies
—mean, median, and mode—are equal
to one another and appear at the midpoint of a normal distribution. The variability among scores is
standard such that 68% of scores are within one standard deviation of the mean, 95% of scores are
within two standard deviations of the mean, and 99% of scores are within three standard deviations of the
mean.
Describe four types of test scores, and explain the advantages and limitations of
each. (1) Raw scores are the number of correct answers or percentage of correct answers. They provide
adequate information for classroom tests but are more dif
ficult to interpret when comparing scores across
students or groups of students. (2) Percentile scores are based on the percentage of test-takers who
scored below or equal to a student
’s raw score. Percentile scores provide information about how well an
individual performed in comparison to a group but should not be used to compare different students
’
scores. (3) Grade-equivalent scores represent the median score for particular grade levels indicating
whether a student is scoring at grade level, below grade level, or above grade level. GE scores are
commonly misinterpreted; hence, experts do not recommend their use. (4) Standard scores are
derived from percentile rank scores by converting them into a single-digit system (i.e., stanines) or from
raw scores by converting them into scores based on a speci
fied mean and standard deviation (i.e.,
z-scores and T-scores). Standard scores typically are used for ease of interpretation, and those based
on the mean and standard deviation also allow accurate comparisons among scores.
Explain why validity and reliability are two important qualities of tests and why teachers need this
information about tests to interpret test scores. To determine the quality of a test, teachers should
evaluate the validity evidence and reliability of test scores. Validity refers to the extent to which a test
measures what it is intended to measure. Reliability of test scores refers to the consistency of the
measurement, with highly reliable tests having minimal measurement error and low-quality tests having
high measurement error. Teachers can use information about validity and reliability evidence to determine
the quality of a test and to make decisions about whether the test should be used. The standard error of
measurement also can be used to determine a con
fidence interval, rather than depending on a single raw
or standard score. Con
fidence intervals remind teachers and other professionals that some
measurement error is present in all tests
—even high-quality tests.
boh7850x_CL9Mod29.p524-539.indd 537
boh7850x_CL9Mod29.p524-539.indd 537
10/9/08 9:23:24 AM
10/9/08 9:23:24 AM
538
case studies: re
flect and evaluate
Key Concepts
career or educational interest inventories central tendency concurrent validity con
fidence interval construct validity
content validity convergent validity criterion-referenced tests criterion-related validity discriminant validity evaluation
frequency distribution grade-equivalent scores
mean measurement measurement error median mode norm-referenced tests normal distribution norm sample
percentile scores personality tests predictive validity range raw score reliability
skewness standard deviation (SD)
standard error of measurement (SEM) standardized achievement tests standardized aptitude tests standardized tests
standard scores stanine scores theory-based validity T-score validity variability z-scores
Case Studies:
Refl ect and Evaluate
Early Childhood:
―Kindergarten Readiness‖
These questions refer to the case study on page 516.
1. The BRIGANCE® K & 1 Screen-II measures gross and
fine motor skills, color recognition, knowledge of body
parts, counting, oral comprehension, and many literacy and numeracy skills. In what way is this standardized
readiness test like a standardized achievement test?
2. What basic concepts of measurement are used to create the range Amy refers to when explaining to her
roommate that average typically means a range in scores?
3. Explain why grade-equivalent scores can be confusing to parents like Ms. Jackson. What types of scores could
be used to better compare achievement differences among students?
4. Maria
’s mother is concerned about how Maria’s test score will be interpreted and used. What characteristic of
―good tests‖ is a concern for Maria’s mother? Is her concern justified? Why or why not?
5. Suppose Maria
’s percentile ranking on the BRIGANCE is the 38th percentile. How would you interpret this score
in relation to other students? What if another student scored at the 49th percentile? How would you compare the
performance of this student to Maria
’s?
6. De
fine validity in your own words. Explain whether Maria’s readiness test results would be valid if she took the
English version of the BRIGANCE. What if she took the English version with her sister as interpreter?
Elementary School:
―Keyboard Courage‖
These questions refer to the case study on page 518.
1. Mr. Whitney mentions the difference between norm-referenced and criterion-referenced tests. Explain whether he
is accurate in his interpretation about how the test scores are used.
2. Based on the normal distribution and the information Mr. Washington provides regarding the test scores being
half a standard deviation below the mean, how poorly are the students doing in comparison to students across the
country?
3. If the test scores had been half a standard deviation above the national mean, would Principal Bowman be as
concerned? Why or why not?
4. Explain what is wrong with Ms. Cong
’s interpretation of percentile scores.
5. Explain how the average percentile score could have increased from 46 to 48 while average scores fell below the
state cutoff levels.
6. Ms. Rivadeneyra suggests that the test scores do not accurately re
flect the students’ abilities. What characteristic
of good tests is involved here? Explain how this characteristic was in
fluenced by the events near the school last
year.
,
,
boh7850x_CL9Mod29.p524-539.indd 538
boh7850x_CL9Mod29.p524-539.indd 538
10/9/08 9:23:26 AM
10/9/08 9:23:26 AM
case studies: re
flect and evaluate
539
Middle School:
―Teachers Are Cheating?‖
These questions refer to the case study on page 520.
1. Why did Mr. Rients take so much time to prepare for the
standardized testing session? What might happen if a teacher didn
’t
prepare by reading the instructions ahead of time and noting the time
limits? 2. Assume that the national test scores represent the normal
distribution. Based on Acting Principal Garrison
’s reply that the
previous year
’s scores were only half a standard deviation above the
mean, how accurate was Mr. Rients
’s interpretation that the test
scores were
―way above the national average‖? Explain your answer.
3. Lisa uses the percentile score to provide information about how
the school
’s test scores have jumped back and forth over the past
two years. Explain why this may not be the best test score to use
for comparing annual progress.
4. Assume that several weeks after the testing session Lisa
announces that the school
’s average stanine score for reading was
7. How does this compare to last year
’s test scores?
5. What characteristic of good tests is Lisa referring to when she
says that
―test scores shouldn’t jump back and forth so drastically‖?
What characteristic of good tests is she referring to when she adds
―at least not if the test is doing its job‖? Why is it important for
teachers to know about these characteristics?
High School:
―SAT Scores‖
These questions refer to the case study on page 522.
1. What type of standardized test is the SAT? Why might a
student score high on an achievement test but not on the SAT?
2. Explain how a norm-referenced test, such as the SAT, can be
used as a criterion-referenced test by colleges and universities for
determining admissions.
3. Based on the information in the module about SAT scores,
explain how much variability there was in the four students
’ test
scores on the math subscale as presented in the case.
4. Assume that another student received a score of 700 on the
math subscale. What would be the equivalent stanine score? What
would be the equivalent z-score?
5. Assume that Trevor takes the SAT again next month and
receives a score of 800 on the math sub-scale. What does the
difference in his two scores indicate about the quality of the test
scores? Based on the information presented in the case, what
might account for the difference in Trevor
’s scores over such a
short length of time?
boh7850x_CL9Mod29.p524-539.indd 539
boh7850x_CL9Mod29.p524-539.indd 539
10/9/08 9:23:28 AM
10/9/08 9:23:28 AM