EdPsych Modules PDF Cluster 9 Module 29

M O D U L E

Basic Concepts of Measurement

Central Tendency and Variability

Normal Distribution

Types of Standardized Tests

Categories of Standardized Tests

Criterion-referenced and Norm-referenced Tests

Standardized Tests and Scores

Outline Learning Goals

Describe the purpose of four broad categories of standardized tests and how standardized tests are used by

teachers.

Explain the difference between criterion-referenced and norm-referenced tests.

Explain the basic properties of a normal distribution.

Characteristics of

―Good‖ Tests

Validity

Reliability

Summary Key Concepts Case Studies: Re

ﬂect and Evaluate

Types of Test Scores

Raw Scores

Percentile Scores

Grade-equivalent Scores

Standard Scores

Describe four types of test scores, and explain the advantages and limitations of each type of score.

Explain why validity and reliability are two important qualities of tests and why teachers need this information

about tests to interpret test scores.

boh7850x_CL9Mod29.p524-539.indd 524

10/23/08 4:06:04 PM

module twenty-nine

standardized tests and scores

525

TYPES OF STANDARDIZED TESTS

How many standardized tests have you taken in your life? Can you remember why you took them? Perhaps you took
the SAT or ACT to apply for college, or the Praxis I exam to be admitted to your undergraduate education program.
Before we consider why educators at all levels use standardized tests, we ﬁrst need to deﬁne exactly what makes a
test standardized. Standardized tests are distinguished by two qualities (Gregory, 2007; Haladyna, 2002):
1. They are created by testing experts at test publishing companies.
2. All students are given the test by a trained examiner under the same (hence ―standardized‖) conditions. For

example, all students are given the same directions, test items, time limits, and scoring procedures.
You’re probably very familiar with tests that are not standardized, such as the classroom tests you have taken

since elementary school. Classroom tests, often created by individual teachers, measure speciﬁc learning that occurs
within a classroom and typically focus on the district’s curriculum. Teachers may use classroom tests as a formative
or a summative assessment of students’ knowledge (Linn & Gronlund, 2000). Assessment includes any and all
procedures used to collect information and to make inferences or judgments about an individual or a program
(Reynolds, Livingston, & Willson, 2006). Formative assessments, such as homework assignments and quizzes,
enable teachers to plan for instruction and monitor student progress throughout a grading period. To assess student
achievement at the end of an instructional unit or grading period, teachers use summative assessments such as tests
and cumulative projects.

Like some classroom tests, standardized tests typically are used for summative assessments, but they focus on

broader areas of learning such as overall mathematical achievement rather than mathematical progress within a
grading period. For a summary of the differences between classroom and standardized tests, see Table 29.1.

and Scores

Classroom assessment: See page 467.

Classroom tests: See page 481.

TA B L E 2 9 .1

Comparison of Classroom Tests and Standardized Tests

Classroom test Standardized test

Purpose formative and summative typically summative Content speci

ﬁc to a content covered in the classroom over a

speci

ﬁc time frame

Source of items created or written by the classroom teacher

can be

ﬂexible for students with disabilities and special needs

created by a panel of professional experts

speci

ﬁc or general topics across many districts or states

Standardized

Tests

Module 29 :

Administration procedures

standardized across all settings and individuals

usually very long

—several hours Scoring procedures typically teacher-scored

typically machine-scored Reliability typically low typically high Scores individual

’s

number or percent correct (raw score)

Length usually short

—less than an hour

compared to predetermined criteria or norm group (converted from raw score)

Grading used to assign a course grade used to determine general ability or achievement; not used to assign course

grade

Source: Haladyna, 2002.

boh7850x_CL9Mod29.p524-539.indd 525

10/9/08 9:23:10 AM

526

cluster nine

standardized testing

What standardized tests do you remember taking in elementary through high school?
What purpose do you think they served? Think about these tests as you read about
the categories of standardized tests.

Categories of Standardized Tests

Standardized tests have several purposes. Some standardized tests—called single-subject survey tests—contain
several subtests that assess one domain-speciﬁ c content area, such as mathematics. Other standardized tests contain
a battery of several tests used in conjunction with one another to provide a broader, more general picture of
performance that may include competencies such as vocabulary, spelling, reading comprehension, mathematics
computation, and mathematics problem solving. Standardized tests fall into one of four broad categories based on
their purpose (Chatterji, 2003), as described here and summarized in Table 29.2.
1. Standardized achievement tests assess current knowledge, which can include learning outcomes and skills

either in general or in a speciﬁc domain. Standardized achievement tests do not neces-

TA B L E 2 9 . 2

Examples of Standardized Tests in Four Broad Areas

Name Type/Purpose

Iowa Test of Basic Skills (ITBS) Battery of achievement tests for grades K

–8

Standardized achievement tests

Tests of Achievement and Pro

ﬁciency (TAP)

Battery of achievement tests for grades 9

–12 Metropolitan Achievement Test (MAT)

Battery of achievement tests for grades K

–12

Standardized aptitude tests

Differential Aptitude Test (DAT) Battery of tests to predict educational goals for students in grades 7

–12 Scholastic Assessment Test

(SAT)

Single test to predict academic performance in college General Aptitude Test Battery (GATB)

Battery of aptitude tests used to predict job performance Armed Services Vocational Aptitude Battery (ASVAB)

Battery of aptitude tests used to assign armed service personnel to jobs and
training programs

Minnesota Multiphasic Personality Inventory-2 (MMPI-2)

Career or educational interest inve ntories

Strong Interest Inventory (SII) Instrument used with high school and college students to identify occupational preferences Kuder

General Interest Survey (KGIS)

Instrument used with students in grades 6

–12 to determine preferences in broad

areas of interest

NEO Five-Factor Inventory (NEO-FFI)

Instrument to assess an individual on

ﬁve theoretically derived dimensions of personality

Instrument used to aid in clinical diagnosis

Personality tests

Source: Chatterji, 2003.

boh7850x_CL9Mod29.p524-539.indd 526

10/9/08 9:23:12 AM

module twenty-nine

standardized tests and scores

527

sarily match the curriculum of any particular state or school
district. Instead, they are used to identify the strengths and
weaknesses of individual students as well as school districts.
Readiness tests, like achievement tests, measure young children’s
current level of skill in various academic (reading, math,
vocabulary) and nonacademic (motor skills, social skills) domains
and are used to make placement and curricular decisions in the
early elementary grades.

2. Standardized aptitude tests assess future potential—or
capacity to learn—in general or in a speciﬁc domain. Aptitude
tests are used for admission or selection purposes to place students
in particular schools (e.g., private schools or colleges) or speciﬁc
classrooms or courses (e.g., advanced mathematics). Standardized
intelligence tests are considered aptitude tests because their
purpose is to predict achievement in school.
3. Career or educational interest inventories assess individual
preferences for certain types of activities. These inventories
typically are used to assist high school and college students in
planning their postsecondary education, as well as to assist
companies and corporations in selecting employees. Some of these
tests are also considered aptitude tests because they may predict
future success.

4. Personality tests assess an individual’s characteristics, such as
interests, attitudes, values, and patterns of behavior. Personality
tests are limited in their educational use because psychologists and
counselors with graduate-level training primarily use them for
diagnosis of clinical disorders and because most personality tests
are appropriate only for individuals age 18 or older.
Most standardized tests administered by teachers are given in a

group format. Group-administered tests are relatively easy to
administer and score, making them cost-effective. Individually
administered tests, such as personality tests and IQ tests, require

expert training, time to administer, and time to score and interpret, all
of which lead to greater cost. Although teachers typically are not
trained to administer these individual tests, they will encounter the
test scores of individually administered tests in meetings to determine
the eligibility of students for special education and related services.

Criterion-referenced and Norm-referenced Tests

The interpretation of test scores includes understanding not only what
the test measures—general versus speciﬁc knowledge or current
knowledge versus future potential—but also how test scores should
be evaluated. A test score is a measurement, that assigns a
quantitative or descriptive number during the process of assessment.
But a test score in itself cannot be detached from how it is evaluated.
Evaluation is the subjective judgment or interpretation of a
measurement or test score (Haladyna, 2002). For example, a student
might take a test and answer 20 of 30 questions correctly
(measurement), but whether that score is interpreted as a ―good‖
score, an ―improvement‖ from a previous score, or ―substantially
below‖ the expected score is a matter of evaluation. Standardized
tests typically are designed so that any test score can be evaluated by
comparing it either to a speciﬁc standard (criterion) or to data
compiled from the test scores of many similar individuals (norm).

Criterion-referenced tests compare an individual’s score to a

preset criterion, or standard of performance, for a learning
objective. Many times criterion-referenced tests are used to test
mastery of speciﬁc skills or educational goals in order to provide
information about what an individual does and does not know. On
criterion-referenced tests, test developers include test items based
on their relevance to speciﬁc academic skills and curricula. The
criteria are chosen because, together, they represent a level of
expert knowledge. Lawyers, doctors, nurses, and teachers must
take standardized tests and meet a speciﬁed criterion in order to
become licensed or certiﬁed for their profession. Some tests that
students take—such as state mastery tests—are also
criterion-referenced tests.

Norm-referenced tests compare the individual test-taker’s

performance to the performance of a group of similar test-takers,
called the norm sample. A norm sample is a large group of
individuals who represent the population of interest on
characteristics such as gender, age, race, and socioeconomic status.
For example, a norm sample for a standardized test can be all ﬁfth
graders nationally, all ﬁfth graders in a state, or all ﬁfth graders in
a district. Norm samples for nationally used standardized tests,
such as the achievement tests listed in Table 29.2, need to be large
(about 100,000 test-takers) and representative of the population of
students in order to allow accurate interpretations. The test items
on norm-referenced tests are designed to differentiate, to the
greatest degree possible, between individual test-takers. For
example, a norm-referenced mathematics

Standardized

Tests

Module 29 :

and Scores

Intelligence measured as IQ: See page 400 .

boh7850x_CL9Mod29.p524-539.indd 527

10/9/08 9:23:13 AM

528

cluster nine

standardized testing

TA B L E 2 9 . 3

Comparison of Criterion-referenced and

Norm-referenced Tests

Criterion-referenced Norm-referenced

Purpose to determine mastery at a speci

ﬁed level

to compare score to the

performance of similar test-takers Content speci

ﬁc to a domain

or content area

standard score, percentile, or grade-equivalent score as
compared to the norm group

Item selection similar level of dif

ﬁculty wide variability in

dif

ﬁculty level Scores number or percent correct as compared

to criteria

broad domain or content area

Source: Gregory, 2007.

achievement test might be used to select the top elementary school
students in a school district for a gifted program with limited seats or
space.

The major difference between the two types of tests is the purpose

or situation for which each type of test is most useful, as summarized
in Table 29.3. Many standardized group-administered achievement
tests provide teachers with both a criterion-referenced and a
norm-referenced interpretation of scores, as shown in Figure 29.1.
When dual interpretation is not available, the type of test that is used
will depend on the purpose. Criterion-referenced tests provide
information about mastery of material but do not allow comparisons
among test-takers. In contrast, norm-referenced tests do not provide
information about the mastery or the strengths and weaknesses of a
particular individual, but they do provide ample information for
comparing test scores across individuals using several basic concepts
of measurement, as discussed next.

BASIC CONCEPTS OF MEASUREMENT

In order to interpret test scores accurately, teachers must understand
some basic concepts of measurement that are used in conjunction
with one another to evaluate individual students as well as to evaluate
groups of students, such as classrooms or school districts.

Central Tendency and Variability

One basic measure needed to form evaluations or make comparisons
is central tendency—the score that is typical or representative of the
entire group. Let’s examine a set of classroom or standardized test
scores. Suppose you teach a class of 11 students who receive these
scores on their ﬁrst exam: 63, 65, 72, 75, 76, 78, 83, 87, 87, 92, and
98. What measure will tell you the central tendency of this set of
numbers? The three most common statistical descriptions of central
tendency are the mean, median, and mode:

1. Mean: Divide the sum of all the scores by the total number of
scores to ﬁnd the mean, or simple average. Summing the 11 scores
(sum = 876) and dividing by 11 gives mean 79.64.

2. Median: Find the middle score in a series of scores listed from
smallest to largest. In this case, the median is 78, the middle value,
because ﬁve scores are on either side. In a group with an even
number of scores, the median is the midpoint, or average, of the two
middle scores.
3. Mode: Find the most frequently occurring score in the group. In
this group, the mode is 87, the only score that occurs more than
once. A group of scores can be bimodal—having two modes— when
two different scores occur frequently.

The mean, median, and mode all provide information about the

typical score within a group but do not provide information about the
variability—how widely the scores are distributed, or spread out,
within a particular group. Compare these two groups of test scores:

boh7850x_CL9Mod29.p524-539.indd 528

10/9/08 9:23:14 AM

module twenty-nine

standardized tests and scores

529

Figure 29.1: A Simple Standardized Test Score Report. Many achievement tests provide two kinds of scores.
Criterion-referenced scores, listed here under

―Performance on Objectives,‖ measure a student’s mastery of material.

Norm-referenced scores, at the bottom of the report, allow some comparisons among test-takers.

TERRANOVA

™, Third Edition

COMPLETE BATTERY

Individual Profile with InView

™, Part I

Student

TERRANOVA3

Standardized

Tests

Module 29 :

Performance on Objectives

Obj. No. Objective Titles

Natl OPI Diff Moderate Mastery Range Objectives Performance Index (OPI)*

0 25 50 75 100

Reading Language

02 Basic Understanding 91 79 12 48

–70

92 84 8 52

–75 65 66 –1 50–70 70 74 –4 45–73

63 68

–5 45–70 59 74 –15 50–75 78 63 15 55–75

71 69 2 47

–77 83 72 11 45–75 66 86 –20 45–60 71 72 –1 50–78 61 83 –22 52–78

–75

77 88

–11 44–73 71 74 –3 69 68 1 43–73 03 Analyze Text 04 Evaluate/Extend Meaning 05 Identify Rdg. Strategies

and Scores

PAT WASHINGTON

Grade 4

Simulated Data

Purpose
This report presents information about this student

’s performance on the TerraNova and InView assessments.

Page 1 describes achievement in terms of performance on the objectives. Together with classroom assessments and classwork, this information can be used to identify
potential strengths and needs in the content areas shown.

07 Sentence Structure 08 Writing Strategies 09 Editing Skills

10 Number & Num. Relations 11 Computation & Estimation 13 Measurement 14 Geometry & Spatial Sense 15 Data, Stats., & Pro b.

16 Patterns, Funcs, Algebra 17 Prob. Solving & Reasoning 18 Communication

47 74

–27 50–75 19 Science Inquiry

49 69

–20 52–77 20 Physical Science

46 83

–37 45–78 21 Life Science

52 84

–32 48–73 22 Earth & Space Science

48 78

–30 52–69 23 Science & Technology

52 56

–4 50–73 24 Personal & Social Persp.

Scoring: PATTERN (IRT) Norms Date: 2007

Birthdate: 02/08/98 Special Codes: ABCDEFGHIJKLMNOPQRSTUVWXYZ 3 5 9 7 3 2 1 1 1 Form/Level: G-14
Test Date: 04/15/07 QM: 31

Class: JONES School: WINFIELD District: GREEN VALLEY

City/State: ANYTOWN, U.S.A.

Mathematics

Science

CTB McGraw-Hill

National Percentile

Range

–12 48–70

92 84

–8 52–75

65 66 1 50

–70

70 74 4 45

–73

Anticipated Normal

Curve Equiv.

Low Mastery
Moderate Mastery
High Mastery

*OPI is an estimate of the number of items that a student could be expected to answer correctly if there had been 100 items for that objective.

Social Studies

26 Geographic Perspectives 91

Key

27 Historical & Cultural 28 Civics & Government 29 Economic Perspectives

Moderate Mastery Range

Norm-Referenced Scores

Scale Score DIFF

Obtained Normal

Curve Equiv.

Anticipated National Percentile

National Percentile

National Percentile Scale
1 10 25 50 75 90 99

Reading

Language

Mathematics

Total Score** Science Social Studies

**Total Score consists of Reading, Language, and Mathematics. Above or Below appears when there is a significant difference.
— —

— —

— —

— —

664

678

674

679 668 662

— —

— —

— —

— —

—

Above

—

Above

— — — —

— —

— —

— —

35

34

41

35 36 38

— —

— —

— —

— —

47

57

48

55 45 43

— —

— —

— —

— —

23

22

29

24 26 28

— —

— —

— —

— —

45

64

47

59 41 37

— —

— —

— —

— —

32

–56

–72

–60

–66 36–50 27–48 — —

— —

— —

— —

National Stanine Scale

1 2 3 5 6

4 7 8 9

boh7850x_CL9Mod29.p524-539.indd 529

10/23/08 4:06:12 PM

530

cluster nine

standardized testing

Figure 29.2: Normal Distributions with Large (orange) and Small (blue) Standard Deviations. In the small standard deviation,
most scores are close to the mean score of the group. Scores are more spread out in a large standard deviation.

Class 1 scores: 6, 7, 7, 8, 8 Class 2 scores: 4, 7, 7, 8, 10
Both classes have a mean test score of 7.2, but the scores in the second class
show considerably more variation. The range is a simple measure of variability
calculated as the difference between the highest and lowest scores. For Class 1,
the range is 2 (8 minus 6), while for Class 2, the range is 6 (10 minus 4).

The most commonly used measure of variability among scores, standard deviation (SD), is the degree

of variability in a group of scores. The standard deviation is more difﬁcult to compute than the range: it equals the
square root of each score’s deviation from the mean. This sounds more complex than it is. The computation is less
important than understanding the standard deviation for test score interpretation. Figure 29.2 shows the difference in
variability for two groups of scores with small and large standard deviations.

A small standard deviation indicates that most scores are close to the mean score of the group.

For a classroom test, the teacher might hope that all students would score well and close to one another,
indicating that all students are mastering the course objectives.

A large standard deviation suggests that the scores are more spread out. On standardized achievement tests, a

large degree of variability is not only typical but optimal, because the test items are designed to make ﬁne
discriminations in achievement among a population of students.

In the example with Class 1 and Class 2, the standard deviations are .84 (Class 1) and 2.17 (Class 2). With a small
set of numbers, the variability may be obvious. However, with large groups of scores, such as the thousands of
scores of students taking a standardized achievement test, the standard deviation provides a scientiﬁc measurement
of the distribution of scores.

Normal Distribution

A frequency distribution is the simple list of all scores for a group. The scores can be depicted visually in a
histogram, or bar graph. For example, Figure 29.3 depicts the ﬁnal grades in an educational psychology course.
Final grades are indicated along the horizontal axis (x-axis), and the number of students receiving each ﬁnal grade is
indicated along the vertical axis (y-axis). As Figure 29.3 shows, 7 students failed the course, 17 students received a
grade of D, 45 students received a C, 37 students received a B, and 15 students received an A. In this ﬁgure more
scores fall to the right (higher scores) and fewer scores fall to the left (lower scores) of the midpoint, indicating that
the scores are skewed. Skewness, or the symmetry or asymmetry of a frequency distribution, tells how a test is
working. Negatively skewed distributions (with long tails to the left), such as in Figure 29.3, indicate that the scores
are piled up at the high end. Positively skewed distributions indicate that the scores are piled up at the low end (long
tail to the right).

Classroom tests with negative skewness are what teachers hope to achieve (i.e., mastery by most students). In
standardized testing, positively skewed distributions suggest that the test had too many difﬁcult questions, and
negatively skewed distributions suggest that the test had too many easy items.

For standardized tests, we expect a frequency distribution that is

symmetrical and bell-shaped, called a normal distribution (see Figure
29.4).Normal distributions are apparent in scores on the SAT and on IQ tests
(see Figure

29.4 and 29.4). A normal distribution has several properties:

Figure 29.3: Histogram of Final Grades in an Educational Psychology Course. Final grades are indicated along the horizontal
axis (x-axis), and the number of students receiving each

ﬁnal grade is indicated along the vertical axis (y-axis).

Number of Students

50
45
40
35
30
25
20
15
10

5
0

F D C B A
Final Grades

boh7850x_CL9Mod29.p524-539.indd 530

10/23/08 4:06:14 PM

module twenty-nine

standardized tests and scores

531

Standardized

Tests

Module 29 :

Percentage of cases under portions of the normal curve

Figure 29.4: Normal Distribution Curves. For standardized tests, a symmetrical, bell-shaped frequency distribution is considered
normal. Normal distributions are ap parent in scores on SAT and IQ tests.

2.14%

Standard Deviations

0.13% 2.14% 0.13%

13.59% 13.59%

34.13% 34.13%

–3s –4s –2s –1s 0 +1s +2s +3s +4s

and Scores

Cumulative Percentages

Rounded

0.1%

2.3% 2%

15.9% 16%

50.0% 50%

84.1% 84%

97.7% 98%

99.9%

Percentile Equivalents

Z-scores Stanines

T-scores

1 5

90 95

20 10 30 40 50 Q1

80 60 70

Q3 Md

–4.0 –2.0 –1.0 0 +1.0 +2.0 +3.0 +4.0

–3.0
1 2 3 4 5 6 7 8 9
Percentage in stanine

4% 4% 7% 7%

12% 12% 17% 20% 17%

20 30 40 50 60 70 80

2.14%

0.13% 2.14% 0.13%

13.59% 13.59%

34.13% 34.13%

200 300 400 500

SAT scores

600 700 800

2.14%

0.13% 2.14% 0.13%

13.59% 13.59%

34.13% 34.13%

55 70 85 100

IQ scores

115 130 145

The mean, median, and mode are equal and appear at the midpoint of the distribution, indicating that half the

scores are above the mean and half the scores are below the mean.

Approximately 68% of scores occur within one standard deviation above and below the mean.

Two standard deviations above and below the mean include approximately 95% of scores.

Three standard deviations above and below the mean include approximately 99% of scores.

TYPES OF TEST SCORES

Raw Scores

Both classroom and standardized tests ﬁrst yield a raw score, which typically is the number or percentage of correct

answers. For evaluating the results of classroom tests, raw scores typically are used. For standardized
criterion-referenced tests, raw scores are compared to the preset criterion for interpretation (e.g., pass/fail,
mastery/nonmastery). For standardized norm-referenced tests, raw scores

boh7850x_CL9Mod29.p524-539.indd 531

10/23/08 4:06:15 PM

532

cluster nine

standardized testing

more commonly are converted or transformed by the test developers
to help provide consistent evaluation and ease of interpretation of
scores by parents and teachers. Next we consider several common
norm-referenced test scores.

Percentile Scores

Percentile scores (or ranks) are derived by listing all raw scores from
highest to lowest and providing information on the percentage of
test-takers in the norm sample who scored below or equal to that raw
score. For example, a percentile score of 80 means that the test-taker
scored as well as or better than 80% of all test-takers in the norm
sample. Be careful not to confuse the percentage of correct answers
on a test with the percentile score, which compares individual scores
among the norm sample. For example, an individual could correctly
answer 65 of 100 questions (65%) on a test, but the percentile ranking
of that score would depend on the performance of other test-takers. If
a raw score of 65 is the mean in a normal distribution of scores (the
middle value in the bell curve), then the raw score of 65 would have a
percentile score of 50, meaning that 50% of the norm group scored
below or equal to 65.

One problem with percentile scores is that they are not equally

distributed across the normal curve. A small difference in raw scores
in the middle of a distribution of scores can result in a large percentile
score difference, while at the extremes of the distribution (the upper
and lower tails) larger raw score differences between students are
needed to increase the percentile ranking. This means that percentile
scores overestimate differences in the middle of the normal curve and
underestimate differences at either end of the normal distribution.

As an example, take another look at Figure 29.4b, the normal

distribution of SAT scores for each subscale. Assume the following
percentile scores:

Student A received a score of 500 → 50 percentile

Student B received a score of 600 → 84.1 percentile

Student C received a score of 700 → 97.7 percentile

Student D received a score of 800 → 99.9 percentile

Based on percentile scores, student B appears to have markedly
outperformed student A (percentile score of 84.1 compared to 50),
and students C and D appear to have performed very similarly
(percentile scores of 97.7 and 99.9). In actuality, the difference in
performance between students is exactly the same (100 points).
Hence, comparisons should not be made between two students’
percentile scores. The interpretation of percentile scores should only
involve comparing one student’s score to the performance of the
entire norm group (e.g., student A performed better than or equal to

50% of all test-takers).

Grade-equivalent Scores

Grade-equivalent scores are based on the median score for a
particular grade level of the norm group. For example, if the median
score for all beginning sixth graders in a norm group taking a
standardized achievement test is 100, then all students scoring 100 are
considered to have a grade-equivalent (GE) score of 6.0, or beginning
of sixth grade. The 6 denotes grade level, and the 0 denotes the
beginning of the school year. Because there are 10 months in a school
year, the decimal represents the month in the school year. Suppose
the median score for all sixth graders in the seventh month of the
school year is 120. Then a student earning a score of 120 would have
a GE score of 6.7.

GE scores often are misused because individuals assume they are

mathematical statistics for interpreting students’ performance.
However, GE scores function more like labels—they cannot be
added, subtracted, multiplied, or divided. Because each GE score for
a test (or a subtest of a test battery) is derived from the median raw
score of a norm group, median raw scores will vary from year to year,
from test to test, and from subtest to subtest within the same test
battery. Therefore, GE scores cannot be used to compare students’
improvement from year to year, students’ relative strengths and
weaknesses from test to test, or even students’ scores on subtests of a
standardized test. GE scores can be used only to describe whether a
student is performing above grade level, at grade level, or below
grade level.

There is a risk of misinterpreting grade-equivalent scores. A

person might conclude that a student who scores above his or her
actual grade level is able to be successful in an advanced curriculum
or that a student who scores below grade level should be held back a
grade. Suppose a second-grade student has a GE score of 5.2 on a
reading achievement test. We would say that the second-grade stu-

boh7850x_CL9Mod29.p524-539.indd 532

10/9/08

9:23:16 AM

10/9/08 9:23:16 AM

module twenty-nine

standardized tests and scores

533

dent scored as would an average ﬁfth-grade student in the second
month of school, if the ﬁfth-grade student took a reading test
appropriate for second graders. In other words, all we can say is that
the second grader is above average for his or her grade in reading
achievement, not that the second grader ―reads on a ﬁfth-grade level.‖
The misinterpretation of GE scores stems from two problems:

1. The computation of GE scores does not use any information
about the variability in scores within a distribution. The median
score of all beginning sixth graders may be 100, but there is great
variability in scores among the students. Not all beginning sixth
graders will score 100. An expectation by the school district,
teacher, or government that all students should reach that score is
unrealistic (Anastasi & Urbina, 1997).

2. The variability in GE scores increases as grade level increases,

with students at lower grade levels relatively homogeneous in
performance and students at middle and high school levels
showing a wide variation (Anastasi & Urbina, 1997). So a
ﬁrst-grade student who scores one year below grade level may be
substantially below peers in achievement, while a ninth-grade
student who scores one year below grade level actually may be
average in achievement.
Because of the likely misinterpretation and misuse of

grade-equivalent scores, most educators and psychologists do not
recommend using them.

Standard Scores

Standard scores are used to simplify score differences and, in some
instances, more accurately describe them than can either percentiles
or grade-equivalent scores. For example, standard scores are used in
interpreting IQ tests and the SAT by converting the raw scores using
the mean and standard deviation. When the scores are converted, the
mean for IQ tests is 100 and the standard deviation is 15, while for
each subscale of the SAT the mean score is approximately 500 and
the standard deviation is 100.

A common standard score is calculated by using the mean and the

standard deviation (SD) to convert raw scores into z-scores. When
raw scores are converted into z-scores using the simple formula

z = (raw score – mean)
SD

the z-score distribution always has a mean score of 0 and a standard
deviation of 1, leading to z-scores that range from –4.00 to +4.00, as
shown in Figure 29.4a. Because z-scores are based on units of
standard deviations, comparisons across student scores are more
precise than with percentiles or grade-equivalent scores and are less
likely to be misinterpreted.

Because negative numbers can create some concern and

confusion—and also have a negative connotation when interpreting
students’ ability or achievement—another common standardized
score is the T-score. The T-score is also based on the number of
standard deviations, but it has a mean of 50 and a standard deviation
of 10 (z-scores are multiplied by 10, and then 50 is added to the
product). Again, looking at Figure 29.4a, we see that a T-score of 60
represents one standard deviation above the mean (+1.00 z-score) and
a T-score of 40 represents one standard deviation below the mean
(–1.00 z-score). T-scores typically are not used with standardized
achievement or aptitude tests, but they are commonly used with
personality tests and behavioral instruments, particularly those that
assist in the diagnosis of disorders (Gregory, 2007).

Stanine scores are based on percentile rank but convert raw scores

to a single-digit system from 1 to 9 that can be easily interpreted
using the normal curve, as shown in Figure 29.4a (Gregory, 2007).
The statistical mean is always 5 and comprises the middle 20 percent
of scores, although scores of 4, 5, and 6 are all interpreted or
evaluated as average. Stanines of 1, 2, and 3 are considered below
average, and stanines of 7, 8, and 9 are considered above average.
Because the scores are based on percentile rank, they do not provide
better comparisons across scores than do z-scores and T-scores.

Teachers typically are asked to interpret standardized test scores
for parents. Given a choice, which type of test score would you

prefer to use in providing your interpretation and why? How
comfortable would you be explaining the various other test
scores?

boh7850x_CL9Mod29.p524-539.indd 533

10/9/08 9:23:17 AM

Standardized

Tests

Module 29 :

and Scores

534

cluster nine

standardized testing

CHARACTERISTICS OF

―GOOD‖ TESTS

Several characteristics of tests and test scores are important for
appropriate test-score interpretation, including:

standardized test administration (as mentioned at the beginning of this

module),

large and representative norm samples for norm-referenced tests, and

the use of standard scores when interpreting performance.

Teachers should evaluate two additional characteristics before
selecting tests to use or interpreting test scores: validity and
reliability. Without adequate evidence that test scores are reliable and
valid, test-score interpretations are meaningless. Let’s explore each
concept in more detail.

Validity

How do we know that a test score accurately reﬂects what the test is
intended to measure? To answer this question, we need to be able to
evaluate the validity of the test score. Validity is the extent to which
an assessment actually measures what it is intended to measure,
yielding accurate and meaningful interpretations from the test score.
Keep in mind that validity refers to the test score, not the test itself
(the collection of items in a test booklet). Consider these examples:

Just because a test score is intended to predict intelligence, such as

an IQ score of 120, does not mean that it fulﬁlls that purpose.

A test may be valid for most individuals, but the test score might

be invalid for a particular individual. For example, a standardized
achievement test score would not be valid for a student who takes
the test without wearing his or her prescription eyeglasses. Similarly,
a non–English speaking student taking a test written in English is
unlikely to receive an accurate interpretation of his or her
achievement in a particular subject area based on the test score
(Haladyna, 2002).

Validity is not an all-or-none characteristic (valid or invalid), and it

can never be proved. Rather, it varies depending on the extent of the

research evidence supporting a test score’s validity. All validity is
considered construct validity, or the degree to which an
unobservable, intangible quality or characteristic (construct) is
measured accurately. The construct validity of a test score can be
supported by several types of evidence:

1. Content validity evidence provides information about the extent
to which the test items accurately represent all possible items for
assessing the variable of interest (Reynolds et al., 2006). For
example, do the 50 items on a standardized eighth-grade math
achievement test adequately represent the content of eighth-grade
mathematics? The issue of content validity is also relevant to
classroom tests because most teachers choose a subset of questions
from a pool of possible questions they could ask to represent the
knowledge base for a particular learning goal.
2. Criterion-related validity evidence shows that the test score is
related to some criterion—an outcome thought to measure the
variable of interest (Reynolds et al., 2006). For example, aptitude
tests used to predict college success should be related to subsequent
GPA in college, an outcome measure related to a student’s general
aptitude (Gregory, 2007). Two types of criterion-related validity are:

concurrent validity evidence, based on the test score and another

criterion assessed at approximately the same time, such as a math
achievement test score and the student’s current grade in math; and

predictive validity evidence, based on the test score and another

criterion assessed in the future, such as an aptitude test and later
college GPA.
3. Convergent validity evidence shows whether the test score is
related to another measure of the construct. For example, a new test
designed to measure intelligence should be correlated with a score
on an established intelligence test.
4. Discriminant validity evidence demonstrates that a test score is
not related to another test score that assesses a different construct.
For example, a reading test would not be expected to correlate with a
test of mental rotations or spatial abilities.
5. Theory-based validity evidence provides information that the
test scores are consistent with a theoretical aspect of the construct
(e.g., older students score higher than younger students on an
achievement test).

Use of standardized tests with non

– English speaking students:

See page 545.

Validity of classroom assessments: See page 470.

boh7850x_CL9Mod29.p524-539.indd 534

10/9/08 9:23:17 AM

module twenty-nine

standardized tests and scores

535

Reliability

If a standardized aptitude test is given to a student on Monday and
again on Friday, would you expect the test scores to be different,
similar, or exactly the same? We would expect both test scores to be
similar, because it would be highly improbable for the student to
receive the exact same score twice or to have two wildly divergent
scores. This consistency, called the reliability of the test score or
measurement, is measured on a continuum from high to low. A
reliability index can be computed in a number of ways depending on
the type of test and the test-scoring procedures. For example,
administering the same aptitude test on Monday and Friday is a type
of reliability procedure called test-retest. The computed relationship
between the test and retest scores provides a reliability index. All
reliability indexes, or reliability coefﬁcients, range from 0 to 1, with
higher numbers indicating higher reliability (Haladyna, 2002):

.90 or above is considered highly reliable,

between .80 and .90 is considered good, and

below .80 is considered questionable.

To better understand reliability, let’s consider another form of

measurement—your bathroom scale. Have you ever stepped on the
scale to weigh yourself, read the number, and then thought ―That
can’t be right‖? You step right back onto the same scale, and a
slightly different weight registers (maybe one you prefer, maybe not).
The difference in the weights is due to measurement error.
Measurement error is the accumulation of imperfections that are
found in all measurements. Test scores, like all other measurements,
are an imperfect type of measurement. Measurement error on tests
can result from a number of sources (Gregory, 2007; Reynolds et al.,
2006):

item selection (e.g., clarity in the wording of questions),

test administration (e.g., a test administrator who has a harsh

tone of voice and increases student anxiety),

individual factors (e.g., anxiety, illness, fatigue), or

test scoring (e.g., subjective, judgment-based assessments such as essays).

Even though these sources of measurement error are unpredictable,

developers of standardized tests estimate the amount of error expected
on a given test, called the standard error of measurement (SEM)
(also called the margin of error in public surveys such as political
polls). The statistical calculation for determining the standard error of
measurement is based on the reliability coefﬁcient of the test and the
standard deviation of test scores from the scores of a norm group.
This calculation is not as important as how SEM can be used to

interpret test scores. With an individual test score, SEM can help
determine the conﬁdence interval, or the range in which the
individual’s true score (i.e., true ability) lies. Consider this example:

A student receives a raw score of 25 on a standardized
achievement test with SEM 4.
If we calculate a 68% conﬁdence interval (the raw score plus or
minus the SEM), the student’s score range is 21 to 29.
We can say with 68% conﬁdence that the student’s true score is
between 21 and 29 on this standardized achievement test.

We have used a 68% conﬁdence interval with a raw score as a

simple explanation of how SEM helps determine a student’s true
score. However, remember that most standardized tests results will
report a 95% or 99% conﬁdence interval and that the conﬁdence
intervals will use standard scores (z-scores, T-scores, sta-nines, etc.)
rather than raw scores. Many psychologists and test developers
recommend using conﬁdence intervals to remind professionals,
parents, and researchers that measurement error is present in all test
scores (Gregory, 2007).

Standardized

Tests

Module 29 :

Reliability of classroom assessments: See page 483.

and Scores

Measurement Error. All measurements, including weights on bathroom
scales and standardized test scores, have imperfections.

boh7850x_CL9Mod29.p524-539.indd 535

10/23/08 4:06:16 PM

536

cluster nine

standardized testing

One note of caution is needed regarding the relationship between validity and

reliability. Say, for example, that your bathroom scale is very reliable but you later discover
that its measurement is consistently off by 10 pounds. This shows that consistent results
(reliability) can be found with a measure that does not accurately assess the construct of
interest (validity). In short, reliability does not lead to validity. Reliability is necessary—any
test that is valid also must measure the construct of interest consist-ently—but it is not
sufﬁcient for achieving validity.

A bathroom scale can consistently read 350 pounds for a person who actually weights 110
pounds. In this case also, the scale is reliable but not valid. If a standardized aptitude test
accurately predicts success in college (validity), the results should be consistent (reliability)
across multiple testing sessions, such as early or late in twelfth grade. If the test results lack
reliability, validity also is undermined.

Teachers need to evaluate the validity and reliability of a test before attempting to make

test score interpretations. Most standardized tests publish the validity evidence and
reliability coefﬁcients that are used by teachers and school districts to determine which tests

are ―good.‖

Assume that your school district is using a highly reputable standardized test with a
reliability coef

ﬁcient above .90. One of your students performs well, as you would

expect, at the beginning of the year but then performs very poorly on the same test at
the end of the year. Is this student

’s test score valid? What might affect the reliability

and validity of this student

’s test score?

Measurement Error Is Present in All Test Scores. The standard error of measurement (SEM), or margin of error, provides
information used to determine a con

ﬁdence interval or range in which the true score would fall. In this political poll, the President’s

actual approval rating is somewhere between 32% and 42%.

THE PRESIDENT

’S APPROVAL RATING

APPROVE 37%

DISAPPROVE 63%

5% margin of error

–

boh7850x_CL9Mod29.p524-539.indd 536

10/23/08 4:06:19 PM

summary

537

Summary

Describe the purpose of four broad categories of standardized tests and how standardized tests
are used by teachers. Standardized achievement tests are used to assess the degree of current
knowledge or learning in either broad or domain-speci

ﬁc areas. Standardized aptitude tests assess an

individual

’s future potential to learn in general or in domain-speciﬁc areas. Career or educational interest

inventories assess preferences related to certain types of activities. Personality tests are used to assess
individual characteristics. Teachers are most likely to administer group tests, which are both relatively
easy to administer and cost effective. Teachers may encounter individually administered test results when
determining special education eligibility.
Explain the difference between criterion-referenced and norm-referenced tests. Although some
standardized tests provide both criterion-referenced and norm-referenced interpretations, the type of test
or interpretation used is based on the purpose of the test. Criterion-referenced tests
provide information about the mastery and the strengths and weaknesses of individual students, such as
whether a particular student meets certi

ﬁcation requirements. In contrast, norm-referenced tests allow

comparisons among student scores that may be used in making decisions such as selecting the top

students from a group.
Explain the basic properties of a normal distribution.
The normal distribution is a special type of frequency distribution. Although some frequency
distributions may be skewed, with more scores falling on the higher or lower end, the normal distribution
is bell-shaped and symmetrical. The three central tendencies

—mean, median, and mode—are equal

to one another and appear at the midpoint of a normal distribution. The variability among scores is
standard such that 68% of scores are within one standard deviation of the mean, 95% of scores are
within two standard deviations of the mean, and 99% of scores are within three standard deviations of the
mean.
Describe four types of test scores, and explain the advantages and limitations of
each. (1) Raw scores are the number of correct answers or percentage of correct answers. They provide
adequate information for classroom tests but are more dif

ﬁcult to interpret when comparing scores across

students or groups of students. (2) Percentile scores are based on the percentage of test-takers who
scored below or equal to a student

’s raw score. Percentile scores provide information about how well an

individual performed in comparison to a group but should not be used to compare different students

’

scores. (3) Grade-equivalent scores represent the median score for particular grade levels indicating
whether a student is scoring at grade level, below grade level, or above grade level. GE scores are
commonly misinterpreted; hence, experts do not recommend their use. (4) Standard scores are
derived from percentile rank scores by converting them into a single-digit system (i.e., stanines) or from
raw scores by converting them into scores based on a speci

ﬁed mean and standard deviation (i.e.,

z-scores and T-scores). Standard scores typically are used for ease of interpretation, and those based
on the mean and standard deviation also allow accurate comparisons among scores.
Explain why validity and reliability are two important qualities of tests and why teachers need this
information about tests to interpret test scores. To determine the quality of a test, teachers should
evaluate the validity evidence and reliability of test scores. Validity refers to the extent to which a test
measures what it is intended to measure. Reliability of test scores refers to the consistency of the
measurement, with highly reliable tests having minimal measurement error and low-quality tests having
high measurement error. Teachers can use information about validity and reliability evidence to determine
the quality of a test and to make decisions about whether the test should be used. The standard error of
measurement also can be used to determine a con

ﬁdence interval, rather than depending on a single raw

or standard score. Con

ﬁdence intervals remind teachers and other professionals that some

measurement error is present in all tests

—even high-quality tests.

boh7850x_CL9Mod29.p524-539.indd 537

10/9/08 9:23:24 AM

538

case studies: re

ﬂect and evaluate

Key Concepts

career or educational interest inventories central tendency concurrent validity con

ﬁdence interval construct validity

content validity convergent validity criterion-referenced tests criterion-related validity discriminant validity evaluation

frequency distribution grade-equivalent scores

mean measurement measurement error median mode norm-referenced tests normal distribution norm sample

percentile scores personality tests predictive validity range raw score reliability

skewness standard deviation (SD)

standard error of measurement (SEM) standardized achievement tests standardized aptitude tests standardized tests

standard scores stanine scores theory-based validity T-score validity variability z-scores

Case Studies:

Refl ect and Evaluate

Early Childhood:

―Kindergarten Readiness‖

These questions refer to the case study on page 516.

1. The BRIGANCE® K & 1 Screen-II measures gross and

ﬁne motor skills, color recognition, knowledge of body

parts, counting, oral comprehension, and many literacy and numeracy skills. In what way is this standardized
readiness test like a standardized achievement test?

2. What basic concepts of measurement are used to create the range Amy refers to when explaining to her

roommate that average typically means a range in scores?

3. Explain why grade-equivalent scores can be confusing to parents like Ms. Jackson. What types of scores could

be used to better compare achievement differences among students?

4. Maria

’s mother is concerned about how Maria’s test score will be interpreted and used. What characteristic of

―good tests‖ is a concern for Maria’s mother? Is her concern justiﬁed? Why or why not?

5. Suppose Maria

’s percentile ranking on the BRIGANCE is the 38th percentile. How would you interpret this score

in relation to other students? What if another student scored at the 49th percentile? How would you compare the
performance of this student to Maria

’s?

6. De

ﬁne validity in your own words. Explain whether Maria’s readiness test results would be valid if she took the

English version of the BRIGANCE. What if she took the English version with her sister as interpreter?

Elementary School:

―Keyboard Courage‖

These questions refer to the case study on page 518.

1. Mr. Whitney mentions the difference between norm-referenced and criterion-referenced tests. Explain whether he

is accurate in his interpretation about how the test scores are used.

2. Based on the normal distribution and the information Mr. Washington provides regarding the test scores being

half a standard deviation below the mean, how poorly are the students doing in comparison to students across the
country?

3. If the test scores had been half a standard deviation above the national mean, would Principal Bowman be as

concerned? Why or why not?

4. Explain what is wrong with Ms. Cong

’s interpretation of percentile scores.

5. Explain how the average percentile score could have increased from 46 to 48 while average scores fell below the

state cutoff levels.

6. Ms. Rivadeneyra suggests that the test scores do not accurately re

ﬂect the students’ abilities. What characteristic

of good tests is involved here? Explain how this characteristic was in

ﬂuenced by the events near the school last

year.

boh7850x_CL9Mod29.p524-539.indd 538

10/9/08 9:23:26 AM

case studies: re

ﬂect and evaluate

539

Middle School:

―Teachers Are Cheating?‖

These questions refer to the case study on page 520.
1. Why did Mr. Rients take so much time to prepare for the
standardized testing session? What might happen if a teacher didn

’t

prepare by reading the instructions ahead of time and noting the time
limits? 2. Assume that the national test scores represent the normal
distribution. Based on Acting Principal Garrison

’s reply that the

previous year

’s scores were only half a standard deviation above the

mean, how accurate was Mr. Rients

’s interpretation that the test

scores were

―way above the national average‖? Explain your answer.

3. Lisa uses the percentile score to provide information about how
the school

’s test scores have jumped back and forth over the past

two years. Explain why this may not be the best test score to use
for comparing annual progress.
4. Assume that several weeks after the testing session Lisa
announces that the school

’s average stanine score for reading was

7. How does this compare to last year

’s test scores?

5. What characteristic of good tests is Lisa referring to when she
says that

―test scores shouldn’t jump back and forth so drastically‖?

What characteristic of good tests is she referring to when she adds
―at least not if the test is doing its job‖? Why is it important for
teachers to know about these characteristics?

High School:

―SAT Scores‖

These questions refer to the case study on page 522.

1. What type of standardized test is the SAT? Why might a
student score high on an achievement test but not on the SAT?
2. Explain how a norm-referenced test, such as the SAT, can be
used as a criterion-referenced test by colleges and universities for
determining admissions.
3. Based on the information in the module about SAT scores,

explain how much variability there was in the four students

’ test

scores on the math subscale as presented in the case.
4. Assume that another student received a score of 700 on the
math subscale. What would be the equivalent stanine score? What
would be the equivalent z-score?
5. Assume that Trevor takes the SAT again next month and
receives a score of 800 on the math sub-scale. What does the
difference in his two scores indicate about the quality of the test
scores? Based on the information presented in the case, what
might account for the difference in Trevor

’s scores over such a

short length of time?

boh7850x_CL9Mod29.p524-539.indd 539

10/9/08 9:23:28 AM