27
Developing
a Test Blueprint
Characteristics
of High-Quality Classroom Tests
n
Validity
n
Reliability
n
Fairness
and Equivalence
n
Practicality
M
O D U L E
Test
Construction and Use
Outline
Learning Goals
1.
Discuss
the importance of validity, reliability, fairness/ equivalence, and
practicality in test construction.
2.
Explain
how a test blueprint is used to develop a good test.
Developing
Test Items
n
Alternate-Choice
(True/False) Items
n
Matching
Exercises
n
Multiple-choice
Items
n
Short-answer/Completion
Items
n
Essay
Tasks
Test
Analysis and Revision
Summary
Key Concepts Case Studies: Reflect and Evaluate
3.
Discuss
the usefulness of each test item format.
4.
Compare
and contrast the scoring considerations for the five test item
formats.
5.
Describe
the benefits of item analysis and revision.
boh7850x_CL8Mod27.p481-497.indd
481 boh7850x_CL8Mod27.p481-497.indd
481 10/9/08
9:19:14 AM
10/9/08
9:19:14 AM
482
cluster
eight
classroom
assessment
CHARACTERISTICS
OF HIGH-QUALITY CLASSROOM TESTS
Tests
are only one form of assessment that you may find useful in your
classroom. Because teachers use tests quite frequently, you will need
to become comfortable designing and evaluating classroom tests.
Writing good test items takes considerable time and practice.
Researchers have identified principles of high-quality test
construction (Gronlund, 2003; McMillan, 2007; National Research
Council, 2001). Despite this, countless teachers violate these
principles when they develop classroom tests. Poorly constructed
tests compromise the teacher’s ability to obtain an accurate
assessment of students’ knowledge and skills, while high-quality
assessments yield reliable and valid information about a student’s
performance (McMillan, 2007). Let’s examine four facets by which to
judge the quality of tests:
n
validity,
n
reliability,
n
fairness
and equivalence, and
n
practicality.
Validity
Validity
typically is defined as the degree to which a test measures what it
is intended to measure. Validity is judged in relation to the purpose
for which the test is used. For example, if a test is given to assess
a social studies unit on American government, then the test questions
should focus on information covered in that unit. Teachers can
optimize the validity of a test by evaluating the test’s content
validity and creating an effective layout for the test items.
Content
validity.
Content
validity refers
to evidence that a test accurately represents a content domain—or
reflects what teachers have actually taught (McMillan, 2007). A test
cannot cover everything a student has learned, but it should provide
a representative sample of learning (Weller, 2001). A test would have
low content validity if it covered only a few ideas from lessons and
focused on extraneous information or if it covered one portion of
assigned readings but completely neglected other material that was
heavily emphasized in class. When developing a test, teachers can
consider several content validity questions (Nitko & Brookhart,
2007):
n
Do
the test questions emphasize the same things that were emphasized in
day-to-day instruction?
n
Do
the test questions cover all levels of instructional objectives
included in the lesson(s)?
n
Does
the weight assigned to each type of question reflect its relative
value among all other types?
Test
layout.
The layout, or physical appearance, of a test also can affect the
validity of the test results. For example, imagine taking a test by
reading test items written on the board by the teacher or listening
to questions dictated by the teacher. The teacher’s handwriting or
students’ ability to see the board may hinder test performance,
lowering the validity of the test score. Writing test items on the
board or dictating problems also may put students with auditory or
vision problems at a particular disadvantage. The dictation process
can slow down test taking, making it very difficult for students to
go back to check over their answers a final time before submitting
them. Have you ever taken a test for which the instructions were
unclear? A test-taker’s inability to understand the instructions
could lower the level of performance, reducing the validity of the
test results.
Teachers
can follow these guidelines when deciding how to design a test’s
layout:
n
In
general, tests should be typed so that each student has a printed
copy. Exceptions would be dictated spelling tests or testing of
listening and comprehension skills.
n
The
test should begin with clear directions at the top of the first page
or on a cover page. Typical directions include the number and format
of items (e.g., multiple-choice, essay), any penalty for guessing,
the amount of time allowed for completion of the test, and perhaps
mention of test-taking strategies the teacher has emphasized (e.g.,
“Read each question completely before answering” or “Try to
answer every question”).
n
Test
items should be grouped by format (e.g., all multiple-choice items in
one section, all true/ false items in another section), and testing
experts usually recommend arranging like items in order of increasing
difficulty. Arranging items from easiest to hardest decreases
student anxiety and increases performance (Tippets & Benson,
1989).
>><<
Validity
as it applies to standardized tests: See page 534.
boh7850x_CL8Mod27.p481-497.indd
482 boh7850x_CL8Mod27.p481-497.indd
482 10/9/08
9:19:20 AM
10/9/08
9:19:20 AM
module
twenty-seven
test
construction and use 483
Reliability.
Consistency among scores on a test given twice indicates reliability.
87
83 74 77 93 88 68 70
88
80 73 78 92 88 69 72
Claire
Jee Kristy Doug Nick Taylor Abby Marcus
Module
27:
Test
Construction and Use
Monday
Wednesday
Reliability
Reliability
refers to the consistency of test results. If a teacher were to give
a test to students on Monday and then repeat that same test on
Wednesday (without additional instruction or practice in the
interim), students should perform consistently from one day to the
next. Let’s consider several factors that affect test reliability.
Test
length and time provided for test taking.
Would you rather take a test that has 5 items or one that has 25
items? Tests with more items generally are more reliable than tests
with fewer items. However, longer assessments may not always be a
practical choice given other constraints, such as the amount of time
available for assessment. When you design tests, make sure that every
student who has learned the material well has sufficient time to
complete the assessment. Consider the average amount of time it will
take students to complete each question, and then adjust the number
of test questions accordingly. Table 27.1 provides some target time
requirements for different types of test items based on typical test
taking at the middle school or high school level. Keep in mind that
elementary school students require shorter tests than older students.
The time allotted during the school day for each subject might be
less than the average class period in middle or high school. So
elementary school students might have only 30 minutes to take a test
(requiring fewer items), while students in middle school or high
school might have a 50-minute period. Students in the elementary
grades also need shorter tests because they have shorter attention
spans and tire more quickly than older students.
Frequency
of testing.
The frequency of testing—or how often students are tested—also
affects reliability. The number and type of items included on tests
may be influenced by the amount of material that has been covered
since the last assessment was given. A review of research on
frequency of testing provides several conclusions (Dempster, 1991):
n
Frequent
testing encourages better retention of information and appears to be
more effective than a comparable amount of time spent studying and
reviewing material.
n
Tests
are more effective in promoting learning if students are tested soon
after they have learned material and then retested on the same
material later.
n
The
use of cumulative questions on tests is key to effective learning.
Cumulative questions give students an opportunity to recall and apply
information learned in previous units.
Objectivity
of scoring.
Objectivity
refers
to the degree to which two or more qualified evaluators agree on
what rating or score to assign to a student’s performance. Some
types of test items, such as multiple choice, true/false, and
matching, tend to be easier to score objectively than short-answer
items and essays. Objective scoring increases the reliability of the
test score. This does not mean that the more subjective item formats
should be eliminated, because they have their own advantages, as we
will see later in the module.
>><<
Reliability
of test results: See page 535.
boh7850x_CL8Mod27.p481-497.indd
483 boh7850x_CL8Mod27.p481-497.indd
483 10/9/08
9:19:20 AM
10/9/08
9:19:20 AM
484
cluster
eight
classroom
assessment
TA
B L E 2 7.1 Time
Requirements for Certain Assessment Tasks
Type
of task Approximate time per item
True/false
20–30 seconds
Multiple
choice (factual) 40–60 seconds
One-word
fill-in 40–60 seconds
Multiple
choice (complex) 70–90 seconds
Matching
(5 stems/6 choices) 2–4 minutes
Short
answer 2–4 minutes
Multiple
choice (with calculations) 2–5 minutes
Word
problems (simple arithmetic) 5–10 minutes
Short
essays 15–20 minutes
Data
analysis/graphing 15–25 minutes
Drawing
models/labeling 20–30 minutes
Extended
essays 35–50 minutes
Source:
Reprinted from Nitko, & Brookhart, 2007, p. 119.
Fairness
and Equivalence
Fairness
is the degree to which all students have an equal opportunity to
learn material and demonstrate their knowledge and skill (Yung,
2001). Consider this multiple-choice question:
Which
ball has the smallest diameter?
a.
basketball
b.
soccer ball
c.
lacrosse ball
d.
football
If
students from a lower-income background answer this geometry question
incorrectly, is it because they haven’t mastered the concept of
diameter or because they lack prior experience with some of the
items? Do you think girls also might lack experience with some of
these items? Females, as a group, do not score as high as males on
tests that reward mechanical or physical skills (Patterson, 1989) or
mathematical, scientific, or technical skills (Moore, 1989).
African-American, Latino, and Native-American students, as well as
students for whom English is a second language, do not as a group
perform as well as Anglos on formal tests (Garcia & Pearson,
1994). Asian Americans, who have a reputation as high achievers in
American society and who tend to score higher than Anglo students in
math, score lower on verbal measures (Tsang, 1989).
A
high-quality assessment should be free of bias,
or
a systematic error in test items that leads students from certain
subgroups (ethnic, socioeconomic, gender, religious, disability) to
perform at a disadvantage (Hargis, 2006). Tests that include items
containing bias reduce the validity of the test score. Additional
factors that tend to disadvantage students from diverse cultural
and linguistic backgrounds during testing include:
n
speededness,
or the inability to complete all items on a test as a result of
proscribed time limitations (Mestre, 1984);
n
test
anxiety and testwiseness (Garcia, 1991; Rincon, 1980);
,
,
boh7850x_CL8Mod27.p481-497.indd
484 boh7850x_CL8Mod27.p481-497.indd
484 10/9/08
9:19:21 AM
10/9/08
9:19:21 AM
Fairness.
Tests should be free of bias that would lead certain subgroups to
perform at a disadvantage.
>><<
Test
fairness and test bias: See page 547.
module
twenty-seven
test
construction and use 485
Module
27:
Test
Construction and Use
n
differential
interpretation of questions and foils (Garcia, 1991); and
n
unfamiliar
test conditions (Taylor, 1977).
To
ensure that all students have an equal opportunity to demonstrate
their knowledge and skills, teachers may need to make individual
assessment accommodations with regard to format, response, setting,
timing, or scheduling (Elliott, McKevitt, & Kettler, 2002).
In
addition to assuring fairness among individuals within a particular
classroom, assessments must demonstrate fairness from one school year
to the next or even one class period to the next. If you decide to
use different questions on a test on a particular unit, you should
try to ensure equivalence from one exam to the next. Equivalence
means that students past and present, or students in the same course
but different class periods, are required to know and perform tasks
of similar (but not identical) complexity and difficulty in order to
earn the same grade (Nitko & Brookhart, 2007). This assumes that
the content or learning goals have not changed and that the analysis
of results from past assessments was satisfactory.
Practicality
Assessing
students is very important, but the time devoted to assessment should
not interfere with providing high-quality instruction. When creating
assessments, teachers need to consider issues of practicality—the
extent to which a particular form of assessment is economical and
efficient to create, administer, and score. For example, essay
questions tend to be less time consuming to construct than good
multiple-choice questions. However, multiple-choice questions can be
scored much more quickly and are easier to score objectively.
Performance tasks such as group projects or class presentations can
be very difficult to construct properly, but there are times when
these formats allow the teacher to better assess what students have
learned. When deciding what format to choose, consider whether:
n
the
format is relatively easy to construct and not too time consuming to
grade,
n
the
time spent using the testing format could be better spent on
teaching, and
n
another
format could meet assessment goals but be more efficient.
Think
about a particular test you might give in the grade level you intend
to teach. How will you ensure its validity, reliability, and
fairness? Evaluate your test’s practicality.
DEVELOPING
A TEST BLUEPRINT
To
increase reliability, validity, fairness, and equivalence, try making
a blueprint prior to developing a test. A test
blueprint
is an assessment planning tool that describes the content the test
will cover and the way students are expected to demonstrate their
understanding of that content. When it is presented in a table
format, as shown in Table 27.2, it is called a table
of specifications.
On the table, the row headings (down the left margin) indicate major
topics that the assessment will cover. The column headings (across
the top) list the six classification levels of Bloom’s (1956)
taxonomy of cognitive objectives, which provide a framework for
composing tests. The first three categories (knowledge,
comprehension, and application) are often called lower-level
objectives, and the last three categories (analysis, synthesis, and
evaluation) are called higher-level objectives. Think of these six
categories as a comprehensive framework for considering different
cognitive goals that need to be met when planning for instruction.
Each cell within the table of specifications itemizes a specific
learning goal. These learning goals get more complex as you move from
left to right. In the far left column, the student might be asked to
define terms, while in the far right column the student is asked to
synthesize or evaluate information in a meaningful way. For each
cell, the teacher also needs to decide how many questions of each
type to ask, as well as how each item will be weighted (how many
points each item will be worth). The weight of each item should
reflect its value or importance (Gronlund, 2003; Nitko &
Brookhart, 2007).
When
planning a test, it is not necessary to cover all six levels of
Bloom’s taxonomy. It is more important that the test’s coverage
matches the learning goals and emphasizes the same concepts or skills
the teacher focused on in day-to-day instruction. Teaching is most
effective when lesson plans, teaching activities, and learning goals
are aligned and when all three are aligned with state standards.
Assessment is most effective when it matches learning goals and
teaching activities. Assuming that
>><<
Bloom's
taxonomy and its application to instructional planning: See page
360.
Performance
assessment: See page 498.
>><<
boh7850x_CL8Mod27.p481-497.indd
485 boh7850x_CL8Mod27.p481-497.indd
485 10/9/08
9:19:23 AM
10/9/08
9:19:23 AM
486
cluster
eight
classroom
assessment
Content
outline Knowledge Comprehension Application Analysis Synthesis
Evaluation
Major
Categories of Cognitive Taxonomy
TA
B L E 2 7. 2
Sample
Test Blueprint for a High School Science Unit
1.
Historical
concepts of force
Classify
each force as a vector or scaler quantity, given its description.
(2 test items)
Identify
the concepts of force and list the empirical support for each
concept. (2
test items )
4.
Three-dimensional forces
Define
the resultant three-dimensional force in terms of one-dimensional
factors.
(1
test item)
Calculate
the gravitational forces acting between two bodies.
(8
test items)
2.
Types of force
Define
each type of force and the term velocity.
(2
test items)
Define
the resultant two-dimensional force in terms of one-dimensional
factors.
(2
test items)
3.
Two-dimensional forces
Find
the x
and y
components of resultant forces on an object.
(6
test items)
5.
Interaction of masses
Define
the terms inertial
mass, weight, active gravitational mass,
and passive
gravitational mass.
(6
or 7 test items)
Develop
a definition of mass that explains the difference between inertial
mass and weight.
(1
test item)
Source:
Adapted from Kryspin & Feldhusen, 1974, p. 42.
boh7850x_CL8Mod27.p481-497.indd
486 boh7850x_CL8Mod27.p481-497.indd
486 10/9/08
9:19:23 AM
10/9/08
9:19:23 AM
module
twenty-seven
test
construction and use 487
learning
goals do not change, the same blueprint can be used to create
multiple tests for use across class periods or school years.
Use
the sample table of specifications in Table 27.2 to create your own
test blueprint for a particular unit at the grade level you intend to
teach. What topics will your test cover, and how will you require
students to demonstrate their knowledge?
Module
27:
Test
Construction and Use
DEVELOPING
TEST ITEMS
After
you have selected the content to be covered on the test, it is time
to consider the format of test items to use. Teachers have several
test formats available: alternate response, matching items, multiple
choice, short answer, and essay. Alternate response (e.g.,
true/false), matching items, and multiple choice are called
recognition
tasks
because they ask students to recognize correct information among
irrelevant or incorrect statements. These types of items are also
referred to as objective
testing
formats because they have one correct answer. Short answer/completion
(fill in the blanks) and essay items are
recall tasks,
requiring students to generate the correct answers from memory. These
types of items are also considered a subjective
testing
format, with the scoring more open to interpretation. Objective and
subjective formats differ in three important ways:
1.
The
amount of time to take the test.
While teachers may be able to ask 30 multiple-choice questions in a
50-minute class period, they may be able to ask only four essay
questions in the same amount of time.
2.
The
amount of time to score the test.
Scoring of objective formats is straightforward, needing only an
answer key with the correct responses. If teachers use optical
scanning sheets (“bubble sheets”) for objective test responses,
they can scan many sheets in a short amount of time. A middle school
or high school teacher who uses the same 50-item test for several
class periods can scan large batches of bubble sheets in minutes.
Because extended short answers or essay questions involve more
subjective judgments, they are more time-consuming to grade. Teachers
may choose to use essays for classes with fewer students.
3.
The
objectivity in grading.
Because extended short answers or essay questions are subjective,
teachers can improve their objectivity in scoring these formats by
using a rubric. A
rubric,
such as the example in Figure 27.1, is an assessment tool that
provides preset criteria for scoring student responses, making
grading simpler and more transparent. Rubrics ensure consistency of
grading across students or grading sessions (when teachers grade a
set of essays, stop, and return to grading).
Teachers
should select the format that provides the most direct assessment of
the particular skill or learning outcome being evaluated (Gronlund,
2003). As you will see next, each type of item format has its own
unique characteristics. However, some general rules of item
construction apply across multiple formats. All test items should:
n
measure
the required skill or knowledge;
n
focus
on important, not trivial, subject area content;
n
contain
accurate information, including correct spelling;
n
be
clear, concise, and free of bias; and
n
be
written at an appropriate level of difficulty and an appropriate
reading level.
Alternate-Choice
(True/False) Items
An
alternate-choice
item
presents a proposition that a student must judge and mark as either
true or false, yes or no, right or wrong. Alternate-choice items
are recognition tasks because
Figure
27.1: Rubrics.
Rubrics ensure consistency of grading when subjective test item
formats are used.
Rubric
for a Position Essay
Format
(typed, double-spaced) 2 States a clear position for or against 3
Provides appropriate level of detail describing stated position 5
Includes arguments to support response from resources 5 Grammar,
spelling, punctuation and clarity of writing 5 Includes references in
appropriate format for resources 5
Points
earned
Criteria
Points possible
TOTAL
25
boh7850x_CL8Mod27.p481-497.indd
487 boh7850x_CL8Mod27.p481-497.indd
487 10/9/08
9:19:24 AM
10/9/08
9:19:24 AM
488
cluster
eight
classroom
assessment
Guidelines
for Writing True/False Items
BOX
27.1
1.
Use short statements and simple vocabulary and sentence structure.
Consider the following true/false item: “The
true/false item is more subject to guessing but it should be used in
place of a multiple-choice item, if well constructed, when there are
a lack of plausible distractors.”
Long, complex sentences such as this one are confusing and therefore
more difficult to judge as true or false.
2.
Include only the central idea in each statement. Consider the
following true-false item: “The
true-false item, which is preferred by testing experts, is also
called an alternate-response item.”
This item has two ideas to evaluate: being favored by experts and
being called an alternative-response item.
3.
Use negative statements sparingly, and avoid double negatives. These
also can be confusing.
4.
Use precise wording so that the statement can unequivocally be judged
true or false.
5.
Write false statements that reflect common misconceptions held by
students who have not achieved learning goals.
6.
Avoid using statements that reproduce verbatim sentences from the
textbook or reading material. Students may be able to judge the item
true or false simply by memorizing and not understanding. This
reduces the validity of the student’s test score. Does the student
really
know the material?
7.
Statements of opinion should be attributed to some source (e.g.,
according to the textbook; according to the research on . . . ).
8.
When cause-effect is being assessed, use only true propositions. For
example, use “Exposure
to ultraviolet rays can cause skin cancer”
rather than “Exposure
to ultraviolet rays does not cause skin cancer.”
Evaluating that the positively worded statement is true is less
confusing than evaluating that the negatively worded statement is
false.
9.
Do not overqualify the statement in a way that gives away the
answer. Avoid using specific determiners (e.g., always,
never, all, none, only, usually, may,
and sometimes
tend to be true).
10.
Make true and false items of comparable length and make the same
number of true and false items.
11.
Randomly sort the items into a numbered sequence so that they are
less likely to appear in a repetitive, predictable pattern (e.g.,
avoid T F T F . . . or . . . TT FF TT FF).
Sources:
Ebel, 1979; Gronlund, 2003; Lindvall & Nitko, 1975.
students
only need to recognize whether the statement matches a fact that they
have in their memory. A true/false question might state:
An
equilateral triangle has three sides of equal length. T F
Teachers
also can design items that require multiple true/false responses. For
example:
Scientists
who study earthquakes have learned that:
1.
the surface of the earth is in constant motion due to forces inside
the planet. T F
2.
an earthquake is the vibrations produced by breaking rocks along a
fault line. T F
3.
the time and place of earthquakes are easy to predict. T F
Alternate-choice
questions are optimal when the subject matter lends itself to an
either-or response. They allow teachers to ask a large number of
questions in a short period of time (a practicality issue), making it
possible to cover a wide range of topics within the domain being
assessed. However, an obvious disadvantage of this format is that
students have a 50% chance of getting the answer correct simply by
guessing. Good alternate choice questions are harder to write than
you might expect, Teachers can use the recommendations given in Box
27.1.
Matching
Exercises
A
matching
exercise
presents students with directions for matching a list of premises and
a list of responses. The student must match each premise with one of
the responses. Matching is a recognition task because the answers are
present in the test item. A simple matching exercise might look like
this:
boh7850x_CL8Mod27.p481-497.indd
488 boh7850x_CL8Mod27.p481-497.indd
488 10/29/08
9:46:14 AM
10/29/08
9:46:14 AM
module
twenty-seven
test
construction and use 489
Module
27:
Test
Construction and Use
Directions:
In the left column are events that preceded the start of the
Revolutionary War. For each event, choose the date it occurred from
the right column, and place the letter identifying it on the line
next to the event description.
Description
of events (premise list):
______
1. British troops fire on demonstrators in the Boston
Massacre,
killing five.
______
2. British Parliament passes the Tea Act.
______
3. Parliament passes the Stamp Act, sparking protests
in
the American colonies.
______
4. Treaty of Paris ends French power in North America.
Dates
(response list):
a.
1763
b.
1765
c.
1768
d.
1770
e.
1773
f.
1778
Matching
exercises are very useful for assessing a student’s ability to make
associations or see relationships between two things (e.g., words and
their definitions, individuals and their accomplishments, events in
history and dates). This format provides a space-saving and objective
way to assess learning goals. It is versatile in that words or
phrases can be matched to symbols or pictures (e.g., matching a
country name to the outline of that country on a map). Well-designed
matching exercises can assess students’ comprehension of concepts,
ideas, and principles. However, teachers often fall into the trap of
using matching only for memorized lists (such as names and dates) and
do not develop matching exercises that assess higher-level thinking.
To construct effective matching items, consider these guidelines:
n
Clearly
explain the intended basis for matching, as in the directions for the
sample item above.
n
Use
short lists of responses and premises.
n
Arrange
the response list in a logical order. For example, in the sample
item, dates are listed in chronological order.
n
Identify
premises with numbers and responses with letters.
n
Construct
items so that longer phrases appear in the premise list and shorter
phrases appear in the response list.
n
Create
responses that are plausible items for each premise. A response that
clearly does not fit any premise gives a hint to the correct
answer.
n
Avoid
“perfect” one-to-one matching by including one or more responses
that are incorrect choices or by using a response as the correct
answer for more than one premise.
Multiple-choice
Items
Each
multiple-choice
item
contains a stem,
or introductory statement or question, and a list of choices, called
response
alternatives. Multiple-choice
items are recognition tasks because the correct answer is provided
among the choices. A typical multiple-choice question looks like
this:
The
response alternatives include a keyed
alternative (the
correct answer) and distractors,
or incorrect alternatives. The example includes three choices: one
keyed alternative and two distractors. Other multiple-choice formats
may include a four-choice option (a, b, c, d) or five-choice option
(a, b, c, d, e). Box 27.2 presents a set of detailed guidelines for
developing a multiple-choice format that addresses content, style,
and tips for writing the stem and the choices.
Of
the item formats that serve as recognition tasks, multiple-choice
items are preferred by most assessment experts because this format
offers many advantages. The multiple-choice format can be used to
assess a great variety of learning goals, and the questions can be
structured to assess factual knowledge as well as higher-order
thinking. Because multiple-choice questions do not require writing,
What
is the main topic of the reading selection Responding to Learners?
a.
academic achievement
b.
question and response techniques
c.
managing student behavior
boh7850x_CL8Mod27.p481-497.indd
489 boh7850x_CL8Mod27.p481-497.indd
489 10/9/08
9:19:26 AM
10/9/08
9:19:26 AM
490
cluster
eight
classroom
assessment
BOX
27.2
Content
concerns:
1.
Every item should reflect specific content and a single specific
mental behavior, as called for in test specifications.
2.
Base each item on important content to learn; avoid trivial
content.
3.
Use novel material to test higher-level learning. Paraphrase
textbook language or language used during instruction in a test item
to avoid testing for simple recall.
4.
Keep the content of each item independent from content of other items
on the test.
5.
Avoid overly specific and overly general content when writing
items.
6.
Avoid opinion-based items.
7.
Avoid trick items.
8.
Keep vocabulary simple for the group of students being tested.
Writing
the stem:
9.
Ensure that the wording of the stem is very clear.
10.
Include the central idea in the stem instead of in the response
alternatives. Minimize the amount of reading in each item. Instead of
repeating a phrase in each alternative, try to include it as part of
the stem.
11.
Avoid window dressing (excessive verbiage).
12.
Word the stem positively; avoid negatives such as NOT or EXCEPT. If
a negative word is used, use the word cautiously and always ensure
that the word appears capitalized and boldface.
Writing
the response alternatives:
13.
Offering three response alternatives is adequate.
14.
Make sure that only one of the alternatives is the right answer.
15.
Make all distractors plausible. Use typical student misconceptions as
your distractors. 16. Phrase choices positively; avoid negatives
such as NOT.
17.
Keep alternatives independent; they should not overlap.
18.
Keep alternatives homogeneous in content and grammatical structure.
For example, if one alternative is stated as a negative—“No
running in the hall”—all alternatives should be phrased as
negatives.
19.
Keep all alternatives roughly the same length. If the correct answer
is substantially longer or shorter than the dis-tractors, this may
serve as a clue for test-takers.
20.
None
of the above should
be avoided or used sparingly.
21.
Avoid All
of the above.
22.
Vary the location of the correct answer in the list of alternatives
(e.g., the correct answer should not always be C).
23.
Place alternatives (A, B, C, D) in logical or numerical order. For
example, if a history question lists dates as the response
alternatives, they should be listed in chronological order.
24.
Avoid giving clues to the right answer, such as:
a.
specific determiners, including always,
never, completely,
and absolutely
b.
choices identical to or resembling words in the stem
c.
grammatical inconsistencies that cue the test-taker to the correct
choice
d.
obvious correct choice
e.
pairs or triplets of options that clue the test-taker to the correct
choice
f.
blatantly absurd, ridiculous options
Style
concerns: 25. Edit and proof items.
26.
Use correct grammar, punctuation, capitalization, and spelling.
Source:
Adapted from Haladyna, Downing, & Rodriquez, 2002.
boh7850x_CL8Mod27.p481-497.indd
490 boh7850x_CL8Mod27.p481-497.indd
490 10/29/08
9:46:18 AM
10/29/08
9:46:18 AM
Guidelines
for Writing Multiple-choice Items
students
who are poor writers have a more equal playing field for
demonstrating their understanding of the content than they have when
answering essay questions. All students also have less chance to
guess the correct answer in a multiple-choice format than they do
with true/false items or a poorly written matching exercise. Also,
the distractor that a student incorrectly chooses can give the
teacher insight into the student’s degree of misunderstanding.
However,
because multiple-choice items are recognition tasks, this format does
not require the student to recall information independently.
Multiple-choice questions are not the best option for
module
twenty-seven
test
construction and use 491
assessing
writing skills, self-expression, or synthesis of ideas or in
situations where you want students to demonstrate their work (e.g.,
showing steps taken to solve a math problem). Also, poorly written
multiple-choice questions can be superficial, trivial, or limited to
factual knowledge.
The
guidelines for writing objective items, such as those given in Box
27.1 and Box 27.2, are especially important for ensuring the validity
of classroom tests. Poorly written test items can give students a
clue to the right answer. For example, specific
determiners—extraneous
clues to the answer such as always, never, all, none, only, usually,
may—can enable students who may not know the material well to
correctly answer some true/false or multiple-choice questions using
testwiseness. Test-wiseness
is an ability to use test-taking strategies, clues from poorly
written test items, and prior experience in test taking to improve
one’s score. Learned either informally or through direct
instruction, it improves with grade level, experience in test taking,
and motivation to do well on an assessment (Ebel & Frisbie, 1991;
Sarnacki, 1979; Slakter, Koehler, & Hampton, 1970).
Short-answer/Completion
Items
Short-answer/completion
items
come in three basic varieties. The question
variety
presents a direct question, and students are expected to supply a
short answer (usually one word or phrase), as shown here: The
completion
variety
presents an incomplete sentence and requires students to fill in the
blank, as in the next examples:
1.
The capital of Kentucky is Frankfort .
2.
3(2 +
4) =
18
The
association
variety
(sometimes called the identification variety) presents a list of
terms, symbols, labels, and so on for which students have to recall
the corresponding answer, as shown below:
Short-answer
items are relatively easy to construct. Teachers can use these
general guidelines to help them develop effective short-answer items:
n
Use
the question variety whenever possible, because it is the most
straightforward short-answer design and is the preferred option of
experts.
n
Be
sure the items are clear and concise so that a single correct answer
is required.
n
Put
the blank toward the end of the line for easier readability.
n
Limit
blanks within a short-answer question to one or two.
n
Specify
the level of precision (a word, a phrase, a few sentences) expected
in the answer so students understand how much to write in their
response.
While
short-answer items typically assess students’ lower-level skills,
such as recall of facts, they also can be used to assess more complex
thinking skills if they are well designed. The short-answer format
lowers a student’s probability of getting an answer correct by
random guessing, a more likely scenario with alternate choice and
multiple choice.
Short-answer
items also are relatively easy to score objectively, especially when
the correct response is a one-word answer. Partial credit can be
awarded if a student provides a response that is close to the correct
answer but not 100% accurate. You occasionally may find that
students provide unanticipated answers. For example, if you ask “Who
discovered America?”, student responses might
boh7850x_CL8Mod27.p481-497.indd
491 boh7850x_CL8Mod27.p481-497.indd
491 10/9/08
9:19:28 AM
10/9/08
9:19:28 AM
Module
27:
Test
Construction and Use
1.
What is the capital of Kentucky? Frankfort
2.
What does the symbol Ag represent on the periodic table? Silver
3.
How many feet are in one yard? 3
Element
Symbol
Barium
Ba Calcium Ca Chlorine Cl
492
cluster
eight
classroom
assessment
include
“Christopher Columbus,” “Leif Erikson,” “the Vikings,” or
“explorers who sailed across the ocean.” You then would have to
make a subjective judgment about whether such answers should receive
full or partial credit. Because reliability decreases as scoring
becomes more subjective, teachers should use a scoring key when
deciding on partial credit to maintain scoring consistency from one
student to the next.
Essay
Tasks
Essay
tasks allow assessment of many cognitive skills that cannot be
assessed adequately, if at all, through more objective item formats.
Essay tasks can be classified into two types. Restricted
response essay
tasks limit the content of students’ answers as well as the form of
their responses. A restricted response task might state: List
the three parts of the memory system and provide a short statement
explaining how each part operates.
Extended
response essay
tasks require students to write essays in which they are free to
express their thoughts and ideas and to organize the information as
they see fit. With this format, there usually is no single correct
answer; rather the accuracy of the response becomes a matter of
degree. Teachers can use these guidelines to develop effective essay
questions:
n
Cover
the appropriate range of content and learning goals. One essay
question may address one learning goal or several.
n
Create
essay questions that assess application of knowledge and higher-order
thinking, not simply recall of facts.
n
Make
sure the complexity of the item is appropriate to the student’s
developmental level. Elementary school students might not be required
to write lengthy essays in essay booklets, whereas middle school or
high school students would be expected to write more detailed
essays.
n
Specify
the purpose of the task, the length of the response, time limits, and
evaluation criteria. For example, instead of an essay task that
states “Discuss the advantages of single-sex classrooms,” a high
school teacher might phrase the task as: “You are addressing the
school board. Provide three arguments in favor of single-sex
classrooms.” The revised essay question provides a purpose for the
response and specifies the amount students need to write. Teachers
also should specify how essays will be evaluated, for example,
whether spelling and grammar count toward the grade and how students’
opinions will be evaluated.
Whether
to use a restricted response format or an extended response format
depends on the intended purpose of the test item as well as
reliability and practicality issues. The restricted response format
narrows the focus of the assessment to a specific, well-defined
area. The level of specificity makes it more likely that students
will interpret the question as intended. This makes scoring easier,
because the range of possible responses is also restricted. Scoring
is more reliable because it is easier to be very clear about what
constitutes a correct answer. On the other hand, if you want to know
how a student organizes and synthesizes information, the narrowly
focused, restricted response format may not serve the assessment
purpose well. Extended response questions are suitable for assessing
students’ writing skill and/or subject matter knowledge. If your
learning goals involve skills such as organizing ideas, critically
evaluating a certain position or argument, communicating feelings, or
demonstrating creative writing skill, the extended response format
provides an opportunity for students to demonstrate these skills.
Because
extended responses are subjective, this format generally has poor
scoring reliability. Given the same essay response to evaluate,
several different teachers might award different scores, or the same
teacher might award different scores to student essays at the
beginning and end of a pile of tests. When the scores given are
inconsistent from one response to the next, the validity of the
assessment results is lessened. Also, teachers tend to evaluate the
essays of different students according to different criteria,
evaluating one essay in terms of its high level of creativity and
another more critically in terms of grammar and spelling. A
significant disadvantage is that grading extended essay responses is
a very time-consuming process, especially if the teacher takes the
time to provide detailed written feedback to help students improve
their work.
Restricted
and extended response essay items have special scoring considerations
unique to the essay format. Essay responses, especially extended
ones, tend to have poor scoring reliability and lower practicality.
As discussed earlier, a scoring rubric can help teachers score essay
answers more fairly and consistently. The following guidelines offer
additional methods for ensuring consistency:
boh7850x_CL8Mod27.p481-497.indd
492 boh7850x_CL8Mod27.p481-497.indd
492 10/9/08
9:19:28 AM
10/9/08
9:19:28 AM
module
twenty-seven
test
construction and use 493
Module
27:
Test
Construction and Use
n
Use
a set of anchor essays—student essays that the teacher selects as
examples of performance at different levels of a scoring rubric
(Moskal, 2003). For example, a teacher may have a representative A
essay, B essay, and so on. Anchors increase reliability because they
provide a comparison set for teachers as they score student
responses. A set of anchor essays without student names can be used
to illustrate the different levels of the scoring rubric to both
students and parents.
n
If
an exam has more than one essay question, score all students on the
first question before moving on to the next question. This method
increases the consistency and uniformity of your scoring.
n
Score
subject matter content separately from other factors such as spelling
or neatness.
n
To
increase fairness and eliminate bias, score essays anonymously by
having students write their names on the back of the exam.
n
Take
the time to provide effective feedback on essays.
Think
about a particular unit on which your students might be tested in the
grade level you intend to teach. What test item formats would you
choose to use, and why? Would your choice of item formats be
different for a pretest and for a test given at the end of a unit?
TEST
ANALYSIS AND REVISION
No
test is perfect. Good teachers evaluate the tests they use and make
necessary revisions to improve them. When using objective test
formats such as alternate response and multiple choice, teachers can
evaluate how well test items function by using item
analysis,
a process of collecting, summarizing, and using information from
student responses to make decisions about each test item. When
teachers use optical scanning sheets (“bubble sheets”) for test
responses, they can use a computer program not only to score the
responses, but also to generate an item analysis. Item analysis
provides two statistics that indicate how test items are functioning:
an item difficulty index and an item discrimination index.
The
item
difficulty index
reports the proportion of the group of test-takers who answered an
item correctly, ranging from 0 to 1. Items that are functioning
appropriately should have a moderate item difficulty index to
distinguish students who have grasped the material from those who
have not. This increases the validity of the test score. As a rule of
thumb, a moderate item difficulty can range from
.3
to .7. However, the optimal item difficulty level must take guessing
into account. For example, with a four-choice (a, b, c, d)
multiple-choice item, the chance of guessing is 25%. The optimal
difficulty level for this type of item is the midpoint between
chance (25%) and 100%, or 62.5%. So item difficulties for a
four-choice multiple-choice item should be close to or around .625.
Item
difficulty indexes that are very low (e.g., 0–.3) indicate that
very few students answered correctly. This information can identify
particular concepts that need to be retaught, provide clues about the
strengths and weaknesses of instruction, or indicate test items that
are poorly written and need to be revised or removed. Item difficulty
indexes that are very high (e.g., .8–.9) indicate that the majority
of students answered the items correctly, suggesting that the items
were too easy. While we want students to perform well, items that are
too easy do not discriminate between the students who know the
material well and those who do not, which is one purpose of assessing
student performance.
Item
discrimination indexes add this crucial piece of information. An item
discrimination index describes
the extent to which a particular test item differentiates
high-scoring students from low-scoring students. It is calculated as
the difference between the proportion of the upper group (highest
scorers) who answered a particular item correctly and the proportion
of the lower group (lowest scorers) who answered that same item
correctly. The resulting index ranges from –1 to +1. If a test is
well constructed, we would expect all test items to be positively
discriminating, meaning that those students in the highest scoring
group get the items correct while those in the lowest scoring group
get the items wrong. Test items with low (below .4), zero, or
negative discrimination indexes reduce a test score’s validity and
should be rewritten or replaced. A low item discrimination index
indicates that the item cannot accurately discriminate between
students who know the material and those who don’t. An item
discrimination index of zero indicates that the item cannot
discriminate at
all
between high scorers and low scorers. A test with many zero
discriminations does not provide a valid measure of student
achievement. Items with negative discrimination indexes indicate that
lower-scoring students tended to answer the items correctly while
higher-scoring students tended to get them wrong. If a test contains
items with negative discriminations, the total score on the exam will
not provide useful information.
boh7850x_CL8Mod27.p481-497.indd
493 boh7850x_CL8Mod27.p481-497.indd
493 10/9/08
9:19:29 AM
10/9/08
9:19:29 AM
494
cluster
eight
classroom
assessment
When
item analyses suggest that particular test items did not function as
expected, the source of the problem must be investigated. The item
analysis may indicate that the problem stems from:
n
the
item itself.
For example, multiple-choice items may have poorly functioning
distractors, ambiguous alternatives, questions that invite random
guessing, or items that have been keyed incorrectly.
n
student
performance.
Students may have misread the item and misunderstood it. When
developing a new test item, the teacher may think it is worded
clearly and has one distinctly correct answer. However, item analysis
might reveal that students interpreted that test item differently
than intended.
n
teacher
performance.
Low item difficulties, which indicate that students did not grasp
the material, may suggest that the teacher’s performance needs to
be improved. Perhaps concepts needed further clarification or the
teacher needs to consider a different approach to presenting the
material.
The
process of discarding certain items that did not work well, revising
other items, and testing a few new items on each test eventually
leads to a much higher quality test (Nitko & Brookhart, 2007).
Once “good” items have been selected, teachers can use software
to store test items in a computer file so they can select a subset
of test items, make revisions, assemble tests, and print out tests
for use in the classroom. Certain software products even allow the
teacher to sort items according to their alignment with curriculum
standards. Computer applications vary in quality, cost,
user-friendliness, and amount of training required for their use.
An
item analysis shows that three of your test items have very low item
difficulties, and you plan to revise these before you use the test
next time. Because these poorly functioning items affect students’
test scores, what can you do to improve the validity of your current
students’ test scores?
boh7850x_CL8Mod27.p481-497.indd
494 boh7850x_CL8Mod27.p481-497.indd
494 10/9/08
9:19:29 AM
10/9/08
9:19:29 AM
key
concepts
495
Summary
Discuss
the importance of validity, reliability, fairness/ equivalence, and
practicality in test construction. Classroom
tests should be evaluated based on their validity, reliability,
fairness/equivalence, and practicality. Teachers must consider how
well the test measures what it is supposed to measure (validity), how
consistent the results are (reliability), the degree to which all
students have an equal opportunity to learn and demonstrate their
knowledge and skill (fairness/equivalence), and how economical and
efficient the test is to create, administer, and score
(practicality).
Explain
how a test blueprint is used to develop a good test. A
test blueprint is an assessment planning tool that describes the
content the test will cover and the way students are expected to
demonstrate their understanding of that content. A test blueprint, or
table of specifications, helps teachers develop good tests because
it matches the test to instructional objectives and actual
instruction. Test blueprints take into consideration the importance
of each learning goal, the content to be assessed, the material that
was emphasized during instruction, and the amount of time available
for students to complete the test.
Discuss
the usefulness of each test item format.
Alternate-choice
items allow teachers to cover a wide range of topics by asking a
large number of questions in a short period of time. Matching
exercises can be useful for assessing a student’s ability to make
associations or see relationships between two things. Multiple-choice
items are preferred by assessment experts because they focus on
reading and thinking but do not
require
writing, give teachers insight into students’ degree of
misunderstanding, and can be used to assess a variety of learning
goals. Short-answer questions are relatively easy to construct, can
assess both lower-order and higher-order thinking skills, and
minimize the chance that students will answer questions correctly by
randomly guessing. Essay questions can provide an effective
assessment of many cognitive skills that cannot be assessed
adequately, if at all, with more objective item formats.
Compare
and contrast the scoring considerations for the five test item
formats. Objective
formats such as alternate choice, matching, and multiple choice have
one right answer and tend to be relatively quick and easy to score.
Short-answer/completion questions, if well designed, also can be
relatively easy to score as long as they are written clearly and
require a very specific answer. Essay questions, especially extended
essay formats, tend to have poor scoring reliability and lower
practicality because scoring is subjective and can be
time-consuming. Scoring rubrics allow teachers to score essay answers
more fairly and consistently.
Describe
the benefits of item analysis and revision.
Item
analysis determines whether a test item functions as intended,
indicates areas where students need clarification of concepts, and
points out where the curriculum needs to be improved in future
presentations of the material. The process of discarding certain
items that did not work well, revising other items, and testing a few
new items on each test eventually leads to a much higher quality
test.
Key
Concepts
alternate-choice
item content validity distractors equivalence extended response essay
fairness item analysis item difficulty index item discrimination
index
matching
exercise multiple-choice item objective testing objectivity
practicality recall tasks recognition tasks reliability response
alternatives
restricted
response essay rubric short-answer/completion items specific
determiners stem subjective testing table of specifications test
blueprint validity
boh7850x_CL8Mod27.p481-497.indd
495 boh7850x_CL8Mod27.p481-497.indd
495 10/9/08
9:19:31 AM
10/9/08
9:19:31 AM
496
case
studies: reflect and evaluate
,
Case
Studies: Refl
ect and Evaluate
Early
Childhood: “The
Zoo”
These
questions refer to the case study on page 458.
1.
Sanjay and Vivian do not do any paper-and-pencil testing in this
sce-nario. What factors should they consider when deciding whether to
use traditional tests as a form of assessment?
2.
If Vivian and Sanjay choose to use tests or quizzes with the
children, what steps should they take to ensure the validity of their
results?
3.
What issues could potentially interfere with the reliability of test
results among preschoolers?
4.
Most of the preschoolers in Vivian and Sanjay’s classroom do not
yet know how to read independently. How would this impact test
construction and use for this age group?
5.
Given your response to question 4, how might Vivian and Sanjay assess
a specific set of academic skills (e.g., letter or number
recognition) with their students in a systematic way?
6.
The lab school is located in a large city and has a diverse group of
students. How might issues of fairness come into play when designing
assessments for use with these students?
Elementary
School: “Writing
Wizards”
These
questions refer to the case study on page 460.
1.
Brigita provides an answer key for the Grammar Slammer activity
(quiz) and reviews it with the class.
How
does the use of such a key impact the level of objectivity in scoring
responses?
2.
Brigita uses a matching exercise (quiz) to evaluate students’
understanding of their weekly vocabulary words. What are the
advantages and disadvantages of this form of assessment?
3.
If Brigita decides she wants to vary her quiz format by using
multiple-choice questions instead, what factors should she keep mind
in order to write good multiple-choice questions?
4.
Brigita uses a combination of traditional tests and applied writing
activities to assess her students’ writing skills. Are there other
subject areas in a fourth-grade classroom in which tests would be a
useful assessment choice? Explain.
5.
Imagine that Brigita will be giving a social studies test and wants
to incorporate writing skills as part of this assessment. What are
the advantages and disadvantages of the essay format?
6.
Based on what you read in the module, what advice would you give
Brigita about how to score the responses on the social studies test
referred to in question 5?
Middle
School: “Assessment:
Cafeteria Style”
These
questions refer to the case study on page 462.
1.
From Ida’s perspective, how do the development, implementation, and
grading of the multiple-choice test rate in terms of practicality?
2.
Do 50 multiple-choice questions seem an appropriate number for a
middle school exam in a
50-minute
class period? Why or why not?
3.
What are the advantages of using multiple-choice questions rather
than one of the other question formats available (alternate choice,
matching, short answer, or essay)? Would your answer vary for
different subjects?
4.
What are some limitations of using only multiple-choice questions to
test students’ understanding of course content?
5.
Ida provided a rubric to help her students better understand what was
expected from them on the project option. What could she have done to
clarify her expectations for those students who choose to take the
exam?
6.
If you were the one writing the questions for Ida’s exam, what are
some guidelines you would follow to make sure the questions are well
constructed?
boh7850x_CL8Mod27.p481-497.indd
496 boh7850x_CL8Mod27.p481-497.indd
496 10/9/08
9:19:33 AM
10/9/08
9:19:33 AM
case
studies: reflect and evaluate
497
High
School: “Innovative
Assessment Strategies”
These
questions refer to the case study on page 464.
1.
What are some advantages tests have that might explain why so many
teachers at Jefferson High rely on them as a primary means of
assessment?
2.
In the New Hampshire humanities course described in the memo,
students are asked to design their own test as part of their final
project. How could these students use a test blueprint to develop a
good test?
3.
What criteria should the New Hampshire teacher use to evaluate the
quality of the tests designed by students?
4.
Imagine that a New Hampshire teacher gave a final exam using a
combination of the test questions created by students. What could the
teacher do to determine how the questions actually functioned?
5.
The California teacher used a “dream home” project to assess his
students’ conceptual understanding of area relationships. Is it
possible to assess this level of understanding by using
multiple-choice questions on a test? Explain you answer.
6.
Imagine that the Rhode Island social studies teacher wants students
to pay close attention to one another’s oral history presentations,
so she announces that students will be given a test on the material.
If she wants to test basic recall of facts and wants the test to be
quick and easy to grade, which item formats would you suggest she use
on her test? Explain.
boh7850x_CL8Mod27.p481-497.indd
497 boh7850x_CL8Mod27.p481-497.indd
497 10/9/08
9:19:35 AM
10/9/08
9:19:35 AM
Wyszukiwarka
Podobne podstrony:
EdPsych Modules word boh7850x CL8Mod28EdPsych Modules word boh7850x CL8Mod26EdPsych Modules word boh7850x CL7Mod25EdPsych Modules word boh7850x CL5Mod17EdPsych Modules word boh7850x CL7Mod23EdPsych Modules word boh7850x CL6EdPsych Modules word boh7850x creEdPsych Modules word boh7850x CL2Mod08EdPsych Modules word boh7850x CL INTROEdPsych Modules word boh7850x CL5Mod16EdPsych Modules word boh7850x CL6Mod21EdPsych Modules word boh7850x CL1Mod02EdPsych Modules word boh7850x CL4EdPsych Modules word boh7850x refEdPsych Modules word boh7850x CL2EdPsych Modules word boh7850x CL7Mod24EdPsych Modules word boh7850x CL6Mod18EdPsych Modules word boh7850x CL2Mod06EdPsych Modules word boh7850x CL1Mod05więcej podobnych podstron