27
Developing a Test Blueprint
Characteristics of High-Quality Classroom Tests
n
Validity
n
Reliability
n
Fairness and Equivalence
n
Practicality
M O D U L E
Test Construction and Use
Outline Learning Goals
1.
Discuss the importance of validity, reliability, fairness/ equivalence, and practicality in test construction.
2.
Explain how a test blueprint is used to develop a good test.
Developing Test Items
n
Alternate-Choice (True/False) Items
n
Matching Exercises
n
Multiple-choice Items
n
Short-answer/Completion Items
n
Essay Tasks
Test Analysis and Revision
Summary Key Concepts Case Studies: Re
flect and Evaluate
3.
Discuss the usefulness of each test item format.
4.
Compare and contrast the scoring considerations for the
five test item formats.
5.
Describe the bene
fits of item analysis and revision.
boh7850x_CL8Mod27.p481-497.indd 481
boh7850x_CL8Mod27.p481-497.indd 481
10/9/08 9:19:14 AM
10/9/08 9:19:14 AM
482
cluster eight
classroom assessment
CHARACTERISTICS OF HIGH-QUALITY
CLASSROOM TESTS
Tests are only one form of assessment that you may find useful in
your classroom. Because teachers use tests quite frequently, you will
need to become comfortable designing and evaluating classroom tests.
Writing good test items takes considerable time and practice.
Researchers have identified principles of high-quality test
construction (Gronlund, 2003; McMillan, 2007; National Research
Council, 2001). Despite this, countless teachers violate these
principles when they develop classroom tests. Poorly constructed tests
compromise the teacher’s ability to obtain an accurate assessment of
students’ knowledge and skills, while high-quality assessments yield
reliable and valid information about a student’s performance
(McMillan, 2007). Let’s examine four facets by which to judge the
quality of tests:
n
validity,
n
reliability,
n
fairness and equivalence, and
n
practicality.
Validity
Validity typically is defined as the degree to which a test measures
what it is intended to measure. Validity is judged in relation to the
purpose for which the test is used. For example, if a test is given to
assess a social studies unit on American government, then the test
questions should focus on information covered in that unit. Teachers
can optimize the validity of a test by evaluating the test’s content
validity and creating an effective layout for the test items.
Content validity. Content validity refers to evidence that a test
accurately represents a content domain—or reflects what teachers
have actually taught (McMillan, 2007). A test cannot cover
everything a student has learned, but it should provide a
representative sample of learning (Weller, 2001). A test would have
low content validity if it covered only a few ideas from lessons and
focused on extraneous information or if it covered one portion of
assigned readings but completely neglected other material that was
heavily emphasized in class. When developing a test, teachers can
consider several content validity questions (Nitko & Brookhart,
2007):
n
Do the test questions emphasize the same things that were emphasized in
day-to-day instruction?
n
Do the test questions cover all levels of instructional objectives included in the
lesson(s)?
n
Does the weight assigned to each type of question reflect its relative value
among all other types?
Test layout. The layout, or physical appearance, of a test also can
affect the validity of the test results. For example, imagine taking a
test by reading test items written on the board by the teacher or
listening to questions dictated by the teacher. The teacher’s
handwriting or students’ ability to see the board may hinder test
performance, lowering the validity of the test score. Writing test items
on the board or dictating problems also may put students with
auditory or vision problems at a particular disadvantage. The dictation
process can slow down test taking, making it very difficult for
students to go back to check over their answers a final time before
submitting them. Have you ever taken a test for which the instructions
were unclear? A test-taker’s inability to understand the instructions
could lower the level of performance, reducing the validity of the test
results.
Teachers can follow these guidelines when deciding how to design a test’s
layout:
n
In general, tests should be typed so that each student has a printed
copy. Exceptions would be dictated spelling tests or testing of
listening and comprehension skills.
n
The test should begin with clear directions at the top of the first
page or on a cover page. Typical directions include the number and
format of items (e.g., multiple-choice, essay), any penalty for
guessing, the amount of time allowed for completion of the test, and
perhaps mention of test-taking strategies the teacher has emphasized
(e.g., ―Read each question completely before answering‖ or ―Try to
answer every question‖).
n
Test items should be grouped by format (e.g., all multiple-choice
items in one section, all true/ false items in another section), and
testing experts usually recommend arranging like items in order of
increasing difficulty. Arranging items from easiest to hardest
decreases student anxiety and increases performance (Tippets &
Benson, 1989).
>
>
<
<
Validity as it applies to standardized tests: See page 534.
boh7850x_CL8Mod27.p481-497.indd 482
boh7850x_CL8Mod27.p481-497.indd 482
10/9/08 9:19:20 AM
10/9/08 9:19:20 AM
module twenty-seven
test construction and use
483
Reliability. Consistency among scores on a test given twice indicates
reliability.
87 83 74 77 93 88 68 70
88 80 73 78 92 88 69 72
Claire Jee Kristy Doug Nick Taylor Abby Marcus
Module 27:
Test Construction and Use
Monday Wednesday
Reliability
Reliability refers to the consistency of test results. If a teacher were
to give a test to students on Monday and then repeat that same test on
Wednesday (without additional instruction or practice in the interim),
students should perform consistently from one day to the next. Let’s
consider several factors that affect test reliability.
Test length and time provided for test taking. Would you rather
take a test that has 5 items or one that has 25 items? Tests with more
items generally are more reliable than tests with fewer items.
However, longer assessments may not always be a practical choice
given other constraints, such as the amount of time available for
assessment. When you design tests, make sure that every student who
has learned the material well has sufficient time to complete the
assessment. Consider the average amount of time it will take students
to complete each question, and then adjust the number of test
questions accordingly. Table 27.1 provides some target time
requirements for different types of test items based on typical test
taking at the middle school or high school level. Keep in mind that
elementary school students require shorter tests than older students.
The time allotted during the school day for each subject might be less
than the average class period in middle or high school. So elementary
school students might have only 30 minutes to take a test (requiring
fewer items), while students in middle school or high school might
have a 50-minute period. Students in the elementary grades also need
shorter tests because they have shorter attention spans and tire more
quickly than older students.
Frequency of testing. The frequency of testing—or how often
students are tested—also affects reliability. The number and type of
items included on tests may be influenced by the amount of material
that has been covered since the last assessment was given. A review
of research on frequency of testing provides several conclusions
(Dempster, 1991):
n
Frequent testing encourages better retention of information and
appears to be more effective than a comparable amount of time
spent studying and reviewing material.
n
Tests are more effective in promoting learning if students are
tested soon after they have learned material and then retested on
the same material later.
n
The use of cumulative questions on tests is key to effective
learning. Cumulative questions give students an opportunity to
recall and apply information learned in previous units.
Objectivity of scoring. Objectivity refers to the degree to which
two or more qualified evaluators agree on what rating or score to
assign to a student’s performance. Some types of test items, such as
multiple choice, true/false, and matching, tend to be easier to score
objectively than short-answer items and essays. Objective scoring
increases the reliability of the test score. This does not mean that the
more subjective item formats should be eliminated, because they have
their own advantages, as we will see later in the module.
>
>
<
<
Reliability of test results: See page 535.
boh7850x_CL8Mod27.p481-497.indd 483
boh7850x_CL8Mod27.p481-497.indd 483
10/9/08 9:19:20 AM
10/9/08 9:19:20 AM
484
cluster eight
classroom assessment
TA B L E 2 7.1
Time Requirements for Certain Assessment
Tasks
Type of task Approximate time per item
True/false 20
–30 seconds
Multiple choice (factual) 40
–60 seconds
One-word
fill-in 40–60 seconds
Multiple choice (complex) 70
–90 seconds
Matching (5 stems/6 choices) 2
–4 minutes
Short answer 2
–4 minutes
Multiple choice (with calculations) 2
–5 minutes
Word problems (simple arithmetic) 5
–10 minutes
Short essays 15
–20 minutes
Data analysis/graphing 15
–25 minutes
Drawing models/labeling 20
–30 minutes
Extended essays 35
–50 minutes
Source: Reprinted from Nitko, & Brookhart, 2007, p. 119.
Fairness and Equivalence
Fairness is the degree to which all students have an equal opportunity
to learn material and demonstrate their knowledge and skill (Yung,
2001). Consider this multiple-choice question:
Which ball has the smallest diameter?
a. basketball
b. soccer ball
c. lacrosse ball
d. football
If students from a lower-income background answer this geometry
question incorrectly, is it because they haven’t mastered the concept
of diameter or because they lack prior experience with some of the
items? Do you think girls also might lack experience with some of
these items? Females, as a group, do not score as high as males on
tests that reward mechanical or physical skills (Patterson, 1989) or
mathematical, scientific, or technical skills (Moore, 1989).
African-American, Latino, and Native-American students, as well as
students for whom English is a second language, do not as a group
perform as well as Anglos on formal tests (Garcia & Pearson, 1994).
Asian Americans, who have a reputation as high achievers in
American society and who tend to score higher than Anglo students
in math, score lower on verbal measures (Tsang, 1989).
A high-quality assessment should be free of bias, or a
systematic error in test items that leads students from certain
subgroups (ethnic, socioeconomic, gender, religious, disability) to
perform at a disadvantage (Hargis, 2006). Tests that include items
containing bias reduce the validity of the test score. Additional
factors that tend to disadvantage students from diverse cultural
and linguistic backgrounds during testing include:
n
speededness, or the inability to complete all items on a test as a
result of proscribed time limitations (Mestre, 1984);
n
test anxiety and testwiseness (Garcia, 1991; Rincon, 1980);
,
,
boh7850x_CL8Mod27.p481-497.indd 484
boh7850x_CL8Mod27.p481-497.indd 484
10/9/08 9:19:21 AM
10/9/08 9:19:21 AM
Fairness. Tests should be free of bias that would
lead certain subgroups to perform at a
disadvantage.
>
>
<
<
Test fairness and test bias: See page 547.
module twenty-seven
test construction and use
485
Module 27:
Test Construction and Use
n
differential interpretation of questions and foils (Garcia, 1991); and
n
unfamiliar test conditions (Taylor, 1977).
To ensure that all students have an equal opportunity to demonstrate
their knowledge and skills, teachers may need to make individual
assessment accommodations with regard to format, response, setting,
timing, or scheduling (Elliott, McKevitt, & Kettler, 2002).
In addition to assuring fairness among individuals within a
particular classroom, assessments must demonstrate fairness from one
school year to the next or even one class period to the next. If you
decide to use different questions on a test on a particular unit, you
should try to ensure equivalence from one exam to the next.
Equivalence means that students past and present, or students in the
same course but different class periods, are required to know and
perform tasks of similar (but not identical) complexity and difficulty
in order to earn the same grade (Nitko & Brookhart, 2007). This
assumes that the content or learning goals have not changed and that
the analysis of results from past assessments was satisfactory.
Practicality
Assessing students is very important, but the time devoted to
assessment should not interfere with providing high-quality
instruction. When creating assessments, teachers need to consider
issues of practicality—the extent to which a particular form of
assessment is economical and efficient to create, administer, and
score. For example, essay questions tend to be less time consuming to
construct than good multiple-choice questions. However,
multiple-choice questions can be scored much more quickly and are
easier to score objectively. Performance tasks such as group projects
or class presentations can be very difficult to construct properly, but
there are times when these formats allow the teacher to better assess
what students have learned. When deciding what format to choose,
consider whether:
n
the format is relatively easy to construct and not too time consuming to grade,
n
the time spent using the testing format could be better spent on teaching, and
n
another format could meet assessment goals but be more efficient.
Think about a particular test you might give in the grade level
you intend to teach. How will you ensure its validity, reliability,
and fairness? Evaluate your test
’s practicality.
DEVELOPING A TEST BLUEPRINT
To increase reliability, validity, fairness, and equivalence, try making
a blueprint prior to developing a test. A test blueprint is an
assessment planning tool that describes the content the test will cover
and the way students are expected to demonstrate their understanding
of that content. When it is presented in a table format, as shown in
Table 27.2, it is called a table of specifications. On the table, the row
headings (down the left margin) indicate major topics that the
assessment will cover. The column headings (across the top) list the
six classification levels of Bloom’s (1956) taxonomy of cognitive
objectives, which provide a framework for composing tests. The first
three categories (knowledge, comprehension, and application) are
often called lower-level objectives, and the last three categories
(analysis, synthesis, and evaluation) are called higher-level objectives.
Think of these six categories as a comprehensive framework for
considering different cognitive goals that need to be met when
planning for instruction. Each cell within the table of specifications
itemizes a specific learning goal. These learning goals get more
complex as you move from left to right. In the far left column, the
student might be asked to define terms, while in the far right column
the student is asked to synthesize or evaluate information in a
meaningful way. For each cell, the teacher also needs to decide how
many questions of each type to ask, as well as how each item will be
weighted (how many points each item will be worth). The weight of
each item should reflect its value or importance (Gronlund, 2003;
Nitko & Brookhart, 2007).
When planning a test, it is not necessary to cover all six levels of
Bloom’s taxonomy. It is more important that the test’s coverage
matches the learning goals and emphasizes the same concepts or
skills the teacher focused on in day-to-day instruction. Teaching is
most effective when lesson plans, teaching activities, and learning
goals are aligned and when all three are aligned with state standards.
Assessment is most effective when it matches learning goals and
teaching activities. Assuming that
>
>
<
<
Bloom's taxonomy and its application to instructional planning:
See page 360.
Performance assessment: See page 498.
>
>
<
<
boh7850x_CL8Mod27.p481-497.indd 485
boh7850x_CL8Mod27.p481-497.indd 485
10/9/08 9:19:23 AM
10/9/08 9:19:23 AM
486
cluster eight
classroom assessment
Content outline Knowledge Comprehension Application Analysis Synthesis Evaluation
Major Categories of Cognitive Taxonomy
TA B L E 2 7. 2
Sample Test Blueprint for a High School Science Unit
1.
Historical concepts of force
Classify each force as a vector or scaler quantity, given its description. (2 test items)
Identify the concepts of force and list the empirical support for each concept. (2 test items )
4. Three-dimensional forces
De
fine the resultant three-dimensional force in terms of one-dimensional factors.
(1 test item)
Calculate the gravitational forces acting between two bodies.
(8 test items)
2. Types of force
De
fine each type of force and the term velocity.
(2 test items)
De
fine the resultant two-dimensional force in terms of one-dimensional factors.
(2 test items)
3. Two-dimensional forces
Find the x and y components of resultant forces on an object.
(6 test items)
5. Interaction of masses
De
fine the terms inertial mass, weight, active gravitational mass, and passive gravitational mass.
(6 or 7 test items)
Develop a de
finition of mass that explains the difference between inertial mass and weight.
(1 test item)
Source: Adapted from Kryspin & Feldhusen, 1974, p. 42.
boh7850x_CL8Mod27.p481-497.indd 486
boh7850x_CL8Mod27.p481-497.indd 486
10/9/08 9:19:23 AM
10/9/08 9:19:23 AM
module twenty-seven
test construction and use
487
learning goals do not change, the same blueprint can be used to create multiple tests for use across class periods or
school years.
Use the sample table of speci
fications in Table 27.2 to create your own test blueprint for a particular unit at
the grade level you intend to teach. What topics will your test cover, and how will you require students to
demonstrate their knowledge?
Module 27:
Test Construction and Use
DEVELOPING TEST ITEMS
After you have selected the content to be covered on the test, it is time to consider the format of test items to use.
Teachers have several test formats available: alternate response, matching items, multiple choice, short answer, and
essay. Alternate response (e.g., true/false), matching items, and multiple choice are called recognition tasks because
they ask students to recognize correct information among irrelevant or incorrect statements. These types of items are
also referred to as objective testing formats because they have one correct answer. Short answer/completion (fill in
the blanks) and essay items are recall tasks, requiring students to generate the correct answers from memory. These
types of items are also considered a subjective testing format, with the scoring more open to interpretation.
Objective and subjective formats differ in three important ways:
1. The amount of time to take the test. While teachers may be able to ask 30 multiple-choice questions in a
50-minute class period, they may be able to ask only four essay questions in the same amount of time.
2. The amount of time to score the test. Scoring of objective formats is straightforward, needing only an answer
key with the correct responses. If teachers use optical scanning sheets (―bubble sheets‖) for objective test
responses, they can scan many sheets in a short amount of time. A middle school or high school teacher who uses
the same 50-item test for several class periods can scan large batches of bubble sheets in minutes. Because
extended short answers or essay questions involve more subjective judgments, they are more time-consuming to
grade. Teachers may choose to use essays for classes with fewer students.
3. The objectivity in grading. Because extended short answers or essay questions are subjective, teachers can
improve their objectivity in scoring these formats by using a rubric. A rubric, such as the example in Figure
27.1, is an assessment tool that provides preset criteria for scoring student responses, making grading simpler and
more transparent. Rubrics ensure consistency of grading across students or grading sessions (when teachers grade
a set of essays, stop, and return to grading).
Teachers should select the format that provides the most direct assessment of the particular skill or learning
outcome being evaluated (Gronlund, 2003). As you will see next, each type of item format has its own unique
characteristics. However, some general rules of item construction apply across multiple formats. All test items
should:
n
measure the required skill or knowledge;
n
focus on important, not trivial, subject area content;
n
contain accurate information, including correct spelling;
n
be clear, concise, and free of bias; and
n
be written at an appropriate level of difficulty and an appropriate reading level.
Alternate-Choice (True/False) Items
An alternate-choice item presents a proposition that a student must judge and mark as either true or false, yes
or no, right or wrong. Alternate-choice items are recognition tasks because
Figure 27.1: Rubrics. Rubrics ensure consistency of grading when subjective test item formats are used.
Rubric for a Position Essay
Format (typed, double-spaced) 2 States a clear position for or against 3 Provides appropriate level of detail describing stated
position 5 Includes arguments to support response from resources 5 Grammar, spelling, punctuation and clarity of writing 5 Includes
references in appropriate format for resources 5
Points earned
Criteria Points possible
TOTAL 25
boh7850x_CL8Mod27.p481-497.indd 487
boh7850x_CL8Mod27.p481-497.indd 487
10/9/08 9:19:24 AM
10/9/08 9:19:24 AM
488
cluster eight
classroom assessment
Guidelines for Writing True/False Items
BOX 27.1
1. Use short statements and simple vocabulary and sentence structure.
Consider the following true/false item:
―The true/false item is more subject to
guessing but it should be used in place of a multiple-choice item, if well
constructed, when there are a lack of plausible distractors.
‖ Long, complex
sentences such as this one are confusing and therefore more dif
ficult to judge as
true or false.
2. Include only the central idea in each statement. Consider the following
true-false item:
―The true-false item, which is preferred by testing experts, is also
called an alternate-response item.
‖ This item has two ideas to evaluate: being
favored by experts and being called an alternative-response item.
3. Use negative statements sparingly, and avoid double negatives. These also
can be confusing.
4. Use precise wording so that the statement can unequivocally be judged true or
false.
5. Write false statements that re
flect common misconceptions held by students
who have not achieved learning goals.
6. Avoid using statements that reproduce verbatim sentences from the textbook
or reading material. Students may be able to judge the item true or false simply
by memorizing and not understanding. This reduces the validity of the student
’s
test score. Does the student really know the material?
7. Statements of opinion should be attributed to some source (e.g., according to
the textbook; according to the research on . . . ).
8. When cause-effect is being assessed, use only true propositions. For
example, use
―Exposure to ultraviolet rays can cause skin cancer‖ rather than
―Exposure to ultraviolet rays does not cause skin cancer.‖ Evaluating that the
positively worded statement is true is less confusing than evaluating that the
negatively worded statement is false.
9. Do not overqualify the statement in a way that gives away the answer. Avoid
using speci
fic determiners (e.g., always, never, all, none, only, usually, may, and
sometimes tend to be true).
10. Make true and false items of comparable length and make the same number of true and false
items.
11. Randomly sort the items into a numbered sequence so that they are less likely to appear in a
repetitive, predictable pattern (e.g., avoid T F T F . . . or . . . TT FF TT FF).
Sources: Ebel, 1979; Gronlund, 2003; Lindvall & Nitko, 1975.
students only need to recognize whether the statement matches a fact
that they have in their memory. A true/false question might state:
An equilateral triangle has three sides of equal length. T F
Teachers also can design items that require multiple true/false responses. For
example:
Scientists who study earthquakes have learned that:
1. the surface of the earth is in constant motion due to forces inside the planet. T
F
2. an earthquake is the vibrations produced by breaking rocks along a fault line.
T F
3. the time and place of earthquakes are easy to predict. T F
Alternate-choice questions are optimal when the subject matter
lends itself to an either-or response. They allow teachers to ask a
large number of questions in a short period of time (a practicality
issue), making it possible to cover a wide range of topics within the
domain being assessed. However, an obvious disadvantage of this
format is that students have a 50% chance of getting the answer
correct simply by guessing. Good alternate choice questions are
harder to write than you might expect, Teachers can use the
recommendations given in Box 27.1.
Matching Exercises
A matching exercise presents students with directions for matching a
list of premises and a list of responses. The student must match each
premise with one of the responses. Matching is a recognition task
because the answers are present in the test item. A simple matching
exercise might look like this:
boh7850x_CL8Mod27.p481-497.indd 488
boh7850x_CL8Mod27.p481-497.indd 488
10/29/08 9:46:14 AM
10/29/08 9:46:14 AM
module twenty-seven
test construction and use
489
Module 27:
Test Construction and Use
Directions: In the left column are events that preceded the start of
the Revolutionary War. For each event, choose the date it
occurred from the right column, and place the letter identifying it
on the line next to the event description.
Description of events (premise list):
______ 1. British troops fire on demonstrators in the Boston
Massacre, killing five.
______ 2. British Parliament passes the Tea Act.
______ 3. Parliament passes the Stamp Act, sparking
protests
in the American colonies.
______ 4. Treaty of Paris ends French power in North America.
Dates (response list):
a. 1763
b. 1765
c. 1768
d. 1770
e. 1773
f. 1778
Matching exercises are very useful for assessing a student’s ability
to make associations or see relationships between two things (e.g.,
words and their definitions, individuals and their accomplishments,
events in history and dates). This format provides a space-saving and
objective way to assess learning goals. It is versatile in that words or
phrases can be matched to symbols or pictures (e.g., matching a
country name to the outline of that country on a map). Well-designed
matching exercises can assess students’ comprehension of concepts,
ideas, and principles. However, teachers often fall into the trap of
using matching only for memorized lists (such as names and dates)
and do not develop matching exercises that assess higher-level
thinking. To construct effective matching items, consider these
guidelines:
n
Clearly explain the intended basis for matching, as in the directions for the sample item above.
n
Use short lists of responses and premises.
n
Arrange the response list in a logical order. For example, in the
sample item, dates are listed in chronological order.
n
Identify premises with numbers and responses with letters.
n
Construct items so that longer phrases appear in the premise list
and shorter phrases appear in the response list.
n
Create responses that are plausible items for each premise. A
response that clearly does not fit any premise gives a hint to the
correct answer.
n
Avoid ―perfect‖ one-to-one matching by including one or more
responses that are incorrect choices or by using a response as the
correct answer for more than one premise.
Multiple-choice Items
Each multiple-choice item contains a stem, or introductory statement
or question, and a list of choices, called response alternatives.
Multiple-choice items are recognition tasks because the correct
answer is provided among the choices. A typical multiple-choice
question looks like this:
The response alternatives include a keyed alternative (the correct
answer) and distractors, or incorrect alternatives. The example
includes three choices: one keyed alternative and two distractors.
Other multiple-choice formats may include a four-choice option (a, b,
c, d) or five-choice option (a, b, c, d, e). Box 27.2 presents a set of
detailed guidelines for developing a multiple-choice format that
addresses content, style, and tips for writing the stem and the choices.
Of the item formats that serve as recognition tasks, multiple-choice
items are preferred by most assessment experts because this format
offers many advantages. The multiple-choice format can be used to
assess a great variety of learning goals, and the questions can be
structured to assess factual knowledge as well as higher-order
thinking. Because multiple-choice questions do not require writing,
What is the main topic of the reading selection Responding to Learners?
a. academic achievement
b. question and response techniques
c. managing student behavior
boh7850x_CL8Mod27.p481-497.indd 489
boh7850x_CL8Mod27.p481-497.indd 489
10/9/08 9:19:26 AM
10/9/08 9:19:26 AM
490
cluster eight
classroom assessment
BOX 27.2
Content concerns:
1. Every item should re
flect specific content and a single specific mental behavior, as called for in
test speci
fications.
2. Base each item on important content to learn; avoid trivial content.
3. Use novel material to test higher-level learning. Paraphrase textbook language or language
used during instruction in a test item to avoid testing for simple recall.
4. Keep the content of each item independent from content of other items on the test.
5. Avoid overly speci
fic and overly general content when writing items.
6. Avoid opinion-based items.
7. Avoid trick items.
8. Keep vocabulary simple for the group of students being tested.
Writing the stem:
9. Ensure that the wording of the stem is very clear.
10. Include the central idea in the stem instead of in the response alternatives.
Minimize the amount of reading in each item. Instead of repeating a phrase in
each alternative, try to include it as part of the stem.
11. Avoid window dressing (excessive verbiage).
12. Word the stem positively; avoid negatives such as NOT or EXCEPT. If a
negative word is used, use the word cautiously and always ensure that the word
appears capitalized and boldface.
Writing the response alternatives:
13. Offering three response alternatives is adequate.
14. Make sure that only one of the alternatives is the right answer.
15. Make all distractors plausible. Use typical student misconceptions as your distractors. 16.
Phrase choices positively; avoid negatives such as NOT.
17. Keep alternatives independent; they should not overlap.
18. Keep alternatives homogeneous in content and grammatical structure. For
example, if one alternative is stated as a negative
—―No running in the hall‖—all
alternatives should be phrased as negatives.
19. Keep all alternatives roughly the same length. If the correct answer is
substantially longer or shorter than the dis-tractors, this may serve as a clue for
test-takers.
20. None of the above should be avoided or used sparingly.
21. Avoid All of the above.
22. Vary the location of the correct answer in the list of alternatives (e.g., the correct answer should
not always be C).
23. Place alternatives (A, B, C, D) in logical or numerical order. For example,
if a history question lists dates as the response alternatives, they should be
listed in chronological order.
24. Avoid giving clues to the right answer, such as:
a. speci
fic determiners, including always, never, completely, and absolutely
b. choices identical to or resembling words in the stem
c. grammatical inconsistencies that cue the test-taker to the correct choice
d. obvious correct choice
e. pairs or triplets of options that clue the test-taker to the correct choice
f. blatantly absurd, ridiculous options
Style concerns: 25. Edit and proof items.
26. Use correct grammar, punctuation, capitalization, and spelling.
Source: Adapted from Haladyna, Downing, & Rodriquez, 2002.
boh7850x_CL8Mod27.p481-497.indd 490
boh7850x_CL8Mod27.p481-497.indd 490
10/29/08 9:46:18 AM
10/29/08 9:46:18 AM
Guidelines for Writing Multiple-choice Items
students who are poor writers have a more equal playing field for
demonstrating their understanding of the content than they have when
answering essay questions. All students also have less chance to
guess the correct answer in a multiple-choice format than they do
with true/false items or a poorly written matching exercise. Also, the
distractor that a student incorrectly chooses can give the teacher
insight into the student’s degree of misunderstanding.
However, because multiple-choice items are recognition tasks, this
format does not require the student to recall information
independently. Multiple-choice questions are not the best option for
module twenty-seven
test construction and use
491
assessing writing skills, self-expression, or synthesis of ideas or in
situations where you want students to demonstrate their work (e.g.,
showing steps taken to solve a math problem). Also, poorly written
multiple-choice questions can be superficial, trivial, or limited to
factual knowledge.
The guidelines for writing objective items, such as those given in
Box 27.1 and Box 27.2, are especially important for ensuring the
validity of classroom tests. Poorly written test items can give students
a clue to the right answer. For example, specific
determiners—extraneous clues to the answer such as always, never,
all, none, only, usually, may—can enable students who may not know
the material well to correctly answer some true/false or
multiple-choice questions using testwiseness. Test-wiseness is an
ability to use test-taking strategies, clues from poorly written test
items, and prior experience in test taking to improve one’s score.
Learned either informally or through direct instruction, it improves
with grade level, experience in test taking, and motivation to do well
on an assessment (Ebel & Frisbie, 1991; Sarnacki, 1979; Slakter,
Koehler, & Hampton, 1970).
Short-answer/Completion Items
Short-answer/completion items come in three basic varieties. The
question variety presents a direct question, and students are expected
to supply a short answer (usually one word or phrase), as shown here:
The completion variety presents an incomplete sentence and requires
students to fill in the blank, as in the next examples:
1. The capital of Kentucky is Frankfort .
2. 3(2 + 4) = 18
The association variety (sometimes called the identification variety)
presents a list of terms, symbols, labels, and so on for which students
have to recall the corresponding answer, as shown below:
Short-answer items are relatively easy to construct. Teachers can
use these general guidelines to help them develop effective
short-answer items:
n
Use the question variety whenever possible, because it is the
most straightforward short-answer design and is the preferred
option of experts.
n
Be sure the items are clear and concise so that a single correct answer is required.
n
Put the blank toward the end of the line for easier readability.
n
Limit blanks within a short-answer question to one or two.
n
Specify the level of precision (a word, a phrase, a few
sentences) expected in the answer so students understand how
much to write in their response.
While short-answer items typically assess students’ lower-level
skills, such as recall of facts, they also can be used to assess more
complex thinking skills if they are well designed. The short-answer
format lowers a student’s probability of getting an answer correct by
random guessing, a more likely scenario with alternate choice and
multiple choice.
Short-answer items also are relatively easy to score objectively,
especially when the correct response is a one-word answer. Partial
credit can be awarded if a student provides a response that is close to
the correct answer but not 100% accurate. You occasionally may find
that students provide unanticipated answers. For example, if you ask
―Who discovered America?‖, student responses might
boh7850x_CL8Mod27.p481-497.indd 491
boh7850x_CL8Mod27.p481-497.indd 491
10/9/08 9:19:28 AM
10/9/08 9:19:28 AM
Module 27:
Test Construction and Use
1. What is the capital of Kentucky? Frankfort
2. What does the symbol Ag represent on the periodic table? Silver
3. How many feet are in one yard? 3
Element Symbol
Barium Ba Calcium Ca Chlorine Cl
492
cluster eight
classroom assessment
include ―Christopher Columbus,‖ ―Leif Erikson,‖ ―the Vikings,‖ or
―explorers who sailed across the ocean.‖ You then would have to
make a subjective judgment about whether such answers should
receive full or partial credit. Because reliability decreases as scoring
becomes more subjective, teachers should use a scoring key when
deciding on partial credit to maintain scoring consistency from one
student to the next.
Essay Tasks
Essay tasks allow assessment of many cognitive skills that cannot be
assessed adequately, if at all, through more objective item formats.
Essay tasks can be classified into two types. Restricted response
essay tasks limit the content of students’ answers as well as the form
of their responses. A restricted response task might state: List the
three parts of the memory system and provide a short statement
explaining how each part operates. Extended response essay tasks
require students to write essays in which they are free to express their
thoughts and ideas and to organize the information as they see fit.
With this format, there usually is no single correct answer; rather the
accuracy of the response becomes a matter of degree. Teachers can
use these guidelines to develop effective essay questions:
n
Cover the appropriate range of content and learning goals. One
essay question may address one learning goal or several.
n
Create essay questions that assess application of knowledge and
higher-order thinking, not simply recall of facts.
n
Make sure the complexity of the item is appropriate to the
student’s developmental level. Elementary school students might not
be required to write lengthy essays in essay booklets, whereas
middle school or high school students would be expected to write
more detailed essays.
n
Specify the purpose of the task, the length of the response, time
limits, and evaluation criteria. For example, instead of an essay task
that states ―Discuss the advantages of single-sex classrooms,‖ a high
school teacher might phrase the task as: ―You are addressing the
school board. Provide three arguments in favor of single-sex
classrooms.‖ The revised essay question provides a purpose for the
response and specifies the amount students need to write. Teachers
also should specify how essays will be evaluated, for example,
whether spelling and grammar count toward the grade and how
students’ opinions will be evaluated.
Whether to use a restricted response format or an extended
response format depends on the intended purpose of the test item as
well as reliability and practicality issues. The restricted response
format narrows the focus of the assessment to a specific, well-defined
area. The level of specificity makes it more likely that students will
interpret the question as intended. This makes scoring easier, because
the range of possible responses is also restricted. Scoring is more
reliable because it is easier to be very clear about what constitutes a
correct answer. On the other hand, if you want to know how a student
organizes and synthesizes information, the narrowly focused,
restricted response format may not serve the assessment purpose well.
Extended response questions are suitable for assessing students’
writing skill and/or subject matter knowledge. If your learning goals
involve skills such as organizing ideas, critically evaluating a certain
position or argument, communicating feelings, or demonstrating
creative writing skill, the extended response format provides an
opportunity for students to demonstrate these skills.
Because extended responses are subjective, this format generally
has poor scoring reliability. Given the same essay response to
evaluate, several different teachers might award different scores, or
the same teacher might award different scores to student essays at the
beginning and end of a pile of tests. When the scores given are
inconsistent from one response to the next, the validity of the
assessment results is lessened. Also, teachers tend to evaluate the
essays of different students according to different criteria, evaluating
one essay in terms of its high level of creativity and another more
critically in terms of grammar and spelling. A significant
disadvantage is that grading extended essay responses is a very
time-consuming process, especially if the teacher takes the time to
provide detailed written feedback to help students improve their work.
Restricted and extended response essay items have special scoring
considerations unique to the essay format. Essay responses, especially
extended ones, tend to have poor scoring reliability and lower
practicality. As discussed earlier, a scoring rubric can help teachers
score essay answers more fairly and consistently. The following
guidelines offer additional methods for ensuring consistency:
boh7850x_CL8Mod27.p481-497.indd 492
boh7850x_CL8Mod27.p481-497.indd 492
10/9/08 9:19:28 AM
10/9/08 9:19:28 AM
module twenty-seven
test construction and use
493
Module 27:
Test Construction and Use
n
Use a set of anchor essays—student essays that the teacher
selects as examples of performance at different levels of a scoring
rubric (Moskal, 2003). For example, a teacher may have a
representative A essay, B essay, and so on. Anchors increase
reliability because they provide a comparison set for teachers as
they score student responses. A set of anchor essays without
student names can be used to illustrate the different levels of the
scoring rubric to both students and parents.
n
If an exam has more than one essay question, score all students
on the first question before moving on to the next question. This
method increases the consistency and uniformity of your scoring.
n
Score subject matter content separately from other factors such as spelling or neatness.
n
To increase fairness and eliminate bias, score essays
anonymously by having students write their names on the back of
the exam.
n
Take the time to provide effective feedback on essays.
Think about a particular unit on which your students might be
tested in the grade level you intend to teach. What test item
formats would you choose to use, and why? Would your choice
of item formats be different for a pretest and for a test given at
the end of a unit?
TEST ANALYSIS AND REVISION
No test is perfect. Good teachers evaluate the tests they use and make
necessary revisions to improve them. When using objective test
formats such as alternate response and multiple choice, teachers can
evaluate how well test items function by using item analysis, a
process of collecting, summarizing, and using information from
student responses to make decisions about each test item. When
teachers use optical scanning sheets (―bubble sheets‖) for test
responses, they can use a computer program not only to score the
responses, but also to generate an item analysis. Item analysis
provides two statistics that indicate how test items are functioning: an
item difficulty index and an item discrimination index.
The item difficulty index reports the proportion of the group of
test-takers who answered an item correctly, ranging from 0 to 1. Items
that are functioning appropriately should have a moderate item
difficulty index to distinguish students who have grasped the material
from those who have not. This increases the validity of the test score.
As a rule of thumb, a moderate item difficulty can range from
.3 to .7. However, the optimal item difficulty level must take guessing
into account. For example, with a four-choice (a, b, c, d)
multiple-choice item, the chance of guessing is 25%. The optimal
difficulty level for this type of item is the midpoint between chance
(25%) and 100%, or 62.5%. So item difficulties for a four-choice
multiple-choice item should be close to or around .625.
Item difficulty indexes that are very low (e.g., 0–.3) indicate that
very few students answered correctly. This information can identify
particular concepts that need to be retaught, provide clues about the
strengths and weaknesses of instruction, or indicate test items that are
poorly written and need to be revised or removed. Item difficulty
indexes that are very high (e.g., .8–.9) indicate that the majority of
students answered the items correctly, suggesting that the items were
too easy. While we want students to perform well, items that are too
easy do not discriminate between the students who know the material
well and those who do not, which is one purpose of assessing student
performance.
Item discrimination indexes add this crucial piece of information.
An item discrimination index describes the extent to which a
particular test item differentiates high-scoring students from
low-scoring students. It is calculated as the difference between the
proportion of the upper group (highest scorers) who answered a
particular item correctly and the proportion of the lower group (lowest
scorers) who answered that same item correctly. The resulting index
ranges from –1 to +1. If a test is well constructed, we would expect all
test items to be positively discriminating, meaning that those students
in the highest scoring group get the items correct while those in the
lowest scoring group get the items wrong. Test items with low (below
.4), zero, or negative discrimination indexes reduce a test score’s
validity and should be rewritten or replaced. A low item
discrimination index indicates that the item cannot accurately
discriminate between students who know the material and those who
don’t. An item discrimination index of zero indicates that the item
cannot discriminate at all between high scorers and low scorers. A
test with many zero discriminations does not provide a valid measure
of student achievement. Items with negative discrimination indexes
indicate that lower-scoring students tended to answer the items
correctly while higher-scoring students tended to get them wrong. If a
test contains items with negative discriminations, the total score on
the exam will not provide useful information.
boh7850x_CL8Mod27.p481-497.indd 493
boh7850x_CL8Mod27.p481-497.indd 493
10/9/08 9:19:29 AM
10/9/08 9:19:29 AM
494
cluster eight
classroom assessment
When item analyses suggest that particular test items did not
function as expected, the source of the problem must be investigated.
The item analysis may indicate that the problem stems from:
n
the item itself. For example, multiple-choice items may have
poorly functioning distractors, ambiguous alternatives, questions that
invite random guessing, or items that have been keyed incorrectly.
n
student performance. Students may have misread the item and
misunderstood it. When developing a new test item, the teacher may
think it is worded clearly and has one distinctly correct answer.
However, item analysis might reveal that students interpreted that
test item differently than intended.
n
teacher performance. Low item difficulties, which indicate that
students did not grasp the material, may suggest that the teacher’s
performance needs to be improved. Perhaps concepts needed further
clarification or the teacher needs to consider a different approach to
presenting the material.
The process of discarding certain items that did not work well,
revising other items, and testing a few new items on each test
eventually leads to a much higher quality test (Nitko & Brookhart,
2007). Once ―good‖ items have been selected, teachers can use
software to store test items in a computer file so they can select a
subset of test items, make revisions, assemble tests, and print out tests
for use in the classroom. Certain software products even allow the
teacher to sort items according to their alignment with curriculum
standards. Computer applications vary in quality, cost,
user-friendliness, and amount of training required for their use.
An item analysis shows that three of your test items have very
low item dif
ficulties, and you plan to revise these before you use
the test next time. Because these poorly functioning items affect
students
’ test scores, what can you do to improve the validity of
your current students
’ test scores?
boh7850x_CL8Mod27.p481-497.indd 494
boh7850x_CL8Mod27.p481-497.indd 494
10/9/08 9:19:29 AM
10/9/08 9:19:29 AM
key concepts
495
Summary
Discuss the importance of validity, reliability, fairness/ equivalence, and practicality in test
construction. Classroom tests should be evaluated based on their validity, reliability,
fairness/equivalence, and practicality. Teachers must consider how well the test measures what it is
supposed to measure (validity), how consistent the results are (reliability), the degree to which all
students have an equal opportunity to learn and demonstrate their knowledge and skill
(fairness/equivalence), and how economical and ef
ficient the test is to create, administer, and score
(practicality).
Explain how a test blueprint is used to develop a good test. A test blueprint is an assessment
planning tool that describes the content the test will cover and the way students are expected to
demonstrate their understanding of that content. A test blueprint, or table of speci
fications, helps teachers
develop good tests because it matches the test to instructional objectives and actual instruction. Test
blueprints take into consideration the importance of each learning goal, the content to be assessed, the
material that was emphasized during instruction, and the amount of time available for students to
complete the test.
Discuss the usefulness of each test item format.
Alternate-choice items allow teachers to cover a wide range of topics by asking a large number of
questions in a short period of time. Matching exercises can be useful for assessing a student
’s ability to
make associations or see relationships between two things. Multiple-choice items are preferred by
assessment experts because they focus on reading and thinking but do not
require writing, give teachers insight into students
’ degree of misunderstanding, and can be used to
assess a variety of learning goals. Short-answer questions are relatively easy to construct, can
assess both lower-order and higher-order thinking skills, and minimize the chance that students will
answer questions correctly by randomly guessing. Essay questions can provide an effective assessment
of many cognitive skills that cannot be assessed adequately, if at all, with more objective item formats.
Compare and contrast the scoring considerations for the
five test item formats. Objective formats
such as alternate choice, matching, and multiple choice have one right answer and tend to be relatively
quick and easy to score. Short-answer/completion questions, if well designed, also can be relatively easy
to score as long as they are written clearly and require a very speci
fic answer. Essay questions,
especially extended essay formats, tend to have poor scoring reliability and lower practicality because
scoring is subjective and can be time-consuming. Scoring rubrics allow teachers to score essay
answers more fairly and consistently.
Describe the bene
fits of item analysis and revision.
Item analysis determines whether a test item functions as intended, indicates areas where students need
clari
fication of concepts, and points out where the curriculum needs to be improved in future
presentations of the material. The process of discarding certain items that did not work well, revising other
items, and testing a few new items on each test eventually leads to a much higher quality test.
Key Concepts
alternate-choice item content validity distractors equivalence extended response essay fairness item analysis item
dif
ficulty index item discrimination index
matching exercise multiple-choice item objective testing objectivity practicality recall tasks recognition tasks reliability
response alternatives
restricted response essay rubric short-answer/completion items speci
fic determiners stem subjective testing table of
speci
fications test blueprint validity
boh7850x_CL8Mod27.p481-497.indd 495
boh7850x_CL8Mod27.p481-497.indd 495
10/9/08 9:19:31 AM
10/9/08 9:19:31 AM
496
case studies: re
flect and evaluate
,
Case Studies:
Refl ect and Evaluate
Early Childhood:
“The Zoo”
These questions refer to the case study on page 458.
1. Sanjay and Vivian do not do any paper-and-pencil testing in this
sce-nario. What factors should they consider when deciding whether
to use traditional tests as a form of assessment?
2. If Vivian and Sanjay choose to use tests or quizzes with the
children, what steps should they take to ensure the validity of their
results?
3. What issues could potentially interfere with the reliability of test results among
preschoolers?
4. Most of the preschoolers in Vivian and Sanjay
’s classroom do
not yet know how to read independently. How would this impact test
construction and use for this age group?
5. Given your response to question 4, how might Vivian and Sanjay
assess a speci
fic set of academic skills (e.g., letter or number
recognition) with their students in a systematic way?
6. The lab school is located in a large city and has a diverse group
of students. How might issues of fairness come into play when
designing assessments for use with these students?
Elementary School:
“Writing Wizards”
These questions refer to the case study on page 460.
1. Brigita provides an answer key for the Grammar Slammer activity (quiz) and
reviews it with the class.
How does the use of such a key impact the level of objectivity in scoring
responses?
2. Brigita uses a matching exercise (quiz) to evaluate students
’
understanding of their weekly vocabulary words. What are the
advantages and disadvantages of this form of assessment?
3. If Brigita decides she wants to vary her quiz format by using
multiple-choice questions instead, what factors should she keep
mind in order to write good multiple-choice questions?
4. Brigita uses a combination of traditional tests and applied writing
activities to assess her students
’ writing skills. Are there other
subject areas in a fourth-grade classroom in which tests would be a
useful assessment choice? Explain.
5. Imagine that Brigita will be giving a social studies test and wants
to incorporate writing skills as part of this assessment. What are the
advantages and disadvantages of the essay format?
6. Based on what you read in the module, what advice would you
give Brigita about how to score the responses on the social studies
test referred to in question 5?
Middle School:
“Assessment: Cafeteria Style”
These questions refer to the case study on page 462.
1. From Ida
’s perspective, how do the development,
implementation, and grading of the multiple-choice test rate in terms
of practicality?
2. Do 50 multiple-choice questions seem an appropriate number for a middle
school exam in a
50-minute class period? Why or why not?
3. What are the advantages of using multiple-choice questions
rather than one of the other question formats available (alternate
choice, matching, short answer, or essay)? Would your answer vary
for different subjects?
4. What are some limitations of using only multiple-choice
questions to test students
’ understanding of course content?
5. Ida provided a rubric to help her students better understand what
was expected from them on the project option. What could she have
done to clarify her expectations for those students who choose to
take the exam?
6. If you were the one writing the questions for Ida
’s exam, what
are some guidelines you would follow to make sure the questions
are well constructed?
boh7850x_CL8Mod27.p481-497.indd 496
boh7850x_CL8Mod27.p481-497.indd 496
10/9/08 9:19:33 AM
10/9/08 9:19:33 AM
case studies: re
flect and evaluate
497
High School:
“Innovative Assessment Strategies”
These questions refer to the case study on page 464.
1. What are some advantages tests have that might explain why
so many teachers at Jefferson High rely on them as a primary
means of assessment?
2. In the New Hampshire humanities course described in the
memo, students are asked to design their own test as part of their
final project. How could these students use a test blueprint to
develop a good test?
3. What criteria should the New Hampshire teacher use to
evaluate the quality of the tests designed by students?
4. Imagine that a New Hampshire teacher gave a
final exam using
a combination of the test questions created by students. What
could the teacher do to determine how the questions actually
functioned?
5. The California teacher used a
―dream home‖ project to assess
his students
’ conceptual understanding of area relationships. Is it
possible to assess this level of understanding by using
multiple-choice questions on a test? Explain you answer.
6. Imagine that the Rhode Island social studies teacher wants
students to pay close attention to one another
’s oral history
presentations, so she announces that students will be given a test
on the material. If she wants to test basic recall of facts and wants
the test to be quick and easy to grade, which item formats would
you suggest she use on her test? Explain.
boh7850x_CL8Mod27.p481-497.indd 497
boh7850x_CL8Mod27.p481-497.indd 497
10/9/08 9:19:35 AM
10/9/08 9:19:35 AM